Salmon Run: PyData LA 2019: Trip Report

PyData LA 2019 was last week, and I had the opportunity to attend. I also presented about NERDS (Named Entity Recognition for Data Scientists), an open source toolkit built by some colleagues from our Amsterdam office. This is my trip report.

The conference was three days long, Tuesday to Thursday, and was spread across 3 parallel tracks. So my report is necessarily incomplete, and limited to the talks and tutorials I attended. The first day was tutorials, and the next two days were talks. In quite a few situations, it was tough to choose between simultaneous talks. Fortunately, however, the talks were videotaped, and the organizers have promised to put them up in the next couple of weeks, so looking forward to catching up on the presentations I missed. The full schedule is here. I am guessing attendees will be notified by email when videos are released, and I will also update this post when that happens.

Day 1: Tutorials

For the tutorials, I did all the tutorials in the first track. Of these, I came in a bit late for Computer Vision with Pytorch, since I miscalculated the volume (and delaying power) of LA Traffic. It was fairly comprehensive, although I was familiar with at least some of the material already, so in retrospect, I should probably have attended one of the other tutorials.

The second tutorial was about Kedro and MLFlow and how to combine the two to build reproducible and versioned data pipelines. I didn't know that MLFlow can be used standalone outside Spark, so definitely something to follow up there. Kedro looks like scaffolding software which allows users to hook into specific callback points in its lifecycle.

The third tutorial was a presentation on teaching a computer to play PacMan using Reinforcement Learning (RL). RL apps definitely have a wow factor, and I suppose it can be useful where the environment is deterministic enough (rules of a game, laws of physics, etc.), but I often wonder if we can use it to train agents that can operate in a more uncertain "business applications"-like environment. I am not an expert on RL though, so if you have ideas on how to use RL in these areas, I would appreciate learning about them.

The fourth and last tutorial of the day was Predicting Transcription Factor (TF) genes from genetic networks using Neural Networks. The data extraction process was pretty cool, it was predicated on the fact that TF genes typically occupy central positions in genetic networks, so graph based algorithms such as connectedness and Louvain modularity can be used to detect them in these networks. These form the positive samples, and standard negative sampling is done to extract negative samples. The positive records (TFs) are oversampled using SMOTE. Features for these genes come from an external dataset of 120 or so experiments, where each gene was subjected to these experiments and results recorded. I thought that the coolest part was using the graph techniques for building up the dataset.

Days 2 and 3: Talks

All the talks provided me with some new information in one form or the other. In some cases, it was a tough choice to make, since multiple simultaneous talks seemed equally interesting to me going in. Below I list the ones I attended and liked, in chronological order of appearance in the schedule.

Gradient Boosting for data with both numerical and text features -- the talk was about the CatBoost library from Yandex, and the talk focused on how much better CatBoost is in terms of performance (especially on GPU) compared to other open source Gradient Boosting libraries (LightGBM and one other that I can't recall at the moment). CatBoost definitely looks attractive, and at some point I hope to give it a try.
Topological Techniques for Unsupervised Learning -- talked about how the topological technique called Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction can be used for generating very powerful embeddings and for clustering that is competitive with T-SNE. UMAP is more fully described in this paper on arXiv (the presenter was one of the co-authors of this paper). There was one other presentation on UMAP by one of the other co-authors which I was unable to attend.
Guide to Modern Hyperparameter Tuning Algorithms -- presented the open source Tune Hyperparameter Tuning Library from the Ray team. As with the previous presentation, there is a paper on arXiv that describes this library in more detail. The library provides functionality to do grid, random, bayesian, and genetic search over the hyperparameter space. It seems to be quite powerful and easy to use, and I hope to try it out soon.
Dynamic Programming for Hidden Markov Models (HMM) -- one of the clearest descriptions of the implementation of the Viterbi (given the parameters for the model and the observed states, find the most likely sequence of hidden states) algorithm that I have ever seen. The objective is for the audience to understand HMM (specifically Viterbi algorithm) well enough so they can apply it to new domains where it might be applicable.
RAPIDS: Open Source GPU Data Science -- I first learned about NVidia's RAPIDS library at KDD 2019 earlier this year. RAPIDS provides GPU optimized drop-in replacements for NumPy, Pandas, Scikit-Learn, and NetworkX (cuPy, cuDF, cuML, and cuGraph), which run order of magnitude faster if you have a GPU. Unfortunately, I don't have a GPU on my laptop, but the presenter said that images with RAPIDS pre-installed are available on Google Cloud (GCP), Azure, and AWS.
Datasets and ML Model Versioning using Open Source Tools -- this is a presentation on the Data Version Control (DVC) toolkit, which gives you a set of git like commands to version control your metadata, and link them to a physical file on some storage area like S3. We had envisioned using it internally for putting our datasets and ML models under version control some time back, so I was familiar with some of the information provided. But I thought the bit about creating versioned ML pipelines (data + model(s)) was quite interesting.

And here are the talks I would like to watch once the videos are uploaded.

Serverless Demo on AWS, GCP, and Azure -- this was covered in the lightning round on the second day. I think this is worth learning, since it seems to be an easy way to set up demos that work on demand. Also learned about AWS Batch, a "serverless" way to serve batch jobs (or at least non-singleton requests).
Supremely Light Introduction to Quantum Computing -- because Quantum Computing which I know nothing about.
Introducting AutoImpute: Python package for grappling with missing data -- No explanation needed, clearly, since real life data often comes with holes, and having something like this gives us access to a bunch of "standard" strategies fairly painlessly.
Tackling Homelessness with Open Data -- I would have attended this if I had not been presenting myself. Using Open Data for social good strikes me as something we, as software people, can do to improve our world and make it a better place, so always interested in seeing (and cheering on) others who do it.
What you got is What you got -- speaker is James Powell, a regular speaker I have heard at previous PyData conferences, who always manages to convey deep Python concepts in a most entertaining way.
GPU Python Libraries -- this was presented by another member of the RAPIDS team, and according to the previous presenter, focuses more on the Deep Learning aspect of RAPIDS.

And then of course there was my presentation. As I mentioned earlier, I spoke of NERDS, or more specifically my fork of NERDS where I made some improvements on the basic software. The improvements started as bug fixes, but currently there are quite a few significant changes, and I plan on making a few more. The slides for my talk are here. I cover why you might want to do Named Entity Recognition (NER), briefly describe various NER model types such as gazetteers, Conditional Random Fields (CRF), and various Neural model variations around the basic Bidirectional LSTM + CRF, cover the NER models available in NERDS, and finally describe how I used them to detect entities in a Biological Entity dataset from BioNLP 2004.

The reason I chose to talk about NERDS was twofold. First, I had begun to get interested in NERs in general in my own work, and "found" NERDS (although since it was an OSS project from my own company, not much discovery was involved :-)). I liked that NERDS does not provide "new" ML models, but rather a unified way to run many out of the box NER models against your data with minimum effort. In some ways, it is a software engineering solution that addresses a data science problem, and I thought the two disciplines coming together to solve a problem was an interesting thing in itself to talk about. Second, I feel that custom NER building is generally considered something of a black art, and something like NERDS has the potential to democratize the process.

Overall, based on some of the feedback I got on LinkedIn and in person, I thought the presentation was quite well received. There was some critical feedback saying that I should have focused more on the intuition behind the various NER modeling techniques than I did. While I agree that this might be desirable, I had limited time to deliver the talk, and I would not have been able to cover as much if I spent too much time on basics. Also, since the audience level was marked as Intermediate, I risked boring at least part of the audience if I did so. But I will keep this in mind for the future.

Finally, I would be remiss if I didn't mention all the wonderful people I met at this conference. I will not call you out by name, but you know who you are. Some people think of conferences as places where a small group of people get to showcase their work in front of a larger group of people, it is also a place where you get to meet people in your discipline but in similar or different domains, and I find it immensely helpful and interesting to share ideas and approaches for solving different problems.

And that's all I have for today. I hope you enjoyed reading my trip report.