Saturday, December 28, 2019

Incorporating the Flair NER into NERDS


Earlier this month I was at PyData LA where I talked about NERDS, a toolkit for Named Entity Recognition (NER) open sourced by some of my colleagues at Elsevier. You can find the slides for my talk here, the video doesn't seem to be released yet unfortunately. I covered some of this in my trip report already, but for those of you who may not know about NERDS, it is a toolkit that provides easy to use NER capabilities for data scientists. Specifically, it wraps a few (4 in the master brach, 6 in my fork -- but more on that later) third party NER models, and provides a common API for training and evaluating them. Each model also provides tunable hyperparameters and structural parameters, so as a NERDS user, you can prepare your data once and have the ability to train many different NER models quickly and efficiently.

One of the things I had promised to talk about in my abstract was how to add new NER models to NERDS, which I ended up not doing due to shortage of time. This was doubly unfortunate, because one of my aims in giving this talk was to popularize the toolkit and also to encourage contributions from Open Source to give future users of NERDS more choices. In any case, I recently added a NER from the Flair project from Zalando Research into NERDS, and figured that this might be a good opportunity to describe the steps, for the benefit of those who might be interested in extending NERDS with your own favorite third party NER model. So that's what this blog post is about.

One thing to remember though, is that, at least for now, these instructions are valid only on my fork of NERDS. In order to support the common API, NERDS exposes a common data format across all its models, and behind the scenes, converts between this format and internal formats of each model. Quite frankly, I think this is a genius idea -- an awesome application of Software Engineering principles to solve a Data Science problem. However, the common data format was somewhat baroque and a source of errors (the BiLSTM-CRF model from the Anago project on the master branch crashes intermittently because of some insidious bug which I wasn't able to crack), so I switched to a simpler data format and the bug disappeared (see the README.md for details). So we basically keep the genius idea but simplified the implementation.

Another major change is to inject parameters at construction time rather than separately during calls to fit() and predict() -- this is in line with how scikit-learn does it too, which is also where we want to go, for interoperability reasons. In any case, here is the full list of changes in the branch so far.

At a high level, here is the list of things you need to do to integrate your favorite NER into NERDS. I describe each step in greater detail below.

  1. Add library dependency in setup.py
  2. Figure out the third party NER API
  3. Update the __init__.py file
  4. Create the NERDS NER Model
  5. Write and run the tests
  6. Update the examples

Add library dependency in setup.py


The Flair package is installable via "pip install", so if you add it to the NERDS setup.py file as shown, it will be added to your (development) environment the next time you run "make install". The development environment simply means that the Python runtime will point to your development directory instead of somewhere central in site-packages. That way changes you make to the code will be reflected in the packag without you having to push (perhaps by additional "make install") your changes each time.

Figure out the third party NER API


If you are looking to add a NER model whose API you are already familiar with, this step may not be needed. For me, though, the Flair NER was new, so I wanted to get familiar with its API before I tried to integrate it into NERDS. I found this Flair tutorial on Tagging your Text particularly useful.

From this tutorial, I was able to figure out that Flair provides a way to train and evaluate its SequenceTagger (what we will use for our NERDS Flair NER) in one go, using a Corpus object, which is a collection of training, validation, and test datasets. Each of these datasets is a collection of Flair Sentence objects, which represents an individual sentence. Each Sentence object contains a collection of Token objects, and each Token object contains a collection of Tag objects.

Conversely, all NERDS models extends the abstract class NERModel, which inherits from the BaseEstimator and ClassifierMixin classes from scikit-learn, and expose the following four methods -- fit, predict, save, and load, as shown below. Here the fit(X, y) method is used for training the model, using dataset X and label set y. Conversely, the predict(X) method is meant for predicting labels for dataset X using a trained model. Therefore, clearly the single Corpus approach will not work for us. Luckily, however, it is possible to pass an empty Sentence list for the test dataset when creating a Corpus for training, and prediction can be done directly against the test Sentence list.

1
2
3
4
5
class NERModel(BaseEstimator, ClassifierMixin):
    def fit(self, X, y): pass
    def predict(self, X): pass
    def save(self, dirpath): pass
    def load(self, dirpath): pass

A typical train-save-load-predict pipeline consists in training the model with a labeled dataset, then saving the trained model to disk, then retrieving the saved model, and running predictions against the test set. My focus was mainly to figure out how to separate out the training and prediction code blocks into their own independent chunks, so I can reuse them in the fit() and predict(). Also, load() and save() can be somewhat idiosyncratic, with different models using different serialization mechanisms, and writing out different artifacts, so its good to watch those too. Another thing to note are the two functions sentences_to_data_labels() and data_labels_to_sentences(), that convert between the NERDS common data format (data=lists of lists of tokens, labels=lists of lists of tags), and the Sentence and Corpus based Flair data format. Its not required, of course, but I find it useful to encapsulate the conversion inside their own routines, that way they can be easily ported, not only into the final NER Model, but can potentially be reused in case I need to incorporate another NER with similar native APIs.

Here is my NER train-save-load-predict pipeline that uses the Flair NER directly. Idea is to ran this for couple of epochs just to make sure it works, and then you are ready for the next step.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import flair
import os

from flair.data import Corpus, Sentence, Token
from flair.embeddings import CharacterEmbeddings, TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

from nerds.utils import load_data_and_labels

from sklearn.model_selection import train_test_split

DATA_DIR = "examples/BioNLP/data"


def data_labels_to_sentences(data, labels=None):
    sentences = []
    is_dummy_labels = False
    if labels is None:
        labels = data
        is_dummy_labels = True
    for tokens, tags in zip(data, labels):
        sentence = Sentence()
        for token, tag in zip(tokens, tags):
            t = Token(token)
            if not is_dummy_labels:
                t.add_tag("ner", tag)
            sentence.add_token(t)
        sentences.append(sentence)
    return sentences


def sentences_to_data_labels(sentences):
    data, labels = [], []
    for sentence in sentences:
        tokens = [t.text for t in sentence.tokens]
        tags = [t.tags["ner"].value for t in sentence.tokens]
        data.append(tokens)
        labels.append(tags)
    return data, labels


# training (fit)
train_filename = os.path.join(DATA_DIR, "train", "Genia4ERtask1.iob2")
train_data, train_labels = load_data_and_labels(train_filename)
trn_data, val_data, trn_labels, val_labels = train_test_split(
    train_data, train_labels, test_size=0.1)
trn_sentences = data_labels_to_sentences(trn_data, trn_labels)
val_sentences = data_labels_to_sentences(val_data, val_labels)
train_corpus = Corpus(trn_sentences, val_sentences, [], name="train-corpus")
print(train_corpus)

basedir = "flair-ner-test"
savedir = "flair-saved"
tag_dict = train_corpus.make_tag_dictionary(tag_type="ner")
embedding_types = [
    WordEmbeddings("glove"),
    CharacterEmbeddings()    
]
embeddings = StackedEmbeddings(embeddings=embedding_types)
tagger = SequenceTagger(hidden_size=256,
    embeddings=embeddings,
    tag_dictionary=tag_dict,
    tag_type="ner",
    use_crf=True)
trainer = ModelTrainer(tagger, train_corpus)
trainer.train(basedir,
    learning_rate=0.1,
    mini_batch_size=32,
    max_epochs=2)

# model is saved by default, but let's do it again
os.makedirs(savedir, exist_ok=True)
tagger.save(os.path.join(savedir, "final-model.pt"))

# load back the model we trained
model_r = SequenceTagger.load(os.path.join(savedir, "final-model.pt"))

# prediction (predict)
test_filename = os.path.join(DATA_DIR, "test", "Genia4EReval1.iob2")
test_data, test_labels = load_data_and_labels(test_filename)
test_sentences = data_labels_to_sentences(test_data)

pred_sentences = model_r.predict(test_sentences, 
    mini_batch_size=32, 
    all_tag_prob=True)
i = 0
_, predictions = sentences_to_data_labels(pred_sentences)
for prediction in predictions:
    print(prediction)
    i += 1
    if i > 10:
        break

The resulting model is shown below. It looks similar to the word+character hybrid model proposed by Guillaume Genthial in his Sequence Tagging with Tensorflow blog post, where word embeddings (seeded with GloVe vectors) and embeddings generated from characters are concatenated and fed into an LSTM, and then the output of the LSTM is fed into a linear layer with CRF loss to produce the predictions.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
SequenceTagger(
  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('glove')
    (list_embedding_1): CharacterEmbeddings(
      (char_embedding): Embedding(275, 25)
      (char_rnn): LSTM(25, 25, bidirectional=True)
    )
  )
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=150, out_features=150, bias=True)
  (rnn): LSTM(150, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=20, bias=True)
)

Update the __init__.py file


Python's package paths are very file-oriented. For example, functions in the nerds.utils package are defined in the nerds/utils.py file. However, since NER models are typically large blocks of code, my preference (as well as the original authors) is to have each model in its own file. This can lead to very deep package structures, or we can effectively flatten the package paths by importing them into the nerds.models package in the nerds/models/__init__.py. You can now refer to the FlairNER class defined in nerds/models/flair.py as nerds.models.FlairNER.

Create the NERDS NER model


At this point, it is fairly easy to build the FlairNER class with code chunks from the throwaway train-save-load-predict script. There are a few things to keep in mind that have to do in part with coding style, and in part with a desire for interoperability with scikit-learn and its huge library of support functions. I try to follow the guidelines in Developing scikit-learn estimators. One important deviation from the guidelines is that we don't allow **kwargs for fit() and predict(), since its easier to track the parameters if they are all passed in via the constructor. Another important thing to note is that NERDS models are not true Estimators, since fit and predict work with lists of lists of primitive objects, rather than just lists, so the check_estimator function fails on these models -- although I think this may be because the creators of check_estimator may not have anticipated this usage.

We don't have publicly available API Docs for NERDS yet, but in anticipation of that, we are using the NumPy DocString format as our guide, as advised by the Scikit-Learn coding guidelines.

Finally, in the save() function, we dump out the parameters fed into the constructor in a YAML file. This is mainly for documentation purposes, to save the user time figuring out after the fact which model was created with which hyperparameters. The class structure doesn't enforce this requirement, i.e., the NER will happily work even without this feature, but its a single-line call to utils.write_param_file(), so its not a lot of work for something very useful, so you just have to remember to add this in.

Here is the code for the FlairNER class. As you can see, a lot of code has been copy-pasted from the throwaway train-save-load-predict code that we built earlier. There is also some validation code, for example, to prevent predict() being run without a trained model, or to complain if the code is asked to load the model from a non-existent location, etc. Also the private functions _convert_to_flar() and _convert_from_flair() are basically clones of the data_labels_to_sentences() and sentence_to_data_labels() functions from the earlier script.

Write and run the tests


NERDS has a suite of unit tests in the nerds/test directory. It uses the nose package for running the tests. For the NER Models, we have a tiny dataset of 2 sentences, with which we train and predict. The dataset is generally insufficient to train an NER model, so basically all we are looking for is that the code runs end-to-end without complaining about size issues, etc. Here is the code for the FlairNER tests.

You can run the test individually by "nosetests nerds/tests/test_flair_ner.py" or run all tests using "make test". I like to start with running individual tests to make sure my changes are good, and then follow it up with a final "make test" to make sure my changes haven't broken something elsewhere in the system.

Update the examples


Finally, it is time to add your NER to the example code in nerds/examples. This is mainly for NERDS users, to provide them examples on how to call NERDS, but it can also be interesting for you, to see how your new NER stacks up against the ones that are there already. There are two examples, one based on the Groningen Meaning Bank (GMB) dataset of general entities such as PERson, LOCation, etc., and another based on the BioNLP dataset for Bio-Entity recognition. As mentioned earlier, NERDS allows you to prepare your data once and reuse it across multiple models, so the code to include the FlairNER is this block here and here respectively. As can be seen from the classification reports on the respective README.md (here and here), performance of the FlairNER is on par with the BiLSTM-CRF in case of GMB but closer to CRF in case of BioNLP.

That's basically all it takes code-wise, to add a new NER to NERDS. The next step is of course to do a Pull Request (PR), which I would request you to hold off on at the moment, since I am working off a fork myself, and my git-fu is not powerful enough to figure how to handle PRs against a fork. I would prefer that my fork gets pulled into master first, then we handle any additional PRs. However, please queue them up on the NERDS Issues page, so they can be incorporated as they come in.

Saturday, December 21, 2019

Finding Similar Tweets with BERT and NMSLib


Since my initial explorations with vector search for images on Lucene some time back, several good libraries and products have appeared that do a better job of computing vector similarity than my home grown solutions. One of them is the Non-Metric Space Library (NMSLib), a C++ library that implements Approximate Nearest Neighbor (ANN) techniques for efficient similarity search in non-metric spaces.

The problem with doing any kind of ranking using vector search is that you must match the query vector against every document vector in the index, compute a similarity score, then order by that similarity score. This is a O(N) operation, so the query time will increase linearly with the number of records N in the index. ANN libraries try to do better, and NMSLib uses the idea of Hierarchical Navigable Small World (HNSW) Graphs, which effectively partition the data into a hierarchical set of small world graphs based on proximity. That way searching this structure for nearest neighbors of a given record will involve searching the proximity graph of that record, which is independent of the number of records.

Over the last year or so, search has increasingly begun to move from traditional information retrieval techniques based on TF-IDF, to techniques based on neural models. To some extent, this has coincided with the release of the Bidirectional Encoder Representations from Transformers (BERT) model by Google, and its subsequent application in the NLP and Search communities for many different tasks, including similarity based document ranking. Prof. Jimmy Lin of the University of Waterloo has published an Opinion piece The Neural Hype, Justified! A Recantation at SIGIR 2019, where he explains why this is so in more detail.

In any case, this trend has led to a need for computing vector based similarity in an efficient manner, so I decided to do a little experiment with NMSLib, to get familiar with the API and with NMSLib generally, as well as check out how good BERT embeddings are for search. This post describes that experiment.

For my dataset, I selected the Health News in Twitter dataset from the UCI Machine Learning Repository. This is a dataset containing approximately 60k tweets related to 16 different Health/News organizations, and was donated to UCI as part of the paper Fuzzy Approach to Topic Discovery in Health and Medical Corpora.

I first extracted the tweet ID and text from these files and uploaded them into a SQLite3 database, so I could look the text up by ID later. I did a little cleanup of the tweet texts, to remove (shortened) URLs from the texts. Then I set up a BERT-as-a-service sentence encoding service on a GPU box, using BERT-base uncased as the underlying model, and generated the vectors for each of the sentences. Code for both these are fairly trivial and can be easily figured out from the documentation, so I will not bother to describe it here.

I then wanted to see how NMSLib scaled with change in dataset size. I had approximately 60k tweets and their vectors, so I loaded random subsets of the dataset into the NMSLib index, then ran a batch of 50 queries (also randomly sampled from the dataset) to generate their K Nearest Neighbors for K=10. I recorded the total loading time in each case, as well as the query time averaged over the 50 queries. Plots for both are shown below.






What is interesting is that load time rises approximately linearly with the number of records, but the query time is sublinear (and on a completely different timescale from indexing) and eventually seems to flatten out. So we can expect to pay a penalty for the large number of records once during indexing, but query seems to be quite fast.

I also looked at some of the results of the similarity search, and they seem to be pretty good. It is hard to think in terms of traditional similarity in case of tweets, since there is so litle content to go by, but I include below the 10 Nearest Neighbors (or equivalently, the top 10 search results) for 3 of my query tweets, and they all intuitively seem to be good, although perhaps for different reasons. In the first case, I think it has done a good job on focusing on the main content "Ebola" and "Guinea" and fetching results around these content words. In the second case, it seems to have captured the click-bait-ey spirit of the query tweet, and retured other tweets that are along the same lines. In the third case, once again, it returns results that are similar in intent to the query document, but uses completely different words.

QUERY: Ebola in Guinea puts miners in lock down (451467465793355776)
--------------------------------------------------------------------------------------------------
dist.  tweet_id            tweet_text
--------------------------------------------------------------------------------------------------
0.104  542396431495987201  Junior doctors in Sierra Leone strike over lack of Ebola care
0.110  582950953625735168  Sierra Leone Ebola lockdown exposes hundreds of suspected cases
0.112  542714193137635328  Junior doctors in Sierra Leone strike over lack of #Ebola care
0.112  517254694029524992  West Africa Ebola crisis hits tourism, compounds hunger in Gambia
0.117  497018332881498112  Ebola kills nurse in Nigeria
0.119  565914210807599104  Ebola-hit Guinea asks for funds for creaking health sector
0.120  514098379408699393  Ebola virus shutdown in Sierra Leone yields 'massive awareness'
0.120  555734676833198081  Ebola kills Red Cross nurse in Sierra Leone
0.121  583431484905754624  Sierra Leone to start laying off Ebola workers as cases fall: president
0.122  499300095402065921  Ebola Shuts Down The Oldest Hospital In Liberia

QUERY: 1 in 3 prescriptions unfilled, Canadian study finds (450767264292167681)
--------------------------------------------------------------------------------------------------
dist.  tweet_id            tweet_text
--------------------------------------------------------------------------------------------------
0.105  564909963739287552  Study finds 1 in 12 children's ER visits medication related
0.108  321311688257306624  1 in 4 skin cancer survivors skips sunscreen, study finds
0.109  161460752803311617  Only 1 in 4 Young Teens Uses Sunscreen Regularly, Study Finds:
0.110  458662217651879936  Child abuse affects 1 in 3 Canadian adults, mental health study indicates
0.112  344601130124316672  1 in 6 women at fracture clinics have been abused, study shows
0.126  160184310849224704  Abortion ends one in five pregnancies worldwide, study finds
0.127  332579818543673346  1 in 8 Boomers report memory loss, large survey finds
0.127  148844725191979009  Nearly 1 in 3 Young U.S. Adults Have Arrest Records: Study:
0.129  468857557126512640  HPV Found in Two-Thirds of Americans, Survey Finds
0.129  119455268106018816  1 in 4 U.S. Adults Treated for High Blood Pressure: Report:

QUERY: Skip the elliptical machine and walk your way to better health: (296539570206568448)
--------------------------------------------------------------------------------------------------
dist.  tweet_id            tweet_text
--------------------------------------------------------------------------------------------------
0.000  295724160922038272  Skip the elliptical machine and walk your way to better health:
0.000  294033035731546112  Skip the elliptical machine and walk your way to better health:
0.126  399914443762855936  Sweat Your Way To A Healthier Brain
0.144  304496972419702784  How to exercise your way to better sex:
0.144  293262936804311041  How to exercise your way to better sex:
0.149  557233923621527552  Need a healthy push? Turn to your partner to lose weight, quit smoking
0.152  564595829152182273  You don't need a gym to torch calories! Try this 30-minute workout 3 times a week to drop winter weight:
0.152  293006265599262722  Keep hands off the treadmill bars while you walk; you're limiting your calorie burn! Boost your treadmill workout:
0.154  541676196686491649  Kickoff your weight loss plan! Learn 25 ways to cut 500 calories a day:
0.154  551943764198301696  Learn the expert tricks to FINALLY achieving your goal weight:


So anyway, that concludes my experiment with NMSLib. Overall, I have enough now to build something real using these components. As before, the code is pretty trivial, it is modeled after code under the Example Usage section in the NMSLib Quickstart, so I am not going to repeat it here.

Of course, NMSLib is by no means the only library in this area. Others in this space that I know of are FAISS from Facebook, another similarity search library that is optimized to run on GPUs, and Vespa, a full-fledged search engine that allows for both traditional and vector search. In addition, there are plugins for vector scoring for both Elasticsearch (elasticsearch-vector-scoring) and Solr (solr-vector-scoring). So these might be options for vector search as well.

For my part, I am happy with my choice. I needed something which was easy to learn and embed, so the preference was for libraries rather than products. I also wanted it to run on a CPU based machine for cost reasons. FAISS does run on CPU as well, but based on results reported by Ben Fredrickson here, NMSLib has better performance on CPU. Also NMSLib installation is just a simple "pip install" (with the pre-compiled binary). In any case, I haven't ruled out using FAISS completely. In this system, we extract BERT embeddings for the tweets in an offline batch operation. In a real system, it is likely that such embeddings will have to be generated on demand, so such setups would be forced to include a GPU box, which could be used to also serve a FAISS index (based on Ben Fredrickson's evaluation, FAISS on GPU is approximately 7x faster than NMSLib is on CPU).




Monday, December 09, 2019

PyData LA 2019: Trip Report


PyData LA 2019 was last week, and I had the opportunity to attend. I also presented about NERDS (Named Entity Recognition for Data Scientists), an open source toolkit built by some colleagues from our Amsterdam office. This is my trip report.

The conference was three days long, Tuesday to Thursday, and was spread across 3 parallel tracks. So my report is necessarily incomplete, and limited to the talks and tutorials I attended. The first day was tutorials, and the next two days were talks. In quite a few situations, it was tough to choose between simultaneous talks. Fortunately, however, the talks were videotaped, and the organizers have promised to put them up in the next couple of weeks, so looking forward to catching up on the presentations I missed. The full schedule is here. I am guessing attendees will be notified by email when videos are released, and I will also update this post when that happens.

Day 1: Tutorials


For the tutorials, I did all the tutorials in the first track. Of these, I came in a bit late for Computer Vision with Pytorch, since I miscalculated the volume (and delaying power) of LA Traffic. It was fairly comprehensive, although I was familiar with at least some of the material already, so in retrospect, I should probably have attended one of the other tutorials.

The second tutorial was about Kedro and MLFlow and how to combine the two to build reproducible and versioned data pipelines. I didn't know that MLFlow can be used standalone outside Spark, so definitely something to follow up there. Kedro looks like scaffolding software which allows users to hook into specific callback points in its lifecycle.

The third tutorial was a presentation on teaching a computer to play PacMan using Reinforcement Learning (RL). RL apps definitely have a wow factor, and I suppose it can be useful where the environment is deterministic enough (rules of a game, laws of physics, etc.), but I often wonder if we can use it to train agents that can operate in a more uncertain "business applications"-like environment. I am not an expert on RL though, so if you have ideas on how to use RL in these areas, I would appreciate learning about them.

The fourth and last tutorial of the day was Predicting Transcription Factor (TF) genes from genetic networks using Neural Networks. The data extraction process was pretty cool, it was predicated on the fact that TF genes typically occupy central positions in genetic networks, so graph based algorithms such as connectedness and Louvain modularity can be used to detect them in these networks. These form the positive samples, and standard negative sampling is done to extract negative samples. The positive records (TFs) are oversampled using SMOTE. Features for these genes come from an external dataset of 120 or so experiments, where each gene was subjected to these experiments and results recorded. I thought that the coolest part was using the graph techniques for building up the dataset.

Days 2 and 3: Talks


All the talks provided me with some new information in one form or the other. In some cases, it was a tough choice to make, since multiple simultaneous talks seemed equally interesting to me going in. Below I list the ones I attended and liked, in chronological order of appearance in the schedule.

  • Gradient Boosting for data with both numerical and text features -- the talk was about the CatBoost library from Yandex, and the talk focused on how much better CatBoost is in terms of performance (especially on GPU) compared to other open source Gradient Boosting libraries (LightGBM and one other that I can't recall at the moment). CatBoost definitely looks attractive, and at some point I hope to give it a try.
  • Topological Techniques for Unsupervised Learning -- talked about how the topological technique called Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction can be used for generating very powerful embeddings and for clustering that is competitive with T-SNE. UMAP is more fully described in this paper on arXiv (the presenter was one of the co-authors of this paper). There was one other presentation on UMAP by one of the other co-authors which I was unable to attend.
  • Guide to Modern Hyperparameter Tuning Algorithms -- presented the open source Tune Hyperparameter Tuning Library from the Ray team. As with the previous presentation, there is a paper on arXiv that describes this library in more detail. The library provides functionality to do grid, random, bayesian, and genetic search over the hyperparameter space. It seems to be quite powerful and easy to use, and I hope to try it out soon.
  • Dynamic Programming for Hidden Markov Models (HMM) -- one of the clearest descriptions of the implementation of the Viterbi (given the parameters for the model and the observed states, find the most likely sequence of hidden states) algorithm that I have ever seen. The objective is for the audience to understand HMM (specifically Viterbi algorithm) well enough so they can apply it to new domains where it might be applicable.
  • RAPIDS: Open Source GPU Data Science -- I first learned about NVidia's RAPIDS library at KDD 2019 earlier this year. RAPIDS provides GPU optimized drop-in replacements for NumPy, Pandas, Scikit-Learn, and NetworkX (cuPy, cuDF, cuML, and cuGraph), which run order of magnitude faster if you have a GPU. Unfortunately, I don't have a GPU on my laptop, but the presenter said that images with RAPIDS pre-installed are available on Google Cloud (GCP), Azure, and AWS.
  • Datasets and ML Model Versioning using Open Source Tools -- this is a presentation on the Data Version Control (DVC) toolkit, which gives you a set of git like commands to version control your metadata, and link them to a physical file on some storage area like S3. We had envisioned using it internally for putting our datasets and ML models under version control some time back, so I was familiar with some of the information provided. But I thought the bit about creating versioned ML pipelines (data + model(s)) was quite interesting.

And here are the talks I would like to watch once the videos are uploaded.

  • Serverless Demo on AWS, GCP, and Azure -- this was covered in the lightning round on the second day. I think this is worth learning, since it seems to be an easy way to set up demos that work on demand. Also learned about AWS Batch, a "serverless" way to serve batch jobs (or at least non-singleton requests).
  • Supremely Light Introduction to Quantum Computing -- because Quantum Computing which I know nothing about.
  • Introducting AutoImpute: Python package for grappling with missing data -- No explanation needed, clearly, since real life data often comes with holes, and having something like this gives us access to a bunch of "standard" strategies fairly painlessly.
  • Tackling Homelessness with Open Data -- I would have attended this if I had not been presenting myself. Using Open Data for social good strikes me as something we, as software people, can do to improve our world and make it a better place, so always interested in seeing (and cheering on) others who do it.
  • What you got is What you got -- speaker is James Powell, a regular speaker I have heard at previous PyData conferences, who always manages to convey deep Python concepts in a most entertaining way.
  • GPU Python Libraries -- this was presented by another member of the RAPIDS team, and according to the previous presenter, focuses more on the Deep Learning aspect of RAPIDS.

And then of course there was my presentation. As I mentioned earlier, I spoke of NERDS, or more specifically my fork of NERDS where I made some improvements on the basic software. The improvements started as bug fixes, but currently there are quite a few significant changes, and I plan on making a few more. The slides for my talk are here. I cover why you might want to do Named Entity Recognition (NER), briefly describe various NER model types such as gazetteers, Conditional Random Fields (CRF), and various Neural model variations around the basic Bidirectional LSTM + CRF, cover the NER models available in NERDS, and finally describe how I used them to detect entities in a Biological Entity dataset from BioNLP 2004.

The reason I chose to talk about NERDS was twofold. First, I had begun to get interested in NERs in general in my own work, and "found" NERDS (although since it was an OSS project from my own company, not much discovery was involved :-)). I liked that NERDS does not provide "new" ML models, but rather a unified way to run many out of the box NER models against your data with minimum effort. In some ways, it is a software engineering solution that addresses a data science problem, and I thought the two disciplines coming together to solve a problem was an interesting thing in itself to talk about. Second, I feel that custom NER building is generally considered something of a black art, and something like NERDS has the potential to democratize the process.

Overall, based on some of the feedback I got on LinkedIn and in person, I thought the presentation was quite well received. There was some critical feedback saying that I should have focused more on the intuition behind the various NER modeling techniques than I did. While I agree that this might be desirable, I had limited time to deliver the talk, and I would not have been able to cover as much if I spent too much time on basics. Also, since the audience level was marked as Intermediate, I risked boring at least part of the audience if I did so. But I will keep this in mind for the future.

Finally, I would be remiss if I didn't mention all the wonderful people I met at this conference. I will not call you out by name, but you know who you are. Some people think of conferences as places where a small group of people get to showcase their work in front of a larger group of people, it is also a place where you get to meet people in your discipline but in similar or different domains, and I find it immensely helpful and interesting to share ideas and approaches for solving different problems.

And that's all I have for today. I hope you enjoyed reading my trip report.