Saturday, December 28, 2019

Incorporating the Flair NER into NERDS

Earlier this month I was at PyData LA where I talked about NERDS, a toolkit for Named Entity Recognition (NER) open sourced by some of my colleagues at Elsevier. You can find the slides for my talk here, the video doesn't seem to be released yet unfortunately. I covered some of this in my trip report already, but for those of you who may not know about NERDS, it is a toolkit that provides easy to use NER capabilities for data scientists. Specifically, it wraps a few (4 in the master brach, 6 in my fork -- but more on that later) third party NER models, and provides a common API for training and evaluating them. Each model also provides tunable hyperparameters and structural parameters, so as a NERDS user, you can prepare your data once and have the ability to train many different NER models quickly and efficiently.

One of the things I had promised to talk about in my abstract was how to add new NER models to NERDS, which I ended up not doing due to shortage of time. This was doubly unfortunate, because one of my aims in giving this talk was to popularize the toolkit and also to encourage contributions from Open Source to give future users of NERDS more choices. In any case, I recently added a NER from the Flair project from Zalando Research into NERDS, and figured that this might be a good opportunity to describe the steps, for the benefit of those who might be interested in extending NERDS with your own favorite third party NER model. So that's what this blog post is about.

One thing to remember though, is that, at least for now, these instructions are valid only on my fork of NERDS. In order to support the common API, NERDS exposes a common data format across all its models, and behind the scenes, converts between this format and internal formats of each model. Quite frankly, I think this is a genius idea -- an awesome application of Software Engineering principles to solve a Data Science problem. However, the common data format was somewhat baroque and a source of errors (the BiLSTM-CRF model from the Anago project on the master branch crashes intermittently because of some insidious bug which I wasn't able to crack), so I switched to a simpler data format and the bug disappeared (see the for details). So we basically keep the genius idea but simplified the implementation.

Another major change is to inject parameters at construction time rather than separately during calls to fit() and predict() -- this is in line with how scikit-learn does it too, which is also where we want to go, for interoperability reasons. In any case, here is the full list of changes in the branch so far.

At a high level, here is the list of things you need to do to integrate your favorite NER into NERDS. I describe each step in greater detail below.

  1. Add library dependency in
  2. Figure out the third party NER API
  3. Update the file
  4. Create the NERDS NER Model
  5. Write and run the tests
  6. Update the examples

Add library dependency in

The Flair package is installable via "pip install", so if you add it to the NERDS file as shown, it will be added to your (development) environment the next time you run "make install". The development environment simply means that the Python runtime will point to your development directory instead of somewhere central in site-packages. That way changes you make to the code will be reflected in the packag without you having to push (perhaps by additional "make install") your changes each time.

Figure out the third party NER API

If you are looking to add a NER model whose API you are already familiar with, this step may not be needed. For me, though, the Flair NER was new, so I wanted to get familiar with its API before I tried to integrate it into NERDS. I found this Flair tutorial on Tagging your Text particularly useful.

From this tutorial, I was able to figure out that Flair provides a way to train and evaluate its SequenceTagger (what we will use for our NERDS Flair NER) in one go, using a Corpus object, which is a collection of training, validation, and test datasets. Each of these datasets is a collection of Flair Sentence objects, which represents an individual sentence. Each Sentence object contains a collection of Token objects, and each Token object contains a collection of Tag objects.

Conversely, all NERDS models extends the abstract class NERModel, which inherits from the BaseEstimator and ClassifierMixin classes from scikit-learn, and expose the following four methods -- fit, predict, save, and load, as shown below. Here the fit(X, y) method is used for training the model, using dataset X and label set y. Conversely, the predict(X) method is meant for predicting labels for dataset X using a trained model. Therefore, clearly the single Corpus approach will not work for us. Luckily, however, it is possible to pass an empty Sentence list for the test dataset when creating a Corpus for training, and prediction can be done directly against the test Sentence list.

class NERModel(BaseEstimator, ClassifierMixin):
    def fit(self, X, y): pass
    def predict(self, X): pass
    def save(self, dirpath): pass
    def load(self, dirpath): pass

A typical train-save-load-predict pipeline consists in training the model with a labeled dataset, then saving the trained model to disk, then retrieving the saved model, and running predictions against the test set. My focus was mainly to figure out how to separate out the training and prediction code blocks into their own independent chunks, so I can reuse them in the fit() and predict(). Also, load() and save() can be somewhat idiosyncratic, with different models using different serialization mechanisms, and writing out different artifacts, so its good to watch those too. Another thing to note are the two functions sentences_to_data_labels() and data_labels_to_sentences(), that convert between the NERDS common data format (data=lists of lists of tokens, labels=lists of lists of tags), and the Sentence and Corpus based Flair data format. Its not required, of course, but I find it useful to encapsulate the conversion inside their own routines, that way they can be easily ported, not only into the final NER Model, but can potentially be reused in case I need to incorporate another NER with similar native APIs.

Here is my NER train-save-load-predict pipeline that uses the Flair NER directly. Idea is to ran this for couple of epochs just to make sure it works, and then you are ready for the next step.

import flair
import os

from import Corpus, Sentence, Token
from flair.embeddings import CharacterEmbeddings, TokenEmbeddings, WordEmbeddings, StackedEmbeddings
from flair.models import SequenceTagger
from flair.trainers import ModelTrainer

from nerds.utils import load_data_and_labels

from sklearn.model_selection import train_test_split

DATA_DIR = "examples/BioNLP/data"

def data_labels_to_sentences(data, labels=None):
    sentences = []
    is_dummy_labels = False
    if labels is None:
        labels = data
        is_dummy_labels = True
    for tokens, tags in zip(data, labels):
        sentence = Sentence()
        for token, tag in zip(tokens, tags):
            t = Token(token)
            if not is_dummy_labels:
                t.add_tag("ner", tag)
    return sentences

def sentences_to_data_labels(sentences):
    data, labels = [], []
    for sentence in sentences:
        tokens = [t.text for t in sentence.tokens]
        tags = [t.tags["ner"].value for t in sentence.tokens]
    return data, labels

# training (fit)
train_filename = os.path.join(DATA_DIR, "train", "Genia4ERtask1.iob2")
train_data, train_labels = load_data_and_labels(train_filename)
trn_data, val_data, trn_labels, val_labels = train_test_split(
    train_data, train_labels, test_size=0.1)
trn_sentences = data_labels_to_sentences(trn_data, trn_labels)
val_sentences = data_labels_to_sentences(val_data, val_labels)
train_corpus = Corpus(trn_sentences, val_sentences, [], name="train-corpus")

basedir = "flair-ner-test"
savedir = "flair-saved"
tag_dict = train_corpus.make_tag_dictionary(tag_type="ner")
embedding_types = [
embeddings = StackedEmbeddings(embeddings=embedding_types)
tagger = SequenceTagger(hidden_size=256,
trainer = ModelTrainer(tagger, train_corpus)

# model is saved by default, but let's do it again
os.makedirs(savedir, exist_ok=True), ""))

# load back the model we trained
model_r = SequenceTagger.load(os.path.join(savedir, ""))

# prediction (predict)
test_filename = os.path.join(DATA_DIR, "test", "Genia4EReval1.iob2")
test_data, test_labels = load_data_and_labels(test_filename)
test_sentences = data_labels_to_sentences(test_data)

pred_sentences = model_r.predict(test_sentences, 
i = 0
_, predictions = sentences_to_data_labels(pred_sentences)
for prediction in predictions:
    i += 1
    if i > 10:

The resulting model is shown below. It looks similar to the word+character hybrid model proposed by Guillaume Genthial in his Sequence Tagging with Tensorflow blog post, where word embeddings (seeded with GloVe vectors) and embeddings generated from characters are concatenated and fed into an LSTM, and then the output of the LSTM is fed into a linear layer with CRF loss to produce the predictions.

  (embeddings): StackedEmbeddings(
    (list_embedding_0): WordEmbeddings('glove')
    (list_embedding_1): CharacterEmbeddings(
      (char_embedding): Embedding(275, 25)
      (char_rnn): LSTM(25, 25, bidirectional=True)
  (word_dropout): WordDropout(p=0.05)
  (locked_dropout): LockedDropout(p=0.5)
  (embedding2nn): Linear(in_features=150, out_features=150, bias=True)
  (rnn): LSTM(150, 256, batch_first=True, bidirectional=True)
  (linear): Linear(in_features=512, out_features=20, bias=True)

Update the file

Python's package paths are very file-oriented. For example, functions in the nerds.utils package are defined in the nerds/ file. However, since NER models are typically large blocks of code, my preference (as well as the original authors) is to have each model in its own file. This can lead to very deep package structures, or we can effectively flatten the package paths by importing them into the nerds.models package in the nerds/models/ You can now refer to the FlairNER class defined in nerds/models/ as nerds.models.FlairNER.

Create the NERDS NER model

At this point, it is fairly easy to build the FlairNER class with code chunks from the throwaway train-save-load-predict script. There are a few things to keep in mind that have to do in part with coding style, and in part with a desire for interoperability with scikit-learn and its huge library of support functions. I try to follow the guidelines in Developing scikit-learn estimators. One important deviation from the guidelines is that we don't allow **kwargs for fit() and predict(), since its easier to track the parameters if they are all passed in via the constructor. Another important thing to note is that NERDS models are not true Estimators, since fit and predict work with lists of lists of primitive objects, rather than just lists, so the check_estimator function fails on these models -- although I think this may be because the creators of check_estimator may not have anticipated this usage.

We don't have publicly available API Docs for NERDS yet, but in anticipation of that, we are using the NumPy DocString format as our guide, as advised by the Scikit-Learn coding guidelines.

Finally, in the save() function, we dump out the parameters fed into the constructor in a YAML file. This is mainly for documentation purposes, to save the user time figuring out after the fact which model was created with which hyperparameters. The class structure doesn't enforce this requirement, i.e., the NER will happily work even without this feature, but its a single-line call to utils.write_param_file(), so its not a lot of work for something very useful, so you just have to remember to add this in.

Here is the code for the FlairNER class. As you can see, a lot of code has been copy-pasted from the throwaway train-save-load-predict code that we built earlier. There is also some validation code, for example, to prevent predict() being run without a trained model, or to complain if the code is asked to load the model from a non-existent location, etc. Also the private functions _convert_to_flar() and _convert_from_flair() are basically clones of the data_labels_to_sentences() and sentence_to_data_labels() functions from the earlier script.

Write and run the tests

NERDS has a suite of unit tests in the nerds/test directory. It uses the nose package for running the tests. For the NER Models, we have a tiny dataset of 2 sentences, with which we train and predict. The dataset is generally insufficient to train an NER model, so basically all we are looking for is that the code runs end-to-end without complaining about size issues, etc. Here is the code for the FlairNER tests.

You can run the test individually by "nosetests nerds/tests/" or run all tests using "make test". I like to start with running individual tests to make sure my changes are good, and then follow it up with a final "make test" to make sure my changes haven't broken something elsewhere in the system.

Update the examples

Finally, it is time to add your NER to the example code in nerds/examples. This is mainly for NERDS users, to provide them examples on how to call NERDS, but it can also be interesting for you, to see how your new NER stacks up against the ones that are there already. There are two examples, one based on the Groningen Meaning Bank (GMB) dataset of general entities such as PERson, LOCation, etc., and another based on the BioNLP dataset for Bio-Entity recognition. As mentioned earlier, NERDS allows you to prepare your data once and reuse it across multiple models, so the code to include the FlairNER is this block here and here respectively. As can be seen from the classification reports on the respective (here and here), performance of the FlairNER is on par with the BiLSTM-CRF in case of GMB but closer to CRF in case of BioNLP.

That's basically all it takes code-wise, to add a new NER to NERDS. The next step is of course to do a Pull Request (PR), which I would request you to hold off on at the moment, since I am working off a fork myself, and my git-fu is not powerful enough to figure how to handle PRs against a fork. I would prefer that my fork gets pulled into master first, then we handle any additional PRs. However, please queue them up on the NERDS Issues page, so they can be incorporated as they come in.

Saturday, December 21, 2019

Finding Similar Tweets with BERT and NMSLib

Since my initial explorations with vector search for images on Lucene some time back, several good libraries and products have appeared that do a better job of computing vector similarity than my home grown solutions. One of them is the Non-Metric Space Library (NMSLib), a C++ library that implements Approximate Nearest Neighbor (ANN) techniques for efficient similarity search in non-metric spaces.

The problem with doing any kind of ranking using vector search is that you must match the query vector against every document vector in the index, compute a similarity score, then order by that similarity score. This is a O(N) operation, so the query time will increase linearly with the number of records N in the index. ANN libraries try to do better, and NMSLib uses the idea of Hierarchical Navigable Small World (HNSW) Graphs, which effectively partition the data into a hierarchical set of small world graphs based on proximity. That way searching this structure for nearest neighbors of a given record will involve searching the proximity graph of that record, which is independent of the number of records.

Over the last year or so, search has increasingly begun to move from traditional information retrieval techniques based on TF-IDF, to techniques based on neural models. To some extent, this has coincided with the release of the Bidirectional Encoder Representations from Transformers (BERT) model by Google, and its subsequent application in the NLP and Search communities for many different tasks, including similarity based document ranking. Prof. Jimmy Lin of the University of Waterloo has published an Opinion piece The Neural Hype, Justified! A Recantation at SIGIR 2019, where he explains why this is so in more detail.

In any case, this trend has led to a need for computing vector based similarity in an efficient manner, so I decided to do a little experiment with NMSLib, to get familiar with the API and with NMSLib generally, as well as check out how good BERT embeddings are for search. This post describes that experiment.

For my dataset, I selected the Health News in Twitter dataset from the UCI Machine Learning Repository. This is a dataset containing approximately 60k tweets related to 16 different Health/News organizations, and was donated to UCI as part of the paper Fuzzy Approach to Topic Discovery in Health and Medical Corpora.

I first extracted the tweet ID and text from these files and uploaded them into a SQLite3 database, so I could look the text up by ID later. I did a little cleanup of the tweet texts, to remove (shortened) URLs from the texts. Then I set up a BERT-as-a-service sentence encoding service on a GPU box, using BERT-base uncased as the underlying model, and generated the vectors for each of the sentences. Code for both these are fairly trivial and can be easily figured out from the documentation, so I will not bother to describe it here.

I then wanted to see how NMSLib scaled with change in dataset size. I had approximately 60k tweets and their vectors, so I loaded random subsets of the dataset into the NMSLib index, then ran a batch of 50 queries (also randomly sampled from the dataset) to generate their K Nearest Neighbors for K=10. I recorded the total loading time in each case, as well as the query time averaged over the 50 queries. Plots for both are shown below.

What is interesting is that load time rises approximately linearly with the number of records, but the query time is sublinear (and on a completely different timescale from indexing) and eventually seems to flatten out. So we can expect to pay a penalty for the large number of records once during indexing, but query seems to be quite fast.

I also looked at some of the results of the similarity search, and they seem to be pretty good. It is hard to think in terms of traditional similarity in case of tweets, since there is so litle content to go by, but I include below the 10 Nearest Neighbors (or equivalently, the top 10 search results) for 3 of my query tweets, and they all intuitively seem to be good, although perhaps for different reasons. In the first case, I think it has done a good job on focusing on the main content "Ebola" and "Guinea" and fetching results around these content words. In the second case, it seems to have captured the click-bait-ey spirit of the query tweet, and retured other tweets that are along the same lines. In the third case, once again, it returns results that are similar in intent to the query document, but uses completely different words.

QUERY: Ebola in Guinea puts miners in lock down (451467465793355776)
dist.  tweet_id            tweet_text
0.104  542396431495987201  Junior doctors in Sierra Leone strike over lack of Ebola care
0.110  582950953625735168  Sierra Leone Ebola lockdown exposes hundreds of suspected cases
0.112  542714193137635328  Junior doctors in Sierra Leone strike over lack of #Ebola care
0.112  517254694029524992  West Africa Ebola crisis hits tourism, compounds hunger in Gambia
0.117  497018332881498112  Ebola kills nurse in Nigeria
0.119  565914210807599104  Ebola-hit Guinea asks for funds for creaking health sector
0.120  514098379408699393  Ebola virus shutdown in Sierra Leone yields 'massive awareness'
0.120  555734676833198081  Ebola kills Red Cross nurse in Sierra Leone
0.121  583431484905754624  Sierra Leone to start laying off Ebola workers as cases fall: president
0.122  499300095402065921  Ebola Shuts Down The Oldest Hospital In Liberia

QUERY: 1 in 3 prescriptions unfilled, Canadian study finds (450767264292167681)
dist.  tweet_id            tweet_text
0.105  564909963739287552  Study finds 1 in 12 children's ER visits medication related
0.108  321311688257306624  1 in 4 skin cancer survivors skips sunscreen, study finds
0.109  161460752803311617  Only 1 in 4 Young Teens Uses Sunscreen Regularly, Study Finds:
0.110  458662217651879936  Child abuse affects 1 in 3 Canadian adults, mental health study indicates
0.112  344601130124316672  1 in 6 women at fracture clinics have been abused, study shows
0.126  160184310849224704  Abortion ends one in five pregnancies worldwide, study finds
0.127  332579818543673346  1 in 8 Boomers report memory loss, large survey finds
0.127  148844725191979009  Nearly 1 in 3 Young U.S. Adults Have Arrest Records: Study:
0.129  468857557126512640  HPV Found in Two-Thirds of Americans, Survey Finds
0.129  119455268106018816  1 in 4 U.S. Adults Treated for High Blood Pressure: Report:

QUERY: Skip the elliptical machine and walk your way to better health: (296539570206568448)
dist.  tweet_id            tweet_text
0.000  295724160922038272  Skip the elliptical machine and walk your way to better health:
0.000  294033035731546112  Skip the elliptical machine and walk your way to better health:
0.126  399914443762855936  Sweat Your Way To A Healthier Brain
0.144  304496972419702784  How to exercise your way to better sex:
0.144  293262936804311041  How to exercise your way to better sex:
0.149  557233923621527552  Need a healthy push? Turn to your partner to lose weight, quit smoking
0.152  564595829152182273  You don't need a gym to torch calories! Try this 30-minute workout 3 times a week to drop winter weight:
0.152  293006265599262722  Keep hands off the treadmill bars while you walk; you're limiting your calorie burn! Boost your treadmill workout:
0.154  541676196686491649  Kickoff your weight loss plan! Learn 25 ways to cut 500 calories a day:
0.154  551943764198301696  Learn the expert tricks to FINALLY achieving your goal weight:

So anyway, that concludes my experiment with NMSLib. Overall, I have enough now to build something real using these components. As before, the code is pretty trivial, it is modeled after code under the Example Usage section in the NMSLib Quickstart, so I am not going to repeat it here.

Of course, NMSLib is by no means the only library in this area. Others in this space that I know of are FAISS from Facebook, another similarity search library that is optimized to run on GPUs, and Vespa, a full-fledged search engine that allows for both traditional and vector search. In addition, there are plugins for vector scoring for both Elasticsearch (elasticsearch-vector-scoring) and Solr (solr-vector-scoring). So these might be options for vector search as well.

For my part, I am happy with my choice. I needed something which was easy to learn and embed, so the preference was for libraries rather than products. I also wanted it to run on a CPU based machine for cost reasons. FAISS does run on CPU as well, but based on results reported by Ben Fredrickson here, NMSLib has better performance on CPU. Also NMSLib installation is just a simple "pip install" (with the pre-compiled binary). In any case, I haven't ruled out using FAISS completely. In this system, we extract BERT embeddings for the tweets in an offline batch operation. In a real system, it is likely that such embeddings will have to be generated on demand, so such setups would be forced to include a GPU box, which could be used to also serve a FAISS index (based on Ben Fredrickson's evaluation, FAISS on GPU is approximately 7x faster than NMSLib is on CPU).

Monday, December 09, 2019

PyData LA 2019: Trip Report

PyData LA 2019 was last week, and I had the opportunity to attend. I also presented about NERDS (Named Entity Recognition for Data Scientists), an open source toolkit built by some colleagues from our Amsterdam office. This is my trip report.

The conference was three days long, Tuesday to Thursday, and was spread across 3 parallel tracks. So my report is necessarily incomplete, and limited to the talks and tutorials I attended. The first day was tutorials, and the next two days were talks. In quite a few situations, it was tough to choose between simultaneous talks. Fortunately, however, the talks were videotaped, and the organizers have promised to put them up in the next couple of weeks, so looking forward to catching up on the presentations I missed. The full schedule is here. I am guessing attendees will be notified by email when videos are released, and I will also update this post when that happens.

Day 1: Tutorials

For the tutorials, I did all the tutorials in the first track. Of these, I came in a bit late for Computer Vision with Pytorch, since I miscalculated the volume (and delaying power) of LA Traffic. It was fairly comprehensive, although I was familiar with at least some of the material already, so in retrospect, I should probably have attended one of the other tutorials.

The second tutorial was about Kedro and MLFlow and how to combine the two to build reproducible and versioned data pipelines. I didn't know that MLFlow can be used standalone outside Spark, so definitely something to follow up there. Kedro looks like scaffolding software which allows users to hook into specific callback points in its lifecycle.

The third tutorial was a presentation on teaching a computer to play PacMan using Reinforcement Learning (RL). RL apps definitely have a wow factor, and I suppose it can be useful where the environment is deterministic enough (rules of a game, laws of physics, etc.), but I often wonder if we can use it to train agents that can operate in a more uncertain "business applications"-like environment. I am not an expert on RL though, so if you have ideas on how to use RL in these areas, I would appreciate learning about them.

The fourth and last tutorial of the day was Predicting Transcription Factor (TF) genes from genetic networks using Neural Networks. The data extraction process was pretty cool, it was predicated on the fact that TF genes typically occupy central positions in genetic networks, so graph based algorithms such as connectedness and Louvain modularity can be used to detect them in these networks. These form the positive samples, and standard negative sampling is done to extract negative samples. The positive records (TFs) are oversampled using SMOTE. Features for these genes come from an external dataset of 120 or so experiments, where each gene was subjected to these experiments and results recorded. I thought that the coolest part was using the graph techniques for building up the dataset.

Days 2 and 3: Talks

All the talks provided me with some new information in one form or the other. In some cases, it was a tough choice to make, since multiple simultaneous talks seemed equally interesting to me going in. Below I list the ones I attended and liked, in chronological order of appearance in the schedule.

  • Gradient Boosting for data with both numerical and text features -- the talk was about the CatBoost library from Yandex, and the talk focused on how much better CatBoost is in terms of performance (especially on GPU) compared to other open source Gradient Boosting libraries (LightGBM and one other that I can't recall at the moment). CatBoost definitely looks attractive, and at some point I hope to give it a try.
  • Topological Techniques for Unsupervised Learning -- talked about how the topological technique called Uniform Manifold Approximation and Projection (UMAP) for dimensionality reduction can be used for generating very powerful embeddings and for clustering that is competitive with T-SNE. UMAP is more fully described in this paper on arXiv (the presenter was one of the co-authors of this paper). There was one other presentation on UMAP by one of the other co-authors which I was unable to attend.
  • Guide to Modern Hyperparameter Tuning Algorithms -- presented the open source Tune Hyperparameter Tuning Library from the Ray team. As with the previous presentation, there is a paper on arXiv that describes this library in more detail. The library provides functionality to do grid, random, bayesian, and genetic search over the hyperparameter space. It seems to be quite powerful and easy to use, and I hope to try it out soon.
  • Dynamic Programming for Hidden Markov Models (HMM) -- one of the clearest descriptions of the implementation of the Viterbi (given the parameters for the model and the observed states, find the most likely sequence of hidden states) algorithm that I have ever seen. The objective is for the audience to understand HMM (specifically Viterbi algorithm) well enough so they can apply it to new domains where it might be applicable.
  • RAPIDS: Open Source GPU Data Science -- I first learned about NVidia's RAPIDS library at KDD 2019 earlier this year. RAPIDS provides GPU optimized drop-in replacements for NumPy, Pandas, Scikit-Learn, and NetworkX (cuPy, cuDF, cuML, and cuGraph), which run order of magnitude faster if you have a GPU. Unfortunately, I don't have a GPU on my laptop, but the presenter said that images with RAPIDS pre-installed are available on Google Cloud (GCP), Azure, and AWS.
  • Datasets and ML Model Versioning using Open Source Tools -- this is a presentation on the Data Version Control (DVC) toolkit, which gives you a set of git like commands to version control your metadata, and link them to a physical file on some storage area like S3. We had envisioned using it internally for putting our datasets and ML models under version control some time back, so I was familiar with some of the information provided. But I thought the bit about creating versioned ML pipelines (data + model(s)) was quite interesting.

And here are the talks I would like to watch once the videos are uploaded.

  • Serverless Demo on AWS, GCP, and Azure -- this was covered in the lightning round on the second day. I think this is worth learning, since it seems to be an easy way to set up demos that work on demand. Also learned about AWS Batch, a "serverless" way to serve batch jobs (or at least non-singleton requests).
  • Supremely Light Introduction to Quantum Computing -- because Quantum Computing which I know nothing about.
  • Introducting AutoImpute: Python package for grappling with missing data -- No explanation needed, clearly, since real life data often comes with holes, and having something like this gives us access to a bunch of "standard" strategies fairly painlessly.
  • Tackling Homelessness with Open Data -- I would have attended this if I had not been presenting myself. Using Open Data for social good strikes me as something we, as software people, can do to improve our world and make it a better place, so always interested in seeing (and cheering on) others who do it.
  • What you got is What you got -- speaker is James Powell, a regular speaker I have heard at previous PyData conferences, who always manages to convey deep Python concepts in a most entertaining way.
  • GPU Python Libraries -- this was presented by another member of the RAPIDS team, and according to the previous presenter, focuses more on the Deep Learning aspect of RAPIDS.

And then of course there was my presentation. As I mentioned earlier, I spoke of NERDS, or more specifically my fork of NERDS where I made some improvements on the basic software. The improvements started as bug fixes, but currently there are quite a few significant changes, and I plan on making a few more. The slides for my talk are here. I cover why you might want to do Named Entity Recognition (NER), briefly describe various NER model types such as gazetteers, Conditional Random Fields (CRF), and various Neural model variations around the basic Bidirectional LSTM + CRF, cover the NER models available in NERDS, and finally describe how I used them to detect entities in a Biological Entity dataset from BioNLP 2004.

The reason I chose to talk about NERDS was twofold. First, I had begun to get interested in NERs in general in my own work, and "found" NERDS (although since it was an OSS project from my own company, not much discovery was involved :-)). I liked that NERDS does not provide "new" ML models, but rather a unified way to run many out of the box NER models against your data with minimum effort. In some ways, it is a software engineering solution that addresses a data science problem, and I thought the two disciplines coming together to solve a problem was an interesting thing in itself to talk about. Second, I feel that custom NER building is generally considered something of a black art, and something like NERDS has the potential to democratize the process.

Overall, based on some of the feedback I got on LinkedIn and in person, I thought the presentation was quite well received. There was some critical feedback saying that I should have focused more on the intuition behind the various NER modeling techniques than I did. While I agree that this might be desirable, I had limited time to deliver the talk, and I would not have been able to cover as much if I spent too much time on basics. Also, since the audience level was marked as Intermediate, I risked boring at least part of the audience if I did so. But I will keep this in mind for the future.

Finally, I would be remiss if I didn't mention all the wonderful people I met at this conference. I will not call you out by name, but you know who you are. Some people think of conferences as places where a small group of people get to showcase their work in front of a larger group of people, it is also a place where you get to meet people in your discipline but in similar or different domains, and I find it immensely helpful and interesting to share ideas and approaches for solving different problems.

And that's all I have for today. I hope you enjoyed reading my trip report.

Sunday, October 20, 2019

Trip Report: Graphorum 2019

I just got back from Graphorum 2019, organized by the folks at Dataversity. The conference was held at The Westin Chicago River North, and colocated with the Data Architecture Summit (DAS), also organized by Dataversity. Attendees were not restricted to talks at one or the other conference, they were allowed, even encouraged, to attend talks at the other conference, perhaps in an effort to popularize Graph concepts among Data Architects, and Data Architecture best practices among Graph practioners. Both conferences were very heavily multi-track -- at any point, you had around 3-5 choices if you restricted yourself to talks in either track. I attended only the Graphorum talks, so this trip report represents one of 322 million possible unique trip reports (and one of 142 billion possible unique reports if I had not restricted myself), with the naive assumption that at all talks within each conference were independent, and at any point, I was equally likely to select any one of the talks offered.

The conference was four days long, starting on Monday last week (October 14 2019) and ended yesterday (October 17). As expected, many of the talks (at least among the ones I attended) were centered around Knowledge Graphs (KGs). Some of these focused on techniques and advice on how to build them from unstructured text, and some focused on making them more effective. Many of the presentations ended up covering the various Linked Data standards such as Resource Description Framework (RDF) for specifying semantic triples, and Web Ontology Language (OWL) for doing inference on them. More than one talk mentioned the new Shape Constraint Language (SHACL) for validating such RDF graphs. On the other hand, it looks like there is strong industry support for Labeled Property Graphs (LPG) as well, both among database vendors and users. Regardless of the graph flavor, there was also quite a lot of interest in using Machine Learning (ML) and Natural Language Processing (NLP) to leverage the signal inherent in graph structure. In the rest of this post, I will cover the talks that I liked in my 1-in-322M path through the conference. I will also cover some interesting discussions I had with some vendors at the exhibition, and some overall feedback about the conference as a whole.

I arrived late at Chicago on Sunday night, so was unable to take advantage of the early bird registration. At breakfast next morning, I found myself at the end of a fairly long line while the others were relatively empty. In an attempt to understand why, and notwithstanding Murphy's Law, I noticed that the lines were by first character of last name, and unevenly sized (A-E, F-J, K-O, and P-Z). It is possible that the sizing is based on actual attendee last names, in that case I guess people with last names P-Z tend to be relatively last minute types.

Day 1

On the first day, I attended a presentation about Applications of Knowledge Graphs in Enterprise Data Management and AI by Juergen Jakobitsch from Semantic Web Company and Andreas Blumauer from Pool Party. It was very comprehensive, and included lots of background material on Knowledge Graphs. Among some standard applications were semantic search, info-box style results, product recommendations, virtual assistants, etc. One interesting application mentioned was text data augmentation, and (in the same vein) filtering results from chatbots for illogical answers. They also talked about the PoolParty pipeline for converting inputs (structured and unstructured) to Knowledge Graphs, which includes entity resolution and entity linking. This presentation, as well as others throughout the conference, also focused on the various W3C standards such as RDF, SPARQL, OWL2, and SHACL.

I also attended Graphs Transform Chemical Discovery in Pharmaceutical Research presented by Liana Kiff of Tom Sawyer Software. I had initially thought that it would talk about the actual process of Chemical Discovery using graphs, but it turned out to be (an admittedly very powerful) visualization of a graph of chemical entities from the ChEMBL database, including features that allow for manual chemical discovery.

The final presentation of the day was You Must Be This Tall: Machine Learning from Graphs, Tables, Trees, and Documents by Brian Sletten of Bosatsu Consulting, Inc. As with the previous talk, I came in hoping to learn about ML with graphs, but the talk turned out to be about what you need to do before you can do ML with your graphs. Nevertheless, the presenter was very knowledgable, so the talk ended up being pretty interesting and educational, in spite of it covering a lot of the RDF/SPARQL/OWL ground covered by earlier talks. One good insight here was the distinction between data, knowledge, and wisdom as points on a Data Understanding vs Connected space, coupled with capturing relationships between entities, the context in which these relations exist, and the exploitation of this information using RDF technologies. Another was the need for data lakes to be available as a pre-requisite to effective ML in the organization. Yet another other thing I liked about his presentation is this example SPARQL queries against Wikidata. He also talked about JSON-LD and how it can be a good substitute for RDF.

Day 2

The first presentation of Tuesday was the Ontology Engineering for Knowledge Graphs by Elisa Kendall and Deborah McGuinness. The talk focused on the role of the Ontology Engineer, who is the technical person who talks to the Knowledge Engineer or Domain Expert. The main component of this role is being effective at interviewing the Knowledge Engineer. The talk also covered various aspects of modeling the domain, with many examples drawn from the presenter's own experiences.

The next presentation I attended was Merging Company Data from Multiple Sources with GraphDB and Ontotext Platform by Atanas Kiryakov (their CEO). As expected, the presentation was more about how Ontotext does things. However, Ontotext is one of the first graph database companies I know of that pursued a strong NLP + Graph strategy, so there was lots of interesting content. Among the functionality covered were features around Entity Recognition from text and Entity Disambiguation (from structured data as well as text), the use of different embedding technology (1-hot, 2-hot, BERT, etc), and the use of different technologies such as Random Forests and SVM for entity and document classification, BiLSTM and CRF for Named Entity Recognition, and Inductive Logic Programming (ILP) for rules engines built around the discovered entities and relations.

The final presentation on Tuesday for me was Reveal Predictive Patterns with Neo4j Graph Algorithms (Hands-on) by Amy Hodler and William Lyon. I had learned about Graph Algorithms in Neo4j (and also used some of them for my own talk later in the conference) from the Graph Algorithms: Practical Examples in Apache Spark and Neo4j from O'Reilly (Amy Hodler, one of the presenters, is also one of the authors of this book), so some of it was old material for me. The talk started off with the need for graph algorithms as tools that exploit the additional information implicit in the graph structure, then covered the various classes of graph algorithms (Pathfinding, Centrality, Community Detection, Link Prediction and Similarity), with deep dives into specific algorithms and running them on their Neo4j Desktop product (proprietary product with 30 day free trial, but all the features covered are also available in the free community edition). I ended up learning a few new things, such as how to use virtual graphs (generated as a result of a Cypher query, sort of like views in the RDBMS world), and how to use the Strongly Connected components algorithm as a debugging tool. They also showed off their NEuler product, which allows forms-based invocation of various algorithms, as well as some very good visualizations. Talking about visualization, William Lyon also mentioned the neo4j-contrib/neovis.js project, which seems interesting as well. Overall, lots of useful information about Neo4j and graphs.

I also learned about the Bridges of Chicago, based on a challenge from the presenters about using Cypher (the Neo4j query language) to find an Eulerian path similar to the Bridges of Königsberg problem. I guess I was the only one that responded, since the problem is much simpler than it appears to be at first glance.

Exhibitors started setting up their booths today, so I spent some of the coffee breaks and most of the evening talking to various exhibitors. Both Graph database vendors and consultants were well represented among the exhibitors (considering it was a graph + data architecture conference). Graph vendors I knew of included Neo4j, Ontotext, TigerGraph, DataStax, and StarDog. Among those who I learned about at this conference were PoolParty, Semantic Web Company, and Cambridge Semantics. Having attended the presentations from PoolParty and The Semantic Web, and Ontotext, I spent a lot of time talking with them. I also met up with the folks at TigerGraph, and let them know how helpful their Graph Gurus webinar series has been to me. I also took the opportunity to meet up with the folks at Stardog, who I had met earlier at another Graph conference few years earlier through a reference. Since I was a speaker here, the conversation also drifted occassionally to the subject of my talk, and what graph database I was using (Neo4j).

Day 3

Wednesday was quite a heavy day in terms of presentations, comparatively speaking. It started with two keynote presentations. The first one was Knowledge Graphs and AI: The Future of Enterprise Data by David Newman from Wells Fargo. He spoke of the progression of looking at Strings to Things to Predicted Things to Vectors, which resonated with me as well, since we are progressing along a very similar path ourself. He led us through multiple examples involving harmonizing an entity across multiple Knowledge Graphs in the enterprise, the need for classifying entities into a taxonomy, using Knowledge Graphs to predict new relationships, using graph relations for creating digital fingerprints for ML algorithms, etc. His examples referenced the Financial Industry Business Ontology (FIBO), which provides a standard schema for the financial services industry.

The second keynote was Graph Stories: How Four Metaphors can help you decide if Graphs are right for you by Dan McCreary of Optum. While David Newman's presentation was based on RDF style graphs, Dan McCreary is a big proponent of Labeled Property Graphs (LPG), although his choice had several very pragmatic reasons. The four metaphors he described are the Neighborhood Walk, the Knowledge Triangle, the Open World Assumption, and the Jenga Tower. Specifically, the first indicates the importance of relationship traversal in your applications, the second indicates where your application is (or wants to be) on the Data / Information / Knowledge Triangle, the third indicates the ease with which new information can be incorporated into your system, and the fourth indicates the resilience of your query system to small changes in your backend. The keynote also covered the importance of graph structure (Structure is the new gold in data mining), the inter-relationship of Graphs with Deep Learning techniques such as Graph Convolutional Networks (GCNN) and Structured Learning with Graphs.

The next presentation I attended was Knowledge Graphs and Model Driven Classification by Scott Henninger of SmartLogic, where he showed off the capabilities of the SmartLogic platform, which centered around Metadata tagging, document classification (based on the metadata tagging and external taxonomies), and Search Enhancement Services (SES). The bulk of the capabilities seem to be rule based, which can be good for explainability purposes. SmartLogic's KG backend is based on RDF Schema, OWL, and SHACL. An interesting functionality of SmartLogic is to allow the user to manually fine-tune the (term) weights from their classifier. I got quite excited at this, thinking that perhaps this functionality could be leveraged to produce explainable Deep Learning models by perturbing the inputs, but then realized that the intuition is similar to the idea behind LIME - Local Interpretable Model-Agnostic Explanations.

Next up was a talk on How Do You Read Millions of Documents for Meaning using Graph? by Ryan Chandler of Caterpillar, Inc. He described a system he built at Caterpillar, that allowed customer support technicians to query a large collection of internal support tickets created by other technicians. The end result is a query-able knowledge base. The text in the support tickets are tokenized and segmented into sentences, tagged with cause, complaint, solution, note, and correction (classification). The document is decomposed into semantic frames, and the document and the associated semantic frames, along with its metadata, are stored in a Neo4j graph database. On the query side, the natural language (NL) query is converted into a graph using a dependency parse, and re-composed into a Cypher query against specific semantic frames (as indicated by the metadata). The Cypher query produces a ranked list of support tickets that best satisfy the NL query. I thought this was quite an interesting technique, although it may be somewhat dependent on the structure of the input data.

The next presentation I attended was Graph Analytics for Enterprise Applications by Melliyal Annamalai, Souripriya Das, and Matthew Perry from Oracle. I came in a few minutes late so I missed the first part, but from what I gathered, it covers Oracle's foray into graph databases -- it turns out that Oracle customers can now start working with SPARQL using SQL Developer, seamlessly against Oracle's new hybrid Property and RDF graph. The functionality is nice, but probably only useful for current and future Oracle customers.

My next presentation was Insufficient Facts always invite Danger: Combat them with a Logical Model by Michael Grove of Stardog, where he described how important it was to have a Logical Model to ensure completeness of your model, and and how it can help you avoid problems later.

The evening before I had spent some time at the DataStax booth, mainly for nostalgic reasons since I worked with Cassandra (the Apache version, not the DataStax version) at my previous job, and I was curious about their graph product based on Cassandra (initially called Titan, then Janus). So I attended the presentation Graph Innovations for Distributed Data at Scale by Jonathan Lacefield. The presentation covered the evolution of their graph product, and also answered a nagging question I had about how they implemented the graph in a column-family database under the covers -- turns out that each row is basically the star graph around each node. Other interesting things in this presentation were their use of Gremlin and Spark support through their DataStax extension.

The last presentation of the day was Knowledge Graphs and GraphQL in Action: A Practical Approach to using RDF and Semantic Models for Web Applications by Irene Polikoff and Ralph Hodgson of TopQuadrant. They described their Semantic GraphQL interface which provides the user with a GraphQL interface, and converts down to a RDF, OWL, and SHACL query against a RDF triple store.

Finally, the last event of the day was a session about Open Source Knowledge Graph Tooling, which really turned out to be a group of banking folks trying to collaborate around the FIBO Ontology, but it is likely that they might expand to other industries as well in the future. There was also talk about listing out a current (non-deprecated) list of open source ontologies in various industries, applying unit tests to ontologies so they don't become stale and irrelevant, both of which were interesting to me.

The exhibitors were still around, and so I hung around for some more conversations with vendors and fellow attendees for a couple more hours after that. Among them were Cambridge Semantics, who have a fast analytics graph database called AnzoDB.

Day 4

The first presentation of the day was Unsupervised and Supervised ML on Big Graph: Case Studies by Victor Lee. He described various case studies using TigerGraph. The first one was finding influential healthcare provides in various local networks from a specialty network, and finding their influence networks. Another case study had to do with detecting spam phone calls in the China Mobile network, the training data for which consisted of 180+ graph features. The model was a Random Forest classifier. At prediction time, an incoming phone call would be placed in the call graph, the 180+ features computed and fed into the Random Forest model to predict (under 20 milliseconds) whether the call was spam or not spam. The third case study was for Bank Fraud, based on some synthetic data from a Kaggle competition, where TigerGraph engineers built some compound relationships based on edges discovered in the feature graph, which ended up giving good results, showing that the structure of the data provides useful signal. The talk ended with an introduction to Node2Vec, a graph embedding scheme.

The next presentation in my list was my own (Graph Techniques for Natural Language Processing). My presentation was about using Graph techniques (mostly a combination of common third party algorithms) to solve Natural Language Processing (NLP) problems. I covered four case studies that attempted to replicate academic papers (referenced from the Bibliography of Graph-Based Natural Language Processing and Information Retrieval) around document summarization, clustering using language model based vectors, word sense disambiguation, and topic finding. Graph techniques used included various graph centrality metrics (some built-in and some computed using Cypher and built-in algorithms), random walk techniques, Louvain Community Detection, Label Propagation, and Personalized PageRank. Compared to the other presentations, mine was probably a bit unusual, since it focused on NLP more than on graphs, so while I had only about 15-20 attendees, there seemed to be lots of interest, and some very good questions at the end. For those of you who weren't able to make it to the presentation but would like more details, you can find the link to my slides and code (in Jupyter notebooks, with a lot of verbose commentary) at my sujitpal/nlp-graph-examples repository.

I hung around a bit after my presentation answering questions, so I ended up being a bit late to the next presentation, even with the generous coffee break in between. This was When Time Meets Relationships: Understanding an Immutable Graph Database by Brian Platz of Fluree. He makes the case that a Knowledge Graph is a snapshot at a point in time. A time-aware Knowledge Graph can be thought of as an immutable linked list, where facts are added to an append-only log, and made tamper-proof with hashing techniques, much like a private blockchain. The strategy assumes the Knowledge Graph is a triple-store of (Subject, Predicte, Object). As time passes, facts are either retracted or added, so a time-aware tuple would be (Subject, Predicate, Object, Time, Add/Retract). In addition, labeled properties, such as a scheduled expiration date, can be accommodated with an addition Metadata attribute. He also covered some indexing strategies that can make it efficient to query such an time-aware tuple-store.

After this, there were two keynotes. The first one was News and Graphs by Peter Olson of NBC News Digital, which covered the graph structure of the NBC News Publishing pipeline, and how NBC leverages graphs to provide news with high velocity and scale. The second keynote was Knowledge Graph Pilot improves Data Quality While Providing a Customer 360 View by Bethany Swhon and Patricia Branum of Capital One, where they described how they improved the quality of their Knowledge Graph to provide a better Customer view across the enterprise.

The final presentation of the day and conference for me was Automated Encoding of Knowledge from Unstructured Text into a Graph Database by Chris Davis of Lymba. The presentation describes the Lymba pipeline to convert text into Knowledge Graph. It includes the usual preprocessing, tokenizing, POS tagging, and segmentation steps other presentations covered (and in some ways seem to be standard knowledge in the text to KG NLP sub-community), but this presentation went one step further and talked about the need for Word Sense Disambiguation, Concept extraction (using gazetteers and NER models), and Syntactic (constituent) and Semantic Parses (dependency) for relation extraction. It also includes Coreference Resolution, which is also quite important but usually omitted from pipelines because of its complexity. The Lymba product provides a turnkey solution plus consulting for various industries.

I had to catch a flight back, and having heard about the Chicago traffic and having faced the zero tolerance for lateness in large airports such as LAX, I didn't want to miss it. So I ended up skipping the last panel discussion on Graphs vs Tables. Turns out I didn't need to, but better safe than sorry.


As conferences go, this was quite luxurious -- attendees were treated to a sumptous buffet breakfast every day, and a 3 course sit-down lunch for 3 of the 4 days (1 of the days was build-your-own sandwiches, but even that was quite nice). One observation is that sit-down lunches can foster really good and insightful conversations. In addition, there was coffee and snacks throughout the day, and (limited) free drinks for 2 of the 3 evenings. Swag included a Dataversity branded backpack to hold your conference materials, wool hats with the Dataversity logo, stickers, and a pen which contained a USB drive with all the presentation slides, as well as the swag vendors give out at their stalls (to potential clients).

Of course, the nicest thing about conferences (after the presentations) are the conversations with fellow attendees, and the chance to learn from their insights, and what they are doing with the tech under consideration (in this case graphs). I met people from aviation, health (insurance), finance, consulting, the government (from both the covert and the overt branches), as well as scientific publishing. In addition, it was a chance to interact with people from the vendor companies, and bounce ideas against them about specific things they do well. Two insights, both gained at lunch table conversations -- first, RDF has better inter-operability and tooling, but LPGs are easier to work with; second, certain Asian cultures believe that you can never define an object fully, which seems to warrant more of a triple-store structure than the more efficient but constrained graph structure.

Overall, it was good to see Graphs being treated with so much importance. The previous Graph conferences I have attended were much smaller affairs, rarely lasting more than a day. I suppose this might partly be because of the focus on explainable AI, advances in Knowledge Graphs, Graph CNNs and embeddings, as well as the realization that graph structure provides useful exploitable signal, all of which are causing graphs to become more and more important, and graph conferences to become more popular.

If I had to suggest an improvement, I would suggest streamlining the evaluation process. I don't know how many feedback forms were returned (I returned all four that were provided in my conference materials, but not the last global one). Each form takes approximately 5 minutes to complete, so it is tempting to skip it and go to the next session instead. And by the evening, it is harder to do, since you have to refer to your notes instead of relying on short term memory. On the other hand, someone at the door with an iPad who scans your badge and asks you to tap on a smiley versus a frowney icon provides much better coverage (although you would have to interpret the meaning of the feedback). I guess its the tension between explicit versus implicit feedback, there are tradeoffs either way.

Sunday, October 06, 2019

Efficient Reverse Image Search on Lucene using Approximate Nearest Neighbors

I wrote last week about our Search Summit, an internal conferences where engineers and product managers from Elsevier, LexisNexis, and various other organizations that make up the RELX Group, get together to share ideas and best practices on search. The Search Summit is in its third year, and given that its a single track conference and runs for two days, there are limited speaking slots available, and competition for these slots is quite fierce. Our acceptance rate this year was around the 40-50% mark, which I guess puts us on par with some big external conferences in terms of exclusivity. In any case, I guess that was my long-winded way of saying that this year I didn't get accepted for a speaking slot, and was offered the opportunity to present a poster instead, which I accepted.

This was the first time I made a poster for work, and quite honestly, I didn't know where to begin. My only experience with posters was back when I was at school, where I remember that we used crayons a lot, and then helping my kids build posters for their school, which involved a lot of cutting out pictures and glueing them on to a posterboard. Fortunately, as I learned from some googling, technology has progressed since then, and now one can build the poster as a single Microsoft Powerpoint (or Apple Keynote or Google slides) slide (I used Powerpoint). Quite a few sites provide free templates for various standard poster sizes as well, I ended up choosing one from Genigraphics. One you are done designing the poster, you save it as a PDF, and upload the PDF to sites such as Staples Print Services for printing. And that's pretty much all there is to it. Here is the poster I ended up presenting at the search summit.

Basically, the poster describes a Proof of Concept (POC) application that attempts to do reverse image search, i.e., given an image, it returns a ranked list of images, similar to the given image, in the corpus. My dataset for my experiments was a set of approximately 20,000 images from the ImageCLEF 2016 competition. The notion of similarity that I use is the one learned by a RESNET network pre-trained on ImageNet. The idea is that this sort of network, trained on a large generic image corpus such as ImageNet, learns how to recognize colors, edges, shapes, and other more complex features in images, and these features can be used to provide a notion of similarity across images, much like tokens in a document.

RESNET (and other pre-trained networks like it) generate dense and (comparatively) low-dimensional vectors of features. For example, the last layer of a trained RESNET model will generate feature vectors of size 2048. On the other hand, text is typically vectorized for Information Retrieval as a vector over all the tokens (and sometimes token combinations as well) in its vocabulary, so feature vector sizes of 50,000 or 100,000 are not uncommon. In addition, such vectors tend to be quite sparse as well, since a given token is relatively unlikely to occur in many documents. Lucene (and probably other search engines) leverage the high-dimensionality and sparseness of text feature vectors using its Inverted Index data structure (aka Postings List), where the key is the token and the value is a priority queue of document ids that the token occurs in.

In order to present a ranked list of similar images, a naive (and exhaustive) solution would be to compute some measure of vector similarity (such as Cosine similarity or Euclidean similarity) between the input image and every other image in the corpus, a O(N) operation. A better solution might be to partition the similarity space using something like a KD-tree such that comparison is done against a subset of most similar images, and that is a O(log N) operation. However, the Lucene inverted index makes the search an almost O(1) operation -- first looking up by token, and then navigating down a priority queue of a subset of document IDs containing that token. My reasoning went as follows -- if I could somehow project the dense, low-dimensional vectors from RESNET back into a sparse, high-dimensional space similar to that used for text vectors without losing too much information, I would be able to do image search as efficiently as text search.

I experimented with two such projection techniques -- Random K-Means and Random Projections.

  • Random K-Means involves running a bunch of K-Means clusterings over the image set, typically 50-200 of them, each with some random value for k. Each image (point in similarity space) is represented by a sequence of its cluster memberships across all the clusterings. The intuition here is that similar images, represented by two points that are close in the similarity space, will share cluster memberships across more clusterings than dissimilar images, represented by points that are further apart. A 2D schematic on the bottom center of the poster illustrates the idea.
  • Random Projections is similar, but instead of partitioning the space into a random number of circular clusters, we partition the space itself using a (D-1) dimensional hyperplane, where D is the size of the dense feature vector (2048). The number of random projections used is typically much higher than the number of K-Means, typically 1,000-10,000. This effectively results in binary clusters where points behind the hyperplane get assigned a 0 and points in front get assigned a 1. As before, each image is represented by a sequence of its membership across all the random projections. The intuition here is similar to random K-Means. The 2D schematic on the bottom right of the poster illustrates how random projections work.

My baseline for experiments was caption based text search, i.e., similar images were found by doing text search on their captions. I took a random set of 20 images, and manually evaluated the relevance for top 10 similar images for each on a 4-point scale. Its only one person, so its not very objective, but I needed a place to start. The guideline I followed was to give it the highest rating (4) if the image is identical, or almost identical, to the source image -- possibly either the same image, or a composite containing the same image, or a minor modification. The second highest rating (3) was for images that were basically the same thing -- for example, if the source image was an X-ray of the left lung, I would rate an X-ray of the right as very similar. The third rating (2) were for images that I couldn't rate 4 or 3, but which I would expect to see in the set of similar images. Finally, the lowest rating (1) is for images that I don't expect to see in the results, i.e., irrelevant images.

I used the ratings to compute Discounted Cumulative Gain (DCG) for k=1, 3, 5, and 10, where k is the number of top results taken from the ranked list of similar images for each image, and then averaged them across the 20 images. I then computed the Ideal DCG (IDCG) for k=10 by sorting the ranked list based on my ratings, and then divided the computed DCG by the IDCG to get the Normalized Discounted Cumulative Gain (NDCG). These numbers are shown in the table under the Results section in the poster. The bolded numbers indicate the best performers. Results for k=1 serve as a sanity check, since all the searches in my evaluation returned the source image as the top result in the ranked list of similar images. Overall, the numbers seem quite good, and Random K-Means with 250 clusters produced the best results.

Random K-Means provided the best results in qualitative terms as well. The block of images at the center of the poster show a brain MRI as the source image, and the top 3 results from the baseline, the best random K-Means (250 clusters), and the best Random Projections (10000 projections). A green box indicates relevant results, and a red box indicates irrelevant results. As you can see, the caption text search based baseline produces more diverse results, but doesn't do very well otherwise. Random K-Means and Random Projections produce similar images that are more visually similar to the source image, and between the two, Random K-Means tends to produce better results.

In the future, I hope to extend this approach to text vectors as well. Specifically, I want to capitalize on the increase in recall that I noticed with caption based text search, by building embeddings (word2vec, ELMo, BERT, etc.) from the caption and treat these embeddings as additional feature vectors for the image. On the other hand, I do notice that the trend has been to use special purpose backends for vector search, such as the Non-Metric Space Library (NMSLib) based on Numpy, or the GPU-based FAISS Library for efficient similarity search, or hybrid search engines such as Vespa. So while it might be intellectually satisfying to try to make a single platform serve all your search needs, it might more worthwhile to see if one of these other toolkits might provide better results.

I also wanted to acknowledge the work of Simon Hughes. The approaches used in this poster are based heavily on ideas originally proposed by him on his DiceTechJobs/VectorsInSearch project, as well as ideas and suggestions proposed by him and others on the Relevancy and Matching Technology #vectors-in-search slack channel.

Sunday, September 29, 2019

Searching for Answer Candidate Passages with Solr and Anserini

I just got back from our company's internal Search Summit at our offices at Raleigh, NC -- the conference is in its third year, and it has grown quite a bit from its humble beginnings. We even have our own conference sticker! The conference was 1 day of multi-track workshops, and two days of single track presentations. Our Labs team conducted a workshop on Bidirectional Encoder Representations from Transformers (or BERT), and I presented my results on BERT based Open Domain Question Answering.

Our BERT based Question Answering pipeline is inspired by the End-to-end Open-Domain Question Answering with BERTSerini paper from Prof Jimmy Lin's team from the University of Waterloo. We are using our own content from ScienceDirect, and we have been trying BERT, pre-trained variants such as BioBERT and SciBERT, and other models such as XLNet and AllenNLP BiDAF, fine tuned with SQuAD 1.1 and 2.0 datasets, and we are using the pipeline variants to answer our own set of questions in the scientific domain. Overall, we have gotten best results from SciBERT+SQuAD 1.1, but we are looking at fine-tuning with SQuAD 2.0 to see if we can get additional signal from when the model abstains from answering.

The figure below shows the BERTSerini pipeline (as described in the BERTSerini paper). In this post, I want to describe the Anserini Retriever component and our implementation of it as a Solr plugin. The Anserini Retriever is an open source IR toolkit described in Anserini: Enabling the Use of Lucene for Information Retrieval Research. It was originally built as a way to experiment with running things like TREC benchmarks, as described in Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. It implements a pluggable strategy for handling question style queries against a Lucene index. The code for the Anserini project is available at castorini/anserini on Github.

Functionally, the Anserini retriever component takes as input a string representing a question, and returns a set of results that can be used as candidate passages by the Question Answering module. The pipeline consists of two steps -- query expansion and results reranking. Anserini offers multiple ways to do each step, allowing the caller to mix and match these strategies to create customized search pipelines. It also offers multiple pluggable similarity strategies, most commonly used of which seem to be BM25 (default similarity for Lucene and its derivative platforms nowadays) and QL (Query Likelihood). The question is parsed by the query expansion steps, and sent to the index, which I will call query A. Results from query A are then reranked -- the reranking is really another (filtered) query to the index, which I will call query B.

Query Expansion strategies include the Bag of Words (BoW), and the Sequential Dependency Model (SDM). Bag of words is fairly self explanatory, its just an OR query of all the tokens in the query, after stopwording, and optionally synonym expansion. SDM is only slightly more complex, it is a weighted query with three main clauses. The first clause is a Bag of Words. The second clause is an OR query of neighboring bigram tokens where proximity and order are both important, and the third clause is an OR query of neighboring bigram tokens where proximity is relaxed and order is not important. The three clauses are weighted in a compound OR query, default weights are (0.85, 0.1, 0.05).

The query (Query A) is sent to the index, which will return results. We take the top K results (configurable, we use K=50 as our default) and send it to the result reranking step. Anserini provides three pluggable reranking algorithms. They are RM3 (Relevance Model 3), Axiomatic, and Identity. RM3 computes feature vectors for query terms and the results from query A. Feature vectors for the results come from the top fbTerms (default 10) from each of the top fbDocs (default 10) documents in the result set. Query vectors and result vectors are interpolated using a multiplier alpha (default 0.5), and resulting top scoring terms are used to construct Query B as a weighted OR query, where the weights for each term is the score computed for it. The Axiomatic strategy is similar, except it uses a mix of the top rerankCutoff results from query A, and a random set of non-results to improve recall. It uses Mutual Information (MI) between query terms in the query and results to compute the top results. As with RM3, Query B for Axiomatic is a weighted OR query consisting of terms with highest MI and the corresponding weights are the MI values for the term. The Identity strategy, as the name suggests, is a no-op passthrough, which passes the output of Query A unchanged. It can be useful for debugging (in a sense "turning-off" reranking), or when the results of Query A produce sufficiently good candidates for question answering. Finally, since Query B is its own separate query, in order to ensure that it behaves as a reranker, we want to restrict the documents returned to those returned in the top rerankedCutoff documents from Query A. In the Solr plugin, we have implemented that as a docID filter on top results of Query A that is added to Query B.

Pluggable similarities is probably a bit of a misnomer. Way back, Lucene offered a single similarity implementation -- a variant of TF-IDF. Later they started offering BM25 as an alternative, and since Lucene 7.x (I believe), BM25 has become the default Similarity implementation. However, probably as a result of the changes needed to accommodate BM25, it became easier to add newer similarity implementations, and recent versions of Lucene offer a large variety of them, as you can see from the Javadocs for Lucene 8 Similarity. However, similarities are associated with index fields, so a pluggable similarity will only work if you indexed your field with the appropriate similarity in the first place. Anserini offers quite a few similarity implementations, corresponding to the different similarities available in Lucene. However, we noticed that in our case, we just needed BM25 and QL (Query Likelihood, corresponding to Lucene's LMDirichletSimilarity), so our Solr plugin just offers these two.

When I set out to implement the BERTSerini pipeline, my original thought was to leverage the Lucene code directly. However, I decided against it for a number of reasons. First, the scripts I saw in their repository suggested that the primary use case is running large benchmarks with different parameters in batch mode, whereas my use case (at least initially) was more interactive. Second, our index is fairly large, consisting of 4000 books from Science Direct, which translates to approximately 42 million records (paragraphs), and takes up 150 GB (approx) disk space, so we are constrained to build it on a cloud provider's machine (AWS in our case). With Lucene, the only way to "look inside" is Luke, which is harder to forward to your local machine over SSH, compared to forwarding HTTP. For these reasons I decided on using Solr as my indexing platform, and implementing the necessary search functionality as a Solr plugin.

Once I understood the functionality Anserini offered, it took just 2-3 days to implement the plugin and access it from inside a rudimentary web application. The figure below shows the candidate passages for a question that should be familiar to many readers of this blog -- How is market basket analysis related to collaborative filtering? If you look at the top 3 (visible) paragraphs returned, they seem like pretty good candidate passages. Overall, the (BM25 + BoW + RM3) strategy seems to return good passages for question answering.

While the plugin is currently usable as-is, i.e., it is responsive and produces good results, the code relies exclusively in copying functionality (and sometimes chunks of code) from the Anserini codebase rather than using Anserini as a library. In fact, the initial implementation (in the "master" branch) does not have any dependencies on the Anserini JAR. For long term viability, it makes sense to have the plugin be dependent on Anserini. I am currently working with Prof Lin to make that happen, and some partially working code is available to do this in the branch "b_anserini_deps".

The code for the Solr plugin (and documentation on how to install and use it) can be found in the elsevierlabs-os/anserini-solr-plugin repository on Github. My employer (Elsevier, Inc.) open sourced the software so we could (a) make it more robust as described above, in consultation with Prof Lin's team, and (b) provide a tool for Solr users interested in exposing the awesome candidate passage generation for question answering functionality provided by Anserini.

If you are working in this space and are looking for a good tool to extract candidate passages from questions, I think you will find the Solr plugin very useful. If you end up using it, please let us know what you think, including how it could be improved.