Saturday, December 21, 2019

Finding Similar Tweets with BERT and NMSLib


Since my initial explorations with vector search for images on Lucene some time back, several good libraries and products have appeared that do a better job of computing vector similarity than my home grown solutions. One of them is the Non-Metric Space Library (NMSLib), a C++ library that implements Approximate Nearest Neighbor (ANN) techniques for efficient similarity search in non-metric spaces.

The problem with doing any kind of ranking using vector search is that you must match the query vector against every document vector in the index, compute a similarity score, then order by that similarity score. This is a O(N) operation, so the query time will increase linearly with the number of records N in the index. ANN libraries try to do better, and NMSLib uses the idea of Hierarchical Navigable Small World (HNSW) Graphs, which effectively partition the data into a hierarchical set of small world graphs based on proximity. That way searching this structure for nearest neighbors of a given record will involve searching the proximity graph of that record, which is independent of the number of records.

Over the last year or so, search has increasingly begun to move from traditional information retrieval techniques based on TF-IDF, to techniques based on neural models. To some extent, this has coincided with the release of the Bidirectional Encoder Representations from Transformers (BERT) model by Google, and its subsequent application in the NLP and Search communities for many different tasks, including similarity based document ranking. Prof. Jimmy Lin of the University of Waterloo has published an Opinion piece The Neural Hype, Justified! A Recantation at SIGIR 2019, where he explains why this is so in more detail.

In any case, this trend has led to a need for computing vector based similarity in an efficient manner, so I decided to do a little experiment with NMSLib, to get familiar with the API and with NMSLib generally, as well as check out how good BERT embeddings are for search. This post describes that experiment.

For my dataset, I selected the Health News in Twitter dataset from the UCI Machine Learning Repository. This is a dataset containing approximately 60k tweets related to 16 different Health/News organizations, and was donated to UCI as part of the paper Fuzzy Approach to Topic Discovery in Health and Medical Corpora.

I first extracted the tweet ID and text from these files and uploaded them into a SQLite3 database, so I could look the text up by ID later. I did a little cleanup of the tweet texts, to remove (shortened) URLs from the texts. Then I set up a BERT-as-a-service sentence encoding service on a GPU box, using BERT-base uncased as the underlying model, and generated the vectors for each of the sentences. Code for both these are fairly trivial and can be easily figured out from the documentation, so I will not bother to describe it here.

I then wanted to see how NMSLib scaled with change in dataset size. I had approximately 60k tweets and their vectors, so I loaded random subsets of the dataset into the NMSLib index, then ran a batch of 50 queries (also randomly sampled from the dataset) to generate their K Nearest Neighbors for K=10. I recorded the total loading time in each case, as well as the query time averaged over the 50 queries. Plots for both are shown below.






What is interesting is that load time rises approximately linearly with the number of records, but the query time is sublinear (and on a completely different timescale from indexing) and eventually seems to flatten out. So we can expect to pay a penalty for the large number of records once during indexing, but query seems to be quite fast.

I also looked at some of the results of the similarity search, and they seem to be pretty good. It is hard to think in terms of traditional similarity in case of tweets, since there is so litle content to go by, but I include below the 10 Nearest Neighbors (or equivalently, the top 10 search results) for 3 of my query tweets, and they all intuitively seem to be good, although perhaps for different reasons. In the first case, I think it has done a good job on focusing on the main content "Ebola" and "Guinea" and fetching results around these content words. In the second case, it seems to have captured the click-bait-ey spirit of the query tweet, and retured other tweets that are along the same lines. In the third case, once again, it returns results that are similar in intent to the query document, but uses completely different words.

QUERY: Ebola in Guinea puts miners in lock down (451467465793355776)
--------------------------------------------------------------------------------------------------
dist.  tweet_id            tweet_text
--------------------------------------------------------------------------------------------------
0.104  542396431495987201  Junior doctors in Sierra Leone strike over lack of Ebola care
0.110  582950953625735168  Sierra Leone Ebola lockdown exposes hundreds of suspected cases
0.112  542714193137635328  Junior doctors in Sierra Leone strike over lack of #Ebola care
0.112  517254694029524992  West Africa Ebola crisis hits tourism, compounds hunger in Gambia
0.117  497018332881498112  Ebola kills nurse in Nigeria
0.119  565914210807599104  Ebola-hit Guinea asks for funds for creaking health sector
0.120  514098379408699393  Ebola virus shutdown in Sierra Leone yields 'massive awareness'
0.120  555734676833198081  Ebola kills Red Cross nurse in Sierra Leone
0.121  583431484905754624  Sierra Leone to start laying off Ebola workers as cases fall: president
0.122  499300095402065921  Ebola Shuts Down The Oldest Hospital In Liberia

QUERY: 1 in 3 prescriptions unfilled, Canadian study finds (450767264292167681)
--------------------------------------------------------------------------------------------------
dist.  tweet_id            tweet_text
--------------------------------------------------------------------------------------------------
0.105  564909963739287552  Study finds 1 in 12 children's ER visits medication related
0.108  321311688257306624  1 in 4 skin cancer survivors skips sunscreen, study finds
0.109  161460752803311617  Only 1 in 4 Young Teens Uses Sunscreen Regularly, Study Finds:
0.110  458662217651879936  Child abuse affects 1 in 3 Canadian adults, mental health study indicates
0.112  344601130124316672  1 in 6 women at fracture clinics have been abused, study shows
0.126  160184310849224704  Abortion ends one in five pregnancies worldwide, study finds
0.127  332579818543673346  1 in 8 Boomers report memory loss, large survey finds
0.127  148844725191979009  Nearly 1 in 3 Young U.S. Adults Have Arrest Records: Study:
0.129  468857557126512640  HPV Found in Two-Thirds of Americans, Survey Finds
0.129  119455268106018816  1 in 4 U.S. Adults Treated for High Blood Pressure: Report:

QUERY: Skip the elliptical machine and walk your way to better health: (296539570206568448)
--------------------------------------------------------------------------------------------------
dist.  tweet_id            tweet_text
--------------------------------------------------------------------------------------------------
0.000  295724160922038272  Skip the elliptical machine and walk your way to better health:
0.000  294033035731546112  Skip the elliptical machine and walk your way to better health:
0.126  399914443762855936  Sweat Your Way To A Healthier Brain
0.144  304496972419702784  How to exercise your way to better sex:
0.144  293262936804311041  How to exercise your way to better sex:
0.149  557233923621527552  Need a healthy push? Turn to your partner to lose weight, quit smoking
0.152  564595829152182273  You don't need a gym to torch calories! Try this 30-minute workout 3 times a week to drop winter weight:
0.152  293006265599262722  Keep hands off the treadmill bars while you walk; you're limiting your calorie burn! Boost your treadmill workout:
0.154  541676196686491649  Kickoff your weight loss plan! Learn 25 ways to cut 500 calories a day:
0.154  551943764198301696  Learn the expert tricks to FINALLY achieving your goal weight:


So anyway, that concludes my experiment with NMSLib. Overall, I have enough now to build something real using these components. As before, the code is pretty trivial, it is modeled after code under the Example Usage section in the NMSLib Quickstart, so I am not going to repeat it here.

Of course, NMSLib is by no means the only library in this area. Others in this space that I know of are FAISS from Facebook, another similarity search library that is optimized to run on GPUs, and Vespa, a full-fledged search engine that allows for both traditional and vector search. In addition, there are plugins for vector scoring for both Elasticsearch (elasticsearch-vector-scoring) and Solr (solr-vector-scoring). So these might be options for vector search as well.

For my part, I am happy with my choice. I needed something which was easy to learn and embed, so the preference was for libraries rather than products. I also wanted it to run on a CPU based machine for cost reasons. FAISS does run on CPU as well, but based on results reported by Ben Fredrickson here, NMSLib has better performance on CPU. Also NMSLib installation is just a simple "pip install" (with the pre-compiled binary). In any case, I haven't ruled out using FAISS completely. In this system, we extract BERT embeddings for the tweets in an offline batch operation. In a real system, it is likely that such embeddings will have to be generated on demand, so such setups would be forced to include a GPU box, which could be used to also serve a FAISS index (based on Ben Fredrickson's evaluation, FAISS on GPU is approximately 7x faster than NMSLib is on CPU).




Be the first to comment. Comments are moderated to prevent spam.