I just got back from our company's internal Search Summit at our offices at Raleigh, NC -- the conference is in its third year, and it has grown quite a bit from its humble beginnings. We even have our own conference sticker! The conference was 1 day of multi-track workshops, and two days of single track presentations. Our Labs team conducted a workshop on Bidirectional Encoder Representations from Transformers (or BERT), and I presented my results on BERT based Open Domain Question Answering.
Our BERT based Question Answering pipeline is inspired by the End-to-end Open-Domain Question Answering with BERTSerini paper from Prof Jimmy Lin's team from the University of Waterloo. We are using our own content from ScienceDirect, and we have been trying BERT, pre-trained variants such as BioBERT and SciBERT, and other models such as XLNet and AllenNLP BiDAF, fine tuned with SQuAD 1.1 and 2.0 datasets, and we are using the pipeline variants to answer our own set of questions in the scientific domain. Overall, we have gotten best results from SciBERT+SQuAD 1.1, but we are looking at fine-tuning with SQuAD 2.0 to see if we can get additional signal from when the model abstains from answering.
The figure below shows the BERTSerini pipeline (as described in the BERTSerini paper). In this post, I want to describe the Anserini Retriever component and our implementation of it as a Solr plugin. The Anserini Retriever is an open source IR toolkit described in Anserini: Enabling the Use of Lucene for Information Retrieval Research. It was originally built as a way to experiment with running things like TREC benchmarks, as described in Toward Reproducible Baselines: The Open-Source IR Reproducibility Challenge. It implements a pluggable strategy for handling question style queries against a Lucene index. The code for the Anserini project is available at castorini/anserini on Github.
Functionally, the Anserini retriever component takes as input a string representing a question, and returns a set of results that can be used as candidate passages by the Question Answering module. The pipeline consists of two steps -- query expansion and results reranking. Anserini offers multiple ways to do each step, allowing the caller to mix and match these strategies to create customized search pipelines. It also offers multiple pluggable similarity strategies, most commonly used of which seem to be BM25 (default similarity for Lucene and its derivative platforms nowadays) and QL (Query Likelihood). The question is parsed by the query expansion steps, and sent to the index, which I will call query A. Results from query A are then reranked -- the reranking is really another (filtered) query to the index, which I will call query B.
Query Expansion strategies include the Bag of Words (BoW), and the Sequential Dependency Model (SDM). Bag of words is fairly self explanatory, its just an OR query of all the tokens in the query, after stopwording, and optionally synonym expansion. SDM is only slightly more complex, it is a weighted query with three main clauses. The first clause is a Bag of Words. The second clause is an OR query of neighboring bigram tokens where proximity and order are both important, and the third clause is an OR query of neighboring bigram tokens where proximity is relaxed and order is not important. The three clauses are weighted in a compound OR query, default weights are (0.85, 0.1, 0.05).
The query (Query A) is sent to the index, which will return results. We take the top K results (configurable, we use K=50 as our default) and send it to the result reranking step. Anserini provides three pluggable reranking algorithms. They are RM3 (Relevance Model 3), Axiomatic, and Identity. RM3 computes feature vectors for query terms and the results from query A. Feature vectors for the results come from the top fbTerms (default 10) from each of the top fbDocs (default 10) documents in the result set. Query vectors and result vectors are interpolated using a multiplier alpha (default 0.5), and resulting top scoring terms are used to construct Query B as a weighted OR query, where the weights for each term is the score computed for it. The Axiomatic strategy is similar, except it uses a mix of the top rerankCutoff results from query A, and a random set of non-results to improve recall. It uses Mutual Information (MI) between query terms in the query and results to compute the top results. As with RM3, Query B for Axiomatic is a weighted OR query consisting of terms with highest MI and the corresponding weights are the MI values for the term. The Identity strategy, as the name suggests, is a no-op passthrough, which passes the output of Query A unchanged. It can be useful for debugging (in a sense "turning-off" reranking), or when the results of Query A produce sufficiently good candidates for question answering. Finally, since Query B is its own separate query, in order to ensure that it behaves as a reranker, we want to restrict the documents returned to those returned in the top rerankedCutoff documents from Query A. In the Solr plugin, we have implemented that as a docID filter on top results of Query A that is added to Query B.
Pluggable similarities is probably a bit of a misnomer. Way back, Lucene offered a single similarity implementation -- a variant of TF-IDF. Later they started offering BM25 as an alternative, and since Lucene 7.x (I believe), BM25 has become the default Similarity implementation. However, probably as a result of the changes needed to accommodate BM25, it became easier to add newer similarity implementations, and recent versions of Lucene offer a large variety of them, as you can see from the Javadocs for Lucene 8 Similarity. However, similarities are associated with index fields, so a pluggable similarity will only work if you indexed your field with the appropriate similarity in the first place. Anserini offers quite a few similarity implementations, corresponding to the different similarities available in Lucene. However, we noticed that in our case, we just needed BM25 and QL (Query Likelihood, corresponding to Lucene's LMDirichletSimilarity), so our Solr plugin just offers these two.
When I set out to implement the BERTSerini pipeline, my original thought was to leverage the Lucene code directly. However, I decided against it for a number of reasons. First, the scripts I saw in their repository suggested that the primary use case is running large benchmarks with different parameters in batch mode, whereas my use case (at least initially) was more interactive. Second, our index is fairly large, consisting of 4000 books from Science Direct, which translates to approximately 42 million records (paragraphs), and takes up 150 GB (approx) disk space, so we are constrained to build it on a cloud provider's machine (AWS in our case). With Lucene, the only way to "look inside" is Luke, which is harder to forward to your local machine over SSH, compared to forwarding HTTP. For these reasons I decided on using Solr as my indexing platform, and implementing the necessary search functionality as a Solr plugin.
Once I understood the functionality Anserini offered, it took just 2-3 days to implement the plugin and access it from inside a rudimentary web application. The figure below shows the candidate passages for a question that should be familiar to many readers of this blog -- How is market basket analysis related to collaborative filtering? If you look at the top 3 (visible) paragraphs returned, they seem like pretty good candidate passages. Overall, the (BM25 + BoW + RM3) strategy seems to return good passages for question answering.
While the plugin is currently usable as-is, i.e., it is responsive and produces good results, the code relies exclusively in copying functionality (and sometimes chunks of code) from the Anserini codebase rather than using Anserini as a library. In fact, the initial implementation (in the "master" branch) does not have any dependencies on the Anserini JAR. For long term viability, it makes sense to have the plugin be dependent on Anserini. I am currently working with Prof Lin to make that happen, and some partially working code is available to do this in the branch "b_anserini_deps".
The code for the Solr plugin (and documentation on how to install and use it) can be found in the elsevierlabs-os/anserini-solr-plugin repository on Github. My employer (Elsevier, Inc.) open sourced the software so we could (a) make it more robust as described above, in consultation with Prof Lin's team, and (b) provide a tool for Solr users interested in exposing the awesome candidate passage generation for question answering functionality provided by Anserini.
If you are working in this space and are looking for a good tool to extract candidate passages from questions, I think you will find the Solr plugin very useful. If you end up using it, please let us know what you think, including how it could be improved.