Thursday, November 26, 2015

SoDA - A Dictionary Based Entity Recognition Tool

Last month I presented a talk at Spark Summit Europe 2015 about a system I have been working on for a while. The system provides a Dictionary based Entity Recognition Microservice based on Solr, SolrTextTagger and OpenNLP. You can find the Abstract, Slides and Video for the talk here. In this post, I describe why I built it and what we are using it for.


My employer, the Reed-Elsevier (RELX) Group, is the world's leading provider of Science and Technology Information. Our charter is to build data and information solutions that help our users (usually STM researchers) achieve better results. Our group at Elsevier Labs is building a Machine Reading Pipeline to distill information from our books and journals into rich domain-specific Knowledge Graphs, that could hopefully be used to make new inferences about the state of our world.

Knowledge graphs (like any other graph) consist of vertices and edges. The vertices represent concepts in the STM universe, and the edges represent the relationships between those concepts. The concepts at the nodes may be generic, such as "surgeon", or may be specific entities such as "Dr. Jane Doe". In order to build knowledge graphs, we need a way to recognize and extract concepts and entities from the text, a process known as entity recognition.


The easiest way to get started with entity recognition is to use pre-trained statistical Named Entity Recognizers (NERs) available in off-the-shelf Natural Language Processing (NLP) libraries. However, these NERs are trained to recognize a very small and general class of entities such as names of people and places, organizations, etc. While there is value in recognizing these classes, we are typically interested in finding more specific subclasses of these classes (such as universities rather than just any organization) or completely different classes (such as protein names).

Further, STM content is very diverse. While there may be some overlap, entities of interest in one subject (say math) are typically very different from entities of interest in another (say biology). Fortunately, well-curated vocabularies exist for most STM disciplines, which we can leverage in our entity recognition efforts.

Because of this, our approach to NER is dictionary based. Dictionary-based entity matching is a process where snippets of text are matched against a dictionary of terms that represent entities. While this approach may not be as resilient to previously unseen entities as the statistical approach described earlier, it requires no manual tagging, and given enough data, achieves comparable coverage. Dictionary-based matching can also be used to create training data to build custom statistical NERs tailored for different domains, thus achieving the best of both worlds.

Dictionary-based matching techniques are usually based on the Aho-Corasick algorithm, in which the dictionary is held in a compact in-memory data structure against which input text is streamed, matching all dictionary entries simultaneously. The problem with this technique is that it breaks down for large dictionaries, since the corresponding memory requirements also become large. Duplicating the dictionary on all nodes of a Spark cluster could be difficult because of its size.


Our solution is called the Solr Dictionary Annotator (SoDA). It is a HTTP REST micro-service that allows a client to post a block of text and get back a list of annotations. Annotations are structured objects that contain the entity identifier, the matched text, the beginning and ending character offsets of the matched text in the input text block, and the confidence of the match. Clients can specify how accurate the match should be.

For exact and case-insensitive matching, SoDA piggybacks on a recent development from the Lucene community. Michael McCandless, a Lucene/Solr committer, figured out a way to build finite-state transducers (FST) with Lucene in a very memory-efficient manner, taking advantage of the fact that the index already stores terms in a sorted manner. David Smiley, another Solr committer, realized that FSTs could be used for text tagging, and built the SolrTextTagger plugin for Solr. In keeping with Lucene’s tradition of memory-efficiency and speed, he introduced some more strategies to keep the memory footprint low without significantly impacting the retrieval speed. The original dictionary used a GATE based implementation of the Aho-Corasick algorithm that needed 80GB of RAM to store the dictionary, while SolrTextTagger version consumed only 198MB.

For fuzzy matching, SoDA uses OpenNLP, another open source project, to chunk incoming text into phrases. Depending on the fuzziness of the matching desired, different analysis chains are applied to the incoming phrases, and they are matched against pre-normalized dictionary entries stored in the index. We borrow several ideas from the Python library FuzzyWuzzy from SeatGeek.

SoDA exposes a JSON over HTTP interface, so its language and platform agnostic. You compose a JSON request document containing the text to be linked and the type of matching required, and send it to the REST endpoint URL via HTTP POST (some parameterless services like the status service are accessible over HTTP GET). The server responds with another JSON document containing the entities found in the text and metadata around these entities.


At its very core, SoDA is a Spring/Scala based web application that exposes a JSON over HTTP interface on the front end and communicates with a Solr index on the back end. A variety of matching strategies are supported, from exact and case-insensitive matching to completely fuzzy matching. The diagram below shows the components that make up the SoDA application. The client is a Spark Notebook in the Databricks cloud, where the rest of our NLP pipeline is also.

SolrTextTagger is used to serve the exact case-sensitive and case-insensitive entity matches, and OpenNLP is used to chunk incoming text to match against the underlying Solr index for the fuzzy matches. Horizontal scalability (with linear increase in throughput) is achieved by duplicating the component and putting them behind a load balancer.


Our experiments indicate that we can achieve a sustained annotation rate of 30-35 docs/second against a dictionary with 8M+ entries, where each document is about 100MB on average, with SoDA and Solr running on 2 r3.2xlarge machines behind a load balancer. We have been using SoDA for a few months now, and it has already proven itself as a useful component in our pipeline.

My employer has been kind enough to allow me to release SoDA to the open source community. Its available at GitHub here under an Apache 2.0 license. If you are looking for Dictionary based Entity Recognition functionality and you liked what you read so far, I encourage you to download it and give it a try. I look forward to hearing your feedback.

17 comments (moderated to prevent spam):

Xiangtao Wang said...

Thanks for open source this amazing project. It is a quite useful tool for NER.

I would like to play with the project, but got some issue after deploy it on jetty.

There is the log file from jetty :

Could you help me to take a look and tell the problem if possible ?

Btw, when run sbt package , I got error for below line.

jetty(config = "src/main/resources/jetty.xml") , it said that jetty is invalid. I guess you use the old version (1.0) of xsbt-web-plugin

so I changed above line as below :

containerConfigFile in Compile := Some(file("src/main/resources/jetty.xml"))


Sujit Pal said...

Thanks for the jetty plugins fix for latest version of xsbt-web, I will upgrade in the project and incorporate your fix.

For the error you are seeing, this is because of a NPE in line 24 of SodaUtils. It is looking for a file called You will need to copy into and update the values in there. Thanks for pointing this out, I need to update the documentation to add this instruction.

Sujit Pal said...

@Xiangtao: I have updated the documentation and upgraded to use xsbt-web-plugin 2.1.0 (latest as of today). I did not need to specify in compile for the containerConfigFile. Thanks again for the feedback.

Xiangtao Wang said...

Currently i try to use it to add the whole wikipedia article into solr and annotate text. then use the result for entity type inference based on the categories of wiki articles and the hierarchical categories linkage (use neo4j). Your effort help me a lot.
Very appreciate ! Thanks

Xiangtao Wang said...

Btw, I have changed the code a little bit by using Spring Boot( jetty and tomcat embed), It is convenient for debugging and deployment without manually copy the war file to jetty server. I would like to contribute the code but do not how...
you may take a look the spring boot

nk_2015 said...

Hi Sujit, i saw your blog and it has been really helping me with work on Apache lucene which i am currently doing. Till now,i have implemented TF-IDF Cosine similarity on set of IT service desk tickets to compute similarity between tickets. Also, before experimenting this i had performed Topic modeling using LDA approach with Mallet on the same set of tickets but could not come up with concrete topics to represent the data. It could be possible that i might have failed to interpret the results or would not have set right parameters say LDA factor,number of iterations and topic counts. Hence, i decided to club both of these models to see if i could get some useful information if given unlabelled data. Although, i found implementation for lucene-lda but could you please guide me with the approach since the approach mentioned in the url has lots of dependencies other than lucene and mallet. Also, as i am new to the field so i am not able to decide a workflow to implement this or to come up with some other easy method.

Your response would be truly appreciated....

Sujit Pal said...

@Xiangtao: Very cool idea, and you are welcome, glad I could help! For the change, if you can do a patch or a pull request and post it into the Issues at the github site, I can check it out and apply.

Sujit Pal said...

@nk_2015: thanks for sharing the link to lucene-lda, seems like an interesting approach. Could be useful for situations where the similarity needs to be more "fuzzy", for example related search terms. Regarding confusing dependencies, all of these are packaged with the project - I just did a git clone and was able to do the "ant jar" successfully. If you are using Maven or similar tool, look at the jars in lib and add them to the pom.xml or equivalent file.

nk_2015 said...

Hi Sujit, thanks for the reply. Could you please be more elaborate about "fuzzy" type of similarity or if you could suggest a blog from where i would be able to understand it. I just need to ensure whether the technique i am applying with my data-set is worth it or not.

Xiangtao Wang said...

Hi, Attach the code in the ISSUE at the github site.

Because integration with my company java project. I use maven instead of SBT.

Please let me know if have any issue.


Sujit Pal said...

Thanks Xiangtao, I will check it out. Because it looks like I will have to do individual diffs and then try to adapt to SBT, it might take me a while before I get back.

JohnT said...

Sujit, wonderful post. As a physician and avid read of your blog it left me very excited. I wish I was involved in such a wonderful project!

Anyways, I haven't looked at the code as I fire off this message: but as I understand it, I can deploy this as a RESTful server and throw a dictionary at it, say a subset of the UMLS, like relevant TUIs etc, and it should achieve pretty good tagging results for those entities?

Thanks as always for your contributions to opensource,

JohnT said...

Also, how hard would it be to deploy this thing on a local box as it stands now, not as outlined in the configuration page?

Sujit Pal said...

Hi JG, thanks for the kind words, and your understanding is correct. You load (one or more) dictionaries into the index, then throw documents at it, and it comes back with a structured list of entities found in the text - each entity is a flat structure of the IDs (TUIs in your case), matched string, start and end character offset and the confidence of the match. You can deploy this on a local box as long as you have sufficient RAM. In fact, the original setup against which I did a lot of testing consisted of a single machine with SoDA (port 8080), Solr (8983) and client code running on the same machine. However, as you point out, the setup can be scaled horizontally as needed.

Sujit Pal said...

@nk_2015: apologies for the delay in replying, just saw your comment in my queue, must have missed it when you posted it. Regarding the "fuzzy" comment above, LDA is generally used to group terms into a single topic, and these terms tend to be "related" rather than "similar".

Brian Burdick said...

Hi Sujit, This is a great find - love the idea of this project. I am picking up Scala. I can run through all of your installation instructions on Github - yet when I try and package the SODA project with SBT it fails with "build.sbt:11: error: not found: value JettyPlugin enablePlugins(JettyPlugin)". It seems like some dependency is missing, but I have re-installed a few times to make sure I had the same versions of Java, Scala, SBT, etc and searched the web to no avail. Appreciate if you have any insight on why this might be failing as I would love to use this project for dictionary tagging out of multiple taxonomies.

Sujit Pal said...

Hi Brian, sorry about the delay in responding. I also have a installation wide plugins.sbt in my $HOME/.sbt/0.13/plugins/plugins.sbt file that contain the following lines (each line should end with \n\n). In retrospect, a better approach should have been to include this in the project, I will figure out how to do this and update the repository when I have some time.

addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.13.0")

addSbtPlugin("com.typesafe.sbteclipse" % "sbteclipse-plugin" % "3.0.0")

addSbtPlugin("com.earldouglas" % "xsbt-web-plugin" % "2.1.0")

Can you please check to see if this gets you past the problem?