Saturday, January 12, 2019

New Deployment Options in SoDA v2.x


I released version 2.0 of the Solr Dictionary Annotator (SoDA), the open source project hosted by Elsevier, sometime middle of last year. While some of the features were driven by user feedback from within the company and my general dissatisfaction with some of the results, a lot of the impetus for the release was due to the team at SWIFT Innovation Labs, who decided to use SoDA as part of their solution for Address Entity Resolution. Because the SWIFT team needed functionality which wasn't available in SoDA v1.x but seemed quite easy to include, I embarked on what ended up being a completely new version. I am also grateful to the SWIFT team for feedback around new and existing (but rewritten) functionality, and for including me as co-author for the paper they are in the process of publishing around their work.

From the outside, the changes in SoDA version 2.0 are mostly evolutionary in nature. The biggest one is the removal of the non-streaming API and the addition of 3 new matching modes to the streaming API. I removed the non-streaming API because annotation results were not very good, and most people ended up not using it. The 3 new matching modes came about as a result of a conversation with my colleague Matt Corkum, where we quite literally brainstormed our way into it. Other improvements came about as a result of improvements in the infrastructure, such as the replacement of Memory Postings format in Solr (and SolrTextTagger) with the new FST Postings format, which effectively freed SoDA dictionary sizes from JVM size limitations.

Few other changes that are important to mention are a much cleaner, consistent (and unfortunately non backwards compatible) JSON interface. You can interact with the service using your own JSON over HTTP client, but we also expose a programmable API in Python and Scala, with clients provided for both. Finally, because the 2018 me did not care too much for the code written by 2016 me, as well as because of the changes described above, the code ended up being pretty heavily rewritten, with the result that it is cleaner and hopefully more maintainable. A full list of changes can be found in the Change log for v2.0.

Today's post is however, not about the new features listed above, but some new functionality I recently added to the project. Before I describe this new functionality, however, let me provide a little background to provide some justification for why I think it might be useful.

Like many companies, we have moved our data center to the cloud, specifically Amazon Web Services (AWS). So we have a SoDA server running 24x7 in our AWS cloud listening on annotation requests. Actual API usage is not very heavy, since annotating text is not a frequent activity to begin with, and SoDA is one of at least five available engines (to my knowledge), some of which are domain specific. However, we are incurring AWS charges even while the server sits idle. Ideally, we would like to spend our AWS dollars more efficiently.

On the other hand, for large annotation jobs (which come up occasionally), a single server can be too limiting and a better option would be a cluster of multiple servers behind a load balancer. Unfortunately, setting this up is a tedious and manual process and more often than not, these jobs just decide to do without the annotations available through SoDA.

The first thing I thought of was to keep the (single) SoDA server turned off while not in use and only start it on demand. The problem is that I would have to go in and manually start the services (Solr and Jetty/Tomcat for SoDA) each time. This can get tedious very quickly as you can imagine. One idea that occurred to me was to leverage the Unix startup/shutdown capabilities (the /etc/rc.d stuff for all you Unix old timers) to do this. Reading further, I found that the current way to do this was to use the systemd daemon, so I used that to build an "application service" which started and stopped my two services. I describe the steps in the Automatic Startup and Shutdown section of my SoDA v2.x installation guide.

Once I was able to do this, the next step was to "freeze" this instance as an Amazon Machine Image (AMI). This has a number of advantages, chief among them being the ability to spin up clusters of multiple instances of the AMI as shown in the right hand part of the figure below. In addition, it provides an additional "last known checkpoint" when it comes to building up dictionaries and doing software updates, as shown in the left hand part of the figure.


SoDA v2.x now comes with two Python scripts, master_instance.py and cluster_instance.py, both callable from the command line, that spin up the configurations on the left and right of the AMI in the figure above, respectively. Details about how to call them are in the AMI Maintenance (AWS) and Spinning up Read-only Clusters (AWS) sections of the SoDA v2.x installation guide.

Both scripts use the Boto3 library, the AWS SDK for Python3 to communicate with the AWS EC2 subsystem to start and stop the clusters. The master_instance.py script also has functionality to save and load the current state from a new AMI, thus allowing the AMI to evolve as software and dictionary changes are made.

Taken together, these scripts will now allow a annotator to spin up a cluster of a specified number of SoDA instances behind a load balancer without having to ask someone to do it for them, as well as allow maintainers to manage the SoDA instance (and AMI).

However, it still does not fully solve our original problem of having to run a SoDA instance 24x7, although it does go a long way towards making a solution possible. The most common use case for us is to access SoDA from a Spark Databricks notebook environment, so to support that, we would still need a system that is accessible to a specific host and port from within the Databricks network.

One solution which we have discussed internally, but for which I did not have the necessary knowledge (still don't, but am in the process of acquiring thanks to the AWS Developer: Building on AWS course on edX) is to have an Amazon API Gateway call that triggers an Amazon Lambda that starts a small cluster that listens on the host/port specified in the Databricks notebook. That would finally allow us to shut down our SoDA server, and give potential SoDA users on Databricks the capability to spin up a cluster of the designated size as needed, via a single API call.