Sunday, April 19, 2020

Inexact Matching Text against Dictionary using the Aho-Corasick Algorithm


In the past, I have written about matching text against dictionaries using David Smiley's SolrTextTagger project. SolrTextTagger uses Lucene's Finite State Transducer (FST) technology to implement a Finite State Machine (FSM) usable by the Aho-Corasick string matching algorithm. As indicated on its Wikipedia page, the Aho-Corasick algorithm has some nice performance properties that make it very attractive for text annotation, which is the use case I am interested in as well. To support the use case, I built the Solr Dictionary Annotator (SoDA) project, which wraps the SolrTextTagger service in a slightly more usable API and provides multiple levels of matching (exact, lowercase, stemmed, etc).

To set up a SolrTextTagger index, or the Aho-Corasick string matching algorithm in general, the basic idea is to populate the FSM with the dictionary entries. Internally, this will create a Trie structure connecting the words in the dictionary entries to each other. At runtime, you would stream text against this structure, and it will match phrases in the text that match the dictionary entries. The one major disadvantage of this approach is that it will only match phrases in the text that are in the same order as the equivalent phrase in the dictionary. For example, if my dictionary entry is "tuberculosis pulmonary", it will not match "pulmonary tuberculosis" in the text. Depending on your application, this may or may not be important. But I recently discovered a way to do it, thanks to the ideas presented in the paper Efficient Clinical Concept Extraction in Electronic Medical Records (Guo, Kakrania, Baldwin, and Syeda-Mahmood, IBM Research Almaden, 2017). Hat tip to my manager Ron Daniel for pointing me to the paper.

Specifically, the part that caught my eye was their example of being able to match the string "Chest CT" from the phrase "a subsequent CT scan of the chest", and their description of their approach in the paper. As you can see, there is not much detail in there, so I am not sure if my approach is identical to theirs or not, but it got me thinking, resulting in the algorithm I will describe in this post. In the spirit of reproducibility, the description is accompanied by my Jupyter notebook containing code to build the FSM and query it using a memory-based FSM using the PyAhoCorasick library. The notebook contains several examples of using the Aho-Corasick algorithm to match strings against an in-house medical dictionary in the inexact manner described, against short queries (including the example cited by the paper), as well as long paragraph texts.

To speed up the processing, we propose a novel indexing method that significantly reduces the search space while still maintaining the requisite flexibility in matching. First, each word in the vocabulary is represented by a unique prefix, the shortest sequence of letters that differentiates it from every other term. Next, an inverted index is created for the mapping from prefixes to report sentences. Starting from the representative prefix of each term in the vocabulary (or a set of prefixes in the case of a multi-word term), all relevant sentences can be easily retrieved as potential matches for the term, and post-filtering by longest common word sequence matching can be used to further refine the search results.

The big idea here is as follows. Instead of populating the FSM with phrases from the dictionary, we populate it with words that the phrases are built out of. When we populated it with phrases, the payload associated with it was the concept ID. When we populate it with words, the chances of collision is almost certain. For example, the word "tuberculosis" occurs 1,352 times in the dictionary. So the payload is now a list of concept ID and synonym ID pairs, much like the inverted index structure used in search engines. The concept ID is self-explanatory, the synonym ID is a running serial number (for example, synonym ID 0 is the primary name for the concept). We filter out stop words and non-alpha words in an attempt to reduce the noise. Matching is case insensitive (both dictionary and text are lowercased) unless the word is an obvious abbreviation and given in all-caps.

In addition to the concept synonym ID pair (CSID hereafter), we also store in the payload a floating point number representing the value of a match. The idea is that the string "tuberculosis" matched against the concept for "tuberculosis" is more valuable than a match against the concept for "pulmonary tuberculosis" or "pulmonary tuberculosis symptoms". The weight for each term in the synonym is the reciprocal of the number of words (minus stopwords) in the synonym. This will be used later to grade the quality of the matches. Thus the payload for each term is a list of pairs (CSID, weight) for each occurrence of the term in a synonym.

Because the PyAhoCorasick API doesn't allow us to incrementally add to the payload, we preprocess the TSV file representing the dictionary to create a dictionary of terms and list of (CSID, weight) pairs. We then use the dictionary to populate an Aho-Corasick FSM. Internally, the FSM states are just characters, so when querying this structure you will have to filter the output, retaining the matches that correspond to the terms in the query.

When querying, you have to apply the same set of transformations that you did on the dictionary terms, namely lowercasing unless an obvious abbreviation, removing stopwords and non-alpha words. As mentioned earlier, you have to filter the results to only retain matches to full terms. The next step is to post-process the results to extract a set of inexact matches.

The first step is populate a weight matrix W using the weights from the payload of the filtered results. The rows of this matrix corresponds to the number of terms being matched, and the columns correspond to the number of CSID entries across all the terms. Because terms would tend to clump around CSIDs, the matrix is expected to be relatively sparse. The quality of a match of the entire query string against a particular CSID is evaluated as the sum of the weights down the corresponding column. You provide a confidence threshold which the algorithm will use to retain only matches whose row sum exceeds it. These are now moved to the candidate match matrix C. The figure below shows the offsets and weights returned for different CSIDs for the query "helicobacter pylori patient perception".


Some interesting things to note here. Notice that both "helicobacter" and "pylori" have their own matches (8131809:0 and 8121505:4). In addition, both matched CSID 8121505:0. The term "patient" matches two CSIDs exactly -- 811861:0 and 811861:1. This is because the term is repeated across the synonyms, one with a capital P and one without. However, there may be cases where the same term may map to multiple different concept IDs because they are genuinely ambiguous.

The next step is to merge offsets. This is where we take terms which are consecutive or close to each other and consider them a single span. Closeness is defined by a user specified span_slop which is set to 5 by default. In our example above, the first two terms can be merged for CSID 8121505:0. At this point, we have moved from single terms annotated to multiple CSID pairs, to some multiple terms annotated as well.

We now remove the mappings which are wholly contained inside longer mappings. Here the idea is that a string such as "helicobacter pylori" is more specific than either "helicobacter" or "pylori", and the latter mappings should be removed in favor of the former. So this is what the algorithm does next. We are now ready to pull in the primary name of the matched concept and present the final results, as shown below.


And that's pretty much it. This technique is applicable not only to short strings such as the one I used for the example, but also longer bodies of text. In my Notebook with the Code and Examples, I show the same algorithm being used to map entire paragraphs, although it may be better for quality reasons to annotate sentence by sentence.

At some point, I hope to fold this functionality in to SoDA, my open source wrapper to SolrTextTagger I mentioned earlier. But given my current queue, I don't foresee that happening anytime soon. If any of you would like to implement this idea in Scala and integrate it into SoDA, please send me a pull request, and I will be happy to merge it in and give you credit.


8 comments (moderated to prevent spam):

David Smiley said...

Hello Sujit,
Thanks for sharing this post with me. I think The SolrTextTagger (TaggerRequestHandler in Solr) could accomplish the requirement here pretty reasonably. I see two things missing:
(A) a new TokenFilter that follows ConcatenateGraphFilter (which "normally" in the tagger would be last) that produces every permutation of the catenated string of words. I googled and found a nifty algorithm to compute pemutations iteratively.
(B) An option for the tagger to continue looking for matches with some configurable slop.

The resulting dictionary would be perhaps several times larger than it is today due to all the extra permutations. I see that the approach you describe here doesn't suffer from that problem. One additional optional step to minimize this is to pre-map the set of words into a single character. Ultimately that could plug in using SynonymFilter.

Sujit Pal said...

Yes, this one is basically a character based automaton. Would this be a good Lucene candidate? I ask because the number of nodes in the automaton are likely to be small, since they correspond to characters in the vocabulary, but the edges are going to be very dense.

David Smiley said...

> Would this be a good Lucene candidate?

I'm not sure quite what you mean. Does "Lucene candidate" mean some sort of open-source contribution to Lucene? Lucene has an Automata API and also an FST API; both of which are at a character or byte level. If you mean what I referred to at the end with the SynonymFilter mapping technique, then I'm not sure there is something to open-source right there -- I suggest more of a recipe using existing Lucene stuff (SynonymFilter).

Sujit Pal said...

Yes, I think the recipe approach might be more appropriate in this case. By Lucene candidate I meant if this would be a good candidate to include within Lucene versus an application level thing. Because it is a character automaton, it will match much more than term based automatons, and many of these may be spurious matches w.r.t the application (even though they may match valid words in the dictionary) which you will have to prune with application specific rules.

(Also, sorry about the delay in replying, I thought I had figured out the notification system, but apparently not).

David Smiley said...

I'm not certain I understand you so I want to confirm. It seems you are asking particularly about managing the word to ID aspect. If "by the application" (and not Lucene/Solr) you mean that the app needs to produce this synonym mapping file, then yes I think so. This is because the Tagger's current design works directly off an indexed field and thus you'd have to arrange for the dictionary mapping (in the form of a synonym file) to be there in advance of sending the data.

The original Tagger design was different; it built a sidecar data file (some FSTs) adjacent to the primary index; consequently it could and actually did produce a two-level dictionary of component words and then phrases of those word ordinals (numbers). Both FSTs of course. I did this intentionally because I had an early requirement to do partial matches, but later that use-case disappeared. Amusingly/frustratingly it was a misunderstanding between the one who gave me the requirement originally and me; shrug. There was some substantial code and sophistication (relative to the Tagger as a whole) to manage that whole approach. When the requirement vanished, I did some re-thinking that led to a large redesign resulting in what we have today (and I like it a lot). Interestingly the total data size shrunk as well, which surprised me greatly because I feared the reverse. After a discussion with Rob Muir (who co-created Lucene's FST code) at a conference, I learned this was likely because my phrase FST (composing of list of word IDs) might have been using full integer (4-byte) component/words instead of more optimally a variable byte int that typically would be 2 bytes.

A note to myself: It'd be interesting to write a PostingsFormat wrapper (and I've done a wrapper before) that reads the incoming terms to build a word dictionary; save it (FST of course); then pass on the terms translated to an ID list. If only I was retired I'd take on something fun like that to enterain myself ;-P

Sujit Pal said...

Yes, thats what I meant, I wasn't thinking specifically about the synonym file, more along the lines of rules/heuristics to combine the matched tokens into multi-term matches.

Alex UK said...

Hello @Sujit,
I am planning to apply your code from gist.
My understanding vertices.txt shaped:
concept_id, list of synonyms separated by |.
Did you use UMLS something like MRXNW_ENG see https://www.nlm.nih.gov/research/umls/implementation_resources/query_diagrams/er6.html build it or did you use another source?

Regards,
Alex

Sujit Pal said...

Hi Alex, sorry for the late reply, I stopped blogging for a bit and temporarily forgot about comments, just seeing your comment now. To answer your question, assuming its still relevant, you are correct about the format of vertices.txt. And the concept IDs are similar to UMLS but are from our internal ontology.