Sunday, April 19, 2020

Inexact Matching Text against Dictionary using the Aho-Corasick Algorithm

In the past, I have written about matching text against dictionaries using David Smiley's SolrTextTagger project. SolrTextTagger uses Lucene's Finite State Transducer (FST) technology to implement a Finite State Machine (FSM) usable by the Aho-Corasick string matching algorithm. As indicated on its Wikipedia page, the Aho-Corasick algorithm has some nice performance properties that make it very attractive for text annotation, which is the use case I am interested in as well. To support the use case, I built the Solr Dictionary Annotator (SoDA) project, which wraps the SolrTextTagger service in a slightly more usable API and provides multiple levels of matching (exact, lowercase, stemmed, etc).

To set up a SolrTextTagger index, or the Aho-Corasick string matching algorithm in general, the basic idea is to populate the FSM with the dictionary entries. Internally, this will create a Trie structure connecting the words in the dictionary entries to each other. At runtime, you would stream text against this structure, and it will match phrases in the text that match the dictionary entries. The one major disadvantage of this approach is that it will only match phrases in the text that are in the same order as the equivalent phrase in the dictionary. For example, if my dictionary entry is "tuberculosis pulmonary", it will not match "pulmonary tuberculosis" in the text. Depending on your application, this may or may not be important. But I recently discovered a way to do it, thanks to the ideas presented in the paper Efficient Clinical Concept Extraction in Electronic Medical Records (Guo, Kakrania, Baldwin, and Syeda-Mahmood, IBM Research Almaden, 2017). Hat tip to my manager Ron Daniel for pointing me to the paper.

Specifically, the part that caught my eye was their example of being able to match the string "Chest CT" from the phrase "a subsequent CT scan of the chest", and their description of their approach in the paper. As you can see, there is not much detail in there, so I am not sure if my approach is identical to theirs or not, but it got me thinking, resulting in the algorithm I will describe in this post. In the spirit of reproducibility, the description is accompanied by my Jupyter notebook containing code to build the FSM and query it using a memory-based FSM using the PyAhoCorasick library. The notebook contains several examples of using the Aho-Corasick algorithm to match strings against an in-house medical dictionary in the inexact manner described, against short queries (including the example cited by the paper), as well as long paragraph texts.

To speed up the processing, we propose a novel indexing method that significantly reduces the search space while still maintaining the requisite flexibility in matching. First, each word in the vocabulary is represented by a unique prefix, the shortest sequence of letters that differentiates it from every other term. Next, an inverted index is created for the mapping from prefixes to report sentences. Starting from the representative prefix of each term in the vocabulary (or a set of prefixes in the case of a multi-word term), all relevant sentences can be easily retrieved as potential matches for the term, and post-filtering by longest common word sequence matching can be used to further refine the search results.

The big idea here is as follows. Instead of populating the FSM with phrases from the dictionary, we populate it with words that the phrases are built out of. When we populated it with phrases, the payload associated with it was the concept ID. When we populate it with words, the chances of collision is almost certain. For example, the word "tuberculosis" occurs 1,352 times in the dictionary. So the payload is now a list of concept ID and synonym ID pairs, much like the inverted index structure used in search engines. The concept ID is self-explanatory, the synonym ID is a running serial number (for example, synonym ID 0 is the primary name for the concept). We filter out stop words and non-alpha words in an attempt to reduce the noise. Matching is case insensitive (both dictionary and text are lowercased) unless the word is an obvious abbreviation and given in all-caps.

In addition to the concept synonym ID pair (CSID hereafter), we also store in the payload a floating point number representing the value of a match. The idea is that the string "tuberculosis" matched against the concept for "tuberculosis" is more valuable than a match against the concept for "pulmonary tuberculosis" or "pulmonary tuberculosis symptoms". The weight for each term in the synonym is the reciprocal of the number of words (minus stopwords) in the synonym. This will be used later to grade the quality of the matches. Thus the payload for each term is a list of pairs (CSID, weight) for each occurrence of the term in a synonym.

Because the PyAhoCorasick API doesn't allow us to incrementally add to the payload, we preprocess the TSV file representing the dictionary to create a dictionary of terms and list of (CSID, weight) pairs. We then use the dictionary to populate an Aho-Corasick FSM. Internally, the FSM states are just characters, so when querying this structure you will have to filter the output, retaining the matches that correspond to the terms in the query.

When querying, you have to apply the same set of transformations that you did on the dictionary terms, namely lowercasing unless an obvious abbreviation, removing stopwords and non-alpha words. As mentioned earlier, you have to filter the results to only retain matches to full terms. The next step is to post-process the results to extract a set of inexact matches.

The first step is populate a weight matrix W using the weights from the payload of the filtered results. The rows of this matrix corresponds to the number of terms being matched, and the columns correspond to the number of CSID entries across all the terms. Because terms would tend to clump around CSIDs, the matrix is expected to be relatively sparse. The quality of a match of the entire query string against a particular CSID is evaluated as the sum of the weights down the corresponding column. You provide a confidence threshold which the algorithm will use to retain only matches whose row sum exceeds it. These are now moved to the candidate match matrix C. The figure below shows the offsets and weights returned for different CSIDs for the query "helicobacter pylori patient perception".

Some interesting things to note here. Notice that both "helicobacter" and "pylori" have their own matches (8131809:0 and 8121505:4). In addition, both matched CSID 8121505:0. The term "patient" matches two CSIDs exactly -- 811861:0 and 811861:1. This is because the term is repeated across the synonyms, one with a capital P and one without. However, there may be cases where the same term may map to multiple different concept IDs because they are genuinely ambiguous.

The next step is to merge offsets. This is where we take terms which are consecutive or close to each other and consider them a single span. Closeness is defined by a user specified span_slop which is set to 5 by default. In our example above, the first two terms can be merged for CSID 8121505:0. At this point, we have moved from single terms annotated to multiple CSID pairs, to some multiple terms annotated as well.

We now remove the mappings which are wholly contained inside longer mappings. Here the idea is that a string such as "helicobacter pylori" is more specific than either "helicobacter" or "pylori", and the latter mappings should be removed in favor of the former. So this is what the algorithm does next. We are now ready to pull in the primary name of the matched concept and present the final results, as shown below.

And that's pretty much it. This technique is applicable not only to short strings such as the one I used for the example, but also longer bodies of text. In my Notebook with the Code and Examples, I show the same algorithm being used to map entire paragraphs, although it may be better for quality reasons to annotate sentence by sentence.

At some point, I hope to fold this functionality in to SoDA, my open source wrapper to SolrTextTagger I mentioned earlier. But given my current queue, I don't foresee that happening anytime soon. If any of you would like to implement this idea in Scala and integrate it into SoDA, please send me a pull request, and I will be happy to merge it in and give you credit.

Sunday, April 05, 2020

Experiments with COVID-19 Patient Data

An interesting (and beneficial, in spite of the otherwise less than ideal situation) side-effect of the COVID-19 pandemic has been that many organizations, both commercial and academic, are coming together looking for ways in which they can work together to eradicate the disease. Many of the collaborations involve sharing datasets, the most famous of which is the COVID-19 Open Research Dataset (CORD-19), a collection of 47,000 (and growing) scientific papers about COVID-19. Some others are offering the CORD-19 dataset processed through their pipelines, or hosted using their products (for example, graph database or search engine). Some are holding seminars, and sharing their expertise with the rest of the world. At Elsevier, we have a grassroots Data Science team of more than 225 employees, looking at the problem from different disciplines and angles, and working to find solutions to address the crisis. The LinkedIn article Elsevier models for COVID19 bio-molecular mechanisms describes some contributions that were driven by work from this team using one of our tools, and hopefully there will be more soon. In addition, about 46% of the papers in the CORD-19 dataset come from Elsevier, and we are looking at ways of making more available.

In the spirit of learning everything I could about COVID-19, I attended the day-long COVID-19 and AI: A Virtual Conference organized by the Stanford Human-AI (HAI) group. One of the speakers was Prof. Nigam Shah, who spoke about his Medical Center's Data Science Response to the Pandemic, and described the types of Data Science models that can inform policy to combat the virus. In addition, he also wrote this Medium post about Profiling presenting symptoms of patients screened for SARS-Cov-2 where he used the same diagram for his unified model, which is what caught my eye. Hat tip to my colleague Helena Deus for finding and posting the link to the article on our internal Slack channel.

In any case, the Medium post describes a text processing pipeline designed by Prof. Nigam's group to extract clinical observations from notes written by care providers at the Emergency Department of Stanford Health Care, when screening patients for COVID-19. The pipeline is built using what look like rules based on the NegEx algorithm among other things, and Snorkel to train models that recognize these observations in text using these noisy rules. The frequency of these observations were then tabulated and probabilities calculated, ultimately leading to an Excel spreadsheet, which Prof. Nigam and his team were kind enough to share with the world.

There were 895 patients considered for this dataset, of which 64 tested positive for SARS-Cov-2 (new name is COVID-19) and 831 tested negative. So at this point in time, the prevalence of COVID-19 in the cohort (and by extension, possibly in the broader community) was 7.2%. The observations considered in the model were the ones that occurred at least 10 times across all the patient notes.

So what can we do with this data? My first thought was a symptom checker, which would compute the probability that a particular patient test positive given one or more of the observations (or symptoms, although I am using the term a bit loosely, there are quite a few observations here that are not symptoms). For example, if we wanted to compute the probability of the patient testing positive given that the patient exhibits only cough and no other symptom, we would denote this as P(D=True|S0=True, S1=False, ..., S49=False).

Of course, this depends on the simplifying (and very likely incorrect) assumption that the observations are independent, i.e., the fact that a patient has a cough is independent from the fact that he has a sore throat. Also, the other thing to remember is that predictions from the symptom checker will be dependent on the correct value of the current disease prevalence rate. The 7.2% value we have is only correct for the time and place where the data was collected, so will need to be updated accordingly if we wish to use the checker even with all its limitations. Here is a schematic of the model.

Implementation wise, I initially considered a Bayesian Network, using SQL tables to model it as taught by Prof. Gautam Shroff in his now-defunct Web Intelligence and Big Data course on Coursera (here's a quick note on how to use SQL tables to model Bayesian Networks since the technique, even though its super cool, does not appear to be mainstream), but I realized (thanks to this Math StackExchange discussion on expressing Conditional Probability given multiple independent events), that the formulation can be much more straightforward, as shown below, so I used this instead.

The idea of using the proportionality relationship is to normalize the numerator by computing P(D=True|∩Sk).P(D=True) and P(D=False|∩Sk).P(D=False) and divide by the sum to get the probability of a positive test given a set of symptoms. Once that was done, it led to several additional interesting questions. First, what happens to the probability as we add more and more symptoms? Second, what happens to the probability with different prevalence rates? Finally, what is the "symptom profile" for a typical COVID-19 patient based on the data? Answers to these questions and the code to get to these answers can be found in my Github Gist here.

I have said it before, and given that people might tend to grasp at straws because of the pandemic situation, I am going to say it again. This is just a model, and very likely an imperfect one. Conclusions from such models are not a substitute for medical advice. I know most of you realize this already, but just in case, please do not use the conclusions from this model to make any real life decisions without independent verification.