Salmon Run: Fuzzy String Matching against UMLS Data

Sunday, February 16, 2014

Fuzzy String Matching against UMLS Data

I started looking at the Unified Medical Language System (UMLS) Database because it is an important data source for the Apache cTakes pipeline. UMLS integrates multiple medical terminologies into a single dictionary, which can be used for recognizing medical entities in text. Our proprietary concept mapping algorithm uses a superset of UMLS (combined with other terminologies, and manually curated by our team of in-house medical folks) to extract meaning from medical text. Both cTakes and Hitex (another NLP application focused on medical text) use Metamap (via its Java API), a tool provided by the National Library of Medicine (NLM) to recognize UMLS concepts in text.

I was curious about how Metamap did it - turns out it uses a pretty complex algorithm as described in this paper by Alan Aronson, the creator of Metamap and is written in Prolog and C. Parallelly, I also happened to come across the Python fuzzywuzzy library for fuzzy string matching, and the Java OpenReconcile project (used in the OpenRefine project) for matching entities in text. This gave me an idea about an algorithm that could be used to annotate text using the UMLS database as a dictionary, which I describe here. It is very different from what we use at work, and the advantage of that one is that it has been vetted and tweaked extensively over the years to yield good results over our entire taxonomy.

My matching algorithm matches the incoming phrase against the original strings from UMLS, then the case and punctuation normalized (ie lowercased and punctuations replaced with spaces) version, then the version with all the words in the normalized string alpha sorted, and finally a version with all the words in the normalized string stemmed using Lucene's StandardAnalyzer. Each match in the sequence above assigned a score in the sequence [100, 75, 50, 25]. If a match is found at any level, the other layers are not executed. If no match is found, we construct n-grams of the phrase, where n varies from N-1 to 1. Like the incoming phrase, each n-gram is transformed and matched against the dictionary. Like the full phrase, a match short-circuits the pipeline for that n-gram. Words in n-grams that matched a dictionary entry are ignored in subsequent matches. Scores for n-grams are discounted by the number of words in the n-gram versus the number of words in the entire phrase.

I store the dictionary as a Lucene index, although I am not using any of Lucene's search capabilities, and I could just as easily have stored the dictionary in a relational database or key-value store. All matches are exact matches, although I use Lucene's StandardAnalyzer to stem the dictionary entries during indexing and the incoming phrases during searching.

To get the data out of the UMLS database into a flat file, I run the following SQL on the MRCONSO table.

mysql> select CUI, STR from MRCONSO 
...    where LAT = 'ENG' 
...    into outfile '/tmp/cuistr.csv' 
...    fields terminated by ',' 
...    enclosed by '"' lines 
...    terminated by '\n';
Query OK, 7871075 rows affected (23.30 sec)

I then removed duplicates (there are some records with the same CUI and STR values from different sources) using this Unix call. This resulted in 6,818,318 rows of input.

1	sujit@tsunami:~$ cat cuistr.csv \| sort \| uniq > cuistr1.csv

Here is the code. The buildIndex() method creates the index and the annotateConcepts() method matches the incoming phrase (as described above) to the entries in the index.

// Source: src/main/scala/com/mycompany/scalcium/umls/UmlsTagger.scala
package com.mycompany.scalcium.umls

import java.io.File
import java.io.StringReader
import java.util.concurrent.atomic.AtomicLong
import java.util.regex.Pattern

import scala.Array.canBuildFrom
import scala.Array.fallbackCanBuildFrom
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
import org.apache.lucene.document.Field.Index
import org.apache.lucene.document.Field.Store
import org.apache.lucene.index.DirectoryReader
import org.apache.lucene.index.IndexReader
import org.apache.lucene.index.IndexWriter
import org.apache.lucene.index.IndexWriterConfig
import org.apache.lucene.index.Term
import org.apache.lucene.search.BooleanClause.Occur
import org.apache.lucene.search.BooleanQuery
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.search.ScoreDoc
import org.apache.lucene.search.TermQuery
import org.apache.lucene.store.SimpleFSDirectory
import org.apache.lucene.util.Version

class UmlsTagger {

  val punctPattern = Pattern.compile("\\p{Punct}")
  val spacePattern = Pattern.compile("\\s+")
  
  def buildIndex(inputFile: File, 
      luceneDir: File): Unit = {
    // set up the index writer
    val analyzer = getAnalyzer()
    val iwconf = new IndexWriterConfig(Version.LUCENE_46, analyzer)
    iwconf.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
    val writer = new IndexWriter(new SimpleFSDirectory(luceneDir), iwconf)
    // read through input file and write out to lucene
    val counter = new AtomicLong(0L)
    val linesReadCounter = new AtomicLong(0L)
    Source.fromFile(inputFile)
        .getLines()
        .foreach(line => {
      val linesRead = linesReadCounter.incrementAndGet()
      if (linesRead % 1000 == 0) Console.println("%d lines read".format(linesRead))
      val Array(cui, str) = line
        .replace("\",\"", "\t")
        .replaceAll("\"", "")
        .split("\t")
      val strNorm = normalizeCasePunct(str)
      val strSorted = sortWords(strNorm)
      val strStemmed = stemWords(strNorm)
      // write full str record 
      // str = exact string
      // str_norm = case and punct normalized, exact
      // str_sorted = str_norm sorted
      // str_stemmed = str_sorted stemmed
      val fdoc = new Document()
      val fid = counter.incrementAndGet()
      fdoc.add(new Field("id", fid.toString, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("cui", cui, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("str", str, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("str_norm", strNorm, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("str_sorted", strSorted, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("str_stemmed", strStemmed, Store.YES, Index.NOT_ANALYZED))
      writer.addDocument(fdoc)
      if (fid % 1000 == 0) writer.commit()
    })
    writer.commit()
    writer.close()
  }

  def annotateConcepts(phrase: String, 
      luceneDir: File): 
      List[(Double,String,String)] = {
    val suggestions = ArrayBuffer[(Double,String,String)]()
    val reader = DirectoryReader.open(
      new SimpleFSDirectory(luceneDir)) 
    val searcher = new IndexSearcher(reader)
    // try to match full string
    suggestions ++= cascadeSearch(searcher, reader, 
      phrase, 1.0)
    if (suggestions.size == 0) {
      // no exact match found, fall back to inexact matches
      val words = normalizeCasePunct(phrase)
        .split(" ")
      val foundWords = scala.collection.mutable.Set[String]()
      for (nword <- words.size - 1 until 0 by -1) {
        words.sliding(nword)
          .map(ngram => ngram.mkString(" "))
          .foreach(ngram => {
            if (alreadySeen(foundWords, ngram)) {
              val ngramWords = ngram.split(" ")
              val ratio = ngramWords.size.toDouble / words.size
              val inexactSuggestions = cascadeSearch(
                searcher, reader, ngram, ratio)
              if (inexactSuggestions.size > 0) {
                suggestions ++= inexactSuggestions
                foundWords ++= ngramWords
              }
            }    
          })       
      }
    }
    if (suggestions.size > 0) {
      // dedup by cui, keeping the first matched
      val deduped = suggestions.groupBy(_._3)
        .map(kv => kv._2.head)
        .toList  
        .sortWith((a,b) => a._1 > b._1)
      suggestions.clear
      suggestions ++= deduped
    }
    // clean up
    reader.close()
    // return results
    suggestions.toList
  }
  
  def printConcepts(
      concepts: List[(Double,String,String)]): 
      Unit = {
    concepts.foreach(concept => 
      Console.println("[%6.2f%%] (%s) %s"
      .format(concept._1, concept._3, concept._2)))
  }
  
  def normalizeCasePunct(str: String): String = {
    val str_lps = punctPattern
      .matcher(str.toLowerCase())
      .replaceAll(" ")
    spacePattern.matcher(str_lps).replaceAll(" ")
  }

  def sortWords(str: String): String = {
    val words = str.split(" ")
    words.sortWith(_ < _).mkString(" ")
  }
  
  def stemWords(str: String): String = {
    val stemmedWords = ArrayBuffer[String]()
    val tokenStream = getAnalyzer().tokenStream(
      "str_stemmed", new StringReader(str))
    val ctattr = tokenStream.addAttribute(
      classOf[CharTermAttribute])    
    tokenStream.reset()
    while (tokenStream.incrementToken()) {
      stemmedWords += ctattr.toString()
    }
    stemmedWords.mkString(" ")
  }
  
  def getAnalyzer(): Analyzer = {
    new StandardAnalyzer(Version.LUCENE_46)
  }
  
  case class StrDocument(id: Int, 
      cui: String, str: String, 
      strNorm: String, strSorted: String, 
      strStemmed: String)
      
  def getDocument(reader: IndexReader,
      hit: ScoreDoc): StrDocument = {
    val doc = reader.document(hit.doc)
    StrDocument(doc.get("id").toInt,
      doc.get("cui"), doc.get("str"), 
      doc.get("str_norm"), doc.get("str_sorted"), 
      doc.get("str_stemmed"))
  }
  
  def cascadeSearch(searcher: IndexSearcher,
      reader: IndexReader, phrase: String,
      ratio: Double): 
      List[(Double,String,String)] = {
    val results = ArrayBuffer[(Double,String,String)]()
    // exact match (100.0%)
    val query1 = new TermQuery(new Term("str", phrase))
    val hits1 = searcher.search(query1, 1).scoreDocs
    if (hits1.size > 0) {
      results += hits1.map(hit => {
        val doc = getDocument(reader, hit)
        (100.0 * ratio, doc.str, doc.cui)
      })
      .toList
      .head
    }
    // match normalized phrase (75%)
    val normPhrase = normalizeCasePunct(phrase)
    if (results.size == 0) {
      val query2 = new TermQuery(new Term("str_norm", normPhrase))
      val hits2 = searcher.search(query2, 1).scoreDocs
      if (hits2.size > 0) {
        results += hits2.map(hit => {
          val doc = getDocument(reader, hit)
          (90.0 * ratio, doc.str, doc.cui)
        })
        .toList
        .head
      }
    }
    // match sorted phrase (50%)
    val sortedPhrase = sortWords(normPhrase)
    if (results.size == 0) {
      val query3 = new TermQuery(new Term("str_sorted", sortedPhrase))
      val hits3 = searcher.search(query3, 1).scoreDocs
      if (hits3.size > 0) {
        results += hits3.map(hit => {
          val doc = getDocument(reader, hit)
          (80.0 * ratio, doc.str, doc.cui)
        })
        .toList
        .head
      }
    }
    // match stemmed phrase (25%)
    val stemmedPhrase = stemWords(normPhrase)
    if (results.size == 0) {
      val query4 = new TermQuery(new Term("str_stemmed", stemmedPhrase))
      val hits4 = searcher.search(query4, 1).scoreDocs
      if (hits4.size > 0) {
        results += hits4.map(hit => {
          val doc = getDocument(reader, hit)
          (70.0 * ratio, doc.str, doc.cui)
        })
        .toList
        .head
      }
    }
    results.toList
  }
  
  def alreadySeen(
      refset: scala.collection.mutable.Set[String], 
      ngram: String): Boolean = {
    val words = ngram.split(" ")
    val nseen = words.filter(word => 
      !refset.contains(word))
      .size
    if (refset.size == 0) true
    else if (nseen > 0) true 
    else false
  }
}

I just used JUnit tests to run through the various functionality. Building the index takes about 40 minutes on my laptop. Here is the JUnit.

// Source: src/test/scala/com/mycompany/scalcium/umls/UmlsTaggerTest.scala
package com.mycompany.scalcium.umls

import org.junit.Test
import java.io.File
import org.junit.Assert

class UmlsTaggerTest {

  @Test
  def testSortWords(): Unit = {
    val s = "heart attack and diabetes"
    val tagger = new UmlsTagger()
    Assert.assertEquals("and attack diabetes heart", tagger.sortWords(s))
  }
  
  @Test
  def testStemWords(): Unit = {
    val s = "and attack diabetes heart"
    val tagger = new UmlsTagger()
    Assert.assertEquals("attack diabetes heart", tagger.stemWords(s))
  }

  @Test
  def testBuild(): Unit = {
    val input = new File("/home/sujit/Projects/med_data/cuistr1.csv")
    val output = new File("/home/sujit/Projects/med_data/umlsindex")
    val tagger = new UmlsTagger()
    tagger.buildIndex(input, output)
  }
  
  @Test
  def testMapSingleConcept(): Unit = {
    val luceneDir = new File("/home/sujit/Projects/med_data/umlsindex")
    val tagger = new UmlsTagger()
    val strs = List("Lung Cancer", "Heart Attack", "Diabetes")
    strs.foreach(str => {
      val concepts = tagger.annotateConcepts(str, luceneDir)
      Console.println("Query: " + str)
      tagger.printConcepts(concepts)
      Assert.assertEquals(1, concepts.size)
      Assert.assertEquals(100.0D, concepts.head._1, 0.1D)
    })
  }

  @Test
  def testMapMultipleConcepts(): Unit = {
    val luceneDir = new File("/home/sujit/Projects/med_data/umlsindex")
    val tagger = new UmlsTagger()
    val strs = List(
        "Heart Attack and diabetes",
        "carcinoma (small-cell) of lung",
        "asthma side effects")
    strs.foreach(str => {
      val concepts = tagger.annotateConcepts(str, luceneDir)
      Console.println("Query: " + str)
      tagger.printConcepts(concepts)
    })
  }
}

The last two tests in the JUnit test above return exact and inexact match results for some queries. The first 3 are exact queries and the next 3 are in-exact queries. Output from these tests is shown below:

Query: Lung Cancer
[100.00%] (C0242379) Lung Cancer

Query: Heart Attack
[100.00%] (C0027051) Heart Attack

Query: Diabetes
[100.00%] (C0011847) Diabetes

Query: Heart Attack and diabetes
[ 50.00%] (C0027051) heart attack
[ 35.00%] (C0011847) Diabetes
[ 35.00%] (C0699795) Attack

Query: carcinoma (small-cell) of lung
[ 80.00%] (C0149925) small cell carcinoma of lung

Query: asthma side effects
[ 66.67%] (C0001688) side effects
[ 33.33%] (C0004096) asthma

As you can see, the results don't look too bad. The algorithm is also not very complex, it borrows a lot from the ideas used in the fuzzywuzzy and OpenReconcile projects. Of course, I have tried this on a very small subset of queries, it remains to be seen if it produces good results as consistently as the one we use at work.

21 comments (moderated to prevent spam):

John G said...: This was meant to be posted here:

Brilliant! I had had this same idea some months back, but I dont think I could have pulled it off half as good as you. Look forward to implementing it soon!; 3/07/2014 11:52 AM
Sujit Pal said...: Thanks, John G. You may also want to take a look at this post, where I use David Smiley's SolrTextTagger - cuts out the need to ngram the query for partial matches.; 3/07/2014 12:28 PM
Anonymous said...: What version of Lucene are you using?; 3/19/2014 6:01 AM
Sujit Pal said...: I'm using Lucene 4.6.0. Here is the relevant snippet from my build.sbt:

"org.apache.lucene" % "lucene-core" % "4.6.0",
"org.apache.lucene" % "lucene-queries" % "4.6.0",
"org.apache.lucene" % "lucene-analyzers-common" % "4.6.0",
"org.apache.lucene" % "lucene-queryparser" % "4.6.0",
"org.apache.solr" % "solr-solrj" % "4.6.0",; 3/19/2014 7:45 AM
JohnT said...: Whew! I'm new to Lucene and Eclipse and setting this up was a pain, only because of my ignorance in those areas.

So, if anyone tries to recreate this great code, the following tip may help:

After importing the Lucene Core org directory AND Lucene Analysis directory (there will be errors, ignore them if your processing english), ADD the META-INF directory to the src directory for wherever you setup Lucene core (I did it as a separate project). This may seem obvious to those comfortable with the environment, but it took me some frustrating hours to figure it out. The root of this is the SPI model that lucene 4 went to as I understand it.

Im on JDK 1.6 running Eclipse Kepler with Lucene 4.6.

Hope this helps someone.

Thanks for the wonderful ideas Sujit.

JG; 3/20/2014 4:43 AM
Sujit Pal said...: I generally set up Scala projects using the giter8 (g8) template generator. Then I manually update the build.sbt to include my dependencies (and resolvers if necessary), then use "sbt eclipse" to build the project artifacts (.classpath, etc). At that point, all you have to do with Eclipse is to create a Scala project and point the IDE to the project directory.; 3/20/2014 8:37 AM
Evan Smith said...: Hello,

I am have gotten to the scala code and am unsure how to setup/compile as needed.

Can you post your eclipse project or give more details about
- updating the build.sbt
- sbt eclipse
and perhaps what version of eclipse and what your class path was (as an example)?

Looking forward to running this.

Thanks!
Evan; 5/08/2014 5:53 PM
Evan Smith said...: Hello,

Ok I have giter8 installed.

Which template to you use to setup?

Evan; 5/08/2014 6:07 PM
Sujit Pal said...: Hi Evan, I used the n8han/giter8 template. Here is the snippet from my build.sbt listing out the resolvers and dependencies:

resolvers ++= Seq(
.."spray" at "https://repo.spray.io/",
.."Neo4j-Contrib" at "http://m2.neo4j.org/content/groups/everything"
)
libraryDependencies ++= Seq(
.."org.apache.opennlp" % "opennlp-maxent" % "3.0.3",
.."org.apache.opennlp" % "opennlp-tools" % "1.5.3",
.."org.apache.lucene" % "lucene-core" % "4.6.0",
.."org.apache.lucene" % "lucene-queries" % "4.6.0",
.."org.apache.lucene" % "lucene-analyzers-common" % "4.6.0",
.."org.apache.lucene" % "lucene-queryparser" % "4.6.0",
.."org.apache.solr" % "solr-solrj" % "4.6.0",
.."org.neo4j" % "neo4j" % "1.9.6",
.."org.neo4j" % "neo4j-rest-graphdb" % "1.9",
.."nz.ac.waikato.cms.weka" % "weka-dev" % "3.7.10",
.."org.apache.commons" % "commons-lang3" % "3.0",
.."net.sourceforge.collections" % "collections-generic" % "4.01",
.."commons-beanutils" % "commons-beanutils" % "1.8.3",
.."commons-io" % "commons-io" % "2.4",
.."io.spray" %% "spray-json" % "1.2.5",
.."log4j" % "log4j" % "1.2.14",
.."com.novocode" % "junit-interface" % "0.8" % "test"
); 5/08/2014 6:47 PM
Evan Smith said...: Hello,

Progress

Ok I think I have a .sbt that is reasonable. I put in your resolvers and depends. naturally when I compile the scala code it complains on the first import of the lucene.analysis.analyzer

Where is this jar in the solr install?

Evan; 5/09/2014 7:07 AM
Sujit Pal said...: If you run "sbt eclipse" with an internet connection, it would download them all off the maven repositories you set up in the build.sbt to your $HOME/.ivy directory, as well as create your Eclipse .classpath and .project descriptors. Alternatively, if you must get them off your solr install, they are in your solr-webapp/webapp/WEB-INF/lib directory and they need to be copied into the automatically managed lib directory (under your project root).; 5/09/2014 10:32 AM
Evan Smith said...: Hello,

I have managed to setup an eclipse project (though I did create it myself). I have all the libs, and am able to compile and run the code.

I generate the json file (looks fine).

But I get no results from my solr instance on queries. I assume that I need to reingest my documents to have them indexed with you fields.

Any pointers on what fields to copy field into or how to do an ingest against your settings?

This is totally cool, and thank you for all of your help!

I will download your eclipse project this weekend.

Evan; 5/09/2014 11:38 AM
Sujit Pal said...: You have to run the testBuild JUnit test to load up the index with the data. Also the project is not publicly available yet (I plan on completing some more functionality before I release).; 5/09/2014 6:27 PM
JohnT said...: Hi Sujit - I converted this to python and added https://code.google.com/p/negex/. The thinking behind python is I do, and a lot of others do, rapid web dev with djangoproject. Your parser, now in python, plus negation, could be a very powerful but simple tool on the web. How do you feel about you and I pushing it to github together as a mini opensource project? I have been using the version I converted to python with negex with much success COMPARED to cTakes, which I also use heavily (for short strings of course, not long documents). If you're interested please email me at john travis green (spaces replaced with dots) (at) gmail.com.; 3/05/2015 7:07 AM
Sujit Pal said...: Hi John, congratulations on making the matching algorithm work with Python. Negex needs something to mark phrases of interest and you are right, this algorithm is much easier than using Metamap or cTakes. I think the combination would be a worthy open source project. But the work is all yours and I don't want to take away from it, so I would be happy if you make a mention of this page on your README file.; 3/05/2015 9:26 AM
JohnT said...: Hi Sujit, pushed the project to github - https://github.com/jtgreen/SMPP. Effectively wrapped in negex and added a more aggressive stemmer (Lancaster) as well as an optional "include all results" which is a bit of a misnomer; simply put, it continues the cascade without short circuiting in order to broaden the net... a bit of bag of words style, only you do keep the scores so you can do some simple permutations and optimize over a solution.; 4/17/2015 6:34 PM
Sujit Pal said...: Very cool, congratulations and thanks for the shout-out!; 4/18/2015 9:28 AM
Anonymous said...: Have you released this project publicly? I've searched some repositories online. This is very interesting!; 8/29/2016 6:03 PM
Sujit Pal said...: Thank you, glad you found it interesting. And yes, the code in this post is part of a public project (https://github.com/sujitpal/scalcium).; 8/29/2016 7:37 PM
Unknown said...: Hi Sujit,

Can we implement this in python?
if yes please suggest how(which libraries will help)?; 7/13/2017 10:02 PM
Sujit Pal said...: Yes, I think you can implement this in Python fairly easily. Main problem is Lucene which you can replace with a Solr/Elasticsearch server. In order to build the index, you can reuse standard analyzers that are available in Solr/ES, and the string manipulation to sort and stem the words can be easily done during preprocessing with Python as well. Same with the query side, it essentially does 4 queries and sorts the results according to weighted scores. To send and receive HTTP requests to and from Solr, you could use the requests library, and json for building and parsing the request and response.; 7/14/2017 10:02 AM

Salmon Run

Sunday, February 16, 2014

Fuzzy String Matching against UMLS Data

21 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me