Sunday, February 16, 2014

Fuzzy String Matching against UMLS Data


I started looking at the Unified Medical Language System (UMLS) Database because it is an important data source for the Apache cTakes pipeline. UMLS integrates multiple medical terminologies into a single dictionary, which can be used for recognizing medical entities in text. Our proprietary concept mapping algorithm uses a superset of UMLS (combined with other terminologies, and manually curated by our team of in-house medical folks) to extract meaning from medical text. Both cTakes and Hitex (another NLP application focused on medical text) use Metamap (via its Java API), a tool provided by the National Library of Medicine (NLM) to recognize UMLS concepts in text.

I was curious about how Metamap did it - turns out it uses a pretty complex algorithm as described in this paper by Alan Aronson, the creator of Metamap and is written in Prolog and C. Parallelly, I also happened to come across the Python fuzzywuzzy library for fuzzy string matching, and the Java OpenReconcile project (used in the OpenRefine project) for matching entities in text. This gave me an idea about an algorithm that could be used to annotate text using the UMLS database as a dictionary, which I describe here. It is very different from what we use at work, and the advantage of that one is that it has been vetted and tweaked extensively over the years to yield good results over our entire taxonomy.

My matching algorithm matches the incoming phrase against the original strings from UMLS, then the case and punctuation normalized (ie lowercased and punctuations replaced with spaces) version, then the version with all the words in the normalized string alpha sorted, and finally a version with all the words in the normalized string stemmed using Lucene's StandardAnalyzer. Each match in the sequence above assigned a score in the sequence [100, 75, 50, 25]. If a match is found at any level, the other layers are not executed. If no match is found, we construct n-grams of the phrase, where n varies from N-1 to 1. Like the incoming phrase, each n-gram is transformed and matched against the dictionary. Like the full phrase, a match short-circuits the pipeline for that n-gram. Words in n-grams that matched a dictionary entry are ignored in subsequent matches. Scores for n-grams are discounted by the number of words in the n-gram versus the number of words in the entire phrase.

I store the dictionary as a Lucene index, although I am not using any of Lucene's search capabilities, and I could just as easily have stored the dictionary in a relational database or key-value store. All matches are exact matches, although I use Lucene's StandardAnalyzer to stem the dictionary entries during indexing and the incoming phrases during searching.

To get the data out of the UMLS database into a flat file, I run the following SQL on the MRCONSO table.

1
2
3
4
5
6
7
mysql> select CUI, STR from MRCONSO 
...    where LAT = 'ENG' 
...    into outfile '/tmp/cuistr.csv' 
...    fields terminated by ',' 
...    enclosed by '"' lines 
...    terminated by '\n';
Query OK, 7871075 rows affected (23.30 sec)

I then removed duplicates (there are some records with the same CUI and STR values from different sources) using this Unix call. This resulted in 6,818,318 rows of input.

1
sujit@tsunami:~$ cat cuistr.csv | sort | uniq > cuistr1.csv

Here is the code. The buildIndex() method creates the index and the annotateConcepts() method matches the incoming phrase (as described above) to the entries in the index.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
// Source: src/main/scala/com/mycompany/scalcium/umls/UmlsTagger.scala
package com.mycompany.scalcium.umls

import java.io.File
import java.io.StringReader
import java.util.concurrent.atomic.AtomicLong
import java.util.regex.Pattern

import scala.Array.canBuildFrom
import scala.Array.fallbackCanBuildFrom
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
import org.apache.lucene.document.Field.Index
import org.apache.lucene.document.Field.Store
import org.apache.lucene.index.DirectoryReader
import org.apache.lucene.index.IndexReader
import org.apache.lucene.index.IndexWriter
import org.apache.lucene.index.IndexWriterConfig
import org.apache.lucene.index.Term
import org.apache.lucene.search.BooleanClause.Occur
import org.apache.lucene.search.BooleanQuery
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.search.ScoreDoc
import org.apache.lucene.search.TermQuery
import org.apache.lucene.store.SimpleFSDirectory
import org.apache.lucene.util.Version

class UmlsTagger {

  val punctPattern = Pattern.compile("\\p{Punct}")
  val spacePattern = Pattern.compile("\\s+")
  
  def buildIndex(inputFile: File, 
      luceneDir: File): Unit = {
    // set up the index writer
    val analyzer = getAnalyzer()
    val iwconf = new IndexWriterConfig(Version.LUCENE_46, analyzer)
    iwconf.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
    val writer = new IndexWriter(new SimpleFSDirectory(luceneDir), iwconf)
    // read through input file and write out to lucene
    val counter = new AtomicLong(0L)
    val linesReadCounter = new AtomicLong(0L)
    Source.fromFile(inputFile)
        .getLines()
        .foreach(line => {
      val linesRead = linesReadCounter.incrementAndGet()
      if (linesRead % 1000 == 0) Console.println("%d lines read".format(linesRead))
      val Array(cui, str) = line
        .replace("\",\"", "\t")
        .replaceAll("\"", "")
        .split("\t")
      val strNorm = normalizeCasePunct(str)
      val strSorted = sortWords(strNorm)
      val strStemmed = stemWords(strNorm)
      // write full str record 
      // str = exact string
      // str_norm = case and punct normalized, exact
      // str_sorted = str_norm sorted
      // str_stemmed = str_sorted stemmed
      val fdoc = new Document()
      val fid = counter.incrementAndGet()
      fdoc.add(new Field("id", fid.toString, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("cui", cui, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("str", str, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("str_norm", strNorm, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("str_sorted", strSorted, Store.YES, Index.NOT_ANALYZED))
      fdoc.add(new Field("str_stemmed", strStemmed, Store.YES, Index.NOT_ANALYZED))
      writer.addDocument(fdoc)
      if (fid % 1000 == 0) writer.commit()
    })
    writer.commit()
    writer.close()
  }

  def annotateConcepts(phrase: String, 
      luceneDir: File): 
      List[(Double,String,String)] = {
    val suggestions = ArrayBuffer[(Double,String,String)]()
    val reader = DirectoryReader.open(
      new SimpleFSDirectory(luceneDir)) 
    val searcher = new IndexSearcher(reader)
    // try to match full string
    suggestions ++= cascadeSearch(searcher, reader, 
      phrase, 1.0)
    if (suggestions.size == 0) {
      // no exact match found, fall back to inexact matches
      val words = normalizeCasePunct(phrase)
        .split(" ")
      val foundWords = scala.collection.mutable.Set[String]()
      for (nword <- words.size - 1 until 0 by -1) {
        words.sliding(nword)
          .map(ngram => ngram.mkString(" "))
          .foreach(ngram => {
            if (alreadySeen(foundWords, ngram)) {
              val ngramWords = ngram.split(" ")
              val ratio = ngramWords.size.toDouble / words.size
              val inexactSuggestions = cascadeSearch(
                searcher, reader, ngram, ratio)
              if (inexactSuggestions.size > 0) {
                suggestions ++= inexactSuggestions
                foundWords ++= ngramWords
              }
            }    
          })       
      }
    }
    if (suggestions.size > 0) {
      // dedup by cui, keeping the first matched
      val deduped = suggestions.groupBy(_._3)
        .map(kv => kv._2.head)
        .toList  
        .sortWith((a,b) => a._1 > b._1)
      suggestions.clear
      suggestions ++= deduped
    }
    // clean up
    reader.close()
    // return results
    suggestions.toList
  }
  
  def printConcepts(
      concepts: List[(Double,String,String)]): 
      Unit = {
    concepts.foreach(concept => 
      Console.println("[%6.2f%%] (%s) %s"
      .format(concept._1, concept._3, concept._2)))
  }
  
  def normalizeCasePunct(str: String): String = {
    val str_lps = punctPattern
      .matcher(str.toLowerCase())
      .replaceAll(" ")
    spacePattern.matcher(str_lps).replaceAll(" ")
  }

  def sortWords(str: String): String = {
    val words = str.split(" ")
    words.sortWith(_ < _).mkString(" ")
  }
  
  def stemWords(str: String): String = {
    val stemmedWords = ArrayBuffer[String]()
    val tokenStream = getAnalyzer().tokenStream(
      "str_stemmed", new StringReader(str))
    val ctattr = tokenStream.addAttribute(
      classOf[CharTermAttribute])    
    tokenStream.reset()
    while (tokenStream.incrementToken()) {
      stemmedWords += ctattr.toString()
    }
    stemmedWords.mkString(" ")
  }
  
  def getAnalyzer(): Analyzer = {
    new StandardAnalyzer(Version.LUCENE_46)
  }
  
  case class StrDocument(id: Int, 
      cui: String, str: String, 
      strNorm: String, strSorted: String, 
      strStemmed: String)
      
  def getDocument(reader: IndexReader,
      hit: ScoreDoc): StrDocument = {
    val doc = reader.document(hit.doc)
    StrDocument(doc.get("id").toInt,
      doc.get("cui"), doc.get("str"), 
      doc.get("str_norm"), doc.get("str_sorted"), 
      doc.get("str_stemmed"))
  }
  
  def cascadeSearch(searcher: IndexSearcher,
      reader: IndexReader, phrase: String,
      ratio: Double): 
      List[(Double,String,String)] = {
    val results = ArrayBuffer[(Double,String,String)]()
    // exact match (100.0%)
    val query1 = new TermQuery(new Term("str", phrase))
    val hits1 = searcher.search(query1, 1).scoreDocs
    if (hits1.size > 0) {
      results += hits1.map(hit => {
        val doc = getDocument(reader, hit)
        (100.0 * ratio, doc.str, doc.cui)
      })
      .toList
      .head
    }
    // match normalized phrase (75%)
    val normPhrase = normalizeCasePunct(phrase)
    if (results.size == 0) {
      val query2 = new TermQuery(new Term("str_norm", normPhrase))
      val hits2 = searcher.search(query2, 1).scoreDocs
      if (hits2.size > 0) {
        results += hits2.map(hit => {
          val doc = getDocument(reader, hit)
          (90.0 * ratio, doc.str, doc.cui)
        })
        .toList
        .head
      }
    }
    // match sorted phrase (50%)
    val sortedPhrase = sortWords(normPhrase)
    if (results.size == 0) {
      val query3 = new TermQuery(new Term("str_sorted", sortedPhrase))
      val hits3 = searcher.search(query3, 1).scoreDocs
      if (hits3.size > 0) {
        results += hits3.map(hit => {
          val doc = getDocument(reader, hit)
          (80.0 * ratio, doc.str, doc.cui)
        })
        .toList
        .head
      }
    }
    // match stemmed phrase (25%)
    val stemmedPhrase = stemWords(normPhrase)
    if (results.size == 0) {
      val query4 = new TermQuery(new Term("str_stemmed", stemmedPhrase))
      val hits4 = searcher.search(query4, 1).scoreDocs
      if (hits4.size > 0) {
        results += hits4.map(hit => {
          val doc = getDocument(reader, hit)
          (70.0 * ratio, doc.str, doc.cui)
        })
        .toList
        .head
      }
    }
    results.toList
  }
  
  def alreadySeen(
      refset: scala.collection.mutable.Set[String], 
      ngram: String): Boolean = {
    val words = ngram.split(" ")
    val nseen = words.filter(word => 
      !refset.contains(word))
      .size
    if (refset.size == 0) true
    else if (nseen > 0) true 
    else false
  }
}

I just used JUnit tests to run through the various functionality. Building the index takes about 40 minutes on my laptop. Here is the JUnit.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Source: src/test/scala/com/mycompany/scalcium/umls/UmlsTaggerTest.scala
package com.mycompany.scalcium.umls

import org.junit.Test
import java.io.File
import org.junit.Assert

class UmlsTaggerTest {

  @Test
  def testSortWords(): Unit = {
    val s = "heart attack and diabetes"
    val tagger = new UmlsTagger()
    Assert.assertEquals("and attack diabetes heart", tagger.sortWords(s))
  }
  
  @Test
  def testStemWords(): Unit = {
    val s = "and attack diabetes heart"
    val tagger = new UmlsTagger()
    Assert.assertEquals("attack diabetes heart", tagger.stemWords(s))
  }

  @Test
  def testBuild(): Unit = {
    val input = new File("/home/sujit/Projects/med_data/cuistr1.csv")
    val output = new File("/home/sujit/Projects/med_data/umlsindex")
    val tagger = new UmlsTagger()
    tagger.buildIndex(input, output)
  }
  
  @Test
  def testMapSingleConcept(): Unit = {
    val luceneDir = new File("/home/sujit/Projects/med_data/umlsindex")
    val tagger = new UmlsTagger()
    val strs = List("Lung Cancer", "Heart Attack", "Diabetes")
    strs.foreach(str => {
      val concepts = tagger.annotateConcepts(str, luceneDir)
      Console.println("Query: " + str)
      tagger.printConcepts(concepts)
      Assert.assertEquals(1, concepts.size)
      Assert.assertEquals(100.0D, concepts.head._1, 0.1D)
    })
  }

  @Test
  def testMapMultipleConcepts(): Unit = {
    val luceneDir = new File("/home/sujit/Projects/med_data/umlsindex")
    val tagger = new UmlsTagger()
    val strs = List(
        "Heart Attack and diabetes",
        "carcinoma (small-cell) of lung",
        "asthma side effects")
    strs.foreach(str => {
      val concepts = tagger.annotateConcepts(str, luceneDir)
      Console.println("Query: " + str)
      tagger.printConcepts(concepts)
    })
  }
}

The last two tests in the JUnit test above return exact and inexact match results for some queries. The first 3 are exact queries and the next 3 are in-exact queries. Output from these tests is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
Query: Lung Cancer
[100.00%] (C0242379) Lung Cancer

Query: Heart Attack
[100.00%] (C0027051) Heart Attack

Query: Diabetes
[100.00%] (C0011847) Diabetes

Query: Heart Attack and diabetes
[ 50.00%] (C0027051) heart attack
[ 35.00%] (C0011847) Diabetes
[ 35.00%] (C0699795) Attack

Query: carcinoma (small-cell) of lung
[ 80.00%] (C0149925) small cell carcinoma of lung

Query: asthma side effects
[ 66.67%] (C0001688) side effects
[ 33.33%] (C0004096) asthma

As you can see, the results don't look too bad. The algorithm is also not very complex, it borrows a lot from the ideas used in the fuzzywuzzy and OpenReconcile projects. Of course, I have tried this on a very small subset of queries, it remains to be seen if it produces good results as consistently as the one we use at work.

19 comments (moderated to prevent spam):

John G said...

This was meant to be posted here:

Brilliant! I had had this same idea some months back, but I dont think I could have pulled it off half as good as you. Look forward to implementing it soon!

Sujit Pal said...

Thanks, John G. You may also want to take a look at this post, where I use David Smiley's SolrTextTagger - cuts out the need to ngram the query for partial matches.

Anonymous said...

What version of Lucene are you using?

Sujit Pal said...

I'm using Lucene 4.6.0. Here is the relevant snippet from my build.sbt:

"org.apache.lucene" % "lucene-core" % "4.6.0",
"org.apache.lucene" % "lucene-queries" % "4.6.0",
"org.apache.lucene" % "lucene-analyzers-common" % "4.6.0",
"org.apache.lucene" % "lucene-queryparser" % "4.6.0",
"org.apache.solr" % "solr-solrj" % "4.6.0",

JohnT said...

Whew! I'm new to Lucene and Eclipse and setting this up was a pain, only because of my ignorance in those areas.

So, if anyone tries to recreate this great code, the following tip may help:

After importing the Lucene Core org directory AND Lucene Analysis directory (there will be errors, ignore them if your processing english), ADD the META-INF directory to the src directory for wherever you setup Lucene core (I did it as a separate project). This may seem obvious to those comfortable with the environment, but it took me some frustrating hours to figure it out. The root of this is the SPI model that lucene 4 went to as I understand it.

Im on JDK 1.6 running Eclipse Kepler with Lucene 4.6.

Hope this helps someone.

Thanks for the wonderful ideas Sujit.

JG

Sujit Pal said...

I generally set up Scala projects using the giter8 (g8) template generator. Then I manually update the build.sbt to include my dependencies (and resolvers if necessary), then use "sbt eclipse" to build the project artifacts (.classpath, etc). At that point, all you have to do with Eclipse is to create a Scala project and point the IDE to the project directory.

Evan Smith said...

Hello,

I am have gotten to the scala code and am unsure how to setup/compile as needed.

Can you post your eclipse project or give more details about
- updating the build.sbt
- sbt eclipse
and perhaps what version of eclipse and what your class path was (as an example)?

Looking forward to running this.

Thanks!
Evan

Evan Smith said...

Hello,

Ok I have giter8 installed.

Which template to you use to setup?

Evan

Sujit Pal said...

Hi Evan, I used the n8han/giter8 template. Here is the snippet from my build.sbt listing out the resolvers and dependencies:

resolvers ++= Seq(
.."spray" at "https://repo.spray.io/",
.."Neo4j-Contrib" at "http://m2.neo4j.org/content/groups/everything"
)
libraryDependencies ++= Seq(
.."org.apache.opennlp" % "opennlp-maxent" % "3.0.3",
.."org.apache.opennlp" % "opennlp-tools" % "1.5.3",
.."org.apache.lucene" % "lucene-core" % "4.6.0",
.."org.apache.lucene" % "lucene-queries" % "4.6.0",
.."org.apache.lucene" % "lucene-analyzers-common" % "4.6.0",
.."org.apache.lucene" % "lucene-queryparser" % "4.6.0",
.."org.apache.solr" % "solr-solrj" % "4.6.0",
.."org.neo4j" % "neo4j" % "1.9.6",
.."org.neo4j" % "neo4j-rest-graphdb" % "1.9",
.."nz.ac.waikato.cms.weka" % "weka-dev" % "3.7.10",
.."org.apache.commons" % "commons-lang3" % "3.0",
.."net.sourceforge.collections" % "collections-generic" % "4.01",
.."commons-beanutils" % "commons-beanutils" % "1.8.3",
.."commons-io" % "commons-io" % "2.4",
.."io.spray" %% "spray-json" % "1.2.5",
.."log4j" % "log4j" % "1.2.14",
.."com.novocode" % "junit-interface" % "0.8" % "test"
)

Evan Smith said...

Hello,

Progress

Ok I think I have a .sbt that is reasonable. I put in your resolvers and depends. naturally when I compile the scala code it complains on the first import of the lucene.analysis.analyzer

Where is this jar in the solr install?

Evan

Sujit Pal said...

If you run "sbt eclipse" with an internet connection, it would download them all off the maven repositories you set up in the build.sbt to your $HOME/.ivy directory, as well as create your Eclipse .classpath and .project descriptors. Alternatively, if you must get them off your solr install, they are in your solr-webapp/webapp/WEB-INF/lib directory and they need to be copied into the automatically managed lib directory (under your project root).

Evan Smith said...

Hello,

I have managed to setup an eclipse project (though I did create it myself). I have all the libs, and am able to compile and run the code.

I generate the json file (looks fine).

But I get no results from my solr instance on queries. I assume that I need to reingest my documents to have them indexed with you fields.

Any pointers on what fields to copy field into or how to do an ingest against your settings?

This is totally cool, and thank you for all of your help!

I will download your eclipse project this weekend.

Evan

Sujit Pal said...

You have to run the testBuild JUnit test to load up the index with the data. Also the project is not publicly available yet (I plan on completing some more functionality before I release).

JohnT said...

Hi Sujit - I converted this to python and added https://code.google.com/p/negex/. The thinking behind python is I do, and a lot of others do, rapid web dev with djangoproject. Your parser, now in python, plus negation, could be a very powerful but simple tool on the web. How do you feel about you and I pushing it to github together as a mini opensource project? I have been using the version I converted to python with negex with much success COMPARED to cTakes, which I also use heavily (for short strings of course, not long documents). If you're interested please email me at john travis green (spaces replaced with dots) (at) gmail.com.

Sujit Pal said...

Hi John, congratulations on making the matching algorithm work with Python. Negex needs something to mark phrases of interest and you are right, this algorithm is much easier than using Metamap or cTakes. I think the combination would be a worthy open source project. But the work is all yours and I don't want to take away from it, so I would be happy if you make a mention of this page on your README file.

JohnT said...

Hi Sujit, pushed the project to github - https://github.com/jtgreen/SMPP. Effectively wrapped in negex and added a more aggressive stemmer (Lancaster) as well as an optional "include all results" which is a bit of a misnomer; simply put, it continues the cascade without short circuiting in order to broaden the net... a bit of bag of words style, only you do keep the scores so you can do some simple permutations and optimize over a solution.

Sujit Pal said...

Very cool, congratulations and thanks for the shout-out!

Anonymous said...

Have you released this project publicly? I've searched some repositories online. This is very interesting!

Sujit Pal said...

Thank you, glad you found it interesting. And yes, the code in this post is part of a public project (https://github.com/sujitpal/scalcium).