I started looking at the Unified Medical Language System (UMLS) Database because it is an important data source for the Apache cTakes pipeline. UMLS integrates multiple medical terminologies into a single dictionary, which can be used for recognizing medical entities in text. Our proprietary concept mapping algorithm uses a superset of UMLS (combined with other terminologies, and manually curated by our team of in-house medical folks) to extract meaning from medical text. Both cTakes and Hitex (another NLP application focused on medical text) use Metamap (via its Java API), a tool provided by the National Library of Medicine (NLM) to recognize UMLS concepts in text.
I was curious about how Metamap did it - turns out it uses a pretty complex algorithm as described in this paper by Alan Aronson, the creator of Metamap and is written in Prolog and C. Parallelly, I also happened to come across the Python fuzzywuzzy library for fuzzy string matching, and the Java OpenReconcile project (used in the OpenRefine project) for matching entities in text. This gave me an idea about an algorithm that could be used to annotate text using the UMLS database as a dictionary, which I describe here. It is very different from what we use at work, and the advantage of that one is that it has been vetted and tweaked extensively over the years to yield good results over our entire taxonomy.
My matching algorithm matches the incoming phrase against the original strings from UMLS, then the case and punctuation normalized (ie lowercased and punctuations replaced with spaces) version, then the version with all the words in the normalized string alpha sorted, and finally a version with all the words in the normalized string stemmed using Lucene's StandardAnalyzer. Each match in the sequence above assigned a score in the sequence [100, 75, 50, 25]. If a match is found at any level, the other layers are not executed. If no match is found, we construct n-grams of the phrase, where n varies from N-1 to 1. Like the incoming phrase, each n-gram is transformed and matched against the dictionary. Like the full phrase, a match short-circuits the pipeline for that n-gram. Words in n-grams that matched a dictionary entry are ignored in subsequent matches. Scores for n-grams are discounted by the number of words in the n-gram versus the number of words in the entire phrase.
I store the dictionary as a Lucene index, although I am not using any of Lucene's search capabilities, and I could just as easily have stored the dictionary in a relational database or key-value store. All matches are exact matches, although I use Lucene's StandardAnalyzer to stem the dictionary entries during indexing and the incoming phrases during searching.
To get the data out of the UMLS database into a flat file, I run the following SQL on the MRCONSO table.
1 2 3 4 5 6 7 | mysql> select CUI, STR from MRCONSO
... where LAT = 'ENG'
... into outfile '/tmp/cuistr.csv'
... fields terminated by ','
... enclosed by '"' lines
... terminated by '\n';
Query OK, 7871075 rows affected (23.30 sec)
|
I then removed duplicates (there are some records with the same CUI and STR values from different sources) using this Unix call. This resulted in 6,818,318 rows of input.
1 | sujit@tsunami:~$ cat cuistr.csv | sort | uniq > cuistr1.csv
|
Here is the code. The buildIndex() method creates the index and the annotateConcepts() method matches the incoming phrase (as described above) to the entries in the index.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 | // Source: src/main/scala/com/mycompany/scalcium/umls/UmlsTagger.scala
package com.mycompany.scalcium.umls
import java.io.File
import java.io.StringReader
import java.util.concurrent.atomic.AtomicLong
import java.util.regex.Pattern
import scala.Array.canBuildFrom
import scala.Array.fallbackCanBuildFrom
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.document.Document
import org.apache.lucene.document.Field
import org.apache.lucene.document.Field.Index
import org.apache.lucene.document.Field.Store
import org.apache.lucene.index.DirectoryReader
import org.apache.lucene.index.IndexReader
import org.apache.lucene.index.IndexWriter
import org.apache.lucene.index.IndexWriterConfig
import org.apache.lucene.index.Term
import org.apache.lucene.search.BooleanClause.Occur
import org.apache.lucene.search.BooleanQuery
import org.apache.lucene.search.IndexSearcher
import org.apache.lucene.search.ScoreDoc
import org.apache.lucene.search.TermQuery
import org.apache.lucene.store.SimpleFSDirectory
import org.apache.lucene.util.Version
class UmlsTagger {
val punctPattern = Pattern.compile("\\p{Punct}")
val spacePattern = Pattern.compile("\\s+")
def buildIndex(inputFile: File,
luceneDir: File): Unit = {
// set up the index writer
val analyzer = getAnalyzer()
val iwconf = new IndexWriterConfig(Version.LUCENE_46, analyzer)
iwconf.setOpenMode(IndexWriterConfig.OpenMode.CREATE)
val writer = new IndexWriter(new SimpleFSDirectory(luceneDir), iwconf)
// read through input file and write out to lucene
val counter = new AtomicLong(0L)
val linesReadCounter = new AtomicLong(0L)
Source.fromFile(inputFile)
.getLines()
.foreach(line => {
val linesRead = linesReadCounter.incrementAndGet()
if (linesRead % 1000 == 0) Console.println("%d lines read".format(linesRead))
val Array(cui, str) = line
.replace("\",\"", "\t")
.replaceAll("\"", "")
.split("\t")
val strNorm = normalizeCasePunct(str)
val strSorted = sortWords(strNorm)
val strStemmed = stemWords(strNorm)
// write full str record
// str = exact string
// str_norm = case and punct normalized, exact
// str_sorted = str_norm sorted
// str_stemmed = str_sorted stemmed
val fdoc = new Document()
val fid = counter.incrementAndGet()
fdoc.add(new Field("id", fid.toString, Store.YES, Index.NOT_ANALYZED))
fdoc.add(new Field("cui", cui, Store.YES, Index.NOT_ANALYZED))
fdoc.add(new Field("str", str, Store.YES, Index.NOT_ANALYZED))
fdoc.add(new Field("str_norm", strNorm, Store.YES, Index.NOT_ANALYZED))
fdoc.add(new Field("str_sorted", strSorted, Store.YES, Index.NOT_ANALYZED))
fdoc.add(new Field("str_stemmed", strStemmed, Store.YES, Index.NOT_ANALYZED))
writer.addDocument(fdoc)
if (fid % 1000 == 0) writer.commit()
})
writer.commit()
writer.close()
}
def annotateConcepts(phrase: String,
luceneDir: File):
List[(Double,String,String)] = {
val suggestions = ArrayBuffer[(Double,String,String)]()
val reader = DirectoryReader.open(
new SimpleFSDirectory(luceneDir))
val searcher = new IndexSearcher(reader)
// try to match full string
suggestions ++= cascadeSearch(searcher, reader,
phrase, 1.0)
if (suggestions.size == 0) {
// no exact match found, fall back to inexact matches
val words = normalizeCasePunct(phrase)
.split(" ")
val foundWords = scala.collection.mutable.Set[String]()
for (nword <- words.size - 1 until 0 by -1) {
words.sliding(nword)
.map(ngram => ngram.mkString(" "))
.foreach(ngram => {
if (alreadySeen(foundWords, ngram)) {
val ngramWords = ngram.split(" ")
val ratio = ngramWords.size.toDouble / words.size
val inexactSuggestions = cascadeSearch(
searcher, reader, ngram, ratio)
if (inexactSuggestions.size > 0) {
suggestions ++= inexactSuggestions
foundWords ++= ngramWords
}
}
})
}
}
if (suggestions.size > 0) {
// dedup by cui, keeping the first matched
val deduped = suggestions.groupBy(_._3)
.map(kv => kv._2.head)
.toList
.sortWith((a,b) => a._1 > b._1)
suggestions.clear
suggestions ++= deduped
}
// clean up
reader.close()
// return results
suggestions.toList
}
def printConcepts(
concepts: List[(Double,String,String)]):
Unit = {
concepts.foreach(concept =>
Console.println("[%6.2f%%] (%s) %s"
.format(concept._1, concept._3, concept._2)))
}
def normalizeCasePunct(str: String): String = {
val str_lps = punctPattern
.matcher(str.toLowerCase())
.replaceAll(" ")
spacePattern.matcher(str_lps).replaceAll(" ")
}
def sortWords(str: String): String = {
val words = str.split(" ")
words.sortWith(_ < _).mkString(" ")
}
def stemWords(str: String): String = {
val stemmedWords = ArrayBuffer[String]()
val tokenStream = getAnalyzer().tokenStream(
"str_stemmed", new StringReader(str))
val ctattr = tokenStream.addAttribute(
classOf[CharTermAttribute])
tokenStream.reset()
while (tokenStream.incrementToken()) {
stemmedWords += ctattr.toString()
}
stemmedWords.mkString(" ")
}
def getAnalyzer(): Analyzer = {
new StandardAnalyzer(Version.LUCENE_46)
}
case class StrDocument(id: Int,
cui: String, str: String,
strNorm: String, strSorted: String,
strStemmed: String)
def getDocument(reader: IndexReader,
hit: ScoreDoc): StrDocument = {
val doc = reader.document(hit.doc)
StrDocument(doc.get("id").toInt,
doc.get("cui"), doc.get("str"),
doc.get("str_norm"), doc.get("str_sorted"),
doc.get("str_stemmed"))
}
def cascadeSearch(searcher: IndexSearcher,
reader: IndexReader, phrase: String,
ratio: Double):
List[(Double,String,String)] = {
val results = ArrayBuffer[(Double,String,String)]()
// exact match (100.0%)
val query1 = new TermQuery(new Term("str", phrase))
val hits1 = searcher.search(query1, 1).scoreDocs
if (hits1.size > 0) {
results += hits1.map(hit => {
val doc = getDocument(reader, hit)
(100.0 * ratio, doc.str, doc.cui)
})
.toList
.head
}
// match normalized phrase (75%)
val normPhrase = normalizeCasePunct(phrase)
if (results.size == 0) {
val query2 = new TermQuery(new Term("str_norm", normPhrase))
val hits2 = searcher.search(query2, 1).scoreDocs
if (hits2.size > 0) {
results += hits2.map(hit => {
val doc = getDocument(reader, hit)
(90.0 * ratio, doc.str, doc.cui)
})
.toList
.head
}
}
// match sorted phrase (50%)
val sortedPhrase = sortWords(normPhrase)
if (results.size == 0) {
val query3 = new TermQuery(new Term("str_sorted", sortedPhrase))
val hits3 = searcher.search(query3, 1).scoreDocs
if (hits3.size > 0) {
results += hits3.map(hit => {
val doc = getDocument(reader, hit)
(80.0 * ratio, doc.str, doc.cui)
})
.toList
.head
}
}
// match stemmed phrase (25%)
val stemmedPhrase = stemWords(normPhrase)
if (results.size == 0) {
val query4 = new TermQuery(new Term("str_stemmed", stemmedPhrase))
val hits4 = searcher.search(query4, 1).scoreDocs
if (hits4.size > 0) {
results += hits4.map(hit => {
val doc = getDocument(reader, hit)
(70.0 * ratio, doc.str, doc.cui)
})
.toList
.head
}
}
results.toList
}
def alreadySeen(
refset: scala.collection.mutable.Set[String],
ngram: String): Boolean = {
val words = ngram.split(" ")
val nseen = words.filter(word =>
!refset.contains(word))
.size
if (refset.size == 0) true
else if (nseen > 0) true
else false
}
}
|
I just used JUnit tests to run through the various functionality. Building the index takes about 40 minutes on my laptop. Here is the JUnit.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | // Source: src/test/scala/com/mycompany/scalcium/umls/UmlsTaggerTest.scala
package com.mycompany.scalcium.umls
import org.junit.Test
import java.io.File
import org.junit.Assert
class UmlsTaggerTest {
@Test
def testSortWords(): Unit = {
val s = "heart attack and diabetes"
val tagger = new UmlsTagger()
Assert.assertEquals("and attack diabetes heart", tagger.sortWords(s))
}
@Test
def testStemWords(): Unit = {
val s = "and attack diabetes heart"
val tagger = new UmlsTagger()
Assert.assertEquals("attack diabetes heart", tagger.stemWords(s))
}
@Test
def testBuild(): Unit = {
val input = new File("/home/sujit/Projects/med_data/cuistr1.csv")
val output = new File("/home/sujit/Projects/med_data/umlsindex")
val tagger = new UmlsTagger()
tagger.buildIndex(input, output)
}
@Test
def testMapSingleConcept(): Unit = {
val luceneDir = new File("/home/sujit/Projects/med_data/umlsindex")
val tagger = new UmlsTagger()
val strs = List("Lung Cancer", "Heart Attack", "Diabetes")
strs.foreach(str => {
val concepts = tagger.annotateConcepts(str, luceneDir)
Console.println("Query: " + str)
tagger.printConcepts(concepts)
Assert.assertEquals(1, concepts.size)
Assert.assertEquals(100.0D, concepts.head._1, 0.1D)
})
}
@Test
def testMapMultipleConcepts(): Unit = {
val luceneDir = new File("/home/sujit/Projects/med_data/umlsindex")
val tagger = new UmlsTagger()
val strs = List(
"Heart Attack and diabetes",
"carcinoma (small-cell) of lung",
"asthma side effects")
strs.foreach(str => {
val concepts = tagger.annotateConcepts(str, luceneDir)
Console.println("Query: " + str)
tagger.printConcepts(concepts)
})
}
}
|
The last two tests in the JUnit test above return exact and inexact match results for some queries. The first 3 are exact queries and the next 3 are in-exact queries. Output from these tests is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | Query: Lung Cancer
[100.00%] (C0242379) Lung Cancer
Query: Heart Attack
[100.00%] (C0027051) Heart Attack
Query: Diabetes
[100.00%] (C0011847) Diabetes
Query: Heart Attack and diabetes
[ 50.00%] (C0027051) heart attack
[ 35.00%] (C0011847) Diabetes
[ 35.00%] (C0699795) Attack
Query: carcinoma (small-cell) of lung
[ 80.00%] (C0149925) small cell carcinoma of lung
Query: asthma side effects
[ 66.67%] (C0001688) side effects
[ 33.33%] (C0004096) asthma
|
As you can see, the results don't look too bad. The algorithm is also not very complex, it borrows a lot from the ideas used in the fuzzywuzzy and OpenReconcile projects. Of course, I have tried this on a very small subset of queries, it remains to be seen if it produces good results as consistently as the one we use at work.
21 comments (moderated to prevent spam):
This was meant to be posted here:
Brilliant! I had had this same idea some months back, but I dont think I could have pulled it off half as good as you. Look forward to implementing it soon!
Thanks, John G. You may also want to take a look at this post, where I use David Smiley's SolrTextTagger - cuts out the need to ngram the query for partial matches.
What version of Lucene are you using?
I'm using Lucene 4.6.0. Here is the relevant snippet from my build.sbt:
"org.apache.lucene" % "lucene-core" % "4.6.0",
"org.apache.lucene" % "lucene-queries" % "4.6.0",
"org.apache.lucene" % "lucene-analyzers-common" % "4.6.0",
"org.apache.lucene" % "lucene-queryparser" % "4.6.0",
"org.apache.solr" % "solr-solrj" % "4.6.0",
Whew! I'm new to Lucene and Eclipse and setting this up was a pain, only because of my ignorance in those areas.
So, if anyone tries to recreate this great code, the following tip may help:
After importing the Lucene Core org directory AND Lucene Analysis directory (there will be errors, ignore them if your processing english), ADD the META-INF directory to the src directory for wherever you setup Lucene core (I did it as a separate project). This may seem obvious to those comfortable with the environment, but it took me some frustrating hours to figure it out. The root of this is the SPI model that lucene 4 went to as I understand it.
Im on JDK 1.6 running Eclipse Kepler with Lucene 4.6.
Hope this helps someone.
Thanks for the wonderful ideas Sujit.
JG
I generally set up Scala projects using the giter8 (g8) template generator. Then I manually update the build.sbt to include my dependencies (and resolvers if necessary), then use "sbt eclipse" to build the project artifacts (.classpath, etc). At that point, all you have to do with Eclipse is to create a Scala project and point the IDE to the project directory.
Hello,
I am have gotten to the scala code and am unsure how to setup/compile as needed.
Can you post your eclipse project or give more details about
- updating the build.sbt
- sbt eclipse
and perhaps what version of eclipse and what your class path was (as an example)?
Looking forward to running this.
Thanks!
Evan
Hello,
Ok I have giter8 installed.
Which template to you use to setup?
Evan
Hi Evan, I used the n8han/giter8 template. Here is the snippet from my build.sbt listing out the resolvers and dependencies:
resolvers ++= Seq(
.."spray" at "https://repo.spray.io/",
.."Neo4j-Contrib" at "http://m2.neo4j.org/content/groups/everything"
)
libraryDependencies ++= Seq(
.."org.apache.opennlp" % "opennlp-maxent" % "3.0.3",
.."org.apache.opennlp" % "opennlp-tools" % "1.5.3",
.."org.apache.lucene" % "lucene-core" % "4.6.0",
.."org.apache.lucene" % "lucene-queries" % "4.6.0",
.."org.apache.lucene" % "lucene-analyzers-common" % "4.6.0",
.."org.apache.lucene" % "lucene-queryparser" % "4.6.0",
.."org.apache.solr" % "solr-solrj" % "4.6.0",
.."org.neo4j" % "neo4j" % "1.9.6",
.."org.neo4j" % "neo4j-rest-graphdb" % "1.9",
.."nz.ac.waikato.cms.weka" % "weka-dev" % "3.7.10",
.."org.apache.commons" % "commons-lang3" % "3.0",
.."net.sourceforge.collections" % "collections-generic" % "4.01",
.."commons-beanutils" % "commons-beanutils" % "1.8.3",
.."commons-io" % "commons-io" % "2.4",
.."io.spray" %% "spray-json" % "1.2.5",
.."log4j" % "log4j" % "1.2.14",
.."com.novocode" % "junit-interface" % "0.8" % "test"
)
Hello,
Progress
Ok I think I have a .sbt that is reasonable. I put in your resolvers and depends. naturally when I compile the scala code it complains on the first import of the lucene.analysis.analyzer
Where is this jar in the solr install?
Evan
If you run "sbt eclipse" with an internet connection, it would download them all off the maven repositories you set up in the build.sbt to your $HOME/.ivy directory, as well as create your Eclipse .classpath and .project descriptors. Alternatively, if you must get them off your solr install, they are in your solr-webapp/webapp/WEB-INF/lib directory and they need to be copied into the automatically managed lib directory (under your project root).
Hello,
I have managed to setup an eclipse project (though I did create it myself). I have all the libs, and am able to compile and run the code.
I generate the json file (looks fine).
But I get no results from my solr instance on queries. I assume that I need to reingest my documents to have them indexed with you fields.
Any pointers on what fields to copy field into or how to do an ingest against your settings?
This is totally cool, and thank you for all of your help!
I will download your eclipse project this weekend.
Evan
You have to run the testBuild JUnit test to load up the index with the data. Also the project is not publicly available yet (I plan on completing some more functionality before I release).
Hi Sujit - I converted this to python and added https://code.google.com/p/negex/. The thinking behind python is I do, and a lot of others do, rapid web dev with djangoproject. Your parser, now in python, plus negation, could be a very powerful but simple tool on the web. How do you feel about you and I pushing it to github together as a mini opensource project? I have been using the version I converted to python with negex with much success COMPARED to cTakes, which I also use heavily (for short strings of course, not long documents). If you're interested please email me at john travis green (spaces replaced with dots) (at) gmail.com.
Hi John, congratulations on making the matching algorithm work with Python. Negex needs something to mark phrases of interest and you are right, this algorithm is much easier than using Metamap or cTakes. I think the combination would be a worthy open source project. But the work is all yours and I don't want to take away from it, so I would be happy if you make a mention of this page on your README file.
Hi Sujit, pushed the project to github - https://github.com/jtgreen/SMPP. Effectively wrapped in negex and added a more aggressive stemmer (Lancaster) as well as an optional "include all results" which is a bit of a misnomer; simply put, it continues the cascade without short circuiting in order to broaden the net... a bit of bag of words style, only you do keep the scores so you can do some simple permutations and optimize over a solution.
Very cool, congratulations and thanks for the shout-out!
Have you released this project publicly? I've searched some repositories online. This is very interesting!
Thank you, glad you found it interesting. And yes, the code in this post is part of a public project (https://github.com/sujitpal/scalcium).
Hi Sujit,
Can we implement this in python?
if yes please suggest how(which libraries will help)?
Yes, I think you can implement this in Python fairly easily. Main problem is Lucene which you can replace with a Solr/Elasticsearch server. In order to build the index, you can reuse standard analyzers that are available in Solr/ES, and the string manipulation to sort and stem the words can be easily done during preprocessing with Python as well. Same with the query side, it essentially does 4 queries and sorts the results according to weighted scores. To send and receive HTTP requests to and from Solr, you could use the requests library, and json for building and parsing the request and response.
Post a Comment