Monday, July 22, 2013

Dictionary Backed Named Entity Recognition with Lucene and LingPipe


Domain-specific Concept Search (such as ours) typically involves recognizing entities in the query and matching them up to entities that make sense in the particular domain - in our case, the entities correspond to concepts in our medical taxonomy. This does mean that we fall short when we try to solve for a slightly different use case such as doctor search.

By doctor search, I mean the search interface that health insurance sites have for their members to find a doctor near them. Typical search patterns are by zip code, city, provider name (doctor or hospital), specialty, symptom, or some combination thereof. Our Named Entity Recognition (NER) system is very good at extracting medical concepts, such as specialties and symptoms, from queries, and we can draw useful inferences based on relationship graphs, but we don't have a good way of recognizing names and addresses. So the idea is to pass the query through an additional pre-processing step through a chain of entity extractors which will identify and extract the different name and address fields.

This has been on my to-do list for a while. The doctor search project came about while I was busy with another one, and someone else built the product using different techniques. So a doctor search product already exists at the moment. This is a proof of concept for my additional pre-processing NER idea mentioned above, and it remains to be seen whether its performance compares favorably with the existing product or not.

NERs can be either regex based, dictionary based or model based. Regex based NERs match incoming text against one or more predefined regular expressions. Dictionary based NERs, also known as gazetteer based NERs, match text against a dictionary of (term, category) pairs. Model based NERs use a training set of (term, category) pairs to train a model, and then use the model to predict the category of new (potentially previously unseen) terms.

Since a doctor search application is closed, ie, the list of doctors are finite and all relevant attributes about them are known, and a search for an unknown doctor or doctor attributes returning no results is expected and desired, we only use regex based and dictionary based NERs. In this post, I describe 3 NERs built using the LingPipe API, one regex based, and two dictionary based, one in-memory and one using Lucene.

LingPipe provides a RegExChunker, which takes a regular expression and a category string, which we use to build a Zip Code NER. For the dictionary NERs, we use LingPipe's ExactDictionaryChunker which takes a MapDictionary object and a category. In the first case, the MapDictionary is in-memory and in the second, we extend MapDictionary to use a pre-populated Lucene index (via Solr). The ExactDictionaryChunker implements the Aho-Corasick algorithm for scalable (O(n)) string matching. There is also an ApproximateDictionaryChunker which returns matches within a given edit distance, but I haven't used it.


Since most of the classes are already provided by LingPipe, the only code we need to write is the interface by which our application (currently a JUnit test, but perhaps a custom QParser in the future) calls the NERs, and the custom MapDictionary subclass that uses a Solr index. These are shown in the code below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
package com.mycompany.solr4extras.ner

import scala.collection.JavaConversions.{asJavaIterator, asScalaIterator, asScalaSet, mapAsJavaMap}

import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.common.params.{CommonParams, MapSolrParams}

import com.aliasi.chunk.{Chunk, Chunker}
import com.aliasi.dict.{DictionaryEntry, MapDictionary}

/**
 * Pass in a list of Chunkers during construction, and then
 * call the chunk method with the text to be chunked. Returns
 * a set of Chunk objects of various types in the text.
 */
class Chunkers(val chunkers: List[Chunker]) {

  def chunk(text: String): Set[Chunk] = chunkers.
    map(chunker => chunker.chunk(text).chunkSet.toList).
    flatten.
    toSet[Chunk]

  def mkString(text: String, chunk: Chunk): String = {
    val pair = mkPair(text, chunk)
    pair._1 + "/" + pair._2
  }
  
  def mkPair(text: String, chunk: Chunk): (String,String) = 
    (text.substring(chunk.start(), chunk.end()), 
      chunk.`type`())
}

/**
 * Custom MapDictionary backed by a Solr index. This is 
 * used by our Dictionary based NER (ExactMatchDictionaryChunker)
 * for large dictionaries of entity names. Dictionary entries
 * are stored as (category, value) pairs in Solr fields
 * (nercat, nerval).
 */
class SolrMapDictionary(
    val solr: SolrServer, val nrows: Int, val category: String) 
    extends MapDictionary[String] {

  override def addEntry(entry: DictionaryEntry[String]) = {} 
  
  override def iterator(): 
      java.util.Iterator[DictionaryEntry[String]] = {
    phraseEntryIt("*:*")
  }
  
  override def phraseEntryIt(phrase: String): 
      java.util.Iterator[DictionaryEntry[String]] = {
    val params = new MapSolrParams(Map(
      CommonParams.Q -> phrase,
      CommonParams.FQ -> ("nercat:" + category),
      CommonParams.FL -> "nerval",
      CommonParams.START -> "0", 
      CommonParams.ROWS -> String.valueOf(nrows)))
    val rsp = solr.query(params)
    rsp.getResults().iterator().
      toList.
      map(doc =>  new DictionaryEntry[String](
        doc.getFieldValue("nerval").asInstanceOf[String], 
        category, 1.0D)).
      iterator
  }
}

The SolrMapDictionary needs two fields nercat and nerval, that hold the NER category and the NER value respectively. These need to be defined in the schema.xml file like so:

1
2
   <field name="nercat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="nerval" type="text_general" indexed="true" stored="true"/>

The Chunkers.chunk() method passes in an Array of Chunkers, one each for the different entities that we want to recognize, and the text to be chunked. The JUnit test below shows example calls for 3 of our NERs.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
package com.mycompany.solr4extras.ner

import scala.Array.canBuildFrom

import org.apache.solr.client.solrj.impl.{ConcurrentUpdateSolrServer, HttpSolrServer}
import org.junit.{After, Assert, Test}

import com.aliasi.chunk.RegExChunker
import com.aliasi.dict.{DictionaryEntry, ExactDictionaryChunker, MapDictionary}
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory

class ChunkersTest {

  val solrUrl = "http://localhost:8983/solr/"
  val solrWriter = new ConcurrentUpdateSolrServer(solrUrl, 10, 1)
  val solrReader = new HttpSolrServer(solrUrl)
  
  val texts = Array[String]( 
    "Cardiologist Berkeley 94701 Dr Chen",
    "Herpetologist 94015-1234",
    "Cost of care $1000 wtf",
    "San Francisco points of interest",
    "Liberty Island, New York"
  )
  val cities = Array[String]("Berkeley", "San Francisco", "New York")
  val specialists = Array[String]("cardiologist", "herpetologist")
  
  @After def teardown(): Unit = {
    solrWriter.shutdown()
    solrReader.shutdown()
  }

  @Test def testRegexChunking(): Unit = {
    val zipCodeChunker = new RegExChunker(
      "\\d{5}-\\d{4}|\\d{5}", "zipcode", 1.0D)
    val chunkers = new Chunkers(List(zipCodeChunker))
    val expected = Array[Int](1, 1, 0, 0, 0)
    val actuals = texts.map(text => {
      val chunkSet = chunkers.chunk(text)
      chunkSet.foreach(chunk => 
        Console.println(chunkers.mkString(text, chunk)))
      chunkSet.size
    })
    Assert.assertArrayEquals(expected, actuals)
  }
  
  @Test def testInMemoryDictChunking(): Unit = {
    val dict = new MapDictionary[String]()
    specialists.foreach(specialist => 
      dict.addEntry(
      new DictionaryEntry[String](specialist, "specialist", 1.0D)))
    val specialistChunker = new ExactDictionaryChunker(
      dict, IndoEuropeanTokenizerFactory.INSTANCE, false, false)   
    val chunkers = new Chunkers(List(specialistChunker))
    val expected = Array[Int](1, 1, 0, 0, 0)
    val actuals = texts.map(text => {
      val chunkSet = chunkers.chunk(text)
      chunkSet.foreach(chunk => 
        Console.println(chunkers.mkString(text, chunk)))
      chunkSet.size
    })
    Assert.assertArrayEquals(expected, actuals)
  }
  
  @Test def testSolrDictChunking(): Unit = {
    val dict = new SolrMapDictionary(solrReader, 18000, "city")
    val cityChunker = new ExactDictionaryChunker(
      dict, IndoEuropeanTokenizerFactory.INSTANCE, false, false)
    val chunkers = new Chunkers(List(cityChunker))
    val expected = Array[Int](1, 0, 0, 1, 2)
    val actuals = texts.map(text => {
      val chunkSet = chunkers.chunk(text)
      chunkSet.foreach(chunk => 
        Console.println(chunkers.mkString(text, chunk)))
      chunkSet.size
    })
    Assert.assertArrayEquals(expected, actuals)
  }
}

The results from the run are as follows (edited slightly for readability), and show the snippets of the input text that were matched (the entities or chunks) and what entity category they were matched to. As you can see, there is a false positive - Liberty is being considered a city.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
94701/zipcode
94015-1234/zipcode

Cardiologist/specialist
Herpetologist/specialist

Berkeley/city
San Francisco/city
Liberty/city
New York/city

While the SolrMapDictionary returns the expected results, I found that it calls the iterator() method once for the entire run and thus needs to have the full dataset of 17k+ cities loaded in one shot, which kind of defeats the purpose of having an external store - I would have expected the phraseEntryIt() method to be called with specific phrases instead. I haven't had a chance to investigate this in depth yet, will update the post when I find out more.

References



8 comments (moderated to prevent spam):

David Smiley said...

Hi Sujit,
To support a dictionary/gazetteer with millions of entries, I recommend you use the SolrTextTagger:
https://github.com/OpenSextant/SolrTextTagger
I hope you will find my Lucene Revolution presentation on it interesting:
http://www.youtube.com/watch?v=3kQyYbTyXfc

Sujit Pal said...

Thanks David, excellent and very informative presentation, many thanks for doing this. I do remember seeing discussions about the Lucene FST on the Solr/Lucene ML, but I wasn't following it closely, so missed the correspondence (that both this and the LingPipe ExactDictionaryChunker implements an FST). I will download your SolrTagger code from github tonight and take a look.

BTW, not sure if you already figured out why your layered FST takes more memory than having a single one, but its because there is no duplication with a single FST (duplicate words are in the same node in the graph), when you layer it, you are adding extra single word nodes which probably account for the extra 41%. Also BTW, I've also tried the MySQL approach (for a different project consisting of about 3M vocabulary phrases) and abandoned for the same reason (performance sucked) :-).

Regarding your final comment about using ShingleFilter - I think you would need it only if you didn't have an FST and you stored your vocabulary in a dictionary. I used a ShingleFilter to build a moving/contracting window over text to find candidate matches for the MySQL based approach.

Sorry about posting this (last 2 paras) here. The comment system on Youtube wouldn't let me comment even though I was logged into Google and wouldn't tell me what else it needed for me to authenticate myself.

David Smiley said...

I'm glad you found my presentation informative.

Your theory on the extra word nodes makes sense. Another possibility suggested by Rob Muir is that it's the way I've configured the 2nd FST -- using 4-byte int word inputs. I did this because I had integers but Rob said I could still use 1 byte words and consumed them 4 at a time from the FST. That would have helped.

Any way, in my presentation I indicated that in hindsight I could instead use the Lucene index itself, particularly with the Memory postings format, to get better compression results and to reduce the code I maintain. Last night I implemented this major change to the SolrTextTagger in the "MemPF" branch, tracked by github issue #8. I'm anxiously awaiting to see how it impacts memory and performance.

Sujit Pal said...

Thanks, I will check it out. I have at least 2 applications that would benefit from the use of SolrTextTagger (this and another one).

Saurabh Garg said...

Sir, can you give a link to the code of NER system which is good at extracting symptoms? Needed so that I can continue my project.

For example:

She denies any COUGH or sputum production

Output - cough, sputum production

Thanks in advance. :)

Sujit Pal said...

Hi Saurabh, that NER system is our proprietary one, it doesn't have a public API. There is a similar service called Metamap from NIH, for which a Java API exists. Maybe take a look at that? Also check out Alchemy and OpenCalais, I believe both can recognize medical entities (although probably not as well as Metamap or our proprietary one.

Manuel LeNormand said...

Do you have any insights about the different available regex extractors: 1. LingPipe's RegExChunker
2. Stanford TokensRegex
3. UIMA RegexAnnotator
4. GATE JAPE

IS there any real difference? optimization, scalability?

As my system is written in .Net, what are the drawbacks of writing the regex rule base on my own? Is there something to be aware of, ie performance or error handling, that the above frameworks take care of?

Sujit Pal said...

Hi Manuel, I am familiar with #1 and #3 only. No real insights except that both (as far as I remember) take a list of regexes and apply them to the string in sequence. short circuiting on the first match (so order matters). At least for regex based chunking I don't see much need to use them in preference to something you could write yourself, except for convenience. But you should probably ask this question on the mailing lists of these projects, you will probably get a more authoritative answer.