Domain-specific Concept Search (such as ours) typically involves recognizing entities in the query and matching them up to entities that make sense in the particular domain - in our case, the entities correspond to concepts in our medical taxonomy. This does mean that we fall short when we try to solve for a slightly different use case such as doctor search.
By doctor search, I mean the search interface that health insurance sites have for their members to find a doctor near them. Typical search patterns are by zip code, city, provider name (doctor or hospital), specialty, symptom, or some combination thereof. Our Named Entity Recognition (NER) system is very good at extracting medical concepts, such as specialties and symptoms, from queries, and we can draw useful inferences based on relationship graphs, but we don't have a good way of recognizing names and addresses. So the idea is to pass the query through an additional pre-processing step through a chain of entity extractors which will identify and extract the different name and address fields.
This has been on my to-do list for a while. The doctor search project came about while I was busy with another one, and someone else built the product using different techniques. So a doctor search product already exists at the moment. This is a proof of concept for my additional pre-processing NER idea mentioned above, and it remains to be seen whether its performance compares favorably with the existing product or not.
NERs can be either regex based, dictionary based or model based. Regex based NERs match incoming text against one or more predefined regular expressions. Dictionary based NERs, also known as gazetteer based NERs, match text against a dictionary of (term, category) pairs. Model based NERs use a training set of (term, category) pairs to train a model, and then use the model to predict the category of new (potentially previously unseen) terms.
Since a doctor search application is closed, ie, the list of doctors are finite and all relevant attributes about them are known, and a search for an unknown doctor or doctor attributes returning no results is expected and desired, we only use regex based and dictionary based NERs. In this post, I describe 3 NERs built using the LingPipe API, one regex based, and two dictionary based, one in-memory and one using Lucene.
LingPipe provides a RegExChunker, which takes a regular expression and a category string, which we use to build a Zip Code NER. For the dictionary NERs, we use LingPipe's ExactDictionaryChunker which takes a MapDictionary object and a category. In the first case, the MapDictionary is in-memory and in the second, we extend MapDictionary to use a pre-populated Lucene index (via Solr). The ExactDictionaryChunker implements the Aho-Corasick algorithm for scalable (O(n)) string matching. There is also an ApproximateDictionaryChunker which returns matches within a given edit distance, but I haven't used it.
Since most of the classes are already provided by LingPipe, the only code we need to write is the interface by which our application (currently a JUnit test, but perhaps a custom QParser in the future) calls the NERs, and the custom MapDictionary subclass that uses a Solr index. These are shown in the code below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | package com.mycompany.solr4extras.ner
import scala.collection.JavaConversions.{asJavaIterator, asScalaIterator, asScalaSet, mapAsJavaMap}
import org.apache.solr.client.solrj.SolrServer
import org.apache.solr.common.params.{CommonParams, MapSolrParams}
import com.aliasi.chunk.{Chunk, Chunker}
import com.aliasi.dict.{DictionaryEntry, MapDictionary}
/**
* Pass in a list of Chunkers during construction, and then
* call the chunk method with the text to be chunked. Returns
* a set of Chunk objects of various types in the text.
*/
class Chunkers(val chunkers: List[Chunker]) {
def chunk(text: String): Set[Chunk] = chunkers.
map(chunker => chunker.chunk(text).chunkSet.toList).
flatten.
toSet[Chunk]
def mkString(text: String, chunk: Chunk): String = {
val pair = mkPair(text, chunk)
pair._1 + "/" + pair._2
}
def mkPair(text: String, chunk: Chunk): (String,String) =
(text.substring(chunk.start(), chunk.end()),
chunk.`type`())
}
/**
* Custom MapDictionary backed by a Solr index. This is
* used by our Dictionary based NER (ExactMatchDictionaryChunker)
* for large dictionaries of entity names. Dictionary entries
* are stored as (category, value) pairs in Solr fields
* (nercat, nerval).
*/
class SolrMapDictionary(
val solr: SolrServer, val nrows: Int, val category: String)
extends MapDictionary[String] {
override def addEntry(entry: DictionaryEntry[String]) = {}
override def iterator():
java.util.Iterator[DictionaryEntry[String]] = {
phraseEntryIt("*:*")
}
override def phraseEntryIt(phrase: String):
java.util.Iterator[DictionaryEntry[String]] = {
val params = new MapSolrParams(Map(
CommonParams.Q -> phrase,
CommonParams.FQ -> ("nercat:" + category),
CommonParams.FL -> "nerval",
CommonParams.START -> "0",
CommonParams.ROWS -> String.valueOf(nrows)))
val rsp = solr.query(params)
rsp.getResults().iterator().
toList.
map(doc => new DictionaryEntry[String](
doc.getFieldValue("nerval").asInstanceOf[String],
category, 1.0D)).
iterator
}
}
|
The SolrMapDictionary needs two fields nercat and nerval, that hold the NER category and the NER value respectively. These need to be defined in the schema.xml file like so:
1 2 | <field name="nercat" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="nerval" type="text_general" indexed="true" stored="true"/>
|
The Chunkers.chunk() method passes in an Array of Chunkers, one each for the different entities that we want to recognize, and the text to be chunked. The JUnit test below shows example calls for 3 of our NERs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 | package com.mycompany.solr4extras.ner
import scala.Array.canBuildFrom
import org.apache.solr.client.solrj.impl.{ConcurrentUpdateSolrServer, HttpSolrServer}
import org.junit.{After, Assert, Test}
import com.aliasi.chunk.RegExChunker
import com.aliasi.dict.{DictionaryEntry, ExactDictionaryChunker, MapDictionary}
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
class ChunkersTest {
val solrUrl = "http://localhost:8983/solr/"
val solrWriter = new ConcurrentUpdateSolrServer(solrUrl, 10, 1)
val solrReader = new HttpSolrServer(solrUrl)
val texts = Array[String](
"Cardiologist Berkeley 94701 Dr Chen",
"Herpetologist 94015-1234",
"Cost of care $1000 wtf",
"San Francisco points of interest",
"Liberty Island, New York"
)
val cities = Array[String]("Berkeley", "San Francisco", "New York")
val specialists = Array[String]("cardiologist", "herpetologist")
@After def teardown(): Unit = {
solrWriter.shutdown()
solrReader.shutdown()
}
@Test def testRegexChunking(): Unit = {
val zipCodeChunker = new RegExChunker(
"\\d{5}-\\d{4}|\\d{5}", "zipcode", 1.0D)
val chunkers = new Chunkers(List(zipCodeChunker))
val expected = Array[Int](1, 1, 0, 0, 0)
val actuals = texts.map(text => {
val chunkSet = chunkers.chunk(text)
chunkSet.foreach(chunk =>
Console.println(chunkers.mkString(text, chunk)))
chunkSet.size
})
Assert.assertArrayEquals(expected, actuals)
}
@Test def testInMemoryDictChunking(): Unit = {
val dict = new MapDictionary[String]()
specialists.foreach(specialist =>
dict.addEntry(
new DictionaryEntry[String](specialist, "specialist", 1.0D)))
val specialistChunker = new ExactDictionaryChunker(
dict, IndoEuropeanTokenizerFactory.INSTANCE, false, false)
val chunkers = new Chunkers(List(specialistChunker))
val expected = Array[Int](1, 1, 0, 0, 0)
val actuals = texts.map(text => {
val chunkSet = chunkers.chunk(text)
chunkSet.foreach(chunk =>
Console.println(chunkers.mkString(text, chunk)))
chunkSet.size
})
Assert.assertArrayEquals(expected, actuals)
}
@Test def testSolrDictChunking(): Unit = {
val dict = new SolrMapDictionary(solrReader, 18000, "city")
val cityChunker = new ExactDictionaryChunker(
dict, IndoEuropeanTokenizerFactory.INSTANCE, false, false)
val chunkers = new Chunkers(List(cityChunker))
val expected = Array[Int](1, 0, 0, 1, 2)
val actuals = texts.map(text => {
val chunkSet = chunkers.chunk(text)
chunkSet.foreach(chunk =>
Console.println(chunkers.mkString(text, chunk)))
chunkSet.size
})
Assert.assertArrayEquals(expected, actuals)
}
}
|
The results from the run are as follows (edited slightly for readability), and show the snippets of the input text that were matched (the entities or chunks) and what entity category they were matched to. As you can see, there is a false positive - Liberty is being considered a city.
1 2 3 4 5 6 7 8 9 10 | 94701/zipcode
94015-1234/zipcode
Cardiologist/specialist
Herpetologist/specialist
Berkeley/city
San Francisco/city
Liberty/city
New York/city
|
While the SolrMapDictionary returns the expected results, I found that it calls the iterator() method once for the entire run and thus needs to have the full dataset of 17k+ cities loaded in one shot, which kind of defeats the purpose of having an external store - I would have expected the phraseEntryIt() method to be called with specific phrases instead. I haven't had a chance to investigate this in depth yet, will update the post when I find out more.
References
- Building Search Applications: Lucene, LingPipe and GATE by Dr Manu Konchady has part of a chapter about building NERs using LingPipe Chunkers.
- This Stack Overflow Page contains suggestions on how to extend the MapDictionary object, which I followed to build my custom SolrMapDictionary object.
- The LingPipe NER Tutorial contains information about how to use the Chunker classes.