Friday, February 21, 2014

Fuzzy String Matching with SolrTextTagger


As you know, I've been experimenting with approaches for associating entities from a dictionary to text. David Smiley (Solr/Lucene committer best known for his work with Solr Spatial) pointed me to SolrTextTagger, part of his OpenSextant project, an application that very efficiently recognizes place names in large blocks of plain text by using Lucene as a Finite State Transducer (FST). Apart from the OpenSextant project, SolrTextTagger is also used in Apache Stanbol, which provides a set of reusable components for semantic content management. This post describes a variant of the Fuzzy String Matching Algorithm I described in my previous post, but using SolrTextTagger and Solr instead of Lucene.

The idea is similar to how I used Redis and Aho-Corasick, but probably far, far more efficient in terms of speed and memory usage. What it meant for me (in terms of my previous algorithm) was that I could eliminate building and querying with phrase n-grams in client code for partial searches, and replace it with one pass through the query phrase against the entire dictionary of 6.8 million terms. In addition, in my current algorithm (described below), I replaced the cascading search across different variants of the phrase into a single search, further reducing the round trips to Lucene per query.

The use of Solr seemed a bit heavyweight at first, but one advantage of Solr is that the lookup resource can be shared within a network. Further, the SolrTextTagger service supports an fq parameter so you could potentially have many dictionaries (from a single FST) being served from the same Solr instance (although I didn't use it in this case).

The current algorithm shares many ideas from the previous one. Incoming text is partitioned into (noun) phrases, and each phrase is sent into the algorithm. It first tries to match the exact phrase, and some standard transformations on the exact phrase (punctuation and case normalized, alpha sorted and stemmed), against pre-transformed String (non-tokenized) fields in the index. If a match happens, then it is reported and the algorithm exits. If no match happens, the phrase is sent to the SolrTextTagger (tag) service, which matches the punctuation and case normalized phrase against the pre-built Lucene FST, and gets back phrase substrings that match entities in the dictionary. Results are deduped by the CUI field and the first (highest ranking) record is retained.

Scoring is manual. This is deliberate. I tried using raw Lucene scores, but it felt a bit confusing. Here is how the scoring works. In the case of the full phrase, a match implies a match either against the original or one of the three transformed versions. We calculate a measure of overlap based on Levenshtein's distance and find the closest one. Depending on which one it matched best, we discount the overlap by the level score (100, 75, 50, 25). We do the same thing for the partial matches, except that we calculate the overlap against the substring and its transforms, and discount the overlap further by the number of words in the substring versus the original phrase.

Here are the steps to replicate my setup.

Download and Build SolrTextTagger


The code for SolrTextTagger resides on GitHub, so to download and build the custom Solr JAR, execute the following sequence of commands. This will create a solr-text-tagger-1.3-SNAPSHOT.jar file in your target subdirectory in the SolrTextTagger project.

1
2
3
4
5
sujit@tsunami:~/Downloads$ git clone \
    https://github.com/OpenSextant/SolrTextTagger.git
sujit@tsunami:~/Downloads$ cd SolrTextTagger
sujit@tsunami:~/Downloads/SolrTextTagger$ mvn test
sujit@tsunami:~/Downloads/SolrTextTagger$ mvn package

Download and Customize Solr


Solr is available for download here. After downloading you will need to expand it locally, then update the schema.xml and solrconfig.xml in the conf subdirectory as shown below:

1
2
sujit@tsunami:~/Downloads$ tar xvzf solr-4.6.1.tgz
sujit@tsunami:~/Downloads$ cd solr-4.6.1/example/solr/collection1/conf

Update the schema.xml to replace the field definitions with our own. Our fields list and the definition of the field type "tag" (copied from the documentation of SolrTextTagger) is shown. The "id" field is just a integer sequence (unique key for Solr), the "cui" and "descr" comes from the CUI and STR fields from the UMLS database, and the descr_norm, descr_sorted, descr_stemmed are case/punctuation normalized, alpha sorted and stemmed versions of STR. The descr_tagged field is identical to descr_norm but is analyzed differently as specified below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
<fields>
    <field name="id" type="string" indexed="true" stored="true" 
      required="true"/>
    <field name="cui" type="string" indexed="true" stored="true"/>
    <field name="descr" type="string" indexed="true" stored="true"/>
    <field name="descr_norm" type="string" indexed="true" stored="true"/>
    <field name="descr_sorted" type="string" indexed="true" stored="true"/>
    <field name="descr_stemmed" type="string" indexed="true" stored="true"/>
    <field name="descr_tagged" type="tag" indexed="true" stored="false" 
         omitTermFreqAndPositions="true" omitNorms="true"/>
    <copyField source="descr_norm" dest="descr_tagged"/>
    <dynamicField name="*" type="string" indexed="true" stored="true"/>
  </fields>
  <uniqueKey>id</uniqueKey>
  <types>
    <fieldType name="tag" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.EnglishPossessiveFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
    </fieldType>
    ...
  </types>

We then add in the requestHandler definition for SolrTextTagger's tag service into the solrconfig.xml file (also in conf). The definition is shown below:

1
2
3
4
5
6
7
8
<requestHandler name="/tag" 
      class="org.opensextant.solrtexttagger.TaggerRequestHandler">
    <str name="indexedField">descr_tagged</str>
    <str name="storedField">descr_norm</str>
    <bool name="partialMatches">false</bool>
    <int name="valueMaxLen">5000</int>
    <str name="cacheFile">taggerCache.dat</str>
  </requestHandler>

Finally, we create a lib directory and copy over the solr-text-tagger-1.3-SNAPSHOT.jar into it. Then go up to the example directory and start Solr. Solr is now listening on port 8983 on localhost.

1
2
3
4
5
sujit@tsunami:~/Downloads/solr-4.6.1/example/solr/collection1$ mkdir lib
sujit@tsunami:~/Downloads/solr-4.6.1/example/solr/collection1$ cp \
    ~/Downloads/SolrTextTagger/target/*jar lib/
sujit@tsunami:~/Downloads/solr-4.6.1/example/solr/collection1$ cd ../..
sujit@tsunami:~/Downloads/solr-4.6.1/example$ java -jar start.jar

Load Data and Build FST


We use the same cuistr1.csv file that we downloaded from our MySQL UMLS database. I guess I could have written custom code to load the data into the index, but I had started experimenting with SolrTextTagger using curl, so I just wrote some code that converted the (CUI,STR) CSV format into JSON, with additional fields created by our case/punctuation normalization, alpha sort and stemming. I used the same Scala code since I already had the transformations coded up from last week. Once I generated the JSON file (cuistr1.json), I uploaded it into Solr and built the FST using the following curl commands.

1
2
3
4
sujit@tsunami:~/Downloads$ curl \
    "http://localhost:8983/solr/update/json?commit=true" \
    --data-binary @cuistr1.json -H 'Content-type:application/json'
sujit@tsunami:~/Downloads$ curl "http://localhost:8983/solr/tag?build=true"

The data is now ready to use, the code for the algorithm is shown below. The buildIndex() method was used to create the cuistr1.json file. The annotateConcepts() method is used to match a phrase against the dictionary.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
// Source: src/main/scala/com/mycompany/scalcium/umls/UmlsTagger2.scala
package com.mycompany.scalcium.umls

import java.io.File
import java.io.FileWriter
import java.io.PrintWriter
import java.io.StringReader
import java.util.regex.Pattern

import scala.collection.JavaConversions.asScalaIterator
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import org.apache.commons.lang3.StringUtils
import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.util.Version
import org.apache.solr.client.solrj.SolrRequest
import org.apache.solr.client.solrj.impl.HttpSolrServer
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest
import org.apache.solr.common.SolrDocumentList
import org.apache.solr.common.params.CommonParams
import org.apache.solr.common.params.ModifiableSolrParams
import org.apache.solr.common.util.ContentStreamBase

class UmlsTagger2(val solrServerUrl: String) {

  val punctPattern = Pattern.compile("\\p{Punct}")
  val spacePattern = Pattern.compile("\\s+")
  
  case class Suggestion(val score: Float, 
    val descr: String, val cui: String)
      
  val solrServer = new HttpSolrServer(solrServerUrl)
  
  def buildIndexJson(inputFile: File, 
      outputFile: File): Unit = {
    val writer = new PrintWriter(new FileWriter(outputFile))
    writer.println("[")
    var i = 0
    Source.fromFile(inputFile)
      .getLines()
      .foreach(line => {
        val Array(cui, str) = line
          .replace("\",\"", "\t")
          .replaceAll("\"", "")
          .replaceAll("\\\\", "")
          .split("\t")
        val strNorm = normalizeCasePunct(str)
        val strSorted = sortWords(strNorm)
        val strStemmed = stemWords(strNorm)
        val obuf = new StringBuilder()
        if (i > 0) obuf.append(",")
        obuf.append("{")
          .append("\"id\":").append(i).append(",")
          .append("\"cui\":\"").append(cui).append("\",")
          .append("\"descr\":\"").append(str).append("\",")
          .append("\"descr_norm\":\"").append(strNorm).append("\",")
          .append("\"descr_sorted\":\"").append(strSorted).append("\",")
          .append("\"descr_stemmed\":\"").append(strStemmed).append("\"")
          .append("}")
        writer.println(obuf.toString)
        i += 1
      })
    writer.println("]")
    writer.flush()
    writer.close()
  }

  def annotateConcepts(phrase: String): 
      List[Suggestion] = {
    // check for full match
    val suggestions = ArrayBuffer[Suggestion]()
    select(phrase) match {
      case Some(suggestion) => suggestions += suggestion
      case None => tag(phrase) match {
        case Some(subSuggs) => suggestions ++= subSuggs
        case None => {}
      }
    }
    suggestions.toList
  }

  ///////////// phrase munging methods //////////////
  
  def normalizeCasePunct(str: String): String = {
    val str_lps = punctPattern
      .matcher(str.toLowerCase())
      .replaceAll(" ")
    spacePattern.matcher(str_lps).replaceAll(" ")
  }

  def sortWords(str: String): String = {
    val words = str.split(" ")
    words.sortWith(_ < _).mkString(" ")
  }
  
  def stemWords(str: String): String = {
    val stemmedWords = ArrayBuffer[String]()
    val tokenStream = getAnalyzer().tokenStream(
      "str_stemmed", new StringReader(str))
    val ctattr = tokenStream.addAttribute(
      classOf[CharTermAttribute])    
    tokenStream.reset()
    while (tokenStream.incrementToken()) {
      stemmedWords += ctattr.toString()
    }
    stemmedWords.mkString(" ")
  }
  
  def getAnalyzer(): Analyzer = {
    new StandardAnalyzer(Version.LUCENE_46)
  }
  
  ///////////////// solr search methods //////////////
  
  def select(phrase: String): Option[Suggestion] = {
    val phraseNorm = normalizeCasePunct(phrase)
    val phraseSorted = sortWords(phraseNorm)
    val phraseStemmed = stemWords(phraseNorm)
    // construct query
    val query = """descr:"%s" descr_norm:"%s" descr_sorted:"%s" descr_stemmed:"%s""""
      .format(phrase, phraseNorm, phraseSorted, phraseStemmed)
    val params = new ModifiableSolrParams()
    params.add(CommonParams.Q, query)
    params.add(CommonParams.ROWS, String.valueOf(1))
    params.add(CommonParams.FL, "*,score")
    val rsp = solrServer.query(params)
    val results = rsp.getResults()
    if (results.getNumFound() > 0L) {
      val sdoc = results.get(0)
      val descr = sdoc.getFieldValue("descr").asInstanceOf[String]
      val cui = sdoc.getFieldValue("cui").asInstanceOf[String]
      val score = computeScore(descr, 
        List(phrase, phraseNorm, phraseSorted, phraseStemmed))
      Some(Suggestion(score, descr, cui))
    } else None
  }

  def tag(phrase: String): Option[List[Suggestion]] = {
    val phraseNorm = normalizeCasePunct(phrase)
    val params = new ModifiableSolrParams()
    params.add("overlaps", "LONGEST_DOMINANT_RIGHT")
    val req = new ContentStreamUpdateRequest("")
    req.addContentStream(new ContentStreamBase.StringStream(phrase))
    req.setMethod(SolrRequest.METHOD.POST)
    req.setPath("/tag")
    req.setParams(params)
    val rsp = req.process(solrServer)
    val results = rsp.getResponse()
      .get("matchingDocs")
      .asInstanceOf[SolrDocumentList]
    val nwordsInPhrase = phraseNorm.split(" ").length.toFloat
    val suggestions = results.iterator().map(sdoc => {
        val descr = sdoc.getFieldValue("descr").asInstanceOf[String]
        val cui = sdoc.getFieldValue("cui").asInstanceOf[String]
        val nWordsInDescr = descr.split(" ").length.toFloat
        val descrNorm = normalizeCasePunct(descr)
        val descrSorted = sortWords(descrNorm)
        val descrStemmed = stemWords(descrNorm)
        val nwords = descrNorm.split(" ").length.toFloat
        val score = (nwords / nwordsInPhrase) * 
          computeScore(descr, 
          List(descr, descrNorm, descrSorted, descrStemmed))
        Suggestion(score, descr, cui)      
      })
      .toList
      .groupBy(_.cui) // dedup by cui
      .map(_._2.toList.head)
      .toList
      .sortWith((a,b) => a.score > b.score) // sort by score
    Some(suggestions)
  }

  def computeScore(s: String, 
      candidates: List[String]): Float = {
    val levels = List(100.0F, 75.0F, 50.0F, 25.0F)
    val candLevels = candidates.zip(levels).toMap
    val topscore = candidates.map(candidate => {
        val maxlen = Math.max(candidate.length(), s.length()).toFloat
        val dist = StringUtils.getLevenshteinDistance(candidate, s).toFloat
        (candidate, 1.0F - (dist / maxlen))
      })    
      .sortWith((a, b) => a._2 > b._2)
      .head
    val level = candLevels.getOrElse(topscore._1, 0.0F)
    level * topscore._2
  }

  //////////////// misc methods ////////////////
  
  def formatSuggestion(sugg: Suggestion): String = {
    "[%6.2f%%] (%s) %s"
      .format(sugg.score, sugg.cui, sugg.descr)
  }
}

To run the code, we use the following JUnit test (commenting and uncommenting tests as needed).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Source: src/test/scala/com/mycompany/scalcium/umls/UmlsTagger2Test.scala 
package com.mycompany.scalcium.umls

import java.io.File

import org.junit.Assert
import org.junit.Test

class UmlsTagger2Test {

  @Test
  def testBuildIndex(): Unit = {
    val tagger = new UmlsTagger2("")
    tagger.buildIndexJson(
      new File("/home/sujit/Projects/med_data/cuistr1.csv"), 
      new File("/home/sujit/Projects/med_data/cuistr1.json"))
  }
  
  @Test
  def testGetFull(): Unit = {
    val tagger = new UmlsTagger2("http://localhost:8983/solr")
    val phrases = List("Lung Cancer", "Heart Attack", "Diabetes")
    phrases.foreach(phrase => {
      Console.println()
      Console.println("Query: %s".format(phrase))
      val suggestions = tagger.select(phrase)
      suggestions match {
        case Some(suggestion) => {
          Console.println(tagger.formatSuggestion(suggestion))
          Assert.assertNotNull(suggestion.cui)
        }
        case None =>
          Assert.fail("No results for [%s]".format(phrase))
      }
    })
  }
  
  @Test
  def testGetPartial(): Unit = {
    val tagger = new UmlsTagger2("http://localhost:8983/solr")
    val phrases = List(
        "Heart Attack and diabetes",
        "carcinoma (small-cell) of lung",
        "asthma side effects")
    phrases.foreach(phrase => {
      Console.println()
      Console.println("Query: %s".format(phrase))
      val suggestions = tagger.tag(phrase)
      suggestions match {
        case Some(psuggs) => {
          psuggs.foreach(psugg => {
            Console.println(tagger.formatSuggestion(psugg))    
          })
          Assert.assertNotNull(psuggs)
        }
        case None =>
          Assert.fail("No results for [%s]".format(phrase))
      }
    })
  }
  
  @Test
  def testAnnotateConcepts(): Unit = {
    val tagger = new UmlsTagger2("http://localhost:8983/solr")
    val phrases = List("Lung Cancer", 
        "Heart Attack", 
        "Diabetes",
        "Heart Attack and diabetes",
        "carcinoma (small-cell) of lung",
        "asthma side effects"
    )
    phrases.foreach(phrase => {
      Console.println()
      Console.println("Query: %s".format(phrase))
      val suggestions = tagger.annotateConcepts(phrase)
      suggestions.foreach(suggestion => {
        Console.println(tagger.formatSuggestion(suggestion))
      })
    })
  }
}

The results of the testAnnotateConcepts() are shown below. I've used the same query terms as the previous algorithm, and the results are also similarly consistent.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Query: Lung Cancer
[100.00%] (C0242379) Lung Cancer

Query: Heart Attack
[100.00%] (C0027051) Heart Attack

Query: Diabetes
[100.00%] (C0011847) Diabetes

Query: Heart Attack and diabetes
[ 25.00%] (C0002838) AND
[ 25.00%] (C1515981) And
[ 25.00%] (C0011847) Diabetes
[ 25.00%] (C1706368) And
[ 12.50%] (C1550557) and
[ 12.50%] (C0027051) heart attack
[  6.25%] (C0011849) diabetes
[  6.25%] (C0011860) diabetes

Query: carcinoma (small-cell) of lung
[ 26.79%] (C0149925) small cell carcinoma of lung

Query: asthma side effects
[ 33.33%] (C2984299) Asthma
[ 16.67%] (C0001688) side effects
[  8.33%] (C0004096) asthma

Thats all I have for today. Next week hopefully I will talk about something else :-).

25 comments (moderated to prevent spam):

David Smiley said...

Sujit,

It’s wonderful to see one of my favorite bloggers try out my SolrTextTagger! I hope I didn’t seem nagging in the past but I knew you would find it really useful.

I’d like to point out some things to anyone reading this post — things you may very well already realize, Sujit, but others may not:

* Solr offers many advantages: network/remote access (optional; OpenSextant and Stanbol embed Solr instead), ease of configuring text-analysis and any other aspect of the index, ease of loading data into it (CSV? JSON? XML? all good), experimental search / browse-ability of the corpus from HTTP, ease of configuring the document cache (particularly relevant to the text tagger), search (text tagging request) response time statistics, and more. It’s all the reasons people talk to Solr or ElasticSearch these days versus writing lots of inflexible code to talk to Lucene directly. But all this said, anyone bent on using it without Solr can totally do it; there’s just one class (the request handler) that you’d need to port to work the way you want it to.

* The 1.x version of the text tagger indeed uses Lucene FSTs directly. The 2.x version (MemPF branch at time of writing) uses them implicitly by telling Lucene to use the “Memory” postings format. This has a lot of advantages: the RAM footprint is about half, there’s less code to maintain, and you don’t need to explicitly “build” a data structure as you do in 1.x — Lucene amortizes that as you index (although you’ll likely want to optimize at the end), and finally there’s no custom file to write — it’s managed by the Lucene index itself. The 2.x version has yet to include a big feature in 1.x which allows a wider variety of text analysis configuration — see issue #20 to track progress. Credit to Rupert Westenthaler at Stanbol for awesome work on the “phrase builder” in 1.x.

* It’s unclear why, but in this blog I see you (Sujit) are “manually” doing the stemming and other text analysis, then putting that in Solr such that by the time Solr sees it; all the text analysis has been done already. That is… odd to me. Part of the value proposition of the text tagger is that you don’t have to do that — it’s the text tagger’s job. Simply configure the text analysis as desired in the schema for the field you’ve marked as the “indexedField” in the TaggerRequestHandler’s configuration and then submit your original text. Perhaps you’re not doing this because you want to see the analyzed text as part of your score computation. Hmmm; interesting. I could imagine a new feature in which the tagger emits the analyzed text for the match.

Sujit Pal said...

Thank you for the kind words David - its great knowing that someone of your stature considers me one of his favorite bloggers :-). And thank you for suggesting SolrTextTagger - the single pass substring matching it offers makes it a very attractive candidate for replacing our current proprietary memory based NER (although we do have a lot of legacy issues to consider).

Regarding the NRT features you described for SolrTextTagger v2, our current workflow wouldn't benefit from this since we do full releases (ie not incremental) of our taxonomy. However, this would come in handy if we ever moved to a GraphDB backed taxonomy where we could publish changes to Solr as they are saved into the DB.

To answer your question about why I am doing manual stemming. My use-case is slightly different than OpenSextant - instead of matching substrings in large chunks of text, I need to match noun phrases, either exactly (or almost exactly) against a entry in my taxonomy, or failing that, match substrings in the phrase (almost exactly, but stricter than the full phrase) to entries in the taxonomy. In case of the full phrase match, I match it (the full phrase, not the tokens) exactly, then with case and punctuation normalized, then with the words in the normalized phrase alpha-sorted, then to the stemmed version of the normalized words. Had I let Solr handle the stemming it would also have tokenized the strings, so to get the kind of matching I wanted I would have to enforce exact phrase matches and have to deal with false positives returned by Solr when there is none.

In case of the substring matches (where I use your SolrTextTagger's tag service), we tend to get back lots of false positives with alpha sorted matches and stemmed matches, since there is less context, so we just use case/punctuation normalized entries as input to the FST and passing in a case/punctuation normalized version of the phrase during search. I suppose I could have used a Lucene/Solr analyzer chain to do this and use the original phrase instead, but I already had the code to do the normalization for the full phrase match case, so it just seemed simpler to use that.

Evan Smith said...

Hello,

looking for the code related to converting from csv to json. I have now compiled the scala code. Just need to have the right input.

I have the csv file, but want to be sure I have the same json format you are expecting.

Thanks,
Evan

referring to
"just wrote some code that converted the (CUI,STR) CSV format into JSON, with additional fields created by our case/punctuation normalization, alpha sort and stemming"

Sujit Pal said...

Looks like I don't have the code anymore, but I think it was something fairly simple and throwaway that converts the CSV file (you have this already) to an equivalent JSON format (look at books.csv and books.json in exampledocs in the solr distribution, that should give you an idea of the transformation required.

JohnT said...

Will do

Anonymous said...

Hello Sujit,
I read you two posts concerning text tagging and I would like to get your advice on my usecase.

I need to extract named entities from text, based on proprietary dictionnary (10 classes of entities, each containing 1-10M entities, 1-5 words each) and based on regular expressions (15 of them).
I'm considering using Stanbol, SolrTextTagger, Lingpipe.
I am looking for the simplest service to update because basically it is for non-programmer use after the first installation, and the usecase requires only the extraction to an external field, e.g no indexing

What is your recommandation?

Sujit Pal said...

I would recommend using LingPipe's RegExChunker for the regular expressions as the first filter, followed by SolrTextTagger. If simplicity of update is a huge focus, using LingPipe's DictionaryChunker instead of SolrTextTagger may make more sense (because update simply means adding a line to your dictionary flat file and restarting) but you will need to accommodate 150M entries in memory. However, even with SolrTextTagger, updates are not a huge deal as long as you have processes laid out.

Unknown said...

Hi Sujit,

Great works! I try to add more features based on your work, including cooperate with cTAKES and run on a Spark cluster and so on.

Do you mind that I push it as a maven project to Github with a Apache license version 2? I think it will be great if other guys are interested in this project.

Sujit Pal said...

Thanks Zhiwei. And yes, that would be great, please let me know the URL to your GitHub project and I can post it here. I am doing something at work which involves calling SolrTextTagger from Spark, would be interesting to see your approach and compare notes. Also you might want to post a message on the SolrTextTagger GitHub site - I think David Smiley (main author of SolrTextTagger) would be interested in knowing about it.

marc u said...

This post started long ago. But just to address the applications of SolrTextTagger in light of your use case Sujit, I thought I should mention the XTax tagger in Xponents (https://github.com/OpenSextant/Xponents/tree/master/Extraction). Your discussion of tagging specialized lexicons, using exact matches and post-filtering, pre-filtering for false positives is all in scope for XTax.

I'll be refactoring some of this to be better documented and make use of Solr 5.x. This is very much a live and active community. We're constantly trying to improve and simplify.

Spark + SolrTextTagger applications is where I am heading next/now.

cheers,
Marc

Sujit Pal said...

Thanks for the pointer to XTax Marc. I built an interface from Spark to SolrTextTagger for use within my company and they allowed me to open source it here. I am doing a presentation about it at Spark Summit EU 2015 next week. I will send David the links to the slides and presentation as they become available (using the Issues feature on SolrTextTagger).

Unknown said...

Hi, Sujit,

Thanks for the great post! I am going to use the idea "Fuzzy String Matching with SolrTextTagger" in a paper, but I can't find any formal citation about it(It usually need a formal citation of a publication in a paper, not a blog address). Could you send me a citation about this blog if you have, or a authorization if you are agree me to use it in a paper? email: z.w.chan.jason at gmail.com

Thank you very much.

Sujit Pal said...

Hi Zhiwei, I uploaded the slides of my presentation about this at Spark Summit EU 2015 at data.mendeley.com.
https://data.mendeley.com/datasets/4xdkh7xdtt/1

You can use the citation from that.
Pal, Sujit (2015), “Dictionary Based Annotation at Scale with Spark SolrTextTagger and OpenNLP”, Mendeley Data, v1
http://dx.doi.org/10.17632/4xdkh7xdtt.1

Unknown said...

Hi, Sujit, It is cool of your project. I also use Spark (mllib), to discover new term from lots of texts. In fact, the result of fuzzy string matching is just one of the feature that I use for clustering and classification. After we finish our project, we will open the code on github too.

Thank you very much.

Sujit Pal said...

Hi Zhiwei, that is very good news. Look forward to seeing your approach.

Unknown said...

Hi,

There is a minor bug in the code: the "desc_stemmed" in line 61 is not the same with the query term "descr_stemmed" in line 123. This will let the stemmed form fail to match anything.

61: .append("\"desc_stemmed\":\"").append(strStemmed).append("\"")
123: val query = """descr:"%s" descr_norm:"%s" descr_sorted:"%s" descr_stemmed:"%s""""

Sujit Pal said...

Thanks Zhiwei, good catch. Line 61 should be descr_stemmed. I have updated the code in the post.

Sanjeev said...

This is great article, thanks for sharing. I am trying to evaluate if SolrTextTagger would be useful in my case. I would have large dictionaries as input, these dictionary could contain city names, people names, numbers etc and they will hold millions of records. I would also have documents to scan against these dictionaries. My documents may have data that is not easy to tokenize. For example, the document might contain theworldisaniceplace (read as the world is a nice place). My dictionary could have an entry for the word "world", so I need to report my document as a match here. I am wondering if this is a proper usecase for SolrTextTagger.

Sujit Pal said...

Thanks Sanjeev, and no, I don't think SolrTextTagger would tokenize these compound strings. The lowest unit in SolrTextTagger is words, so it will match words in the incoming text stream that are in its dictionary. Since it looks like your setup includes Solr or Lucene, you may want to take a look at compound word token filters that it provides.

Sanjeev said...

Hi Sujit, thanks for clarifying. My current setup does not have Solr/Lucene, currently, I use Aho-Corasick dictionary match which works just great except for large dictionaries where it consumes unacceptable amount of heap, so I am exploring alternate options.

marc u said...

Sanjeev,
Next to thw SolrTextagger project are some projects that are very similar to yours and they use SolrTextTagger quite well.
Xponents employs a dictionary of 17 million place names and tags 10kb/sec with 2.5 GB of RAM.

More details to share.....

Sanjeev said...

Marc- Does that work with documents which cannot be tokenized?

David Smiley said...

This use case seems a bit challenging -- a fun brain teaser for me. But nonetheless, the SolrTextTagger can probably do it thanks to the versatility of the analyzers in Lucene. The query time analysis chain (analyzing the incoming documents / queries to be tagged ) could end with an NGramTokenFilter of size 1 and thus at each letter we start looking for a new match. The index time chain would need to have a PatternReplaceFilter after the concatenating one that inserts a space char after each non-space (to produce ex: "n e w y o r k c i t y". There are some inefficiencies here that could be improved like the space after each char will make for a larger dictionary, but in the scheme of things not a big deal.

marc u said...

Sanjeev --
Since Xponents uses SolrTextTagger for this capability and scale of tagging, it is limited to what SolrTextTagger can do. As of now, I have not tested such cases where you have need for novel sub-word tokenization. I was mostly responding to your need to use a large dictionary of terms to tag, but not exceed practical JVM/OS memory limits.

Anyhow, to respond to David's solution -- you might want to employ multiple taggers in parallel, given the potential performance implication/inefficiency here by tokenizing to the char.

Things like Xponents TaxonMatcher (basically a keyword tagger using SolrTextTagger "STT") offer is a blueprint for making an application out of STT. But for your purposes I can't say how well it would work, since there is a good amount of assumption about reasonably clean, natural language text as input. You have fundamentally unique data (text).

Links - https://github.com/OpenSextant/Xponents/tree/master/Extraction/src/main/java/org/opensextant/extractors/xtax ... various READMEs around there. But specifically, you'd be adapting the schema.xml as David suggests:
https://github.com/OpenSextant/Xponents/blob/master/solr/solr4/taxcat/conf/schema.xml#L124 phrase and phrase_tag fields here store the values of your dictionary; phrase_tag is of type "name_tag" which is to be configured according to your tokenization and filtering rules so you produce the unit of data (token) that is fed to the FST.


Other uses of STT: https://stanbol.apache.org/docs/trunk/components/enhancer/engines/lucenefstlinking

But indeed consider starting with:
https://github.com/OpenSextant/SolrTextTagger



Sanjeev,
I think the points to consider are --
Does this solve your memory limitation?
Is the complexity of this technique in line with what you expect to solve your problem?
What other ways are there to solve this other than Aho-Corasick?


great discussion.

Marc

Sanjeev said...

David and Marc-
At this moment I have limited flexibility to experiment with this. In the meantime, to keep moving I have found a low footprint in-memory trie implementation, however, with a large number of dictionaries, I think I will hit the threshold here as well. So, long term I will need SolrTextTagger like tool to do the lookup on a separate JVM. In that direction, I will look at the options suggested by both you.

Thanks for detailed responses.