As you know, I've been experimenting with approaches for associating entities from a dictionary to text. David Smiley (Solr/Lucene committer best known for his work with Solr Spatial) pointed me to SolrTextTagger, part of his OpenSextant project, an application that very efficiently recognizes place names in large blocks of plain text by using Lucene as a Finite State Transducer (FST). Apart from the OpenSextant project, SolrTextTagger is also used in Apache Stanbol, which provides a set of reusable components for semantic content management. This post describes a variant of the Fuzzy String Matching Algorithm I described in my previous post, but using SolrTextTagger and Solr instead of Lucene.
The idea is similar to how I used Redis and Aho-Corasick, but probably far, far more efficient in terms of speed and memory usage. What it meant for me (in terms of my previous algorithm) was that I could eliminate building and querying with phrase n-grams in client code for partial searches, and replace it with one pass through the query phrase against the entire dictionary of 6.8 million terms. In addition, in my current algorithm (described below), I replaced the cascading search across different variants of the phrase into a single search, further reducing the round trips to Lucene per query.
The use of Solr seemed a bit heavyweight at first, but one advantage of Solr is that the lookup resource can be shared within a network. Further, the SolrTextTagger service supports an fq parameter so you could potentially have many dictionaries (from a single FST) being served from the same Solr instance (although I didn't use it in this case).
The current algorithm shares many ideas from the previous one. Incoming text is partitioned into (noun) phrases, and each phrase is sent into the algorithm. It first tries to match the exact phrase, and some standard transformations on the exact phrase (punctuation and case normalized, alpha sorted and stemmed), against pre-transformed String (non-tokenized) fields in the index. If a match happens, then it is reported and the algorithm exits. If no match happens, the phrase is sent to the SolrTextTagger (tag) service, which matches the punctuation and case normalized phrase against the pre-built Lucene FST, and gets back phrase substrings that match entities in the dictionary. Results are deduped by the CUI field and the first (highest ranking) record is retained.
Scoring is manual. This is deliberate. I tried using raw Lucene scores, but it felt a bit confusing. Here is how the scoring works. In the case of the full phrase, a match implies a match either against the original or one of the three transformed versions. We calculate a measure of overlap based on Levenshtein's distance and find the closest one. Depending on which one it matched best, we discount the overlap by the level score (100, 75, 50, 25). We do the same thing for the partial matches, except that we calculate the overlap against the substring and its transforms, and discount the overlap further by the number of words in the substring versus the original phrase.
Here are the steps to replicate my setup.
Download and Build SolrTextTagger
The code for SolrTextTagger resides on GitHub, so to download and build the custom Solr JAR, execute the following sequence of commands. This will create a solr-text-tagger-1.3-SNAPSHOT.jar file in your target subdirectory in the SolrTextTagger project.
1 2 3 4 5 | sujit@tsunami:~/Downloads$ git clone \
https://github.com/OpenSextant/SolrTextTagger.git
sujit@tsunami:~/Downloads$ cd SolrTextTagger
sujit@tsunami:~/Downloads/SolrTextTagger$ mvn test
sujit@tsunami:~/Downloads/SolrTextTagger$ mvn package
|
Download and Customize Solr
Solr is available for download here. After downloading you will need to expand it locally, then update the schema.xml and solrconfig.xml in the conf subdirectory as shown below:
1 2 | sujit@tsunami:~/Downloads$ tar xvzf solr-4.6.1.tgz
sujit@tsunami:~/Downloads$ cd solr-4.6.1/example/solr/collection1/conf
|
Update the schema.xml to replace the field definitions with our own. Our fields list and the definition of the field type "tag" (copied from the documentation of SolrTextTagger) is shown. The "id" field is just a integer sequence (unique key for Solr), the "cui" and "descr" comes from the CUI and STR fields from the UMLS database, and the descr_norm, descr_sorted, descr_stemmed are case/punctuation normalized, alpha sorted and stemmed versions of STR. The descr_tagged field is identical to descr_norm but is analyzed differently as specified below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | <fields>
<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="cui" type="string" indexed="true" stored="true"/>
<field name="descr" type="string" indexed="true" stored="true"/>
<field name="descr_norm" type="string" indexed="true" stored="true"/>
<field name="descr_sorted" type="string" indexed="true" stored="true"/>
<field name="descr_stemmed" type="string" indexed="true" stored="true"/>
<field name="descr_tagged" type="tag" indexed="true" stored="false"
omitTermFreqAndPositions="true" omitNorms="true"/>
<copyField source="descr_norm" dest="descr_tagged"/>
<dynamicField name="*" type="string" indexed="true" stored="true"/>
</fields>
<uniqueKey>id</uniqueKey>
<types>
<fieldType name="tag" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
...
</types>
|
We then add in the requestHandler definition for SolrTextTagger's tag service into the solrconfig.xml file (also in conf). The definition is shown below:
1 2 3 4 5 6 7 8 | <requestHandler name="/tag"
class="org.opensextant.solrtexttagger.TaggerRequestHandler">
<str name="indexedField">descr_tagged</str>
<str name="storedField">descr_norm</str>
<bool name="partialMatches">false</bool>
<int name="valueMaxLen">5000</int>
<str name="cacheFile">taggerCache.dat</str>
</requestHandler>
|
Finally, we create a lib directory and copy over the solr-text-tagger-1.3-SNAPSHOT.jar into it. Then go up to the example directory and start Solr. Solr is now listening on port 8983 on localhost.
1 2 3 4 5 | sujit@tsunami:~/Downloads/solr-4.6.1/example/solr/collection1$ mkdir lib
sujit@tsunami:~/Downloads/solr-4.6.1/example/solr/collection1$ cp \
~/Downloads/SolrTextTagger/target/*jar lib/
sujit@tsunami:~/Downloads/solr-4.6.1/example/solr/collection1$ cd ../..
sujit@tsunami:~/Downloads/solr-4.6.1/example$ java -jar start.jar
|
Load Data and Build FST
We use the same cuistr1.csv file that we downloaded from our MySQL UMLS database. I guess I could have written custom code to load the data into the index, but I had started experimenting with SolrTextTagger using curl, so I just wrote some code that converted the (CUI,STR) CSV format into JSON, with additional fields created by our case/punctuation normalization, alpha sort and stemming. I used the same Scala code since I already had the transformations coded up from last week. Once I generated the JSON file (cuistr1.json), I uploaded it into Solr and built the FST using the following curl commands.
1 2 3 4 | sujit@tsunami:~/Downloads$ curl \
"http://localhost:8983/solr/update/json?commit=true" \
--data-binary @cuistr1.json -H 'Content-type:application/json'
sujit@tsunami:~/Downloads$ curl "http://localhost:8983/solr/tag?build=true"
|
The data is now ready to use, the code for the algorithm is shown below. The buildIndex() method was used to create the cuistr1.json file. The annotateConcepts() method is used to match a phrase against the dictionary.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 | // Source: src/main/scala/com/mycompany/scalcium/umls/UmlsTagger2.scala
package com.mycompany.scalcium.umls
import java.io.File
import java.io.FileWriter
import java.io.PrintWriter
import java.io.StringReader
import java.util.regex.Pattern
import scala.collection.JavaConversions.asScalaIterator
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import org.apache.commons.lang3.StringUtils
import org.apache.lucene.analysis.Analyzer
import org.apache.lucene.analysis.standard.StandardAnalyzer
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute
import org.apache.lucene.util.Version
import org.apache.solr.client.solrj.SolrRequest
import org.apache.solr.client.solrj.impl.HttpSolrServer
import org.apache.solr.client.solrj.request.ContentStreamUpdateRequest
import org.apache.solr.common.SolrDocumentList
import org.apache.solr.common.params.CommonParams
import org.apache.solr.common.params.ModifiableSolrParams
import org.apache.solr.common.util.ContentStreamBase
class UmlsTagger2(val solrServerUrl: String) {
val punctPattern = Pattern.compile("\\p{Punct}")
val spacePattern = Pattern.compile("\\s+")
case class Suggestion(val score: Float,
val descr: String, val cui: String)
val solrServer = new HttpSolrServer(solrServerUrl)
def buildIndexJson(inputFile: File,
outputFile: File): Unit = {
val writer = new PrintWriter(new FileWriter(outputFile))
writer.println("[")
var i = 0
Source.fromFile(inputFile)
.getLines()
.foreach(line => {
val Array(cui, str) = line
.replace("\",\"", "\t")
.replaceAll("\"", "")
.replaceAll("\\\\", "")
.split("\t")
val strNorm = normalizeCasePunct(str)
val strSorted = sortWords(strNorm)
val strStemmed = stemWords(strNorm)
val obuf = new StringBuilder()
if (i > 0) obuf.append(",")
obuf.append("{")
.append("\"id\":").append(i).append(",")
.append("\"cui\":\"").append(cui).append("\",")
.append("\"descr\":\"").append(str).append("\",")
.append("\"descr_norm\":\"").append(strNorm).append("\",")
.append("\"descr_sorted\":\"").append(strSorted).append("\",")
.append("\"descr_stemmed\":\"").append(strStemmed).append("\"")
.append("}")
writer.println(obuf.toString)
i += 1
})
writer.println("]")
writer.flush()
writer.close()
}
def annotateConcepts(phrase: String):
List[Suggestion] = {
// check for full match
val suggestions = ArrayBuffer[Suggestion]()
select(phrase) match {
case Some(suggestion) => suggestions += suggestion
case None => tag(phrase) match {
case Some(subSuggs) => suggestions ++= subSuggs
case None => {}
}
}
suggestions.toList
}
///////////// phrase munging methods //////////////
def normalizeCasePunct(str: String): String = {
val str_lps = punctPattern
.matcher(str.toLowerCase())
.replaceAll(" ")
spacePattern.matcher(str_lps).replaceAll(" ")
}
def sortWords(str: String): String = {
val words = str.split(" ")
words.sortWith(_ < _).mkString(" ")
}
def stemWords(str: String): String = {
val stemmedWords = ArrayBuffer[String]()
val tokenStream = getAnalyzer().tokenStream(
"str_stemmed", new StringReader(str))
val ctattr = tokenStream.addAttribute(
classOf[CharTermAttribute])
tokenStream.reset()
while (tokenStream.incrementToken()) {
stemmedWords += ctattr.toString()
}
stemmedWords.mkString(" ")
}
def getAnalyzer(): Analyzer = {
new StandardAnalyzer(Version.LUCENE_46)
}
///////////////// solr search methods //////////////
def select(phrase: String): Option[Suggestion] = {
val phraseNorm = normalizeCasePunct(phrase)
val phraseSorted = sortWords(phraseNorm)
val phraseStemmed = stemWords(phraseNorm)
// construct query
val query = """descr:"%s" descr_norm:"%s" descr_sorted:"%s" descr_stemmed:"%s""""
.format(phrase, phraseNorm, phraseSorted, phraseStemmed)
val params = new ModifiableSolrParams()
params.add(CommonParams.Q, query)
params.add(CommonParams.ROWS, String.valueOf(1))
params.add(CommonParams.FL, "*,score")
val rsp = solrServer.query(params)
val results = rsp.getResults()
if (results.getNumFound() > 0L) {
val sdoc = results.get(0)
val descr = sdoc.getFieldValue("descr").asInstanceOf[String]
val cui = sdoc.getFieldValue("cui").asInstanceOf[String]
val score = computeScore(descr,
List(phrase, phraseNorm, phraseSorted, phraseStemmed))
Some(Suggestion(score, descr, cui))
} else None
}
def tag(phrase: String): Option[List[Suggestion]] = {
val phraseNorm = normalizeCasePunct(phrase)
val params = new ModifiableSolrParams()
params.add("overlaps", "LONGEST_DOMINANT_RIGHT")
val req = new ContentStreamUpdateRequest("")
req.addContentStream(new ContentStreamBase.StringStream(phrase))
req.setMethod(SolrRequest.METHOD.POST)
req.setPath("/tag")
req.setParams(params)
val rsp = req.process(solrServer)
val results = rsp.getResponse()
.get("matchingDocs")
.asInstanceOf[SolrDocumentList]
val nwordsInPhrase = phraseNorm.split(" ").length.toFloat
val suggestions = results.iterator().map(sdoc => {
val descr = sdoc.getFieldValue("descr").asInstanceOf[String]
val cui = sdoc.getFieldValue("cui").asInstanceOf[String]
val nWordsInDescr = descr.split(" ").length.toFloat
val descrNorm = normalizeCasePunct(descr)
val descrSorted = sortWords(descrNorm)
val descrStemmed = stemWords(descrNorm)
val nwords = descrNorm.split(" ").length.toFloat
val score = (nwords / nwordsInPhrase) *
computeScore(descr,
List(descr, descrNorm, descrSorted, descrStemmed))
Suggestion(score, descr, cui)
})
.toList
.groupBy(_.cui) // dedup by cui
.map(_._2.toList.head)
.toList
.sortWith((a,b) => a.score > b.score) // sort by score
Some(suggestions)
}
def computeScore(s: String,
candidates: List[String]): Float = {
val levels = List(100.0F, 75.0F, 50.0F, 25.0F)
val candLevels = candidates.zip(levels).toMap
val topscore = candidates.map(candidate => {
val maxlen = Math.max(candidate.length(), s.length()).toFloat
val dist = StringUtils.getLevenshteinDistance(candidate, s).toFloat
(candidate, 1.0F - (dist / maxlen))
})
.sortWith((a, b) => a._2 > b._2)
.head
val level = candLevels.getOrElse(topscore._1, 0.0F)
level * topscore._2
}
//////////////// misc methods ////////////////
def formatSuggestion(sugg: Suggestion): String = {
"[%6.2f%%] (%s) %s"
.format(sugg.score, sugg.cui, sugg.descr)
}
}
|
To run the code, we use the following JUnit test (commenting and uncommenting tests as needed).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | // Source: src/test/scala/com/mycompany/scalcium/umls/UmlsTagger2Test.scala
package com.mycompany.scalcium.umls
import java.io.File
import org.junit.Assert
import org.junit.Test
class UmlsTagger2Test {
@Test
def testBuildIndex(): Unit = {
val tagger = new UmlsTagger2("")
tagger.buildIndexJson(
new File("/home/sujit/Projects/med_data/cuistr1.csv"),
new File("/home/sujit/Projects/med_data/cuistr1.json"))
}
@Test
def testGetFull(): Unit = {
val tagger = new UmlsTagger2("http://localhost:8983/solr")
val phrases = List("Lung Cancer", "Heart Attack", "Diabetes")
phrases.foreach(phrase => {
Console.println()
Console.println("Query: %s".format(phrase))
val suggestions = tagger.select(phrase)
suggestions match {
case Some(suggestion) => {
Console.println(tagger.formatSuggestion(suggestion))
Assert.assertNotNull(suggestion.cui)
}
case None =>
Assert.fail("No results for [%s]".format(phrase))
}
})
}
@Test
def testGetPartial(): Unit = {
val tagger = new UmlsTagger2("http://localhost:8983/solr")
val phrases = List(
"Heart Attack and diabetes",
"carcinoma (small-cell) of lung",
"asthma side effects")
phrases.foreach(phrase => {
Console.println()
Console.println("Query: %s".format(phrase))
val suggestions = tagger.tag(phrase)
suggestions match {
case Some(psuggs) => {
psuggs.foreach(psugg => {
Console.println(tagger.formatSuggestion(psugg))
})
Assert.assertNotNull(psuggs)
}
case None =>
Assert.fail("No results for [%s]".format(phrase))
}
})
}
@Test
def testAnnotateConcepts(): Unit = {
val tagger = new UmlsTagger2("http://localhost:8983/solr")
val phrases = List("Lung Cancer",
"Heart Attack",
"Diabetes",
"Heart Attack and diabetes",
"carcinoma (small-cell) of lung",
"asthma side effects"
)
phrases.foreach(phrase => {
Console.println()
Console.println("Query: %s".format(phrase))
val suggestions = tagger.annotateConcepts(phrase)
suggestions.foreach(suggestion => {
Console.println(tagger.formatSuggestion(suggestion))
})
})
}
}
|
The results of the testAnnotateConcepts() are shown below. I've used the same query terms as the previous algorithm, and the results are also similarly consistent.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | Query: Lung Cancer
[100.00%] (C0242379) Lung Cancer
Query: Heart Attack
[100.00%] (C0027051) Heart Attack
Query: Diabetes
[100.00%] (C0011847) Diabetes
Query: Heart Attack and diabetes
[ 25.00%] (C0002838) AND
[ 25.00%] (C1515981) And
[ 25.00%] (C0011847) Diabetes
[ 25.00%] (C1706368) And
[ 12.50%] (C1550557) and
[ 12.50%] (C0027051) heart attack
[ 6.25%] (C0011849) diabetes
[ 6.25%] (C0011860) diabetes
Query: carcinoma (small-cell) of lung
[ 26.79%] (C0149925) small cell carcinoma of lung
Query: asthma side effects
[ 33.33%] (C2984299) Asthma
[ 16.67%] (C0001688) side effects
[ 8.33%] (C0004096) asthma
|
Thats all I have for today. Next week hopefully I will talk about something else :-).
Sujit,
ReplyDeleteIt’s wonderful to see one of my favorite bloggers try out my SolrTextTagger! I hope I didn’t seem nagging in the past but I knew you would find it really useful.
I’d like to point out some things to anyone reading this post — things you may very well already realize, Sujit, but others may not:
* Solr offers many advantages: network/remote access (optional; OpenSextant and Stanbol embed Solr instead), ease of configuring text-analysis and any other aspect of the index, ease of loading data into it (CSV? JSON? XML? all good), experimental search / browse-ability of the corpus from HTTP, ease of configuring the document cache (particularly relevant to the text tagger), search (text tagging request) response time statistics, and more. It’s all the reasons people talk to Solr or ElasticSearch these days versus writing lots of inflexible code to talk to Lucene directly. But all this said, anyone bent on using it without Solr can totally do it; there’s just one class (the request handler) that you’d need to port to work the way you want it to.
* The 1.x version of the text tagger indeed uses Lucene FSTs directly. The 2.x version (MemPF branch at time of writing) uses them implicitly by telling Lucene to use the “Memory” postings format. This has a lot of advantages: the RAM footprint is about half, there’s less code to maintain, and you don’t need to explicitly “build” a data structure as you do in 1.x — Lucene amortizes that as you index (although you’ll likely want to optimize at the end), and finally there’s no custom file to write — it’s managed by the Lucene index itself. The 2.x version has yet to include a big feature in 1.x which allows a wider variety of text analysis configuration — see issue #20 to track progress. Credit to Rupert Westenthaler at Stanbol for awesome work on the “phrase builder” in 1.x.
* It’s unclear why, but in this blog I see you (Sujit) are “manually” doing the stemming and other text analysis, then putting that in Solr such that by the time Solr sees it; all the text analysis has been done already. That is… odd to me. Part of the value proposition of the text tagger is that you don’t have to do that — it’s the text tagger’s job. Simply configure the text analysis as desired in the schema for the field you’ve marked as the “indexedField” in the TaggerRequestHandler’s configuration and then submit your original text. Perhaps you’re not doing this because you want to see the analyzed text as part of your score computation. Hmmm; interesting. I could imagine a new feature in which the tagger emits the analyzed text for the match.
Thank you for the kind words David - its great knowing that someone of your stature considers me one of his favorite bloggers :-). And thank you for suggesting SolrTextTagger - the single pass substring matching it offers makes it a very attractive candidate for replacing our current proprietary memory based NER (although we do have a lot of legacy issues to consider).
ReplyDeleteRegarding the NRT features you described for SolrTextTagger v2, our current workflow wouldn't benefit from this since we do full releases (ie not incremental) of our taxonomy. However, this would come in handy if we ever moved to a GraphDB backed taxonomy where we could publish changes to Solr as they are saved into the DB.
To answer your question about why I am doing manual stemming. My use-case is slightly different than OpenSextant - instead of matching substrings in large chunks of text, I need to match noun phrases, either exactly (or almost exactly) against a entry in my taxonomy, or failing that, match substrings in the phrase (almost exactly, but stricter than the full phrase) to entries in the taxonomy. In case of the full phrase match, I match it (the full phrase, not the tokens) exactly, then with case and punctuation normalized, then with the words in the normalized phrase alpha-sorted, then to the stemmed version of the normalized words. Had I let Solr handle the stemming it would also have tokenized the strings, so to get the kind of matching I wanted I would have to enforce exact phrase matches and have to deal with false positives returned by Solr when there is none.
In case of the substring matches (where I use your SolrTextTagger's tag service), we tend to get back lots of false positives with alpha sorted matches and stemmed matches, since there is less context, so we just use case/punctuation normalized entries as input to the FST and passing in a case/punctuation normalized version of the phrase during search. I suppose I could have used a Lucene/Solr analyzer chain to do this and use the original phrase instead, but I already had the code to do the normalization for the full phrase match case, so it just seemed simpler to use that.
Hello,
ReplyDeletelooking for the code related to converting from csv to json. I have now compiled the scala code. Just need to have the right input.
I have the csv file, but want to be sure I have the same json format you are expecting.
Thanks,
Evan
referring to
"just wrote some code that converted the (CUI,STR) CSV format into JSON, with additional fields created by our case/punctuation normalization, alpha sort and stemming"
Looks like I don't have the code anymore, but I think it was something fairly simple and throwaway that converts the CSV file (you have this already) to an equivalent JSON format (look at books.csv and books.json in exampledocs in the solr distribution, that should give you an idea of the transformation required.
ReplyDeleteWill do
ReplyDeleteHello Sujit,
ReplyDeleteI read you two posts concerning text tagging and I would like to get your advice on my usecase.
I need to extract named entities from text, based on proprietary dictionnary (10 classes of entities, each containing 1-10M entities, 1-5 words each) and based on regular expressions (15 of them).
I'm considering using Stanbol, SolrTextTagger, Lingpipe.
I am looking for the simplest service to update because basically it is for non-programmer use after the first installation, and the usecase requires only the extraction to an external field, e.g no indexing
What is your recommandation?
I would recommend using LingPipe's RegExChunker for the regular expressions as the first filter, followed by SolrTextTagger. If simplicity of update is a huge focus, using LingPipe's DictionaryChunker instead of SolrTextTagger may make more sense (because update simply means adding a line to your dictionary flat file and restarting) but you will need to accommodate 150M entries in memory. However, even with SolrTextTagger, updates are not a huge deal as long as you have processes laid out.
ReplyDeleteHi Sujit,
ReplyDeleteGreat works! I try to add more features based on your work, including cooperate with cTAKES and run on a Spark cluster and so on.
Do you mind that I push it as a maven project to Github with a Apache license version 2? I think it will be great if other guys are interested in this project.
Thanks Zhiwei. And yes, that would be great, please let me know the URL to your GitHub project and I can post it here. I am doing something at work which involves calling SolrTextTagger from Spark, would be interesting to see your approach and compare notes. Also you might want to post a message on the SolrTextTagger GitHub site - I think David Smiley (main author of SolrTextTagger) would be interested in knowing about it.
ReplyDeleteThis post started long ago. But just to address the applications of SolrTextTagger in light of your use case Sujit, I thought I should mention the XTax tagger in Xponents (https://github.com/OpenSextant/Xponents/tree/master/Extraction). Your discussion of tagging specialized lexicons, using exact matches and post-filtering, pre-filtering for false positives is all in scope for XTax.
ReplyDeleteI'll be refactoring some of this to be better documented and make use of Solr 5.x. This is very much a live and active community. We're constantly trying to improve and simplify.
Spark + SolrTextTagger applications is where I am heading next/now.
cheers,
Marc
Thanks for the pointer to XTax Marc. I built an interface from Spark to SolrTextTagger for use within my company and they allowed me to open source it here. I am doing a presentation about it at Spark Summit EU 2015 next week. I will send David the links to the slides and presentation as they become available (using the Issues feature on SolrTextTagger).
ReplyDeleteHi, Sujit,
ReplyDeleteThanks for the great post! I am going to use the idea "Fuzzy String Matching with SolrTextTagger" in a paper, but I can't find any formal citation about it(It usually need a formal citation of a publication in a paper, not a blog address). Could you send me a citation about this blog if you have, or a authorization if you are agree me to use it in a paper? email: z.w.chan.jason at gmail.com
Thank you very much.
Hi Zhiwei, I uploaded the slides of my presentation about this at Spark Summit EU 2015 at data.mendeley.com.
ReplyDeletehttps://data.mendeley.com/datasets/4xdkh7xdtt/1
You can use the citation from that.
Pal, Sujit (2015), “Dictionary Based Annotation at Scale with Spark SolrTextTagger and OpenNLP”, Mendeley Data, v1
http://dx.doi.org/10.17632/4xdkh7xdtt.1
Hi, Sujit, It is cool of your project. I also use Spark (mllib), to discover new term from lots of texts. In fact, the result of fuzzy string matching is just one of the feature that I use for clustering and classification. After we finish our project, we will open the code on github too.
ReplyDeleteThank you very much.
Hi Zhiwei, that is very good news. Look forward to seeing your approach.
ReplyDeleteHi,
ReplyDeleteThere is a minor bug in the code: the "desc_stemmed" in line 61 is not the same with the query term "descr_stemmed" in line 123. This will let the stemmed form fail to match anything.
61: .append("\"desc_stemmed\":\"").append(strStemmed).append("\"")
123: val query = """descr:"%s" descr_norm:"%s" descr_sorted:"%s" descr_stemmed:"%s""""
Thanks Zhiwei, good catch. Line 61 should be descr_stemmed. I have updated the code in the post.
ReplyDeleteThis is great article, thanks for sharing. I am trying to evaluate if SolrTextTagger would be useful in my case. I would have large dictionaries as input, these dictionary could contain city names, people names, numbers etc and they will hold millions of records. I would also have documents to scan against these dictionaries. My documents may have data that is not easy to tokenize. For example, the document might contain theworldisaniceplace (read as the world is a nice place). My dictionary could have an entry for the word "world", so I need to report my document as a match here. I am wondering if this is a proper usecase for SolrTextTagger.
ReplyDeleteThanks Sanjeev, and no, I don't think SolrTextTagger would tokenize these compound strings. The lowest unit in SolrTextTagger is words, so it will match words in the incoming text stream that are in its dictionary. Since it looks like your setup includes Solr or Lucene, you may want to take a look at compound word token filters that it provides.
ReplyDeleteHi Sujit, thanks for clarifying. My current setup does not have Solr/Lucene, currently, I use Aho-Corasick dictionary match which works just great except for large dictionaries where it consumes unacceptable amount of heap, so I am exploring alternate options.
ReplyDeleteSanjeev,
ReplyDeleteNext to thw SolrTextagger project are some projects that are very similar to yours and they use SolrTextTagger quite well.
Xponents employs a dictionary of 17 million place names and tags 10kb/sec with 2.5 GB of RAM.
More details to share.....
Marc- Does that work with documents which cannot be tokenized?
ReplyDeleteThis use case seems a bit challenging -- a fun brain teaser for me. But nonetheless, the SolrTextTagger can probably do it thanks to the versatility of the analyzers in Lucene. The query time analysis chain (analyzing the incoming documents / queries to be tagged ) could end with an NGramTokenFilter of size 1 and thus at each letter we start looking for a new match. The index time chain would need to have a PatternReplaceFilter after the concatenating one that inserts a space char after each non-space (to produce ex: "n e w y o r k c i t y". There are some inefficiencies here that could be improved like the space after each char will make for a larger dictionary, but in the scheme of things not a big deal.
ReplyDeleteSanjeev --
ReplyDeleteSince Xponents uses SolrTextTagger for this capability and scale of tagging, it is limited to what SolrTextTagger can do. As of now, I have not tested such cases where you have need for novel sub-word tokenization. I was mostly responding to your need to use a large dictionary of terms to tag, but not exceed practical JVM/OS memory limits.
Anyhow, to respond to David's solution -- you might want to employ multiple taggers in parallel, given the potential performance implication/inefficiency here by tokenizing to the char.
Things like Xponents TaxonMatcher (basically a keyword tagger using SolrTextTagger "STT") offer is a blueprint for making an application out of STT. But for your purposes I can't say how well it would work, since there is a good amount of assumption about reasonably clean, natural language text as input. You have fundamentally unique data (text).
Links - https://github.com/OpenSextant/Xponents/tree/master/Extraction/src/main/java/org/opensextant/extractors/xtax ... various READMEs around there. But specifically, you'd be adapting the schema.xml as David suggests:
https://github.com/OpenSextant/Xponents/blob/master/solr/solr4/taxcat/conf/schema.xml#L124 phrase and phrase_tag fields here store the values of your dictionary; phrase_tag is of type "name_tag" which is to be configured according to your tokenization and filtering rules so you produce the unit of data (token) that is fed to the FST.
Other uses of STT: https://stanbol.apache.org/docs/trunk/components/enhancer/engines/lucenefstlinking
But indeed consider starting with:
https://github.com/OpenSextant/SolrTextTagger
Sanjeev,
I think the points to consider are --
Does this solve your memory limitation?
Is the complexity of this technique in line with what you expect to solve your problem?
What other ways are there to solve this other than Aho-Corasick?
great discussion.
Marc
David and Marc-
ReplyDeleteAt this moment I have limited flexibility to experiment with this. In the meantime, to keep moving I have found a low footprint in-memory trie implementation, however, with a large number of dictionaries, I think I will hit the threshold here as well. So, long term I will need SolrTextTagger like tool to do the lookup on a separate JVM. In that direction, I will look at the options suggested by both you.
Thanks for detailed responses.