My last post was a little over a month ago, a record for me - I generally try to post every week or at least every other week. The reason for the delay is that I got stuck on an idea which turned out to be not very workable. Problem with these situations is that it kind of eats at me until I am able to resolve it or realize its completely unworkable and abandon it. I haven't completely given up hope on the idea yet, but I couldn't think of any ways to solve it either, so I decided to put it aside and catch up on my reading1 instead.
In the meantime, at work we have started using UIMAFit for a new NLP pipeline we are building. I had experimented with UIMA in the past, but gave up because its heavy dependence on XML became a pain after a while. UIMAFit does not completely get rid of XML, you still need to define the types in XML and generate the code using JCasGen, but the Analysis Engines don't need to be described in XML anymore.
Generally, I try to experiment with tools before proposing them at work, and since I do all my (JVM based) personal projects with Scala nowadays, I initially thought of using UIMAFit with Scala. However, using UIMAFit would make (my personal) project a mixture of Java and Scala (JCasGen would generate Java classes for the XML types), something I wanted to avoid if possible. Luckily I came across the uimaScala project, which provides a Scala interface to UIMAFit, and eliminates XML altogether as an added bonus (it uses a Scala DSL instead to specify the types).
Unfortunately, the project had been written using Scala 2.9 and built with SBT 0.12 and I was using Scala 2.10 and SBT 0.13. My attempts to just use the project based on the instructions in the project's README.md failed. So did attempts to build it locally. So I contacted the author, who was kind enough to make the necessary changes so it worked with Scala 2.11. So currently I am using Scala 2.11 for this project, there are still quite a few Scala 2.10 based projects like Spark and Scalding that I use, so I can't do a wholesale upgrade. This post describes an annotator built using uimaScala that marks up a text with PERSON and ORGANIZATION tags using OpenNLP's Named Entity Recognizer.
[Edit (2014-07-07): the uimaScala project also offers a JAR built with Scala 2.10 now. I was able to compile and run my project by updating my scalaVersion to 2.10.2 and removing the dependency to scala-xml (split out in 2.11 into its own library) in my build.sbt file.]
First the Name Finder. My pipeline actually doesn't have a need for a NER that recognizes PERSON and ORGANIZATION, but I've been meaning to figure out how to do this with OpenNLP for a while, so I built it anyway. Here's the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | // Source: src/main/scala/com/mycompany/scalcium/utils/NameFinder.scala
package com.mycompany.scalcium.utils
import java.io.File
import java.io.FileInputStream
import org.apache.commons.io.IOUtils
import opennlp.tools.namefind.NameFinderME
import opennlp.tools.namefind.TokenNameFinderModel
import opennlp.tools.util.Span
class NameFinder {
val ModelDir = "src/main/resources/opennlp/models"
val tokenizer = Tokenizer.getTokenizer("opennlp")
val personME = buildME("en_ner_person.bin")
val orgME = buildME("en_ner_organization.bin")
def find(finder: NameFinderME, doc: List[String]):
List[List[(String,Int,Int)]] = {
try {
doc.map(sent => find(finder, sent))
} finally {
clear(finder)
}
}
def find(finder: NameFinderME, sent: String):
List[(String,Int,Int)] = {
val words = tokenizer.wordTokenize(sent)
.toArray
finder.find(words).map(span => {
val start = span.getStart()
val end = span.getEnd()
val text = words.slice(start, end).mkString(" ")
(text, start, end)
}).toList
}
def clear(finder: NameFinderME): Unit = finder.clearAdaptiveData()
def buildME(model: String): NameFinderME = {
var pfin: FileInputStream = null
try {
pfin = new FileInputStream(new File(ModelDir, model))
new NameFinderME(new TokenNameFinderModel(pfin))
} finally {
IOUtils.closeQuietly(pfin)
}
}
}
|
The Annotator uses the NameFinder and a previously written Tokenizer (which I haven't shown here, its a thin wrapper on top of OpenNLP's tokenizers) that provide methods that work like NLTK's text tokenizer methods. Note that this is generally not the way I would structure my annotator, I would prefer to have a pipeline with a Sentence tokenizer ahead of this and make the NameFinderAnnotator work on sentences instead, but in the interests of time and space I decided to make it accept the full text and tokenize it inside the process method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | // Source: src/main/scala/com/mycompany/scalcium/pipeline/NameFinderAnnotator.scala
package com.mycompany.scalcium.pipeline
import org.apache.uima.jcas.JCas
import com.github.jenshaase.uimascala.core.SCasAnnotator_ImplBase
import com.mycompany.scalcium.utils.NameFinder
import com.mycompany.scalcium.utils.Tokenizer
class NameFinderAnnotator extends SCasAnnotator_ImplBase {
val tokenizer = Tokenizer.getTokenizer("opennlp")
val namefinder = new NameFinder()
override def process(jcas: JCas): Unit = {
val text = jcas.getDocumentText()
val sentences = tokenizer.sentTokenize(text)
val soffsets = sentences.map(sentence => sentence.length())
.scanLeft(0)(_ + _)
// people annotations
val allPersons = namefinder.find(namefinder.personME, sentences)
applyAnnotations(jcas, allPersons, sentences, soffsets, "PER")
// organization annotations
val allOrgs = namefinder.find(namefinder.orgME, sentences)
applyAnnotations(jcas, allOrgs, sentences, soffsets, "ORG")
}
def applyAnnotations(jcas: JCas,
allEnts: List[List[(String,Int,Int)]], sentences: List[String],
soffsets: List[Int], tag: String): Unit = {
var sindex = 0
allEnts.map(ents => { // all entities in each sentence
ents.map(ent => { // entity
val coffset = charOffset(soffsets(sindex) + sindex,
sentences(sindex), ent)
val entity = new Entity(jcas, coffset._1, coffset._2)
entity.setEntityType(tag)
entity.addToIndexes()
})
sindex += 1
})
}
def charOffset(soffset: Int, sentence: String, ent: (String,Int,Int)):
(Int,Int) = {
val estring = tokenizer.wordTokenize(sentence)
.slice(ent._2, ent._3)
.mkString(" ")
val cbegin = soffset + sentence.indexOf(estring)
val cend = cbegin + estring.length()
(cbegin, cend)
}
}
|
The Entity annotation is described using the following Scala DSL. It defines an annotation that has the standard fields (begin, end) and an additional property entityType. Unfortunately my Scala-IDE (customized Eclipse) is not able to recognize it as valid Scala. However, it all compiles and runs fine from SBT on the command line. Very likely I have to let Scala-IDE know about the paradise compiler plugin (see the README.md for uimaScala for setting up the compiler plugin in your build.sbt). But hey, its better than having to write the types in XML!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | // Source: src/main/scala/com/mycompany/scalcium/pipeline/TypeSystem.scala
package com.mycompany.scalcium.pipeline
import com.github.jenshaase.uimascala.core.description._
import org.apache.uima.jcas.tcas.Annotation
import org.apache.uima.cas.Feature
@TypeSystemDescription
object TypeSystem {
val Entity = Annotation {
val entityType = Feature[String]
}
}
|
The uimaScala README recommends using its scalaz-stream based DSL to construct and execute pipelines. I haven't tried that yet, my JUnit unit test is based on patterns similar to my Java JUnit tests for my UIMAFit based pipeline at work. The JUnit test below takes a block of text and outputs the Entity annotations using the NameFinderAnnotator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | // Source: src/test/scala/com/mycompany/scalcium/pipeline/NameFinderAnnotatorTest.scala
package com.mycompany.scalcium.pipeline
import org.junit.Test
import org.apache.uima.fit.factory.AnalysisEngineFactory
import org.apache.uima.fit.util.JCasUtil
import scala.collection.JavaConversions._
class NameFinderAnnotatorTest {
val text = """
Pierre Vinken , 61 years old , will join the board as a nonexecutive
director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch
publishing group . Rudolph Agnew , 55 years old and former chairman of
Consolidated Gold Fields PLC , was named a director of this British
industrial conglomerate ."""
@Test
def testPipeline(): Unit = {
val ae = AnalysisEngineFactory.createEngine(classOf[NameFinderAnnotator])
val jcas = ae.newJCas()
jcas.setDocumentText(text)
ae.process(jcas)
JCasUtil.select(jcas, classOf[Entity]).foreach(entity => {
Console.println("(%d, %d): %s/%s".format(
entity.getBegin(), entity.getEnd(),
text.substring(entity.getBegin(), entity.getEnd()),
entity.getEntityType()))
})
}
}
|
The output of this test looks like below. It seems to have missed "Mr. Vinken" and "Elsevier N.V" as PERSON and ORGANIZATION respectively, but this seems to be a problem with the OpenNLP NameFinder (or maybe not even a problem, its a model based parser after all, it depends on what it was trained with).
1 2 3 | (0, 13): Pierre Vinken/PER
(159, 172): Rudolph Agnew/PER
(211, 239): Consolidated Gold Fields PLC/ORG
|
And that's all I have for today. Hopefully it was worth the wait :-).
[1]: In case you are curious about what I read while I was not posting articles last month, here is the list of books I read over last month. The last one was specifically so I could learn how to make the uimaScala code compile under Scala 2.10 but it turned out to be unnecessary, many thanks to Jens Haase (author of uimaScala) for that.
- The Theory that wouldn't die: How Bayes' Rule cracked the Enigma Code, hunted down Russian submarines...
- Learning NumPy Array
- NumPy Cookbook
- Learning SciPy for Numerical and Scientific Computing
- Getting Started with SBT for Scala
Update (2014-09-02): I recently tried the Stanford NER because I heard good things about it, and I am happy to say it vastly outperforms OpenNLP in terms of tagging quality, at the expense of a very slight increase in processing time (3755ms for Stanford NER vs 3746ms for OpenNLP on my 3 sentence test above). OpenNLP has pre-trained models for PERSON and ORGANIZATION entity detection, Stanford NER can recognize PERSON, LOCATION, ORGANIZATION and MISC. I show below the results for my 3 sentences from OpenNLP and Stanford below for comparison.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ==== OpenNLP ====
Pierre Vinken, 61 years old, will join the board as a nonexecutive director
Nov. 29.
(0,13): Pierre Vinken / PERSON
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group based
at Amsterdam.
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
PLC, was named a director of this British industrial conglomerate.
(0,13): Rudolph Agnew / PERSON
(52,80): Consolidated Gold Fields PLC / ORGANIZATION
==== Stanford ====
Pierre Vinken, 61 years old, will join the board as a nonexecutive director
Nov. 29.
(0,13): Pierre Vinken / PERSON
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group based
at Amsterdam.
(0,10): Mr. Vinken / PERSON
(26,39): Elsevier N.V. / ORGANIZATION
(45,50): Dutch / MISC
(77,86): Amsterdam / LOCATION
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields
PLC, was named a director of this British industrial conglomerate.
(0,13): Rudolph Agnew / PERSON
(52,80): Consolidated Gold Fields PLC / ORGANIZATION
(111,118): British / MISC
|
My code to call the Stanford NER and extract entities from it is shown below. It takes in a sentence, and returns a List of triples containing the entity tag, the start and end character offsets.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | package com.mycompany.scalcium.names
import java.io.File
import scala.collection.JavaConversions._
import com.mycompany.scalcium.tokenizers.Tokenizer
import edu.stanford.nlp.ie.AbstractSequenceClassifier
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.CoreLabel
class StanfordNameFinder extends NameFinder {
val ModelDir = "src/main/resources/stanford"
val tokenizer = Tokenizer.getTokenizer("opennlp")
val classifier = buildClassifier(
"english.conll.4class.distsim.crf.ser.gz")
override def find(sentences: List[String]):
List[List[(String,Int,Int)]] = {
sentences.map(sentence =>
classifier.classifyToCharacterOffsets(sentence)
.map(triple => (triple.first,
triple.second.toInt, triple.third.toInt))
.toList)
}
def buildClassifier(model: String):
AbstractSequenceClassifier[CoreLabel] = {
val modelfile = new File(ModelDir, model)
CRFClassifier.getClassifier(modelfile)
}
}
|
Hi Sujit!
ReplyDeleteGreat post (as usual). Looking at your code: do you basically use the opennlp's models for person/organization detection? If so, have they been good enough for your purposes?
Thanks Dmitry. No, we haven't had a need for detecting person and organization names in our pipeline (we are more into diseases, drugs, symptoms, etc), but I wanted to check out the OpenNLP NameFinder, which is why I used it for this example. From my very limited testing (with the 3 sentences in the post), it seems to miss quite a lot.
ReplyDeleteThanks, Sujit. It probably depends on feature set as well used during the model training. But I'm not sure, if OpenNLP provides a way to pick your own features during training.
ReplyDeleteYes, I was using the OpenNLP pre-trained models. I couldn't find what corpus was used for training these models, but OpenNLP does allow to train your own models and pick your own features (documented here).
ReplyDeleteHi Dmitry, just an update. If this is still an open issue for you. Stanford NER appears to be much better at detecting entities (although slightly slower). I've included some output and code at the bottom of the post.
ReplyDeleteHi Sujit! Thanks for sharing your experience with the Stanford NER, looks more promising than OpenNLP (at least for English).
ReplyDeleteIn fact, my original task is to find a suitably qualitative tool / framework that could be trained for Russian. I have got a parser that would be able to output POS tags and lemmas + Object tags, like LOCATION, PERSON, CITY, COUNTRY etc for a sentence. I'm hoping this could be used as a training data to build a classifier. I'm really hoping to get there and blog about it :)
The reason I don't want to use the parser directly is that it is quite bound to windows platform + may be slower as it performs larger computation for constructing a dependency tree.
Hi Dmitry, interesting approach and makes sense from a performance POV. I once built a Maxent classifier based NER that recognized consumer electronics terms from data I tagged myself - its described here, may be of help perhaps.
ReplyDeleteHi Sujit,
ReplyDeleteThanks for sharing the link to another great post. Do you happen to remember how shape of a word is formalized? The link to the paper describing it gave 403 (Forbidden).
I happened to have trained MaxEnt classifier for English in a coursera NLP course, where the features list was rather extensive to cover for many patterns of people second names. IIRC, the accuracy was also in the 90+ ballpark.
I like your idea of building a binary classifier for a specific class of named entities. Then the logic of resolving conflicts could be moved to the classifier ensemble level, perhaps.
Thanks for your kind words Dmitry, and thanks for pointing out the bad link for the word-shapes paper - its Content Characterization using Word Shape Tokens by Sibun and Farrar - I've updated the link in the post as well. Regarding the idea of using multiple one-vs-all classifiers, in the CE NER I just used a single binary classifier, but Peter Flasch's Machine Learning book has some discussion on strategies for resolving conflicts from an ensemble.
ReplyDeleteHi Sujit,
ReplyDeleteFirst of all, your posts are very useful so thank you so much for posting useful information.
I have one question, I am using OpenNLP for parsing Unstructured data and I have created a corpus of 4 million records.when I am creating a model out of the corpus, OpenNLP process is taking around 3 hours for building the model using OpenNLP API's.
so my question is, is there any way i can speed up this process because as the process is taking 3 hours, i am not able to experiment with it frequently.
as you are using openNLP, have you ever came across this kind of problem?
Please share some expertise on this.
Thanks
Nikhil Jain
Thanks for the kind words, Nikhil, and you're welcome, glad you found it useful.
ReplyDelete3 hours for 4M records works out to 370.4 recs/s, which I think is quite good. One way to speed up development cycles is to work with a smaller sample (say a random sample of 1000 recs, which would get processed under a minute each time) during development and then move to the full set when your code is ready. Another option, if what you are doing with the documents are independent of other documents, and you have the processing power, is to parallelize the processing across multiple machines. We do the former during development and the latter during processing runs.
Thanks Sujit for the feedback.
ReplyDelete1. One way to speed up development cycles is to work with a smaller sample (say a random sample of 1000 recs, which would get processed under a minute each time) during development and then move to the full set when your code is ready:
If I am taking a random sample of let say 15000 records, no doubt about it that it is taking very less time and working fine but I think, we cannot say that the model built on 15000 records will give the same performance(processing runs) as compare to model which built on 4 million records.This is happening with me if I am taking small training set then model is performing(processing runs) well but when I created a model based on 4 million records then model is behaving differently and taking 3 hours as well.
2. Another option, if what you are doing with the documents are independent of other documents, and you have the processing power, is to parallelize the processing across multiple machines.
yes I am trying to implement the openNLP on Spark that can give me processing power and divide the work on multiple nodes but I did not find good Java examples on web so now I am struggling a bit because I read on web that to implement a openNLP on spark, it could be possible with UIMAFit but don't know how. Need examples.
Thanks
Nikhil
I think our pipelines look a bit different - mine is basically a set of annotators operating on a single document at a time, where we use OpenNLP to do sentence segmentation and phrase chunking (implicitly also word segmentation and POS tagging). Periodically we also do large batch runs across the full corpus for quality checking. UIMA also has a batch component (CPE, the Collection Processing Engine), we built our version using Storm.
ReplyDeleteIn your case, it appears that you are trying to build a model from the corpus. Based on that, your pipeline is probably not as "embarrassingly parallel" as ours. Could you elaborate a bit about what you are doing and how OpenNLP and UIMA fit into this?
So, My problem is, when I am creating a model(e.g like en_ner_organization.bin) from the training set then openNLP process is taking good amount of time because my training set is containing 4 million records. I am not concerned about pipeline and parsing the stuff. I am in initial phase where I am building a model, once the model will build up then I will think about how to parse the information coming from different sources.
ReplyDeletewhat I want to do is, build the model from the training set and then run it on test data. Thing is if I am taking small training set and creating a model based on it then model is building in less time but results of this model are different from the model which I built on 4 million records, when I am applying both models on test data. so I cannot say that if model based on small training set is working fine on test data then model based on large training data will also work on it.
So either I need do something in openNLP Java code, add some parameters , give some more memory, something like that or try to run the code on Spark so that it can create my model with less time.
Here I want to use Spark for building the model not for parsing the documents and as far as I know, in order to run the openNLP in Spark, I need to wrap the openNLP code in UiMAFit then I can use openNLP in Spark but I am not sure how.
I hope now you understand my problem.
Wow, you are lucky to have 4M tagged sentences :-). Yes, in that case I guess you will have to take the hit. However, more training data may not necessarily result in a better model. If you train your model with various training sizes and plot the model's error rate against the test set (or validation set) for different training set size, you should see the typical hockey stick shaped curve, and the optimum training set size may very well be much smaller than 4M - may be worth looking at this.
ReplyDeleteI don't know Spark that well (planning on remedying that soon), but from what I know, its not going to be much help in this case, since the model building is not a parallel operation. But you may want to ask on the respective mailing lists for a more authoritative opinion.
Thanks Sujit for the suggestions. I will try to plot the model's error rate against the test set (or validation set) for different training set sizes and will see the optimum training set size.
ReplyDeleteThanks again,
Nikhil
You're welcome, happy to help.
ReplyDelete