Thursday, July 03, 2014

A uimaScala Annotator for Named Entity Recognition


My last post was a little over a month ago, a record for me - I generally try to post every week or at least every other week. The reason for the delay is that I got stuck on an idea which turned out to be not very workable. Problem with these situations is that it kind of eats at me until I am able to resolve it or realize its completely unworkable and abandon it. I haven't completely given up hope on the idea yet, but I couldn't think of any ways to solve it either, so I decided to put it aside and catch up on my reading1 instead.

In the meantime, at work we have started using UIMAFit for a new NLP pipeline we are building. I had experimented with UIMA in the past, but gave up because its heavy dependence on XML became a pain after a while. UIMAFit does not completely get rid of XML, you still need to define the types in XML and generate the code using JCasGen, but the Analysis Engines don't need to be described in XML anymore.

Generally, I try to experiment with tools before proposing them at work, and since I do all my (JVM based) personal projects with Scala nowadays, I initially thought of using UIMAFit with Scala. However, using UIMAFit would make (my personal) project a mixture of Java and Scala (JCasGen would generate Java classes for the XML types), something I wanted to avoid if possible. Luckily I came across the uimaScala project, which provides a Scala interface to UIMAFit, and eliminates XML altogether as an added bonus (it uses a Scala DSL instead to specify the types).

Unfortunately, the project had been written using Scala 2.9 and built with SBT 0.12 and I was using Scala 2.10 and SBT 0.13. My attempts to just use the project based on the instructions in the project's README.md failed. So did attempts to build it locally. So I contacted the author, who was kind enough to make the necessary changes so it worked with Scala 2.11. So currently I am using Scala 2.11 for this project, there are still quite a few Scala 2.10 based projects like Spark and Scalding that I use, so I can't do a wholesale upgrade. This post describes an annotator built using uimaScala that marks up a text with PERSON and ORGANIZATION tags using OpenNLP's Named Entity Recognizer.

[Edit (2014-07-07): the uimaScala project also offers a JAR built with Scala 2.10 now. I was able to compile and run my project by updating my scalaVersion to 2.10.2 and removing the dependency to scala-xml (split out in 2.11 into its own library) in my build.sbt file.]

First the Name Finder. My pipeline actually doesn't have a need for a NER that recognizes PERSON and ORGANIZATION, but I've been meaning to figure out how to do this with OpenNLP for a while, so I built it anyway. Here's the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Source: src/main/scala/com/mycompany/scalcium/utils/NameFinder.scala
package com.mycompany.scalcium.utils

import java.io.File
import java.io.FileInputStream

import org.apache.commons.io.IOUtils

import opennlp.tools.namefind.NameFinderME
import opennlp.tools.namefind.TokenNameFinderModel
import opennlp.tools.util.Span

class NameFinder {

  val ModelDir = "src/main/resources/opennlp/models"
  
  val tokenizer = Tokenizer.getTokenizer("opennlp")
  val personME = buildME("en_ner_person.bin")
  val orgME = buildME("en_ner_organization.bin")
  
  def find(finder: NameFinderME, doc: List[String]): 
      List[List[(String,Int,Int)]] = {
    try {
      doc.map(sent => find(finder, sent))
    } finally {
      clear(finder)
    }
  }
  
  def find(finder: NameFinderME, sent: String): 
    List[(String,Int,Int)] = {
    val words = tokenizer.wordTokenize(sent)
                         .toArray
    finder.find(words).map(span => {
      val start = span.getStart()
      val end = span.getEnd()
      val text = words.slice(start, end).mkString(" ")
      (text, start, end)
    }).toList
  }
  
  def clear(finder: NameFinderME): Unit = finder.clearAdaptiveData()
  
  def buildME(model: String): NameFinderME = {
    var pfin: FileInputStream = null
    try {
      pfin = new FileInputStream(new File(ModelDir, model))
      new NameFinderME(new TokenNameFinderModel(pfin))
    } finally {
    IOUtils.closeQuietly(pfin)
    }
  }
}

The Annotator uses the NameFinder and a previously written Tokenizer (which I haven't shown here, its a thin wrapper on top of OpenNLP's tokenizers) that provide methods that work like NLTK's text tokenizer methods. Note that this is generally not the way I would structure my annotator, I would prefer to have a pipeline with a Sentence tokenizer ahead of this and make the NameFinderAnnotator work on sentences instead, but in the interests of time and space I decided to make it accept the full text and tokenize it inside the process method.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Source: src/main/scala/com/mycompany/scalcium/pipeline/NameFinderAnnotator.scala
package com.mycompany.scalcium.pipeline

import org.apache.uima.jcas.JCas
import com.github.jenshaase.uimascala.core.SCasAnnotator_ImplBase
import com.mycompany.scalcium.utils.NameFinder
import com.mycompany.scalcium.utils.Tokenizer

class NameFinderAnnotator extends SCasAnnotator_ImplBase {

  val tokenizer = Tokenizer.getTokenizer("opennlp")
  val namefinder = new NameFinder()
  
  override def process(jcas: JCas): Unit = {
    val text = jcas.getDocumentText()
    val sentences = tokenizer.sentTokenize(text)
    val soffsets = sentences.map(sentence => sentence.length())
                            .scanLeft(0)(_ + _)
    // people annotations
    val allPersons = namefinder.find(namefinder.personME, sentences)
    applyAnnotations(jcas, allPersons, sentences, soffsets, "PER")
    // organization annotations
    val allOrgs = namefinder.find(namefinder.orgME, sentences)
    applyAnnotations(jcas, allOrgs, sentences, soffsets, "ORG")
  }
  
  def applyAnnotations(jcas: JCas, 
      allEnts: List[List[(String,Int,Int)]], sentences: List[String], 
      soffsets: List[Int], tag: String): Unit = {
    var sindex = 0
    allEnts.map(ents => { // all entities in each sentence
      ents.map(ent => {   // entity
        val coffset = charOffset(soffsets(sindex) + sindex,
          sentences(sindex), ent)
        val entity = new Entity(jcas, coffset._1, coffset._2)
        entity.setEntityType(tag)
        entity.addToIndexes()
      })
      sindex += 1
    })
  }

  def charOffset(soffset: Int, sentence: String, ent: (String,Int,Int)): 
      (Int,Int) = {
    val estring = tokenizer.wordTokenize(sentence)
      .slice(ent._2, ent._3)
      .mkString(" ")
    val cbegin = soffset + sentence.indexOf(estring)
    val cend = cbegin + estring.length()
    (cbegin, cend)
  }
}

The Entity annotation is described using the following Scala DSL. It defines an annotation that has the standard fields (begin, end) and an additional property entityType. Unfortunately my Scala-IDE (customized Eclipse) is not able to recognize it as valid Scala. However, it all compiles and runs fine from SBT on the command line. Very likely I have to let Scala-IDE know about the paradise compiler plugin (see the README.md for uimaScala for setting up the compiler plugin in your build.sbt). But hey, its better than having to write the types in XML!

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Source: src/main/scala/com/mycompany/scalcium/pipeline/TypeSystem.scala
package com.mycompany.scalcium.pipeline

import com.github.jenshaase.uimascala.core.description._
import org.apache.uima.jcas.tcas.Annotation
import org.apache.uima.cas.Feature

@TypeSystemDescription
object TypeSystem {

  val Entity = Annotation {
    val entityType = Feature[String]
  }
}

The uimaScala README recommends using its scalaz-stream based DSL to construct and execute pipelines. I haven't tried that yet, my JUnit unit test is based on patterns similar to my Java JUnit tests for my UIMAFit based pipeline at work. The JUnit test below takes a block of text and outputs the Entity annotations using the NameFinderAnnotator.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Source: src/test/scala/com/mycompany/scalcium/pipeline/NameFinderAnnotatorTest.scala
package com.mycompany.scalcium.pipeline

import org.junit.Test
import org.apache.uima.fit.factory.AnalysisEngineFactory
import org.apache.uima.fit.util.JCasUtil
import scala.collection.JavaConversions._

class NameFinderAnnotatorTest {

  val text = """
    Pierre Vinken , 61 years old , will join the board as a nonexecutive 
    director Nov. 29 . Mr. Vinken is chairman of Elsevier N.V. , the Dutch 
    publishing group . Rudolph Agnew , 55 years old and former chairman of 
    Consolidated Gold Fields PLC , was named a director of this British 
    industrial conglomerate ."""

  @Test
  def testPipeline(): Unit = {
    val ae = AnalysisEngineFactory.createEngine(classOf[NameFinderAnnotator])
    val jcas = ae.newJCas()
    jcas.setDocumentText(text)
    ae.process(jcas)
    JCasUtil.select(jcas, classOf[Entity]).foreach(entity => {
      Console.println("(%d, %d): %s/%s".format(
        entity.getBegin(), entity.getEnd(),
        text.substring(entity.getBegin(), entity.getEnd()),
        entity.getEntityType()))
    })
  }
}

The output of this test looks like below. It seems to have missed "Mr. Vinken" and "Elsevier N.V" as PERSON and ORGANIZATION respectively, but this seems to be a problem with the OpenNLP NameFinder (or maybe not even a problem, its a model based parser after all, it depends on what it was trained with).

1
2
3
(0, 13): Pierre Vinken/PER
(159, 172): Rudolph Agnew/PER
(211, 239): Consolidated Gold Fields PLC/ORG

And that's all I have for today. Hopefully it was worth the wait :-).

[1]: In case you are curious about what I read while I was not posting articles last month, here is the list of books I read over last month. The last one was specifically so I could learn how to make the uimaScala code compile under Scala 2.10 but it turned out to be unnecessary, many thanks to Jens Haase (author of uimaScala) for that.


4 comments (moderated to prevent spam):

Dmitry Kan said...

Hi Sujit!

Great post (as usual). Looking at your code: do you basically use the opennlp's models for person/organization detection? If so, have they been good enough for your purposes?

Sujit Pal said...

Thanks Dmitry. No, we haven't had a need for detecting person and organization names in our pipeline (we are more into diseases, drugs, symptoms, etc), but I wanted to check out the OpenNLP NameFinder, which is why I used it for this example. From my very limited testing (with the 3 sentences in the post), it seems to miss quite a lot.

Dmitry Kan said...

Thanks, Sujit. It probably depends on feature set as well used during the model training. But I'm not sure, if OpenNLP provides a way to pick your own features during training.

Sujit Pal said...

Yes, I was using the OpenNLP pre-trained models. I couldn't find what corpus was used for training these models, but OpenNLP does allow to train your own models and pick your own features (documented here).