Friday, September 05, 2014

Tokenizing and Named Entity Recognition with Stanford CoreNLP



I got into NLP using Java, but I was already using Python at the time, and soon came across the Natural Language Tool Kit (NLTK), and just fell in love with the elegance of its API. So much so that when I started working with Scala, I figured it would be a good idea to build a NLP toolkit with an API similar to NLTKs, primarily as a way to learn NLP and Scala but also to build something that would be as enjoyable to work with as NLTK and have the benefit of Java's rich ecosystem.

The project is perenially under construction, and serves as a test bed for my NLP experiments. In the past, I have used OpenNLP and LingPipe to build Tokenizer implementations that expose an API similar to NLTK's. More recently, I have built an Named Entity Recognizer (NER) with OpenNLP's NameFinder. At the recommendation of one of my readers, I decided to take a look at Stanford CoreNLP, with which I ended up building a Tokenizer and a NER implementation. This post describes that work.

The code for the Tokenizer is shown below. The appropriate implementation can be invoked by calling Tokenizer.getTokenizer("stanford") using a factory pattern on the Tokenizer trait.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// Source: src/main/scala/com/mycompany/scalcium/tokenizers/StanfordTokenizer.scala
package com.mycompany.scalcium.tokenizers

import java.util.Properties

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer

import edu.stanford.nlp.ling.CoreAnnotations.PartOfSpeechAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.trees.Tree
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation

class StanfordTokenizer extends Tokenizer {

  val props = new Properties()
  props("annotators") = "tokenize, ssplit, pos, parse"
  val pipeline = new StanfordCoreNLP(props)

  override def sentTokenize(para: String): List[String] = {
    val doc = new Annotation(para)
    pipeline.annotate(doc)
    doc.get(classOf[SentencesAnnotation])
      .map(coremap => coremap.get(classOf[TextAnnotation]))
      .toList
  }
  
  override def wordTokenize(sentence: String): List[String] = {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TokensAnnotation])
      .map(corelabel => corelabel.get(classOf[TextAnnotation]))
      .toList
  }
  
  override def posTag(sentence: String): List[(String,String)]= {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TokensAnnotation])
      .map(corelabel => {
        val word = corelabel.get(classOf[TextAnnotation])
        val tag = corelabel.get(classOf[PartOfSpeechAnnotation])
        (word, tag)
      })
      .toList
  }
  
  override def phraseChunk(sentence: String): List[(String,String)] = {
    val sent = new Annotation(sentence)
    pipeline.annotate(sent)
    val tree = sent.get(classOf[SentencesAnnotation])
      .head
      .get(classOf[TreeAnnotation])
    val chunks = ArrayBuffer[(String,String)]()
    extractChunks(tree, chunks)
    chunks.toList
  }
  
  def extractChunks(tree: Tree, chunks: ArrayBuffer[(String,String)]): Unit = {
    tree.children().map(child => {
      val tag = child.value()
      if (child.isPhrasal() && hasOnlyLeaves(child)) {
        // concatenate words into phrase if the children of this
        // phrase are leaves (not phrases themselves)
        val phrase = child.getLeaves[Tree]()
          .flatMap(leaf => leaf.yieldWords())
          .map(word => word.word())
          .mkString(" ")
        chunks += ((phrase, tag))
      } else {
     // dig deeper
     extractChunks(child, chunks)
      }
    })
  }
  
  def hasOnlyLeaves(tree: Tree): Boolean = 
    tree.children().filter(child => child.isPhrasal()).size == 0
}

Most of the calls are straightforward. The only exception is the phraseChunk() method, which was originally built as a wrapper around OpenNLP's shallow phrase chunking. Stanford parser only does deep parsing into a Tree, from which my code extracts a list of phrases and phrase types.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
>>> text = """
Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group
based at Amsterdam. 
Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields 
PLC, was named a nonexecutive director of this British industrial conglomerate.
"""
>>> tokenizer = Tokenizer.getTokenizer("stanford")
>>> sentences = tokenizer.sentTokenize(text)
List(Pierre Vinken, 61 years old, will join the board as a nonexecutive 
  director Nov. 29.,
  Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group
  based at Amsterdam.,
  Rudolph Agnew, 55 years old and former chairman of Consolidated Gold 
  Fields PLC, was named a nonexecutive director of this British industrial 
  conglomerate.)
>>> words = tokenizer.wordTokenize(sentences(0))
List(Pierre, Vinken, ,, 61 years, old, ,, will, join, the, board, as,
  a, nonexecutive, director, Nov., 29, .)
>>> postags = tokenizer.posTag(sentences(0))
List((Pierre,NNP), (Vinken,NNP), (,,,), (61,CD), (years,NNS), (old,JJ),
  (,,,), (will,MD), (join,VB), (the,DT), (board,NN), (as,IN), (a,DT),
  (nonexecutive,JJ), (director,NN), (Nov.,NNP), (29,CD), (.,.))
>>> phrases = tokenizer.phraseTokenize(sentences(0))
List(Pierre Vinken, 61 years, the board, a nonexecutive director, Nov. 29)
>>> chunks = tokenizer.phraseChunk(sentences(0))
List((Pierre Vinken,NP), (61 years,NP), (the board,NP), 
  (a nonexecutive director,NP), (Nov. 29,NP-TMP))
>>>

The Stanford CoreNLP based NER follows a similar approach as the Tokenizer, being instantiated by calling NameFinder.getNameFinder("stanford") using a factory pattern on the NameFinder trait. Here is the code.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
// Source: src/main/scala/com/mycompany/scalcium/names/StanfordNameFinder.scala
package com.mycompany.scalcium.names

import java.io.File
import scala.collection.JavaConversions._
import com.mycompany.scalcium.tokenizers.Tokenizer
import java.util.Properties
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.NamedEntityTagAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.NormalizedNamedEntityTagAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation

class StanfordNameFinder extends NameFinder {

  val props = new Properties()
  props("annotators") = "tokenize, ssplit, pos, lemma, ner"
  props("ssplit.isOneSentence") = "true"
  val pipeline = new StanfordCoreNLP(props)

  override def find(sentences: List[String]): List[List[(String,Int,Int)]] = {
    sentences.map(sentence => {
      val sent = new Annotation(sentence)
      pipeline.annotate(sent)
      sent.get(classOf[SentencesAnnotation])
        .head
        .get(classOf[TokensAnnotation])
        .map(corelabel => (corelabel.ner(), corelabel.beginPosition(), 
          corelabel.endPosition()))
        .filter(triple => (! "O".equals(triple._1)))
        .groupBy(triple => triple._1)
        .map(kv => {
          val key = kv._1
          val list = kv._2
          val begin = list.sortBy(x => x._2).head._2
          val end = list.sortBy(x => x._3).reverse.head._3
          (key, begin, end)
        })
        .toList
    })
    .toList
  }
}

The previous version of my Stanford based NER used the Stanford NER library and the 4 class classifier model directly. This was definitely an improvement over the OpenNLP NameFinder as described here (scroll down to the end). The code above creates a NER that can recognize 7 classes and uses code very similar to the Tokenizer (although arguably I could have created a 7 class NER by using the appropriate classifier model). Here is some output from the NER.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>>> namefinder = NameFinder.getNameFinder("stanford")
>>> entities = namefinder.find(sentences)
List(List((PERSON,0,13), (DURATION,15,27), (DATE,76,83)),
  List((PERSON,4,10), (MISC,45,50), (LOCATION,77,86), (ORGANIZATION,26,39)),
  List((PERSON,0,13), (MISC,111,118), (DURATION,16,28), (ORGANIZATION,52,80)))
>>> prettyPrint(sentences, entities)
Pierre Vinken, 61 years old, will join the board as a nonexecutive director 
Nov. 29.
  (0,13): Pierre Vinken / PERSON
  (15,27): 61 years old / DURATION
  (76,83): Nov. 29 / DATE
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group based 
at Amsterdam.
  (4,10): Vinken / PERSON
  (45,50): Dutch / MISC
  (77,86): Amsterdam / LOCATION
  (26,39): Elsevier N.V. / ORGANIZATION
Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields 
PLC, was named a director of this British industrial conglomerate.
  (0,13): Rudolph Agnew / PERSON
  (111,118): British / MISC
  (16,28): 55 years old / DURATION
  (52,80): Consolidated Gold Fields PLC / ORGANIZATION
>>>

When I last looked at the Stanford parser, I found the API somewhat hard to understand. The CoreNLP API is much simpler and seems more unified, possibly at the cost of some compile time type checking.

Overall, I was quite impressed by Stanford CoreNLP's accuracy. However, performance-wise, Stanford CoreNLP seems to be uniformly slower than either OpenNLP and LingPipe, although not by much (using my limited set of examples). In all fairness, though, Stanford CoreNLP is designed to work in batch mode, where you run the pipeline with the text and then walk through the annotations generated as a result of the Properties object passed in to the constructor.