Saturday, September 20, 2014

Coreference Resolution with Stanford CoreNLP and LingPipe


Up until recently, I had no use for Entity Recognition apart from the entities in our medical ontology. As we move away from published literature and into the realm of patient records, recognizing names, locations, etc is becoming a necessity. And once you get into Entity Recognition, the need for Coreference Resolution can't be very far behind. Accordingly, I decided to investigate what kind of support was available for Coreference Resolution among the three NLP toolkits I am reasonably familiar with - Apache OpenNLP, LingPipe and Stanford CoreNLP.

Having recently experimented with Stanford CoreNLP, I was aware that it supported Coreference Resolution. I was also aware that OpenNLP offerered limited support based on this blog post by D P Dearing. I didn't know about LingPipe, but this post on the LingPipe blog also indicated some sort of support. So I decided to investigate and build implementations using each of the three toolkits - this post describes the results of that effort.

Coreference Resolution initially seemed (to me) to be something of a black art involving linguistics and regular expression hackery, but this quote from the LingPipe blog post (referenced above) gave me a rough mental model of how one can go about doing it. Hopefully it helps you too.

LingPipe's heuristic coreference package is based on Breck's thesis work (called CogNIAC). If you troll through the code or read Breck's paper, you'll see that it's essentially a greedy online algorithm that visits the entity mentions in a document in order, and for each mention either links it to a previous linked chain of mentions, or it starts a new chain consisting only of the current mention. Essentially, it's a greedy online clustering algorithm.

The resolution of a mention is guided by matchers and anti-matchers that score candidate antecedent mention chains based on properties such as the closest matching alias (using a complex comparison allowing for missing tokens), known alias associations, discourse proximity (how far away the last mention in a chain is and how many are intervening), entity type (person, location, and ogranization, and for persons, gender).

Anyway, on to my implementation. Like the Tokenizer and NameFinder, I built this CorefResolver trait that all my implementations must extend, and a factory using which I could get one of its implementations by name. I also implemented a case class to hold a Coreference mention (the string itself, and its start and end character offsets in the input text).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
// Source: src/main/scala/com/mycompany/scalcium/coref/CorefResolver.scala
package com.mycompany.scalcium.coref

object CorefResolver {
  def getResolver(name: String): CorefResolver = {
    name.toLowerCase() match {
      case "stanford" => new StanfordCorefResolver()
      case "lingpipe" => new LingPipeCorefResolver()
      case "opennlp" => new OpenNLPCorefResolver()
    }
  }
}

trait CorefResolver {
  def resolve(text: String): List[(CorefTriple,List[CorefTriple])]
}

case class CorefTriple(text: String, begin: Int, end: Int)

The Stanford CoreNLP toolkit has the best API support for Coreference Resolution among the three (in my opinion). The CoreNLP API implements a pipeline of named processes, and getting the coreferences is simply a matter of reading the appropriate annotations the pipeline has placed on the text. This StackOverflow discussion provided me with most of the pointers for my implementation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Source: src/main/scala/com/mycompany/scalcium/coref/StanfordCorefResolver.scala
package com.mycompany.scalcium.coref

import java.util.Properties

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer

import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.dcoref.CorefCoreAnnotations.CorefChainAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.ling.CoreAnnotations.TokensAnnotation
import edu.stanford.nlp.util.IntPair
import edu.stanford.nlp.dcoref.CorefChain.CorefMention
import edu.stanford.nlp.ling.CoreAnnotations.TextAnnotation

class StanfordCorefResolver extends CorefResolver {

  val props = new Properties()
  props("annotators") = "tokenize, ssplit, pos, lemma, ner, parse, dcoref"
  val pipeline = new StanfordCoreNLP(props)
  
  override def resolve(text: String): List[(CorefTriple,List[CorefTriple])] = {
    val doc = new Annotation(text)
    pipeline.annotate(doc)
    val sentences = doc.get(classOf[SentencesAnnotation])
      .map(coremap => coremap.get(classOf[TextAnnotation]))
      .toList
    val sentenceOffsets = buildSentenceOffsets(sentences)
    val graph = doc.get(classOf[CorefChainAnnotation])
    graph.values.map(chain => {
      val mention = chain.getRepresentativeMention()
      val ref = toTuple(mention, doc, sentenceOffsets)
      val comentions = chain.getMentionsInTextualOrder()
      val corefs = comentions.map(coref => toTuple(coref, doc, sentenceOffsets))
                             .filter(triple => ! ref.text.equals(triple.text))
                             .toList
      (ref, corefs)
    })
    .filter(tuple => tuple._2.size > 0)
    .toList
  }

  def max(a: Int, b: Int) = if (a > b) a else b
  def min(a: Int, b: Int) = if (a < b) a else b

  def toTuple(coref: CorefMention, doc: Annotation, soffsets: Map[Int,Int]): 
   CorefTriple = {
    val sbegin = soffsets(coref.sentNum - 1)
    val mtriple = doc.get(classOf[SentencesAnnotation])
      // get sentence being analyzed
      .get(coref.sentNum - 1)
      // get all tokens in sentence with character offsets
      .get(classOf[TokensAnnotation])
      .map(token => ((token.originalText().toString(), 
          sbegin + token.beginPosition(), sbegin + token.endPosition())))
      // sublist the coreference part
      .subList(coref.startIndex - 1, coref.endIndex - 1)
      // join adjacent tokens into a single mention triple
      .foldLeft(("", Int.MaxValue, 0))((a, b) => 
        (List(a._1, b._1).mkString(" "), min(a._2, b._2), max(a._3, b._3)))
    CorefTriple(mtriple._1.trim(), mtriple._2, mtriple._3)
  }
  
  def buildSentenceOffsets(sentences: List[String]): Map[Int,Int] = {
    val slengths = sentences
      .zipWithIndex
      .map(si => (si._2, si._1.length()))
      .sortWith(_._1 > _._1)
    val soffsets = ArrayBuffer[(Int,Int)]()
    for (sindex <- slengths.map(_._1)) {
      val offset = if (sindex == 0) ((sindex, 0))
      else {
     val rest = slengths.drop(slengths.size - sindex)
        val offset = rest.map(_._2).foldLeft(0)(_ + _)
        ((sindex, offset))
      }
      soffsets += offset
    }
    soffsets.toMap[Int,Int]
  }
}

The JUnit test to test this runs through seven input sentence groups (some of them single sentences) and prints the co-references in an human readable format. Here is the code to test the Stanford CoreNLP based CorefResolver, the others are similar.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
// Source: src/test/scala/com/mycompany/scalcium/coref/CorefResolverTest.scala
package com.mycompany.scalcium.coref

import org.junit.Test
import scala.io.Source
import java.io.File

class CorefResolverTest {

  val texts = List(
    "The atom is a basic unit of matter, it consists of a dense central 
     nucleus surrounded by a cloud of negatively charged electrons.",
    "The Revolutionary War occurred during the 1700s and it was the first 
     war in the United States.",
    "Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.",
    "The project leader is refusing to help. The jerk thinks only of himself.",
    "A final resting place for another legend, Anna Pavlova, the Russian 
     ballerina who spent her final years in London, may be less than secure. 
     For 65 years, Ms. Pavlova's ashes have been in a white urn at Golder's 
     Green cemetery, where they are likely to remain according to a director 
     of the crematorium.",
    "Another icon of the '60s, Che Guevara, has been turned into a capitalist 
     tool 28 years after he was gunned down in Bolivia.",
    "I am Sam. Sam I am. I like green eggs and ham."
  )
      
  @Test
  def testStanfordCorefResolver(): Unit = {
    val scr = CorefResolver.getResolver("stanford")
    texts.foreach(text => {
      val x = scr.resolve(text)
      prettyPrint(text, x)
    })
  }
  
  def prettyPrint(text: String,
      result, List[(CorefTriple,List[CorefTriple])]): Unit = {
    Console.println(text)
    result.foreach(refcorefs => {
      val ref = refcorefs._1
      val corefs = refcorefs._2
      Console.println("(%d,%d): %s".format(ref.begin, ref.end, ref.text))
      corefs.foreach(coref => 
        Console.println("  (%d,%d): %s".format(coref.begin, coref.end, 
          coref.text)))
    })
    Console.println()
  }
}

The results from this implementation are quite good. I used a set of seven sentence groups (most of them single sentences) to test them out, and it identified coreferences that seemed valid, although it didn't get all of them. The downside is the tremendous amount of system resources it consumes, but that may not be a huge factor given adequate ability to parallelize operations. I have manually highlighted the mentions that it found in the input sentence group.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
The atom is a basic unit of matter, it consists of a dense central nucleus 
surrounded by a cloud of negatively charged electrons.
(12,34): a basic unit of matter
  (0,8): The atom
  (36,38): it

The Revolutionary War occurred during the 1700s and it was the first war in 
the United States.
(0,21): The Revolutionary War
  (52,54): it
  (59,93): the first war in the United States

Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
(26,39): Elsevier N.V.
  (41,67): the Dutch publishing group
(0,10): Mr. Vinken
  (14,67): chairman of Elsevier N.V. , the Dutch publishing group

The project leader is refusing to help. The jerk thinks only of himself.
(79,87): The jerk
  (103,110): himself

A final resting place for another legend, Anna Pavlova, the Russian ballerina 
who spent her final years in London, may be less than secure. For 65 years, 
Ms. Pavlova's ashes have been in a white urn at Golder's Green cemetery, 
where they are likely to remain according to a director of the crematorium.
(293,306): Ms. Pavlova 's
  (42,54): Anna Pavlova
  (88,91): her

Another icon of the '60s, Che Guevara, has been turned into a capitalist tool 
28 years after he was gunned down in Bolivia.
(26,37): Che Guevara
  (93,95): he

I am Sam. Sam I am. I like green eggs and ham.
(19,24): Sam I
  (0,1): I
  (5,8): Sam
  (23,24): I
  (38,39): I

I wasn't able to make my OpenNLP implementation work, although I probably didn't try hard enough. While writing the code (heavily influenced by D P Dearing's blog post), I realized that support for Coreference Resolution in OpenNLP was a bit half baked, and even if I got it working, it probably won't be too useful for me.

While I was able to make the LingPipe implementation work using the CorefDemo.java in the source distribution as a guide, it seems to need larger chunks of input text to be able to find coreferences in them, and it misses quite a few. Here is the code:

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
// Source: src/main/scala/com/mycompany/scalcium/coref/LingPipeCorefResolver.scala
package com.mycompany.scalcium.coref

import java.io.File
import java.io.FileInputStream
import java.io.ObjectInputStream

import scala.collection.JavaConversions._
import scala.util.matching.Regex

import com.aliasi.chunk.Chunk
import com.aliasi.chunk.ChunkFactory
import com.aliasi.chunk.Chunker
import com.aliasi.coref.EnglishMentionFactory
import com.aliasi.coref.WithinDocCoref
import com.aliasi.sentences.MedlineSentenceModel
import com.aliasi.sentences.SentenceChunker
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
import com.aliasi.util.Streams
import com.aliasi.coref.Tags

class LingPipeCorefResolver extends CorefResolver {

  val ModelDir = "src/main/resources/lingpipe/models"
  val ChunkerModelFile = "ne-en-news-muc6.AbstractCharLmRescoringChunker"
  val MalePronouns = "(?i)\\b(he|him|his)\\b".r
  val FemalePronouns = "(?i)\\b(she|her|hers)\\b".r
  val NeuterPronouns = "(?i)\\b(it)\\b".r
    
  val tokenizerFactory = new IndoEuropeanTokenizerFactory()
  val sentenceModel = new MedlineSentenceModel()
  val sentenceChunker = new SentenceChunker(tokenizerFactory, sentenceModel)
  val entityChunker = readObject(new File(ModelDir, ChunkerModelFile))
    .asInstanceOf[Chunker]
    
  override def resolve(text: String): List[(CorefTriple,List[CorefTriple])] = {
 val mentionFactory = new EnglishMentionFactory()
    val coref = new WithinDocCoref(mentionFactory)
    val sentenceChunking = sentenceChunker.chunk(text.toCharArray, 0, text.length)
    val mentions = sentenceChunking.chunkSet()
      .zipWithIndex
      .map(chunkIndexPair => {
        val schunk = chunkIndexPair._1
        val sentence = text.substring(schunk.start, schunk.end)  
        // find entities in sentence
        val mentionChunking = entityChunker.chunk(sentence)
        val mentions = mentionChunking.chunkSet().toSet
        // add different types of pronoun entities
        val malePronouns = buildPronounMentions(
          MalePronouns, sentence, "MALE_PRONOUN", mentions)
        val femalePronouns = buildPronounMentions(
          FemalePronouns, sentence, "FEMALE_PRONOUN", mentions)
        val neuterPronouns = buildPronounMentions(
          NeuterPronouns, sentence, "NEUTER_PRONOUN", mentions)
        val allMentions = 
          mentions ++ malePronouns ++ femalePronouns ++ neuterPronouns
        // resolve coreferences
        allMentions.map(chunk => {
          val chstart = chunk.start
          val chend = chunk.end
          val chtext = sentence.substring(chstart, chend)
          val chtype = chunk.`type`
          val mention = mentionFactory.create(chtext, chtype)
          val mentionId = coref.resolveMention(mention, chunkIndexPair._2)
          (mentionId, (schunk.start + chstart, schunk.start + chend, chtext))
        })
      })
      .flatten
      .groupBy(pair => pair._1) // {mentionId => Set((mentionId, (chunk))
      .filter(kv => kv._2.size > 1) // filter out single mentions
      .map(kv => kv._2.map(x => CorefTriple(x._2._3, x._2._1, x._2._2)).toList)
      .toList                   // List[List[CorefTriple]]
    mentions.map(mention => {
      val head = mention.head
      val rest = mention.tail
      (head, rest)
    })
  }
  
  def readObject(f: File): Object = {
    val oistream = new ObjectInputStream(new FileInputStream(f))
    val obj = oistream.readObject
    Streams.closeQuietly(oistream)
    obj
  }
  
  def buildPronounMentions(regex: Regex, sentence: String, tag: String, 
      mentions: Set[Chunk]): Set[Chunk] =
    regex.findAllMatchIn(sentence)
      .map(m => ChunkFactory.createChunk(m.start, m.end, tag))
      .filter(pronoun => ! overlaps(mentions, pronoun))
      .toSet
  
  def overlaps(mentions: Set[Chunk], pronoun: Chunk): Boolean = {
    val pstart = pronoun.start
    val pend = pronoun.end
    mentions.filter(mention => {
      val maxStart = if (mention.start < pstart) pstart else mention.start
      val minEnd = if (mention.end < pend) mention.end else pend
      maxStart < minEnd
    })
    .size() > 0
  }
}

In my test set, the only one in which it caught the coreferences was in the Anna Pavlova sentence, where it reported fewer coreferences than the Stanford CoreNLP implementation. It is however, orders of magnitude faster.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
The atom is a basic unit of matter, it consists of a dense central nucleus 
surrounded by a cloud of negatively charged electrons.

The Revolutionary War occurred during the 1700s and it was the first war in 
the United States.

Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.

The project leader is refusing to help. The jerk thinks only of himself.

A final resting place for another legend, Anna Pavlova, the Russian ballerina 
who spent her final years in London, may be less than secure. For 65 years, 
Ms. Pavlova's ashes have been in a white urn at Golder's Green cemetery, 
where they are likely to remain according to a director of the crematorium.
(42,54): Anna Pavlova
  (88,91): her

Another icon of the '60s, Che Guevara, has been turned into a capitalist tool 
28 years after he was gunned down in Bolivia.

I am Sam. Sam I am. I like green eggs and ham.

Thats all I have for today. I hope you found it useful. Overall, my initial reaction is that neither implementation (Stanford or LingPipe) would work out too well for me. But if the feature was really required, I would probably pay the performance price and go with Stanford CoreNLP.

Be the first to comment. Comments are moderated to prevent spam.