Friday, April 08, 2011

An UIMA Sentence Annotator using OpenNLP

Recently, a colleague pointed out that our sentence splitting code (written by me using Java BreakIterator) was rather naive. More specifically, it was (incorrectly) breaking the text on abbreviation dots within a sentence. I had not seen this behavior before, and I was under the impression that BreakIterator's rule based FSA specifically solved for these cases, so I decided to investigate.

I've also been planning to write an UIMA sentence annotator as part of a larger application, so I figured that this would help me choose the best approach to use in the annotator, so it would be a twofer.

In this post, I describe the results of my investigation, and also describe the code and descriptors for my UIMA Sentence Annotator. As you can see from the title, I ended up choosing OpenNLP. Read on to find out why.

Sentence Boundary Detector Comparison

For test data, I used the sentence list from my JTMT test case, and augmented it with example sentences from the MorphAdorner Sentence Splitter Heuristics page, the LingPipe Sentence Detection Tutorial Page and the OpenNLP Sentence Detector Page.

The BreakIterator code quite simple, its really just the standard usage described in the Javadocs. It is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  @Test
  public void testSentenceBoundaryDetectWithBreakIterators() throws Exception {
    BreakIterator bi = BreakIterator.getSentenceInstance();
    bi.setText(TEST_STRING);
    int pos = 0;
    while (bi.next() != BreakIterator.DONE) {
      String sentence = TEST_STRING.substring(pos, bi.current());
      System.out.println("sentence: " + sentence);
      pos = bi.current();
    }
  }

Running this reveals at least one class of pattern which the BreakIterator wrongly detects as a sentence boundary - where a punctuation character is immediately followed by an capitalized word, such as this one:

1
Mrs. Smith was here earlier. At 5 p.m. I had to go to the bank.

and which gets incorrectly tokenized to:

1
2
3
4
sentence: Mrs.
sentence: Smith was here earlier.
sentence: At 5 p.m.
sentence: I had to go to the bank.

I then ran the test set using LingPipe and OpenNLP. Both these sentence boundary detectors are model based (ie, you need to train the detector with a list of sentences from your corpus). Both of them supply pre-built models for this purpose, however, so I just used those. Here is the sentence detection code for LingPipe and OpenNLP.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
  @Test
  public void testSentenceBoundaryDetectWithLingpipe() throws Exception {
    TokenizerFactory tokenizerFactory = IndoEuropeanTokenizerFactory.FACTORY;
    com.aliasi.sentences.SentenceModel sentenceModel = 
      new MedlineSentenceModel();
    List<String> tokens = new ArrayList<String>();
    List<String> whitespace = new ArrayList<String>();
    char[] ch = TEST_STRING.toCharArray();
    Tokenizer tokenizer = tokenizerFactory.tokenizer(ch, 0, ch.length);
    tokenizer.tokenize(tokens, whitespace);
    int[] sentenceBoundaries = sentenceModel.boundaryIndices(
      tokens.toArray(new String[tokens.size()]), 
      whitespace.toArray(new String[whitespace.size()]));
    if (sentenceBoundaries.length > 0) {
      int tokStart = 0;
      int tokEnd = 0;
      int charStart = 0;
      int charLen = 0;
      for (int i = 0; i < sentenceBoundaries.length; ++i) {
        tokEnd = sentenceBoundaries[i];
        for (int j = tokStart; j <= tokEnd; j++) {
          charLen += tokens.get(j).length() + 
            whitespace.get(j + 1).length();
        }
        String currentSentence = 
          TEST_STRING.substring(charStart, charStart + charLen); 
        System.out.println("sentence: " + currentSentence);
      }
    }
  }
  
  @Test
  public void testSentenceBoundaryDetectWithOpenNlp() throws Exception {
    InputStream data = new FileInputStream(".../en_sent.bin");
    SentenceModel model = new SentenceModel(data);
    SentenceDetectorME sentenceDetector = new SentenceDetectorME(model);
    String[] sentences = sentenceDetector.sentDetect(TEST_STRING);
    Span[] spans = sentenceDetector.sentPosDetect(TEST_STRING);
    for (int i = 0; i < sentences.length; i++) {
      System.out.println("sentence: " + sentences[i]);
    }
    data.close();
  }

LingPipe had the same problem as BreakIterator with the input data. OpenNLP parsed everything correctly, except text tags inside embedded HTML tags in the input sentences. So a sentence such as:

1
I have a <a href="http://www.funny.com/funnyurl">funny url</a> to share.

gets (rather bizzarely) tokenized to:

1
2
sentence: I have a <a href="http://www.funny.com/funnyurl">funny
sentence:  url</a> to share.

Performance wise, LingPipe came in the fastest (6ms for my input data), followed by OpenNLP (8ms) and the BreakIterator (9ms). However, LingPipe's commercial license is quite expensive for the limited use I was going to make of it, so I went with OpenNLP. The failing test case described above is not truly a concern, since by the time the input text gets to the sentence splitter, it is going to be converted to plain text.

UIMA Sentence Annotator

My UIMA Sentence Annotator expects its input CAS to have annotations identifying text blocks in the document text (HTML or plaintext) set by an upstream annotator. I don't describe the text annotator here because its a bit fluid at the moment, maybe I will describe it in a future post.

The XML descriptor for the Sentence Annotation Type is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/sentence/Sentence.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>Sentence</name>
  <description>Annotates text blocks into sentences.</description>
  <version>1.0</version>
  <vendor/>
  <types>
    <typeDescription>
      <name>com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>
  </types>
</typeSystemDescription>

The Sentence Annotator loops through each of the pre-annotated text blocks, and annotates sentence boundaries within each block. The sentence annotation start and end indexes are relative to the document, and hence they must be offset by the start index of the containing Text annotation.

There is also the reference to the AnnotatorUtils.whiteout(String) which basically replaces spans of text like "<...>" with whitespace. This preserves the offsets for index computations, but gets rid of issues related to incorrect handlng of embedded XML/HTML tags in the text. Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
// Source: src/main/java/com/mycompany/myapp/uima/annotators/sentence/SentenceAnnotator.java
package com.mycompany.myapp.uima.annotators.sentence;

import java.io.InputStream;
import java.util.Iterator;

import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.util.Span;

import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;

import com.mycompany.myapp.uima.annotators.text.TextAnnotation;
import com.mycompany.myapp.utils.AnnotatorUtils;

public class SentenceAnnotator extends JCasAnnotator_ImplBase {

  private SentenceDetectorME sentenceDetector;
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    try {
      InputStream stream = getContext().getResourceAsStream("SentenceModel");
      SentenceModel model = new SentenceModel(stream);
      sentenceDetector = new SentenceDetectorME(model);
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
  }
  
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    FSIndex index = jcas.getAnnotationIndex(TextAnnotation.type);
    for (Iterator<TextAnnotation> it = index.iterator(); it.hasNext(); ) {
      TextAnnotation inputAnnotation = it.next();
      int start = inputAnnotation.getBegin();
      String text = AnnotatorUtils.whiteout(
        inputAnnotation.getCoveredText());
      Span[] spans = sentenceDetector.sentPosDetect(text);
      for (int i = 0; i < spans.length; i++) {
        SentenceAnnotation annotation = new SentenceAnnotation(jcas);
        annotation.setBegin(start + spans[i].getStart());
        annotation.setEnd(start + spans[i].getEnd());
        annotation.addToIndexes(jcas);
      }
    }
  }
}

And finally, the XML descriptor for the annotator.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/sentence/SentenceAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotator</annotatorImplementationName>
  <analysisEngineMetaData>
    <name>SentenceAE</name>
    <description>Annotates Sentences.</description>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <typeSystemDescription>
      <types>
        <typeDescription>
          <name>com.mycompany.myapp.uima.annotators.text.TextAnnotator</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
        </typeDescription>
        <typeDescription>
          <name>com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotation</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
        </typeDescription>
      </types>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs>
          <type allAnnotatorFeatures="true">com.mycompany.myapp.uima.annotators.text.TextAnnotator</type>
          <feature>com.mycompany.myapp.uima.annotators.text.TextAnnotation:tagname</feature>
        </inputs>
        <outputs>
          <type allAnnotatorFeatures="true">com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotator</type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies>
    <externalResourceDependency>
      <key>SentenceModel</key>
      <description>OpenNLP Sentence Model</description>
      <optional>false</optional>
    </externalResourceDependency>
  </externalResourceDependencies>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>SentenceModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:com/mycompany/myapp/uima/annotators/sentence/en_sent.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>SentenceModel</key>
        <resourceName>SentenceModelSerFile</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

To test this, we create an aggregate AE descriptor containing the TextAnnotator and the SentenceAnnotator, then call the AE using our standard TestUtils calls (getAE(), runAE()). I am not showing the JUnit test because it is so trivial. The Aggregate AE descriptor is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/aggregates/TestAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>false</primitive>
  <delegateAnalysisEngineSpecifiers>
    <delegateAnalysisEngine key="TextAE">
      <import location="../text/TextAE.xml"/>
    </delegateAnalysisEngine>
    <delegateAnalysisEngine key="SentenceAE">
      <import location="../sentence/SentenceAE.xml"/>
    </delegateAnalysisEngine>
  </delegateAnalysisEngineSpecifiers>
  <analysisEngineMetaData>
    <name>TestAE</name>
    <description/>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <flowConstraints>
      <fixedFlow>
        <node>TextAE</node>
        <node>SentenceAE</node>
      </fixedFlow>
    </flowConstraints>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs/>
        <outputs>
          <type allAnnotatorFeatures="true">
            com.mycompany.myapp.uima.annotators.text.TextAnnotator
          </type>
          <type allAnnotatorFeatures="true">
            com.mycompany.myapp.uima.annotators.sentence.SentenceAnnotator
          </type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <resourceManagerConfiguration/>
</analysisEngineDescription>

Conclusion

In the past, I have spent quite a lot of time trying to develop text mining tools (as well as my understanding of the underlying theory and techniques involved) from first principles, and my preference has been for rule or heuristic based approaches rather than model based ones. At least one advantage to using model based approaches that I can see is that it is relatively simple to scale the application for another (human) language. Of course, the obvious disadvantage is that it is almost impossible to guarantee accomodation of special rules if your training set does not reflect that pattern enough times, without resorting to pre- or post-processing the data.

Another thing I am trying to avoid going forward is to roll my own text mining/NLP solution from scratch if there is already a tool or framework that provides that. Paradoxically, this is harder to do, since now you have to understand the problem space and the framework API to solve it, but I think this is a more effective approach - these frameworks are built by experts in their respective field, and they have spent time working around corner cases which I won't even know about, so the resulting application is likely to be more robust.

7 comments (moderated to prevent spam):

Anonymous said...

Hi Sujit,

You have some great content here! I'm a content curator for a popular site for developers with over 500,000 registered users.

After looking through your blog, we'd like to invite you to join our Most Valuable Blogger program. There are several significant benefits to joining the program.

If you'd be interested in hearing more about the MVB program, email me at: Katie (at) dzone (dot) com

I'll be looking forward to hearing from you!

-Katie McKinsey-

Sujit Pal said...

Hi Katie, I believe I've registered for the DZone MVB program already in response to a prior email from one of your curators.

debovis said...

Hey Sujit,

I have been working on some NLP tools and understanding more about NLP in general using different Java technologies. Everytime I try to find some documentation on a tool, I always end up refering to your blog and SF code.

Thanks for the help!

debovis said...

Hey Sujit,

I have been working on some NLP tools and understanding more about NLP in general using different Java technologies. Everytime I try to find some documentation on a tool, I always end up refering to your blog and SF code.

Thanks for the help!

Sujit Pal said...

Thanks for the kind words, Debovis, glad it helped you.

Anonymous said...

Hi Sujit,
Thanks for the post.

I was trying to use your codes to recreate a project on my machine, but Eclipse is complaining that the TextAnnotation class is missing.

Do you have the source code of this class?

TIA

-Swirl-

Sujit Pal said...

Hi Swirl, the code is still work-in-progress at the moment, so I did not post it anywhere publicly, but its been a while since I last worked on it, so I guess I might as well do it. I am going to try and do this over the weekend to my github page, look for a project called "tgni".