Saturday, August 27, 2011

An UIMA Noun Phrase POS Annotator using OpenNLP

Some time ago, I described an UIMA Sentence Annotator, that parsed a block of plain text using OpenNLP's Sentence Detector and its prebuilt Maximum Entropy based sentence model.

This Sentence Annotator is used in my TGNI application to split a block of text into sentences, which are then further split up into shingles and matched against the Lucene/Neo4j representation of the taxonomy. This approach works in general, but yields a fair amount of false positives.

One class of false positives are words such as "Be" or "As" which you would normally expect to be stopworded out, but which match the chemical name synonyms for the elements Berrylium and Arsenic respectively. Another class consisted of words used in an incorrect context, for example "lead" in the sense "lead a team" rather than the metal. An approach to solving for both the above classes would be to use only the noun phrases from the sentences - the non-noun portions are the ones that generally contain the ambiguous usages described above.

I decided to investigate if I could do this with OpenNLP, since I was already using it. The last time I used it was only for sentence detection, and documentation was quite sparse at that time. Fortunately, this time round, I stumbled on these two posts in Davelog: Getting starting with OpenNLP 1.5.0 - Sentence Detection and Tokenizing and Part of Speech (POS) Tagging with OpenNLP 1.5.0, both of which were enormously useful in getting me started.

So I decided to replace my SentenceAnnotator (which annotated the text with sentence annotation markers) with a NounPhraseAnnotator. This one also first splits the input text into sentences using the SentenceDetector, then for each sentence it tokenizes it into words using the Tokenizer, then find POS tags for each token using the POSTagger. Now using the tokens and the associated tags, it uses the Chunker to break up the sentence into phrase chunks. For each chunk, it checks its type and only noun-phrases (NP) are annotated. The SentenceDetector, Tokenizer, POSTagger and Chunker are all OpenNLP components, each backed by their own maximum entropy based models. Pre-built versions of these models are available for download from here.

A UIMA primitive Analysis Engine (AE) consists of an annotation descriptor (specified as XML), an annotator (specified as a Java class) and its associated AE descriptor (also specified as XML).

Annotation XML Descriptor

There is nothing to this annotation, really. Its just a regular Annotation without any extra properties. Here it is.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
<?xml version="1.0" encoding="UTF-8"?>
<!-- src/main/resources/descriptors/NounPhrase.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <name>NounPhrase</name>
  <description>Annotation to represent Noun Phrase sequences in a body of text.</description>
  <version>1.0</version>
  <vendor/>
  <types>
    <typeDescription>
      <name>com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotation</name>
      <description/>
      <supertypeName>uima.tcas.Annotation</supertypeName>
    </typeDescription>
  </types>
</typeSystemDescription>

Annotator

I have already described what the annotator does above. Ultimately, it will replace the SentenceAnnotator, so it should consume TextAnnotation objects placed by the upstream TextAnnotator. For now, for quick development and testing, I have modeled it as a primitive AE which consumes text blocks. Here is the code for the NounPhraseAnnotator.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
// Source: src/main/java/com/mycompany/tgni/uima/annotators/nlp/NounPhraseAnnotator.java
package com.mycompany.tgni.uima.annotators.nlp;

import java.io.InputStream;

import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

import org.apache.commons.io.IOUtils;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;

/**
 * Annotate noun phrases in sentences from within blocks of
 * text (marked up with TextAnnotation) from either HTML or
 * plain text documents. Using the OpenNLP library and models,
 * the incoming text is tokenized into sentences, then each 
 * sentence is tokenized to words and POS tagged, and finally
 * tokens are grouped together into chunks. Of these chunks,
 * only the noun phrases are annotated. 
 */
public class NounPhraseAnnotator extends JCasAnnotator_ImplBase {

  private SentenceDetectorME sentenceDetector;
  private TokenizerME tokenizer;
  private POSTaggerME posTagger;
  private ChunkerME chunker;
  
  @Override
  public void initialize(UimaContext ctx) 
      throws ResourceInitializationException {
    super.initialize(ctx);
    InputStream smis = null;
    InputStream tmis = null;
    InputStream pmis = null;
    InputStream cmis = null;
    try {
      smis = getContext().getResourceAsStream("SentenceModel");
      SentenceModel smodel = new SentenceModel(smis);
      sentenceDetector = new SentenceDetectorME(smodel);
      smis.close();
      tmis = getContext().getResourceAsStream("TokenizerModel");
      TokenizerModel tmodel = new TokenizerModel(tmis);
      tokenizer = new TokenizerME(tmodel);
      tmis.close();
      pmis = getContext().getResourceAsStream("POSModel");
      POSModel pmodel = new POSModel(pmis);
      posTagger = new POSTaggerME(pmodel);
      pmis.close();
      cmis = getContext().getResourceAsStream("ChunkerModel");
      ChunkerModel cmodel = new ChunkerModel(cmis);
      chunker = new ChunkerME(cmodel);
      cmis.close();
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    } finally {
      IOUtils.closeQuietly(cmis);
      IOUtils.closeQuietly(pmis);
      IOUtils.closeQuietly(tmis);
      IOUtils.closeQuietly(smis);
    }
  }
  
  @Override
  public void process(JCas jcas) throws AnalysisEngineProcessException {
    String text = jcas.getDocumentText();
    Span[] sentSpans = sentenceDetector.sentPosDetect(jcas.getDocumentText());
    for (Span sentSpan : sentSpans) {
      String sentence = sentSpan.getCoveredText(text).toString();
      int start = sentSpan.getStart();
      Span[] tokSpans = tokenizer.tokenizePos(sentence);
      String[] tokens = new String[tokSpans.length];
      for (int i = 0; i < tokens.length; i++) {
        tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
      }
      String[] tags = posTagger.tag(tokens);
      Span[] chunks = chunker.chunkAsSpans(tokens, tags);
      for (Span chunk : chunks) {
        if ("NP".equals(chunk.getType())) {
          NounPhraseAnnotation annotation = new NounPhraseAnnotation(jcas);
          annotation.setBegin(start + 
            tokSpans[chunk.getStart()].getStart());
          annotation.setEnd(
            start + tokSpans[chunk.getEnd() - 1].getEnd());
          annotation.addToIndexes(jcas);
        }
      }
    }
  }
}

The various OpenNLP components are all initialized (once at startup) in the initialize() method - as you can see, the code pattern is quite repetitive. The process() method splits the text into sentences, sentences into tokens, POS tags the tokens, then uses the tokens and tags to chunk each sentence. Only noun-phrase chunks are annotated. One important things to note are that the chunk spans report its start and end offsets in terms of token (not character positions). Another thing to note is that the NounPhrase annotation start and end offsets are character offsets relative to the start of the incoming block of text.

AE Descriptor

Finally, the AE descriptor for the NounPhraseAnnotator. This is also pretty vanilla, the only non-standard block is the resource manager configuration which relates the OpenNLP model files with symbolic names used by the annotator. One other thing that you may notice is the reference to the TextAnnotator - as mentioned above, the ultimate goal is to replace the SentenceAnnotator which consumes TextAnnotations - thats why its here.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
  <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
  <primitive>true</primitive>
  <annotatorImplementationName>
    com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotator
  </annotatorImplementationName>
  <analysisEngineMetaData>
    <name>NounPhraseAE</name>
    <description>Annotates Noun Phrases in sentences</description>
    <version>1.0</version>
    <vendor/>
    <configurationParameters/>
    <configurationParameterSettings/>
    <typeSystemDescription>
      <types>
        <typeDescription>
          <name>com.mycompany.tgni.uima.annotators.text.TextAnnotation</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
        </typeDescription>
        <typeDescription>
          <name>com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotation</name>
          <description/>
          <supertypeName>uima.tcas.Annotation</supertypeName>
        </typeDescription>
      </types>
    </typeSystemDescription>
    <typePriorities/>
    <fsIndexCollection/>
    <capabilities>
      <capability>
        <inputs>
          <type allAnnotatorFeatures="true">com.mycompany.tgni.uima.annotator.text.TextAnnotator</type>
          <feature>com.mycompany.tgni.uima.annotator.text.TextAnnotator:tagname</feature>
        </inputs>
        <outputs>
          <type allAnnotatorFeatures="true">com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotator</type>
        </outputs>
        <languagesSupported/>
      </capability>
    </capabilities>
    <operationalProperties>
      <modifiesCas>true</modifiesCas>
      <multipleDeploymentAllowed>true</multipleDeploymentAllowed>
      <outputsNewCASes>false</outputsNewCASes>
    </operationalProperties>
  </analysisEngineMetaData>
  <externalResourceDependencies>
    <externalResourceDependency>
      <key>SentenceModel</key>
      <description>OpenNLP Sentence Model</description>
      <optional>false</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>TokenizerModel</key>
      <description>OpenNLP Tokenizer Model</description>
      <optional>false</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>POSModel</key>
      <description>OpenNLP POS Tagging Model</description>
      <optional>false</optional>
    </externalResourceDependency>
    <externalResourceDependency>
      <key>ChunkerModel</key>
      <description>OpenNLP Chunker Model</description>
      <optional>false</optional>
    </externalResourceDependency>
  </externalResourceDependencies>
  <resourceManagerConfiguration>
    <externalResources>
      <externalResource>
        <name>SentenceModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/models/en-sent.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>TokenizerModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/models/en-token.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>POSModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/models/en-pos-maxent.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
      <externalResource>
        <name>ChunkerModelSerFile</name>
        <description/>
        <fileResourceSpecifier>
          <fileUrl>file:@tgni.home@/conf/models/en-chunker.bin</fileUrl>
        </fileResourceSpecifier>
      </externalResource>
    </externalResources>
    <externalResourceBindings>
      <externalResourceBinding>
        <key>SentenceModel</key>
        <resourceName>SentenceModelSerFile</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>TokenizerModel</key>
        <resourceName>TokenizerModelSerFile</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>POSModel</key>
        <resourceName>POSModelSerFile</resourceName>
      </externalResourceBinding>
      <externalResourceBinding>
        <key>ChunkerModel</key>
        <resourceName>ChunkerModelSerFile</resourceName>
      </externalResourceBinding>
    </externalResourceBindings>
  </resourceManagerConfiguration>
</analysisEngineDescription>

Testing Code and Results

The following JUnit test runs the NounPhraseAnnotator primitive AE against some input sentences.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Source: src/test/java/com/mycompany/tgni/uima/annotators/nlp/NounPhraseAnnotatorTest.java
package com.mycompany.tgni.uima.annotators.nlp;

import java.util.Iterator;

import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.junit.Test;

import com.mycompany.tgni.uima.utils.UimaUtils;

/**
 * Tests for the Noun Phrase Annotator.
 */
public class NounPhraseAnnotatorTest {

  private static final String[] INPUTS = new String[] { ... };

  @Test
  public void testNounPhraseAnnotation() throws Exception {
    AnalysisEngine ae = UimaUtils.getAE(
      "conf/descriptors/NounPhraseAE.xml", null);
    for (String input : INPUTS) {
      System.out.println("text: " + input);
      JCas jcas = UimaUtils.runAE(ae, input, UimaUtils.MIMETYPE_TEXT);
      FSIndex index = jcas.getAnnotationIndex(NounPhraseAnnotation.type);
      for (Iterator<NounPhraseAnnotation> it = index.iterator(); it.hasNext();) {
        NounPhraseAnnotation annotation = it.next();
        System.out.println("...(" + annotation.getBegin() + "," + 
          annotation.getEnd() + "): " + 
          annotation.getCoveredText());
      }
    }
  }
}

And here are some test inputs and the associated noun phrases that were annotated. The annotation consists of the start and end character positions and the actual string covered.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
    [junit] text: Be that as it may, the show must go on.
    [junit] ...(11,13): it
    [junit] ...(19,27): the show
    [junit] text: As I was telling you, he will not attend the meeting.
    [junit] ...(3,4): I
    [junit] ...(17,20): you
    [junit] ...(22,24): he
    [junit] ...(41,52): the meeting
    [junit] text: Dr Johnson will lead the team
    [junit] ...(0,10): Dr Johnson
    [junit] ...(21,29): the team
    [junit] text: Lead is the lead cause of lead poisoning.
    [junit] ...(0,4): Lead
    [junit] ...(8,22): the lead cause
    [junit] ...(26,40): lead poisoning

As you can see, the "Be" and "As" are no longer in the set of strings to be matched. The "lead" as a verb in the third example is also taken care of. The fourth example does return "the lead cause" which will still need to be taken care of somehow.

14 comments (moderated to prevent spam):

Anonymous said...

There is more documentation here: http://incubator.apache.org/opennlp/documentation.html

Also about the official UIMA integration.

Nice post!

Sujit Pal said...

Thanks, both for the kind words and the link to the documentation!

Anonymous said...

Sujit, If you are processing the entire text, have you thought about discarding the co-references so that you will not have to worry about you, his, me etc which come up as nouns. Generally they show up as [NP his/PRP$ first/JJ ] or [NP whose/WP$ father/NN ] or [NP I/PRP ] or [NP the/DT process/NN ] in the chunker output (as it does shallow parsing including pronouns and determinants as NPs)

Ravi Kiran Bhaskar
Principal Software Engineer
Washington Post Digital
1150 15th Street NW, Washington, DC 20071

Sujit Pal said...

Thanks Ravi, that is a good idea. Although for or my particular application, I don't have concepts called "you", "his", etc, so these terms will not match, so my hope with coreference resolution would be to recognize and score these terms against the original entity rather than ignoring them.

Satya said...

Hi Sujit, this is a very nice article for a beginner. Could you also please post the NounPhraseAnnotation code.

Sujit Pal said...

Hi Satya, the NounPhraseAnnotation is autogenerated by UIMA from the XML descriptor. If you have UIMA installed, you should be able to do this from the XML descriptor NounPhrase.xml - as you can see, its actually a vanilla annotation that has no other purpose than to give the annotation a name.

Anonymous said...

how to do without uima

Sujit Pal said...

Hi, to do this without UIMA, create a class that is initialized similar to the NounPhraseAnnotator.initialize(), and pass your input text to a method similar to the NounPhraseAnnotator.process(). Instead of passing in a JCas, pass in your text directly (the process() method gets the text using JCas.getDocumentText()).

Anonymous said...

Hi Sujit,

i use OpenNLP without UIMA.I created a function for each tool, but my Chunking function does not work. Could you help me please.Here is my main function:

public static void main(String[] args) throws FileNotFoundException {
String input="This article provides a review of the literature on clinical correlates of awareness in dementia. Most inconsistencies were found with regard to an association between depression and higher levels of awareness. Dysthymia, but not major depression, is probably related to higher levels of awareness. Anxiety also appears to be related to higher levels of awareness. Apathy and psychosis are frequently present in patients with less awareness, and may share common neuropathological substrates with awareness. Furthermore, unawareness seems to be related to difficulties in daily life functioning, increased caregiver burden, and deterioration in global dementia severity. Factors that may be of influence on the inconclusive data are discussed, as are future directions of research.";
System.out.println("text : " + input);
NounPhraseAnnotator2 ann=new NounPhraseAnnotator2();
String [] sentences=ann.SentenceDetector(input);
for(String sentence:sentences){
String tokens[]=ann.SentenceTokenizer(sentence);
String tags[]=ann.POSTagger(tokens);
//for(int i=0;i<tokens.length;i++)
//System.out.println(tokens[i]+" - "+tags[i]);
Span chunks[]=ann.Chunking(tokens, tags);
//System.out.println(sentence);
for (Span chunk : chunks) {
if ("NP".equals(chunk.getType())) {
System.out.println(chunk.getStart()+" "+chunk.getEnd()+" "+chunk.getCoveredText(sentence));
}
}
}
// Parse[] parse=ann.parser(input);
//for(Parse pars:parse) pars.show();
System.out.println("FIN");
}

there is some results:

0 2 Th
3 5 s
6 8 rt
9 11 cl
12 13
14 15 r
0 2 Mo
5 6 i
7 9 co
10 14 sist
15 16 n
0 1 D
3 6 thy
11 13 bu
14 15
0 1 A
7 9 a
10 11 s
2 3 a
7 8 a
9 11 d
15 18 hos
19 20 s
2 3 r
8 9 o
10 13 e,
15 17 aw
19 20 e
21 24 ess
..........

many thanks

Sujit Pal said...

Blogger messes up the formatting so its hard to read. From a quick look I don't see anything glaring (but I didn't look that hard). I suspect the problems may be with your models. I have written a JUnit test (which you can convert to a main almost 1 to 1 if you need) that produces results that seem more believable. Both the code snippet and results can be found in the code snip here: http://pastebin.com/bUDY7fb0

Vivek Kumar said...

Hi,

Can you please describe the use of UimaUtils in your example. Do I need to create any custom code to run this example?

Regards
Vivek

Sujit Pal said...

Hi Vivek, sorry did not include the source for UimaUtils.java, here it is.

Vivek Kumar said...

Hi Sujit,

I have two versions of noun phrases extractor both giving same result, one using uima and one without using uima. Can you please explain apart from time on which parameter should I select one?

Sujit Pal said...

I think you may want to use UIMA if you want to put your extractor into a pipeline of other UIMA components or you want to use UIMA's built in features, for example if you want to change the model and have the extractor automatically refresh itself using a reinit() call from your client. Otherwise, it may make more sense to build a non-UIMA pipeline and bypass the pain of UIMA configuration.