Some time ago, I described an UIMA Sentence Annotator, that parsed a block of plain text using OpenNLP's Sentence Detector and its prebuilt Maximum Entropy based sentence model.
This Sentence Annotator is used in my TGNI application to split a block of text into sentences, which are then further split up into shingles and matched against the Lucene/Neo4j representation of the taxonomy. This approach works in general, but yields a fair amount of false positives.
One class of false positives are words such as "Be" or "As" which you would normally expect to be stopworded out, but which match the chemical name synonyms for the elements Berrylium and Arsenic respectively. Another class consisted of words used in an incorrect context, for example "lead" in the sense "lead a team" rather than the metal. An approach to solving for both the above classes would be to use only the noun phrases from the sentences - the non-noun portions are the ones that generally contain the ambiguous usages described above.
I decided to investigate if I could do this with OpenNLP, since I was already using it. The last time I used it was only for sentence detection, and documentation was quite sparse at that time. Fortunately, this time round, I stumbled on these two posts in Davelog: Getting starting with OpenNLP 1.5.0 - Sentence Detection and Tokenizing and Part of Speech (POS) Tagging with OpenNLP 1.5.0, both of which were enormously useful in getting me started.
So I decided to replace my SentenceAnnotator (which annotated the text with sentence annotation markers) with a NounPhraseAnnotator. This one also first splits the input text into sentences using the SentenceDetector, then for each sentence it tokenizes it into words using the Tokenizer, then find POS tags for each token using the POSTagger. Now using the tokens and the associated tags, it uses the Chunker to break up the sentence into phrase chunks. For each chunk, it checks its type and only noun-phrases (NP) are annotated. The SentenceDetector, Tokenizer, POSTagger and Chunker are all OpenNLP components, each backed by their own maximum entropy based models. Pre-built versions of these models are available for download from here.
A UIMA primitive Analysis Engine (AE) consists of an annotation descriptor (specified as XML), an annotator (specified as a Java class) and its associated AE descriptor (also specified as XML).
Annotation XML Descriptor
There is nothing to this annotation, really. Its just a regular Annotation without any extra properties. Here it is.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | <?xml version="1.0" encoding="UTF-8"?>
<!-- src/main/resources/descriptors/NounPhrase.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
<name>NounPhrase</name>
<description>Annotation to represent Noun Phrase sequences in a body of text.</description>
<version>1.0</version>
<vendor/>
<types>
<typeDescription>
<name>com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotation</name>
<description/>
<supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>
</types>
</typeSystemDescription>
|
Annotator
I have already described what the annotator does above. Ultimately, it will replace the SentenceAnnotator, so it should consume TextAnnotation objects placed by the upstream TextAnnotator. For now, for quick development and testing, I have modeled it as a primitive AE which consumes text blocks. Here is the code for the NounPhraseAnnotator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | // Source: src/main/java/com/mycompany/tgni/uima/annotators/nlp/NounPhraseAnnotator.java
package com.mycompany.tgni.uima.annotators.nlp;
import java.io.InputStream;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.sentdetect.SentenceDetectorME;
import opennlp.tools.sentdetect.SentenceModel;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
import org.apache.commons.io.IOUtils;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
/**
* Annotate noun phrases in sentences from within blocks of
* text (marked up with TextAnnotation) from either HTML or
* plain text documents. Using the OpenNLP library and models,
* the incoming text is tokenized into sentences, then each
* sentence is tokenized to words and POS tagged, and finally
* tokens are grouped together into chunks. Of these chunks,
* only the noun phrases are annotated.
*/
public class NounPhraseAnnotator extends JCasAnnotator_ImplBase {
private SentenceDetectorME sentenceDetector;
private TokenizerME tokenizer;
private POSTaggerME posTagger;
private ChunkerME chunker;
@Override
public void initialize(UimaContext ctx)
throws ResourceInitializationException {
super.initialize(ctx);
InputStream smis = null;
InputStream tmis = null;
InputStream pmis = null;
InputStream cmis = null;
try {
smis = getContext().getResourceAsStream("SentenceModel");
SentenceModel smodel = new SentenceModel(smis);
sentenceDetector = new SentenceDetectorME(smodel);
smis.close();
tmis = getContext().getResourceAsStream("TokenizerModel");
TokenizerModel tmodel = new TokenizerModel(tmis);
tokenizer = new TokenizerME(tmodel);
tmis.close();
pmis = getContext().getResourceAsStream("POSModel");
POSModel pmodel = new POSModel(pmis);
posTagger = new POSTaggerME(pmodel);
pmis.close();
cmis = getContext().getResourceAsStream("ChunkerModel");
ChunkerModel cmodel = new ChunkerModel(cmis);
chunker = new ChunkerME(cmodel);
cmis.close();
} catch (Exception e) {
throw new ResourceInitializationException(e);
} finally {
IOUtils.closeQuietly(cmis);
IOUtils.closeQuietly(pmis);
IOUtils.closeQuietly(tmis);
IOUtils.closeQuietly(smis);
}
}
@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
String text = jcas.getDocumentText();
Span[] sentSpans = sentenceDetector.sentPosDetect(jcas.getDocumentText());
for (Span sentSpan : sentSpans) {
String sentence = sentSpan.getCoveredText(text).toString();
int start = sentSpan.getStart();
Span[] tokSpans = tokenizer.tokenizePos(sentence);
String[] tokens = new String[tokSpans.length];
for (int i = 0; i < tokens.length; i++) {
tokens[i] = tokSpans[i].getCoveredText(sentence).toString();
}
String[] tags = posTagger.tag(tokens);
Span[] chunks = chunker.chunkAsSpans(tokens, tags);
for (Span chunk : chunks) {
if ("NP".equals(chunk.getType())) {
NounPhraseAnnotation annotation = new NounPhraseAnnotation(jcas);
annotation.setBegin(start +
tokSpans[chunk.getStart()].getStart());
annotation.setEnd(
start + tokSpans[chunk.getEnd() - 1].getEnd());
annotation.addToIndexes(jcas);
}
}
}
}
}
|
The various OpenNLP components are all initialized (once at startup) in the initialize() method - as you can see, the code pattern is quite repetitive. The process() method splits the text into sentences, sentences into tokens, POS tags the tokens, then uses the tokens and tags to chunk each sentence. Only noun-phrase chunks are annotated. One important things to note are that the chunk spans report its start and end offsets in terms of token (not character positions). Another thing to note is that the NounPhrase annotation start and end offsets are character offsets relative to the start of the incoming block of text.
AE Descriptor
Finally, the AE descriptor for the NounPhraseAnnotator. This is also pretty vanilla, the only non-standard block is the resource manager configuration which relates the OpenNLP model files with symbolic names used by the annotator. One other thing that you may notice is the reference to the TextAnnotator - as mentioned above, the ultimate goal is to replace the SentenceAnnotator which consumes TextAnnotations - thats why its here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 | <?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>true</primitive>
<annotatorImplementationName>
com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotator
</annotatorImplementationName>
<analysisEngineMetaData>
<name>NounPhraseAE</name>
<description>Annotates Noun Phrases in sentences</description>
<version>1.0</version>
<vendor/>
<configurationParameters/>
<configurationParameterSettings/>
<typeSystemDescription>
<types>
<typeDescription>
<name>com.mycompany.tgni.uima.annotators.text.TextAnnotation</name>
<description/>
<supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>
<typeDescription>
<name>com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotation</name>
<description/>
<supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>
</types>
</typeSystemDescription>
<typePriorities/>
<fsIndexCollection/>
<capabilities>
<capability>
<inputs>
<type allAnnotatorFeatures="true">com.mycompany.tgni.uima.annotator.text.TextAnnotator</type>
<feature>com.mycompany.tgni.uima.annotator.text.TextAnnotator:tagname</feature>
</inputs>
<outputs>
<type allAnnotatorFeatures="true">com.mycompany.tgni.uima.annotators.nlp.NounPhraseAnnotator</type>
</outputs>
<languagesSupported/>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<externalResourceDependencies>
<externalResourceDependency>
<key>SentenceModel</key>
<description>OpenNLP Sentence Model</description>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>TokenizerModel</key>
<description>OpenNLP Tokenizer Model</description>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>POSModel</key>
<description>OpenNLP POS Tagging Model</description>
<optional>false</optional>
</externalResourceDependency>
<externalResourceDependency>
<key>ChunkerModel</key>
<description>OpenNLP Chunker Model</description>
<optional>false</optional>
</externalResourceDependency>
</externalResourceDependencies>
<resourceManagerConfiguration>
<externalResources>
<externalResource>
<name>SentenceModelSerFile</name>
<description/>
<fileResourceSpecifier>
<fileUrl>file:@tgni.home@/conf/models/en-sent.bin</fileUrl>
</fileResourceSpecifier>
</externalResource>
<externalResource>
<name>TokenizerModelSerFile</name>
<description/>
<fileResourceSpecifier>
<fileUrl>file:@tgni.home@/conf/models/en-token.bin</fileUrl>
</fileResourceSpecifier>
</externalResource>
<externalResource>
<name>POSModelSerFile</name>
<description/>
<fileResourceSpecifier>
<fileUrl>file:@tgni.home@/conf/models/en-pos-maxent.bin</fileUrl>
</fileResourceSpecifier>
</externalResource>
<externalResource>
<name>ChunkerModelSerFile</name>
<description/>
<fileResourceSpecifier>
<fileUrl>file:@tgni.home@/conf/models/en-chunker.bin</fileUrl>
</fileResourceSpecifier>
</externalResource>
</externalResources>
<externalResourceBindings>
<externalResourceBinding>
<key>SentenceModel</key>
<resourceName>SentenceModelSerFile</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>TokenizerModel</key>
<resourceName>TokenizerModelSerFile</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>POSModel</key>
<resourceName>POSModelSerFile</resourceName>
</externalResourceBinding>
<externalResourceBinding>
<key>ChunkerModel</key>
<resourceName>ChunkerModelSerFile</resourceName>
</externalResourceBinding>
</externalResourceBindings>
</resourceManagerConfiguration>
</analysisEngineDescription>
|
Testing Code and Results
The following JUnit test runs the NounPhraseAnnotator primitive AE against some input sentences.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | // Source: src/test/java/com/mycompany/tgni/uima/annotators/nlp/NounPhraseAnnotatorTest.java
package com.mycompany.tgni.uima.annotators.nlp;
import java.util.Iterator;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.junit.Test;
import com.mycompany.tgni.uima.utils.UimaUtils;
/**
* Tests for the Noun Phrase Annotator.
*/
public class NounPhraseAnnotatorTest {
private static final String[] INPUTS = new String[] { ... };
@Test
public void testNounPhraseAnnotation() throws Exception {
AnalysisEngine ae = UimaUtils.getAE(
"conf/descriptors/NounPhraseAE.xml", null);
for (String input : INPUTS) {
System.out.println("text: " + input);
JCas jcas = UimaUtils.runAE(ae, input, UimaUtils.MIMETYPE_TEXT);
FSIndex index = jcas.getAnnotationIndex(NounPhraseAnnotation.type);
for (Iterator<NounPhraseAnnotation> it = index.iterator(); it.hasNext();) {
NounPhraseAnnotation annotation = it.next();
System.out.println("...(" + annotation.getBegin() + "," +
annotation.getEnd() + "): " +
annotation.getCoveredText());
}
}
}
}
|
And here are some test inputs and the associated noun phrases that were annotated. The annotation consists of the start and end character positions and the actual string covered.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | [junit] text: Be that as it may, the show must go on.
[junit] ...(11,13): it
[junit] ...(19,27): the show
[junit] text: As I was telling you, he will not attend the meeting.
[junit] ...(3,4): I
[junit] ...(17,20): you
[junit] ...(22,24): he
[junit] ...(41,52): the meeting
[junit] text: Dr Johnson will lead the team
[junit] ...(0,10): Dr Johnson
[junit] ...(21,29): the team
[junit] text: Lead is the lead cause of lead poisoning.
[junit] ...(0,4): Lead
[junit] ...(8,22): the lead cause
[junit] ...(26,40): lead poisoning
|
As you can see, the "Be" and "As" are no longer in the set of strings to be matched. The "lead" as a verb in the third example is also taken care of. The fourth example does return "the lead cause" which will still need to be taken care of somehow.