Lately, I have been looking at Named Entity Extraction (NER). As I see it, NER can be used to improve the search experience in various ways. First, NER can be incorporated into a custom Lucene analyzer, so "known" entities are protected from stemming, both during indexing and search. Second, NER can be used to parse a query string into an intelligent boolean multi-field query.
The two frameworks I looked at for this were the General Architecture for Text Engineering (GATE) and Apache Unstructured Information Management Architecture (UIMA). GATE is a huge and comprehensive framework, and it took me a while to get my head around it, and I still don't think I got it all. During this time, I happened to stumble across UIMA, and I liked the fact that it was all Java (compared to GATE's Jape, as powerful as it is) and because I liked the way it was built (small components roll up nicely into larger components, compared to the Language/Processor Resources approach of GATE). Maybe its just me, but I felt that GATE is more aimed towards linguists (many prebuilt components, but relatively harder to build their own) and UIMA towards programmers (relatively fewer components, but a well defined API fo people to build their own fairly easily).
Anyway, I decided to get familiar with the UIMA API by solving a toy problem. Assume a website which allows searching for names of people and organizations with optional (and partial) addresses to narrow the search. Behind the scenes, asume an index which stores city, state and zipcode as separate indexed fields. The query string is parsed using a UIMA aggregate analysis engine (AE) composed of a pipeline of three primitive AEs, for parsing the zipcode, state and city respectively. The end result of the analysis is the term with token offset information for each of these entities. I haven't gone as far as the query parser (a CAS Consumer in UIMA), so in this post I show the various descriptors and annotator code that parse the query string and extract the entities from it.
UIMA Background
For those not familiar with UIMA, its a framework developed by IBM and donated to Apache. UIMA is currently in the Apache incubator. For details, you should refer to the UIMA Tutorial and Developer's Guide, but if you want a really quick (and possibly incomplete) tour, here it is. The basic building block that you build is a primitive Analysis Engine (AE). Each primitive AE needs to have an annotation type and an annotator. The type is defined as an XML file and a tool called JCasGen used to generate the POJO representing the type and annotation. The annotator is written next, and an XML descriptor created. The framework instantiates the annotator using the AE XML descriptor. Aggregate AEs are defined as XML files, and define chains of primitive AEs.
UIMA comes with an Eclipse plug in, which provides tools to build the XML using fill-in forms. Its probably advisable to use that because the XML is quite complex, at least initially.
Zip Code Annotator
The Zip Code Annotator uses regular expressions to find zip codes in the input text. As mentioned before it needs a ZipCode type, which is defined by the following XML file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | <?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/zipcode/ZipCode.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
<name>ZipCode</name>
<description>Defines the zipcode type</description>
<version>1.0</version>
<vendor>MyCompany, Inc.</vendor>
<types>
<typeDescription>
<name>com.mycompany.myapp.uima.annotators.zipcode.ZipCodeAnnotation</name>
<description>ZipCode</description>
<supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>
</types>
</typeSystemDescription>
|
Running JCasGen creates a ZipCodeAnnotation_Type.java and a ZipCodeAnnotation.java files (the annotation class name is specified in the XML file). We then write the annotator, which looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | // Source: src/main/java/com/mycompany/myapp/uima/annotators/zipcode/ZipCodeAnnotator.java
package com.mycompany.myapp.uima.annotators.zipcode;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
public class ZipCodeAnnotator extends JCasAnnotator_ImplBase {
private Pattern zipCodePattern = Pattern.compile("\\d{5}(-\\d{4})*");
@Override
public void process(JCas jCAS) throws AnalysisEngineProcessException {
String text = jCAS.getDocumentText();
Matcher matcher = zipCodePattern.matcher(text);
int pos = 0;
while (matcher.find(pos)) {
ZipCodeAnnotation annotation = new ZipCodeAnnotation(jCAS);
annotation.setBegin(matcher.start());
annotation.setEnd(matcher.end());
annotation.addToIndexes();
pos = matcher.end();
}
}
}
|
Finally, we build a component descriptor for the annotator as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | <?xml version="1.0" encoding="UTF-8" ?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/zipcode/ZipCodeAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>true</primitive>
<annotatorImplementationName>
com.mycompany.myapp.uima.annotators.zipcode.ZipCodeAnnotator
</annotatorImplementationName>
<analysisEngineMetaData>
<name>Zip Code Annotator</name>
<description>Recognize and annotate zip code in text</description>
<version>1.0</version>
<vendor>MyCompany, Inc.</vendor>
<configurationParameters></configurationParameters>
<configurationParameterSettings></configurationParameterSettings>
<typeSystemDescription>
<imports>
<import location="ZipCode.xml"/>
</imports>
</typeSystemDescription>
<typePriorities></typePriorities>
<fsIndexCollection></fsIndexCollection>
<capabilities>
<capability>
<inputs></inputs>
<outputs>
<type>com.mycompany.myapp.uima.annotators.zipcode.ZipCode</type>
</outputs>
<languagesSupported></languagesSupported>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<externalResourceDependencies></externalResourceDependencies>
<resourceManagerConfiguration></resourceManagerConfiguration>
</analysisEngineDescription>
|
For each annotator, I build a unit test to make sure it functions properly. To keep the size of the post down, I will show the unit test for only the aggregate AE I create out of these primitives. The beauty of UIMA is that the Java code to call and run an aggregate AE is the same as that for a primitive AE.
City Annotator
The city annotator follows a slightly different approach. Rather than use a regular expression, it uses a list of US cities that is written to a database table. At JVM startup, UIMA calls the AE's init() method to load the database into an in-memory Set. The text is passed through a Lucene ShingleFilter, and the tokens generated matched against the contents of the set. There is an additional tweak to remove city tokens which are subsumed within longer city tokens, so for example, if both "Brunswick" and "South Brunswick" are recognized and the first is within the second one, the first token will be removed.
As before, we need an annotation type and an annotator. The XML descriptor for the type is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | <?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/city/City.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
<name>City</name>
<description>US Cities</description>
<version>1.0</version>
<vendor>MyCompany, Inc.</vendor>
<types>
<typeDescription>
<name>com.mycompany.myapp.uima.annotators.city.CityAnnotation</name>
<description>US States</description>
<supertypeName>uima.tcas.Annotation</supertypeName>
</typeDescription>
</types>
</typeSystemDescription>
|
We then run JCasGen to generate the Type and Annotation classes, and write the City Annotator, the code for which is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | // Source: src/main/java/com/mycompany/myapp/uima/annotators/city/CityAnnotator.java
package com.mycompany.myapp.uima.annotators.city;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashSet;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.commons.lang.math.Range;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import com.mycompany.myapp.utils.DbUtils;
public class CityAnnotator extends JCasAnnotator_ImplBase {
private static final int MAX_SHINGLE_SIZE = 3;
private Set<String> cityNames = new HashSet<String>();
@Override
public void initialize(UimaContext ctx)
throws ResourceInitializationException {
super.initialize(ctx);
try {
List<Map<String,Object>> rows = DbUtils.queryForList(
"select name from us_cities", null);
for (Map<String,Object> row : rows) {
cityNames.add(StringUtils.lowerCase((String) row.get("name")));
}
} catch (Exception e) {
throw new ResourceInitializationException(e);
}
}
@SuppressWarnings("unchecked")
@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
String text = jcas.getDocumentText();
text = text.replaceAll("\\p{Punct}", " ");
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(
new StringReader(text));
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = new ShingleFilter(tokenStream, MAX_SHINGLE_SIZE);
try {
while (tokenStream.incrementToken()) {
TermAttribute term = tokenStream.getAttribute(TermAttribute.class);
OffsetAttribute offset = tokenStream.getAttribute(OffsetAttribute.class);
String shingle = term.term();
if (cityNames.contains(shingle)) {
CityAnnotation annotation = new CityAnnotation(jcas);
annotation.setBegin(offset.startOffset());
annotation.setEnd(offset.endOffset());
annotation.addToIndexes();
}
}
// remove city entities that are subsumed within other
// city entities (such as Concord => North Concord, we
// should prefer the longer match).
// NOTE: this is an O(n**2) operation! If there are large
// number of annotations, then this can be expensive
FSIndex index = jcas.getAnnotationIndex(CityAnnotation.type);
for (Iterator<CityAnnotation> it1 = index.iterator(); it1.hasNext(); ) {
CityAnnotation ca1 = it1.next();
Range r1 = new IntRange(ca1.getBegin(), ca1.getEnd());
for (Iterator<CityAnnotation> it2 = index.iterator(); it2.hasNext(); ) {
CityAnnotation ca2 = it2.next();
if (ca1.getAddress() == ca2.getAddress()) {
continue;
}
Range r2 = new IntRange(ca2.getBegin(), ca2.getEnd());
if (r1.containsRange(r2)) {
ca2.removeFromIndexes();
}
}
}
} catch (IOException e) {
throw new AnalysisEngineProcessException(e);
}
}
}
|
And finally, there is the XML descriptor for the City AE, which is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | <?xml version="1.0" encoding="UTF-8" ?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/city/CityAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>true</primitive>
<annotatorImplementationName>
com.mycompany.myapp.uima.annotators.city.CityAnnotator
</annotatorImplementationName>
<analysisEngineMetaData>
<name>US States Annotator</name>
<description>Recognize and annotate city names in text</description>
<version>1.0</version>
<vendor>MyCompany, Inc.</vendor>
<configurationParameters></configurationParameters>
<configurationParameterSettings></configurationParameterSettings>
<typeSystemDescription>
<imports>
<import location="City.xml"/>
</imports>
</typeSystemDescription>
<typePriorities></typePriorities>
<fsIndexCollection></fsIndexCollection>
<capabilities>
<capability>
<inputs></inputs>
<outputs>
<type>com.mycompany.myapp.uima.annotators.city.CityAnnotation</type>
</outputs>
<languagesSupported></languagesSupported>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<externalResourceDependencies></externalResourceDependencies>
<resourceManagerConfiguration></resourceManagerConfiguration>
</analysisEngineDescription>
|
State Annotator
The state annotator uses a combination of pattern matching and name based lookup for both state abbreviations and the full names of the state. Since the addresses in our (hypothetical) index contains the states as abbreviations, we add the abbreviation as an attribute of the annotated state names. The code first searches for two letter patterns (CA, OR, etc), and then looks them up against a list of state abbreviations. It then shingles the input and looks up the shingles against a list of state names. The two lists are generated from data in a database table that is sucked into the in-memory data structures in the init() method.
Here is the XML descriptor for the State type. We have defined the "abbreviation" feature here, which triggers creation of getters and setters in the StateAnnotation POJO.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | <?xml version="1.0" encoding="UTF-8"?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/state/State.xml -->
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
<name>State</name>
<description>US State</description>
<version>1.0</version>
<vendor>MyCompany, Inc.</vendor>
<types>
<typeDescription>
<name>com.mycompany.myapp.uima.annotators.state.StateAnnotation</name>
<description>US State</description>
<supertypeName>uima.tcas.Annotation</supertypeName>
<features>
<featureDescription>
<name>abbreviation</name>
<description>State Abbreviation</description>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
</types>
</typeSystemDescription>
|
The code for the State Annotator is shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | // Source: src/main/java/com/mycompany/myapp/uima/annotators/state/StateAnnotator.java
package com.mycompany.myapp.uima.annotators.state;
import java.io.IOException;
import java.io.StringReader;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.WhitespaceTokenizer;
import org.apache.lucene.analysis.shingle.ShingleFilter;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import com.mycompany.myapp.utils.DbUtils;
public class StateAnnotator extends JCasAnnotator_ImplBase {
private static final Pattern STATE_PATTERN =
Pattern.compile("[A-Z]{2}");
private static final int SHINGLE_SIZE = 2;
private Set<String> stateAbbrs = new HashSet<String>();
private Map<String,String> stateNames = new HashMap<String,String>();
@Override
public void initialize(UimaContext ctx)
throws ResourceInitializationException {
super.initialize(ctx);
try {
List<Map<String,Object>> rows = DbUtils.queryForList(
"select abbr, name from us_states", null);
for (Map<String,Object> row : rows) {
stateAbbrs.add((String) row.get("abbr"));
stateNames.put(
StringUtils.lowerCase((String) row.get("name")),
(String) row.get("abbr"));
}
} catch (Exception e) {
throw new ResourceInitializationException(e);
}
}
@Override
public void process(JCas jcas) throws AnalysisEngineProcessException {
String text = jcas.getDocumentText();
// look for words with two uppercase letters, then check
// against the abbreviation set
Matcher stateAbbrMatcher = STATE_PATTERN.matcher(text);
int pos = 0;
while (stateAbbrMatcher.find(pos)) {
int start = stateAbbrMatcher.start();
int end = stateAbbrMatcher.end();
String abbr = text.substring(start, end);
if (stateAbbrs.contains(abbr)) {
StateAnnotation annotation = new StateAnnotation(jcas);
annotation.setBegin(start);
annotation.setEnd(end);
annotation.setAbbreviation(abbr);
annotation.addToIndexes();
}
pos = stateAbbrMatcher.end();
}
// now look for multi-word tokens (1-3), looking for a match
// against the state names
// preprocess the text so we remove punctuation
text = text.replaceAll("\\p{Punct}", " ");
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(
new StringReader(text));
TokenStream tokenStream = new LowerCaseFilter(tokenizer);
tokenStream = new ShingleFilter(tokenStream, SHINGLE_SIZE);
try {
while (tokenStream.incrementToken()) {
TermAttribute term = tokenStream.getAttribute(TermAttribute.class);
OffsetAttribute offset = tokenStream.getAttribute(OffsetAttribute.class);
String shingle = term.term();
if (stateNames.containsKey(shingle)) {
StateAnnotation annotation = new StateAnnotation(jcas);
annotation.setBegin(offset.startOffset());
annotation.setEnd(offset.endOffset());
annotation.setAbbreviation(stateNames.get(shingle));
annotation.addToIndexes();
}
}
} catch (IOException e) {
throw new AnalysisEngineProcessException(e);
}
}
}
|
And finally, the XML descriptor for the State AE. The abbreviation feature has to be defined in this XML as well.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | <?xml version="1.0" encoding="UTF-8" ?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/state/StateAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>true</primitive>
<annotatorImplementationName>
com.mycompany.myapp.uima.annotators.state.StateAnnotator
</annotatorImplementationName>
<analysisEngineMetaData>
<name>US States Annotator</name>
<description>Annotate state abbreviations and names in text</description>
<version>1.0</version>
<vendor>MyCompany, Inc.</vendor>
<configurationParameters></configurationParameters>
<configurationParameterSettings></configurationParameterSettings>
<typeSystemDescription>
<imports>
<import location="State.xml"/>
</imports>
</typeSystemDescription>
<typePriorities></typePriorities>
<fsIndexCollection></fsIndexCollection>
<capabilities>
<capability>
<inputs></inputs>
<outputs>
<type>com.mycompany.myapp.uima.annotators.state.StateAnnotation</type>
<feature>com.mycompany.myapp.uima.annotators.state.StateAnnotation:abbreviation</feature>
</outputs>
<languagesSupported></languagesSupported>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<externalResourceDependencies></externalResourceDependencies>
<resourceManagerConfiguration></resourceManagerConfiguration>
</analysisEngineDescription>
|
Putting it all together - Unit test
As mentioned before, each AE has its own unit tests to make sure they are working. Unit tests are especially important in this kind of setup, because a real life aggregate AE pipeline will consist of a set of co-operating primitive AE or aggregate AEs. Since there are likely to be inter-dependencies, unit tests can be a way to ensure that new functionality does not break something that used to work before the change. Of course, you should use Assert.assertXXX() calls instead of System.out.println() as I am doing here.
So I created an aggregate AE which recognizes tokens in address snippets, and I call it the AddressAE. There is no Java code for this AE, only an XML descriptor that chains the previous primitive AEs together. Here it is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 | <?xml version="1.0" encoding="UTF-8" ?>
<!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/aggregates/AddressAE.xml -->
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>false</primitive>
<delegateAnalysisEngineSpecifiers>
<delegateAnalysisEngine key="ZipCode">
<import location="../zipcode/ZipCodeAE.xml"/>
</delegateAnalysisEngine>
<delegateAnalysisEngine key="State">
<import location="../state/StateAE.xml"/>
</delegateAnalysisEngine>
<delegateAnalysisEngine key="City">
<import location="../city/CityAE.xml"/>
</delegateAnalysisEngine>
</delegateAnalysisEngineSpecifiers>
<analysisEngineMetaData>
<name>AddressAE</name>
<description>Runs the delegate AEs together</description>
<version>1.0</version>
<vendor>MyCompany, Inc.</vendor>
<flowConstraints>
<fixedFlow>
<node>ZipCode</node>
<node>State</node>
<node>City</node>
</fixedFlow>
</flowConstraints>
<configurationParameters></configurationParameters>
<configurationParameterSettings></configurationParameterSettings>
<fsIndexCollection></fsIndexCollection>
<capabilities>
<capability>
<inputs></inputs>
<outputs>
<type allAnnotatorFeatures="true">
com.mycompany.myapp.uima.annotators.zipcode.ZipCodeAnnotator
</type>
<type allAnnotatorFeatures="true">
com.mycompany.myapp.uima.annotators.state.StateAnnotator
</type>
<type allAnnotatorFeatures="true">
com.mycompany.myapp.uima.annotators.city.CityAnnotator
</type>
</outputs>
<languagesSupported>en</languagesSupported>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<externalResourceDependencies></externalResourceDependencies>
<resourceManagerConfiguration></resourceManagerConfiguration>
</analysisEngineDescription>
|
I create a TestUtils class which expose standard static methods to get an AE from the UIMA framework given its XML descriptor, another one that runs the AE, and a method that prints the results. This code was derived from JUnit test code in the AlchemyAPI UIMA sandbox component.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | // Source: src/test/java/com/mycompany/myapp/uima/TestUtils.java
package com.mycompany.myapp.uima;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.commons.lang.StringUtils;
import org.apache.uima.UIMAFramework;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.cas.Feature;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.util.InvalidXMLException;
import org.apache.uima.util.ProcessTrace;
import org.apache.uima.util.ProcessTraceEvent;
import org.apache.uima.util.XMLInputSource;
public class TestUtils {
public static AnalysisEngine getAE(
String descriptor, Map<String,Object> params)
throws IOException, InvalidXMLException,
ResourceInitializationException {
AnalysisEngine ae = null;
try {
XMLInputSource in = new XMLInputSource(descriptor);
AnalysisEngineDescription desc =
UIMAFramework.getXMLParser().
parseAnalysisEngineDescription(in);
if (params != null) {
for (String key : params.keySet()) {
desc.getAnalysisEngineMetaData().
getConfigurationParameterSettings().
setParameterValue(key, params.get(key));
}
}
ae = UIMAFramework.produceAnalysisEngine(desc);
} catch (Exception e) {
throw new ResourceInitializationException(e);
}
return ae;
}
public static JCas runAE(AnalysisEngine ae, String text)
throws AnalysisEngineProcessException,
ResourceInitializationException {
JCas jcas = ae.newJCas();
jcas.setDocumentText(text);
ProcessTrace trace = ae.process(jcas);
for (ProcessTraceEvent evt : trace.getEvents()) {
if (evt != null && evt.getResultMessage() != null &&
evt.getResultMessage().contains("error")) {
throw new AnalysisEngineProcessException(
new Exception(evt.getResultMessage()));
}
}
return jcas;
}
public static void printResults(JCas jcas) {
FSIndex index = jcas.getAnnotationIndex();
for (Iterator<Annotation> it = index.iterator(); it.hasNext(); ) {
Annotation annotation = it.next();
List<Feature> features = new ArrayList<Feature>();
if (annotation.getType().getName().contains("com.mycompany")) {
features = annotation.getType().getFeatures();
}
List<String> fasl = new ArrayList<String>();
for (Feature feature : features) {
if (feature.getName().contains("com.mycompany")) {
String name = feature.getShortName();
String value = annotation.getStringValue(feature);
fasl.add(name + "=\"" + value + "\"");
}
}
System.out.println(
annotation.getType().getShortName() + ": " +
annotation.getCoveredText() + " " +
(fasl.size() > 0 ? StringUtils.join(fasl.iterator(), ",") : "") + " " +
annotation.getBegin() + ":" + annotation.getEnd());
}
System.out.println("==");
}
}
|
The JUnit test for the AddressAE is simple (and follows the same pattern as the JUnit test cases for the primitive AEs). Here it is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | // Source: src/test/java/com/mycompany/myapp/uima/annotators/aggregates/AddressAnnotatorTest.java
package com.mycompany.myapp.uima.annotators.aggregates;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.jcas.JCas;
import org.junit.Test;
import com.mycompany.myapp.uima.TestUtils;
public class AddressAnnotatorTest {
private final String[] TEST_STRINGS = new String[] {
"Dr Goldwater, University of Michigan, Ann Arbor, MI 01234",
"Microsoft, 1 Microsoft Way, Redmond, WA",
"Apple, 1 Infinite Loop, Cupertino, CA 95014",
"IBM, 1 New Orchard Road, Armonk, NY 10504",
"Google, 1600 Amphitheater Parkway, Mountain View, CA 94043",
"Healthline, 600 3rd Street, San Francisco, CA 94107",
"Jane Doe, Lake Tahoe, California",
"Miss Liberty, Empire State Building, New York, NY"
};
@Test
public void testAddressAE() throws Exception {
AnalysisEngine ae = TestUtils.getAE(
"src/main/java/com/mycompany/myapp/uima/annotators/aggregates/AddressAE.xml",
null);
for (String text : TEST_STRINGS) {
JCas jcas = TestUtils.runAE(ae, text);
TestUtils.printResults(jcas);
}
}
}
|
And here are the results of this test. Each test string is treated as a Document by UIMA, so thats the first line. Below this are the annotations produced by each of the primitive AEs described above. I also report the begin and end offsets along with the annotated text in case I ever want to produce a Lucene tokenizer out of this. The next step is to create multi-field Lucene queries that query individual fields in the index.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | DocumentAnnotation: Dr Goldwater, University of Michigan, Ann Arbor, MI 01234 0:57
StateAnnotation: Michigan abbreviation="MI" 28:36
CityAnnotation: Ann Arbor 38:47
StateAnnotation: MI abbreviation="MI" 49:51
ZipCodeAnnotation: 01234 52:57
DocumentAnnotation: Microsoft, 1 Microsoft Way, Redmond, WA 0:39
CityAnnotation: Redmond 28:35
StateAnnotation: WA abbreviation="WA" 37:39
DocumentAnnotation: Apple, 1 Infinite Loop, Cupertino, CA 95014 0:43
CityAnnotation: Cupertino 24:33
StateAnnotation: CA abbreviation="CA" 35:37
ZipCodeAnnotation: 95014 38:43
DocumentAnnotation: IBM, 1 New Orchard Road, Armonk, NY 10504 0:41
CityAnnotation: Armonk 25:31
StateAnnotation: NY abbreviation="NY" 33:35
ZipCodeAnnotation: 10504 36:41
DocumentAnnotation: Google, 1600 Amphitheater Parkway, Mountain View, CA 94043 0:58
StateAnnotation: CA abbreviation="CA" 50:52
ZipCodeAnnotation: 94043 53:58
DocumentAnnotation: Healthline, 600 3rd Street, San Francisco, CA 94107 0:51
CityAnnotation: San Francisco 28:41
StateAnnotation: CA abbreviation="CA" 43:45
ZipCodeAnnotation: 94107 46:51
DocumentAnnotation: Jane Doe, Lake Tahoe, California 0:32
StateAnnotation: California abbreviation="CA" 22:32
DocumentAnnotation: Miss Liberty, Empire State Building, New York, NY 0:49
StateAnnotation: New York abbreviation="NY" 37:45
CityAnnotation: New York 37:45
StateAnnotation: NY abbreviation="NY" 47:49
|
Conclusion
As you can see, UIMA provides a nice framework for NER, allowing you to manually specify tokens that should be protected. All the programmer has to do is to specify the algorithms by which the tokens should be recognized. If you notice the results though, there is still quite a lot of improvement that can be done. For example, Michigan in "University of Michigan" is being recognized as a state, which points to the need to recognize various Universities. Also "New York" is recognized both as a city and a state, which points to the need for the city and the state annotators to be aware of each other (ie a city and state are usually collocated).
There is obviously much more to UIMA than this. I plan on taking a look at the UIMA sandbox components, either using some of them as-is, or leveraging the ideas in there to make my code smarter.
Thanks for this. I am new to UIMA and have been trying to get my head around it by writing simple annotators. do you have any more examples of how UIMA can be used in more complex cases - like detecting sentiments in sentences? also what is the benefit of combining UIMa with OpenNLP?
ReplyDeleteYou are welcome Gautam, glad it helped. I have been using UIMA for a toy/skunkworks project for a while now - its a system for concept mapping text against our medical taxonomy, you can find some examples in my more recent posts.
ReplyDeleteI initially used OpenNLP to break the input text into sentences. Bit of an overkill I know, but sentence parsing turned out to be not as easy as it sounds. Anyway OpenNLP offered the best performance/cost characteristics - it handles edge cases which I would have to explicitly handle in my own code, and its free.
More recently I have used OpenNLP for noun phrase extraction, which makes the concept mapping more accurate.
Thats a great post. I wonder if you have a source which i can download directly without hick ups and get started with your example code as a starter before dwelling deeper into UIMA.
ReplyDeleteThanks, but no, I don't have the source code in downlodable format (actually I don't have the source code anymore, deleted during refactoring). This was while I was learning UIMA for the skunkworks thing I am working on presently. I needed a toy application to write some UIMA code to teach myself, and this was it.
ReplyDelete