In his Porter Stemming Makes Me Rage post, Ted Dzuiba points out two classes of terms where Porter Stemmer (the stemming algorithm used in most Lucene installations, including ours) overstems and causes poor results. At the end of the post, he concludes:
What's the answer to this? If you're a company with millions in VC lottery winnings, you can pay Basistech $100,000 for a 3-year license of their context sensitive stemmer. If you're me, though, you make exclusion lists. Big ones.
Coencidentally, Solr (since version 3.1) provides filters which allow you to customize stemming by providing exclusion lists which are ignored by the Porter stemmer. Both the Solr filters in this category are driven by flat files, ie, someone would have to list the set of protected words into these input files.
But to take it one step further, what if you were algorithmically able to detect exclusions? Take abbreviations, for example. In the absence of exclusion lists, abbreviations such as "AIDS" would be stemmed to "aid". Ths space of possible abbreviations is potentially infinite, however, so now we are stuck with the onerous task of manually tracking and maintaining abbreviations as they show up. An alternative is to rely on pattern regexes to detect them. Detecting an abbreviation seems to be quite simple (at least until you hit the corner cases) - either a sequence of uppercase letter followed by period, or at least 2 uppercase letters followed by more or a lowercase letter or digit or hyphen, etc.
Generalizing a bit further, we could have other candidates for exclusion, such as named entities, which may rely on a dictionary match or some other computation, rather than a simple pattern match. An UIMA aggregate analysis engine would be a natural fit for such a processing step (where multiple types of entities are being recognized and annotated in a body of text). The results of this processing need to be visible to a Lucene Porter Stemmer TokenFilter, so there needs to be a way to wrap this in a Lucene TokenFilter.
UIMA Side - Creating an Abbreviation Analysis Engine
I have been building up a few components for a larger project, so some names might seem a bit wonky given the limited context presented here, but essentially the annotator to recognize abbreviations is a simple pattern based annotator which creates keyword annotations in the input text. Our analysis engine consists of this single annotator. If, down the road, we want to replace it with an aggregate analysis engine, we just point to a different XML descriptor file.
The keyword annotation represents one (or more) words that are matched by a pattern (either regular expression or dictionary term) in the input text. Its XML descriptor looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | <!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/keyword/Keyword.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
<name>Keyword</name>
<description/>
<version>1.0</version>
<vendor/>
<types>
<typeDescription>
<name>com.mycompany.myapp.uima.annotators.keyword.KeywordAnnotation</name>
<description/>
<supertypeName>uima.tcas.Annotation</supertypeName>
<features>
<featureDescription>
<name>keywordType</name>
<description>The keyword type</description>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
<featureDescription>
<name>keywordValue</name>
<description>The keyword value (can be empty)</description>
<rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
</features>
</typeDescription>
</types>
</typeSystemDescription>
|
The XML file above is used by JCasGen to generate a KeywordAnnotation.java bean. The PatternAnnotator.java uses one or more regex patterns to detect different "kinds" of keywords in the text. In our case, we are looking for abbrevations, so we test for the following two regexes: {"([A-Z]\.)+", "[A-Z]{2}[A-Za-z0-9-]*"}. Here is the code for the PatternAnnotator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 | // Source: src/main/java/com/mycompany/myapp/uima/annotators/keyword/PatternAnnotator.java
package com.mycompany.myapp.uima.annotators.keyword;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.lang.StringUtils;
import org.apache.uima.UimaContext;
import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.jcas.JCas;
import org.apache.uima.resource.ResourceInitializationException;
import com.mycompany.myapp.utils.AnnotatorUtils;
import com.mycompany.myapp.utils.DbUtils;
public class PatternAnnotator extends JCasAnnotator_ImplBase {
private String annotationName;
private List<Pattern> patterns;
private List<String> keywordValues;
@Override
public void initialize(UimaContext ctx)
throws ResourceInitializationException {
super.initialize(ctx);
annotationName = (String) ctx.getConfigParameterValue("annotationName");
List<Map<String,Object>> rows = DbUtils.queryForList(
"select prop_name, prop_val from config where ann_name=?",
new Object[] {annotationName});
patterns = new ArrayList<Pattern>(rows.size());
keywordValues = new ArrayList<String>(rows.size());
for (Map<String,Object> row : rows) {
patterns.add(Pattern.compile((String) row.get("prop_name")));
String patternValue = (String) row.get("prop_val");
if (StringUtils.isEmpty(patternValue)) {
patternValue = "";
}
keywordValues.add(patternValue);
}
}
@Override
public void process(JCas jcas)
throws AnalysisEngineProcessException {
String text = jcas.getDocumentText();
int pcnt = 0;
for (Pattern pattern : patterns) {
Matcher matcher = pattern.matcher(text);
int pos = 0;
while (matcher.find(pos)) {
pos = matcher.end();
KeywordAnnotation annotation = new KeywordAnnotation(jcas);
annotation.setBegin(matcher.start());
annotation.setEnd(pos);
annotation.setKeywordType(annotationName);
annotation.addToIndexes();
}
pcnt++;
}
}
}
|
And the corresponding XML descriptor (that will be called to buid the analysis engine) looks like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 | <!-- Source: src/main/java/com/mycompany/myapp/uima/annotators/keyword/PatternAE.xml -->
<?xml version="1.0" encoding="UTF-8"?>
<analysisEngineDescription xmlns="http://uima.apache.org/resourceSpecifier">
<frameworkImplementation>org.apache.uima.java</frameworkImplementation>
<primitive>true</primitive>
<annotatorImplementationName>com.mycompany.myapp.uima.annotators.keyword.PatternAnnotator</annotatorImplementationName>
<analysisEngineMetaData>
<name>PatternAE</name>
<description/>
<version>1.0</version>
<vendor/>
<configurationParameters>
<configurationParameter>
<name>annotationName</name>
<description/>
<type>String</type>
<multiValued>false</multiValued>
<mandatory>true</mandatory>
</configurationParameter>
</configurationParameters>
<configurationParameterSettings/>
<typeSystemDescription>
<imports>
<import location="Keyword.xml"/>
</imports>
</typeSystemDescription>
<typePriorities/>
<fsIndexCollection/>
<capabilities>
<capability>
<inputs/>
<outputs>
<type>com.mycompany.myapp.uima.annotators.keyword.KeywordAnnotation</type>
<feature>com.mycompany.myapp.uima.annotators.keyword.KeywordAnnotation:keywordType</feature>
<feature>com.mycompany.myapp.uima.annotators.keyword.KeywordAnnotation:keywordValue</feature>
</outputs>
<languagesSupported/>
</capability>
</capabilities>
<operationalProperties>
<modifiesCas>true</modifiesCas>
<multipleDeploymentAllowed>true</multipleDeploymentAllowed>
<outputsNewCASes>false</outputsNewCASes>
</operationalProperties>
</analysisEngineMetaData>
<resourceManagerConfiguration/>
</analysisEngineDescription>
|
Lucene Side - Creating an UIMA Annotation TokenFilter
To hook this up on the Lucene side, I decided to build a custom attribute that holds all the information computed from the UIMA layer. I could have just reused the attributes already available, but then I needed the attribute to carry some more information. Here is the code for the UimaAnnotationAttribute interface.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | // Source: src/main/java/com/mycompany/myapp/lucene/UimaAnnotationAttribute.java
package com.mycompany.myapp.lucene;
import java.util.Map;
import org.apache.lucene.util.Attribute;
public interface UimaAnnotationAttribute extends Attribute {
public UimaAnnotationType getType();
public void setType(UimaAnnotationType type);
public int getBegin();
public void setBegin(int begin);
public int getEnd();
public void setEnd(int end);
public String getCoveredText();
public void setCoveredText(String coveredText);
public Map<String,String> getProperties();
public void setProperties(Map<String,String> properties);
}
|
And the corresponding attribute implementation. Note that the implementation extends from the abstract class AttributeImpl as well as implements the interface defined above.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | // Source: src/main/java/com/mycompany/myapp/lucene/UimaAnnotationAttributeImpl.java
package com.mycompany.myapp.lucene;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.util.AttributeImpl;
public class UimaAnnotationAttributeImpl extends AttributeImpl
implements UimaAnnotationAttribute {
private static final long serialVersionUID = -623971024279939739L;
private UimaAnnotationType type;
private int begin;
private int end;
private String coveredText;
private Map<String,String> properties = new HashMap<String,String>();
//////////////// UimaAttribute mwthods /////////////////
@Override
public UimaAnnotationType getType() {
return type;
}
@Override
public void setType(UimaAnnotationType type) {
this.type = type;
}
@Override
public int getBegin() {
return begin;
}
@Override
public void setBegin(int begin) {
this.begin = begin;
}
@Override
public int getEnd() {
return end;
}
@Override
public void setEnd(int end) {
this.end = end;
}
@Override
public String getCoveredText() {
return coveredText;
}
@Override
public void setCoveredText(String coveredText) {
this.coveredText = coveredText;
}
@Override
public Map<String, String> getProperties() {
return properties;
}
@Override
public void setProperties(Map<String, String> properties) {
this.properties = properties;
}
////////////////// AttributeImpl methods /////////////////////
@Override
public void clear() {
this.type = UimaAnnotationType.NONE;
this.begin = 0;
this.end = 0;
this.coveredText = null;
this.properties.clear();
}
@Override
public void copyTo(AttributeImpl target) {
clear();
UimaAnnotationAttribute u = (UimaAnnotationAttribute) target;
this.type = u.getType();
this.begin = u.getBegin();
this.end = u.getEnd();
this.coveredText = u.getCoveredText();
this.properties.putAll(u.getProperties());
}
@Override
public int hashCode() {
return (this.type == null ? 0 : this.type.hashCode()) +
(31 * (this.begin)) +
((31 * 31) * (this.end)) +
((31 * 31 * 31) * (this.coveredText == null ? 0 : this.coveredText.hashCode())) +
((31 * 31 * 31 * 31) * this.properties.hashCode());
}
@Override
public boolean equals(Object other) {
if (other == this) {
return true;
}
if (other instanceof UimaAnnotationAttribute) {
clear();
UimaAnnotationAttribute that = (UimaAnnotationAttribute) other;
return ((this.type == that.getType()) &&
(this.begin == that.getBegin()) &&
(this.end == that.getEnd()) &&
(StringUtils.equals(this.coveredText, that.getCoveredText())) &&
(mapEquals(this.properties, that.getProperties())));
}
return false;
}
private boolean mapEquals(Map<String,String> m1,
Map<String,String> m2) {
if (m1 == null && m2 == null) {
return true;
}
if (m1 == null ^ m2 == null) {
return false;
}
if (m1.size() != m2.size()) {
return false;
}
boolean isEqual = true;
for (String key : m1.keySet()) {
String v1 = m1.get(key);
String v2 = m2.get(key);
if (! StringUtils.equals(v1, v2)) {
isEqual = false;
break;
}
}
return isEqual;
}
}
|
The tokenfilter itself expects the full string to be fed to it (with whitespace and punctuation), so your best bet is to build an analyzer which starts with a KeywordTokenizer. This token filter will call the configured UIMA Analyzer Chain to apply the annotation(s) on the string, and then return tokens where the recognized entities (and the remaining unrecognized chunks) are returned as tokens from this token filter.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | // Source: src/main/java/com/mycompany/myapp/lucene/UimaAnnotationTokenFilter.java
package com.mycompany.myapp.lucene;
import java.io.IOException;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.Map;
import org.apache.commons.lang.math.IntRange;
import org.apache.commons.lang.math.Range;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.springframework.core.annotation.AnnotationUtils;
import com.mycompany.myapp.uima.annotators.keyword.KeywordAnnotation;
import com.mycompany.myapp.utils.AnnotatorUtils;
import com.mycompany.myapp.utils.UimaUtils;
public class UimaAnnotationTokenFilter extends TokenFilter {
private int annotationType;
private TermAttribute terms;
private UimaAnnotationAttribute attrib;
private AnalysisEngine ae;
private LinkedList<TypedRange> ranges;
protected UimaAnnotationTokenFilter(TokenStream input,
String aeDescriptor, Map<String,Object> params,
int annotationType) {
super(input);
attrib = addAttribute(UimaAnnotationAttribute.class);
terms = addAttribute(TermAttribute.class);
try {
ae = UimaUtils.getAE(aeDescriptor, params);
} catch (Exception e) {
throw new RuntimeException(e);
}
this.annotationType = annotationType;
}
@Override
public boolean incrementToken() throws IOException {
while (input.incrementToken()) {
String text = terms.term();
if (ranges == null) {
ranges = new LinkedList<TypedRange>();
try {
JCas jcas = UimaUtils.runAE(ae, text);
FSIndex<? extends Annotation> fsindex =
jcas.getAnnotationIndex(annotationType);
int pos = 0;
for (Iterator<? extends Annotation> it = fsindex.iterator();
it.hasNext(); ) {
KeywordAnnotation annot = (KeywordAnnotation) it.next();
int begin = annot.getBegin();
int end = annot.getEnd();
if (pos < begin) {
ranges.add(new TypedRange(UimaAnnotationType.NONE,
new IntRange(pos, begin), text.substring(pos, begin)));
}
ranges.add(new TypedRange(UimaAnnotationType.ABBR,
new IntRange(begin, end), annot.getCoveredText()));
pos = end;
}
// take care of trailing part
if (pos < text.length()) {
ranges.add(new TypedRange(UimaAnnotationType.NONE,
new IntRange(pos, text.length()), text.substring(pos)));
}
} catch (Exception e) {
throw new IOException(e);
}
}
}
if (ranges == null || ranges.isEmpty()) {
return false;
}
int remainingTokens = ranges.size();
if (remainingTokens > 0) {
TypedRange typedRange = ranges.get(0);
attrib.setType(typedRange.type);
attrib.setBegin(typedRange.range.getMinimumInteger());
attrib.setEnd(typedRange.range.getMaximumInteger());
attrib.setCoveredText(typedRange.covered);
ranges.removeFirst();
return true;
}
input.clearAttributes();
return false;
}
private class TypedRange {
public UimaAnnotationType type;
public Range range;
public String covered;
public TypedRange(UimaAnnotationType type,
Range range, String covered) {
this.type = type;
this.range = range;
this.covered = covered;
}
public String toString() {
return type + "(" + range + ")";
}
}
}
|
As you can see from the code above, the TokenFilter is not a streaming one, ie, it gets the entire text in before it starts on the annotation, and then once done, sends the text out as tokens.
Bringing it together - JUnit test case
The JUnit test below illustrates how to use it. The one thing to keep in mind that you must pass in the entire text into the UIMA TokenFilter for it to do its job, so I went with a KeywordTokenizer at the head of my analyzer chain.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | // Source: src/test/java/com/mycompany/myapp/lucene/UimaAnnotationTokenFilterTest.java
package com.mycompany.myapp.lucene;
import java.io.Reader;
import java.io.StringReader;
import java.util.HashMap;
import java.util.Map;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.KeywordTokenizer;
import org.apache.lucene.analysis.TokenStream;
import org.junit.Test;
import com.mycompany.myapp.uima.annotators.keyword.KeywordAnnotation;
public class UimaAnnotationTokenFilterTest {
@Test
public void testUimaKeywordTokenFilter() throws Exception {
String[] testStrings = new String[] {
"Born in the USA I was...",
"Born in the U.S.A., I was...",
"CSC and IBM are Fortune 500 companies.",
"Linux is embraced by the Oracles and IBMs of the world",
"PET scans are uncomfortable.",
"The HIV-1 virus is an AIDS carrier",
"Unstructured Information Management Application (UIMA) is fantastic!"
};
Analyzer analyzer = getAnalyzer();
for (String s : testStrings) {
System.out.println("input=" + s);
TokenStream tokenStream = analyzer.tokenStream("f", new StringReader(s));
while (tokenStream.incrementToken()) {
UimaAnnotationAttribute attr = (UimaAnnotationAttribute)
tokenStream.getAttribute(UimaAnnotationAttribute.class);
System.out.println("term=" + attr.getCoveredText() +
" (" + attr.getBegin() + "," + attr.getEnd() +
") [" + attr.getType().name());
}
System.out.println();
}
}
private Analyzer getAnalyzer() {
final Map<String,Object> params = new HashMap<String,Object>();
params.put("annotationName", "pattern_abbr");
return new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
KeywordTokenizer tokenizer = new KeywordTokenizer(reader);
UimaAnnotationTokenFilter filter =
new UimaAnnotationTokenFilter(tokenizer,
"/path/to/descriptors/PatternAE.xml",
params, KeywordAnnotation.type);
return filter;
}
};
}
}
|
And here is the output of this test:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 | input=Born in the USA I was...
term=Born in the (0,12) [NONE]
term=USA (12,15) [ABBR]
term= I was... (15,24) [NONE]
input=Born in the U.S.A., I was...
term=Born in the (0,12) [NONE]
term=U.S.A. (12,18) [ABBR]
term=, I was... (18,28) [NONE]
input=CSC and IBM are Fortune 500 companies.
term=CSC (0,3) [ABBR]
term= and (3,8) [NONE]
term=IBM (8,11) [ABBR]
term= are Fortune 500 companies. (11,38) [NONE]
input=Linux is embraced by the Oracles and IBMs of the world
term=Linux is embraced by the Oracles and (0,37) [NONE]
term=IBMs (37,41) [ABBR]
term= of the world (41,54) [NONE]
input=PET scans are uncomfortable.
term=PET (0,3) [ABBR]
term= scans are uncomfortable. (3,28) [NONE]
input=The HIV-1 virus is an AIDS carrier
term=The (0,4) [NONE]
term=HIV-1 (4,9) [ABBR]
term= virus is an (9,22) [NONE]
term=AIDS (22,26) [ABBR]
term= carrier (26,34) [NONE]
input=Unstructured Information Management Application (UIMA) is fantastic!
term=Unstructured Information Management Application ( (0,49) [NONE]
term=UIMA (49,53) [ABBR]
term=) is fantastic! (53,68) [NONE]
|
As you can see, abbreviations are recognized and marked. The tokens could now be fed to an annotation aware Standard TokenFilter to break non annotated chunks into word terms and then to an annotation aware Porter Stem TokenFilter to only stem terms which are not annotated.
2 comments (moderated to prevent spam):
Hi Sujit,
thank you very much for your interesting post. This looks very promising as I am looking for something similar. Unfortunately, with all the solutions I tried before I had the problem, that the highlighter always showed the original input instead of the extracted text. Does your approach work well with the highlighter? Can you highlight something like all Named Entities recognized as persons?
Best regards,
Hannes
Hi Hannes, thanks for the kind words.
I haven't tried my output with the highlighter yet but I suspect it will also show the original text. The replaced text is a property of the annotation, so I would expect it to appear as such in highlighter output.
Post a Comment