Last week, I wrote about a UIMA Aggregate Analysis Engine (AE) that annotates keywords in a body of text, optionally inserting synonyms, using a combination of pattern matching and dictionary lookups. The idea is that this analysis will be done on text on its way into a Lucene index. So this week, I describe the Lucene Analyzer chain that I built around the AE I described last week.
A picture is worth a thousand words, so here is one that shows what I am (or will be soon, in much greater detail) talking about.
As you can imagine, most of the work happens in the UimaAETokenizer. The tokenizer is a buffering (non-streaming) Tokenizer, ie, the entire text is read from the Reader and analyzed by the UIMA AE, then individual tokens returned on successive calls to its incrementToken() method. I decided to use the new (to me) AttributeSource.State object to keep track of the tokenizer's state between calls to incrementToken() (found out about it by grokking through the Synonym filter example in the LIA2 book).
After (UIMA) analysis, the annotated tokens are marked as Keyword, any transformed values for the annotation are set into the SynonymMap (for use by the synonym filter, next in the chain). Text that is not annotated are split up (by punctuation and whitespace) and returned as plain Lucene Term (or CharTerm since Lucene 3.x) tokens. Here is the code for the Tokenizer class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 | // Source: src/main/java/com/mycompany/tgni/lucene/UimaAETokenizer.java
package com.mycompany.tgni.lucene;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.synonym.SynonymMap;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation;
import com.mycompany.tgni.uima.utils.UimaUtils;
/**
* Tokenizes a block of text from the passed in reader and
* annotates it with the specified UIMA Analysis Engine. Terms
* in the text that are not annotated by the Analysis Engine
* are split on whitespace and punctuation. Attributes available:
* CharTermAttribute, OffsetAttribute, PositionIncrementAttribute
* and KeywordAttribute.
*/
public final class UimaAETokenizer extends Tokenizer {
private final CharTermAttribute termAttr;
private final OffsetAttribute offsetAttr;
private final PositionIncrementAttribute posIncAttr;
private final KeywordAttribute keywordAttr;
private AttributeSource.State current;
private AnalysisEngine ae;
private SynonymMap synmap;
private LinkedList<IntRange> rangeList;
private Map<IntRange,Object> rangeMap;
private Reader reader = null;
private boolean eof = false;
private static final Pattern PUNCT_OR_SPACE_PATTERN =
Pattern.compile("[\\p{Punct}\\s+]");
private static final String SYN_DELIMITER = "__";
public UimaAETokenizer(Reader input,
String aeDescriptor, Map<String,Object> aeParams,
SynonymMap synonymMap) {
super(input);
// validate inputs
try {
ae = UimaUtils.getAE(aeDescriptor, aeParams);
} catch (Exception e) {
throw new RuntimeException(e);
}
if (synonymMap == null) {
throw new RuntimeException(
"Need valid (non-null) reference to a SynonymMap");
}
synmap = synonymMap;
reader = new BufferedReader(input);
// set available attributes
termAttr = addAttribute(CharTermAttribute.class);
offsetAttr = addAttribute(OffsetAttribute.class);
posIncAttr = addAttribute(PositionIncrementAttribute.class);
keywordAttr = addAttribute(KeywordAttribute.class);
// initialize variables
rangeList = new LinkedList<IntRange>();
rangeMap = new HashMap<IntRange,Object>();
}
@Override
public boolean incrementToken() throws IOException {
if (rangeList.size() > 0) {
populateAttributes();
current = captureState();
restoreState(current);
if (rangeList.size() == 0) {
eof = true;
}
return true;
}
// if no more tokens, return
if (eof) {
return false;
}
// analyze input and buffer tokens
clearAttributes();
rangeList.clear();
rangeMap.clear();
try {
List<String> texts = IOUtils.readLines(reader);
for (String text : texts) {
JCas jcas = UimaUtils.runAE(ae, text);
FSIndex<? extends Annotation> fsindex =
jcas.getAnnotationIndex(KeywordAnnotation.type);
int pos = 0;
for (Iterator<? extends Annotation> it = fsindex.iterator();
it.hasNext(); ) {
KeywordAnnotation annotation = (KeywordAnnotation) it.next();
int begin = annotation.getBegin();
int end = annotation.getEnd();
if (pos < begin) {
// this is plain text, split this up by whitespace
// into individual terms
addNonAnnotatedTerms(pos, text.substring(pos, begin));
}
IntRange range = new IntRange(begin, end);
mergeAnnotationInfo(range, annotation);
pos = end;
}
if (pos < text.length()) {
addNonAnnotatedTerms(pos, text.substring(pos));
}
current = captureState();
}
} catch (Exception e) {
throw new IOException(e);
}
// return the first term from rangeList
populateAttributes();
return true;
}
private void populateAttributes() {
if (rangeList.size() == 0) {
return;
}
// return buffered tokens one by one. If current
// token has an associated UimaAnnotationAttribute,
// then set the attribute in addition to term
IntRange range = rangeList.removeFirst();
if (rangeMap.containsKey(range)) {
Object rangeValue = rangeMap.get(range);
if (rangeValue instanceof KeywordAnnotation) {
// this is a UIMA Keyword annotation
KeywordAnnotation annotation = (KeywordAnnotation) rangeValue;
String term = annotation.getCoveredText();
String transformedValue = annotation.getTransformedValue();
if (StringUtils.isNotEmpty(transformedValue)) {
List<Token> tokens = SynonymMap.makeTokens(
Arrays.asList(StringUtils.split(
transformedValue, SYN_DELIMITER)));
// rather than add all the synonym tokens in a single
// add, we have to do this separately to ensure that
// the position increment attribute is set to 0 for
// all the synonyms, not just the first one
for (Token token : tokens) {
synmap.add(Arrays.asList(term), Arrays.asList(token), true, true);
}
}
offsetAttr.setOffset(annotation.getBegin(),
annotation.getEnd());
termAttr.copyBuffer(term.toCharArray(), 0, term.length());
termAttr.setLength(term.length());
keywordAttr.setKeyword(true);
posIncAttr.setPositionIncrement(1);
} else {
// this is a plain text term
String term = (String) rangeValue;
termAttr.copyBuffer(term.toCharArray(), 0, term.length());
termAttr.setLength(term.length());
offsetAttr.setOffset(range.getMinimumInteger(),
range.getMaximumInteger());
keywordAttr.setKeyword(false);
posIncAttr.setPositionIncrement(1);
}
}
}
private void addNonAnnotatedTerms(int pos, String snippet) {
int start = 0;
Matcher m = PUNCT_OR_SPACE_PATTERN.matcher(snippet);
while (m.find(start)) {
int begin = m.start();
int end = m.end();
if (start == begin) {
// this is a punctuation character, skip it
start = end;
continue;
}
IntRange range = new IntRange(pos + start, pos + begin);
rangeList.add(range);
rangeMap.put(range, snippet.substring(start, begin));
start = end;
}
// take care of trailing string in snippet
if (start < snippet.length()) {
IntRange range = new IntRange(pos + start, pos + snippet.length());
rangeList.add(range);
rangeMap.put(range, snippet.substring(start));
}
}
private void mergeAnnotationInfo(IntRange range,
KeywordAnnotation annotation) {
// verify if the range has not already been recognized.
// this is possible if multiple AEs recognize and act
// on the same pattern/dictionary entry
if (rangeMap.containsKey(range) &&
rangeMap.get(range) instanceof KeywordAnnotation) {
KeywordAnnotation prevAnnotation =
(KeywordAnnotation) rangeMap.get(range);
Set<String> synonyms = new HashSet<String>();
if (StringUtils.isNotEmpty(
prevAnnotation.getTransformedValue())) {
synonyms.addAll(Arrays.asList(StringUtils.split(
prevAnnotation.getTransformedValue(), SYN_DELIMITER)));
}
if (StringUtils.isNotEmpty(annotation.getTransformedValue())) {
synonyms.addAll(Arrays.asList(StringUtils.split(
annotation.getTransformedValue(), SYN_DELIMITER)));
}
annotation.setTransformedValue(StringUtils.join(
synonyms.iterator(), SYN_DELIMITER));
rangeMap.put(range, annotation);
} else {
rangeList.add(range);
rangeMap.put(range, annotation);
}
}
}
|
The UimaUtils class is a simple utilities class that wraps common UIMA operations such as building an Analysis Engine from a descriptor, running an Analysis Engine, etc. Its code is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | // Source: ./src/main/java/com/mycompany/tgni/uima/utils/UimaUtils.java
package com.mycompany.tgni.uima.utils;
import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import org.apache.commons.lang.StringUtils;
import org.apache.uima.UIMAFramework;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.cas.Feature;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.util.InvalidXMLException;
import org.apache.uima.util.ProcessTrace;
import org.apache.uima.util.ProcessTraceEvent;
import org.apache.uima.util.XMLInputSource;
/**
* Largely copied from the TestUtils class in UIMA Sandbox component
* AlchemyAPIAnnotator.
*/
public class UimaUtils {
public static AnalysisEngine getAE(
String descriptor, Map<String,Object> params)
throws IOException, InvalidXMLException,
ResourceInitializationException {
AnalysisEngine ae = null;
try {
XMLInputSource in = new XMLInputSource(descriptor);
AnalysisEngineDescription desc =
UIMAFramework.getXMLParser().
parseAnalysisEngineDescription(in);
if (params != null) {
for (String key : params.keySet()) {
desc.getAnalysisEngineMetaData().
getConfigurationParameterSettings().
setParameterValue(key, params.get(key));
}
}
ae = UIMAFramework.produceAnalysisEngine(desc);
} catch (Exception e) {
throw new ResourceInitializationException(e);
}
return ae;
}
public static JCas runAE(AnalysisEngine ae, String text)
throws AnalysisEngineProcessException,
ResourceInitializationException {
JCas jcas = ae.newJCas();
jcas.setDocumentText(text);
ProcessTrace trace = ae.process(jcas);
for (ProcessTraceEvent evt : trace.getEvents()) {
if (evt != null && evt.getResultMessage() != null &&
evt.getResultMessage().contains("error")) {
throw new AnalysisEngineProcessException(
new Exception(evt.getResultMessage()));
}
}
return jcas;
}
public static void printResults(JCas jcas) {
FSIndex<Annotation> index = jcas.getAnnotationIndex();
for (Iterator<Annotation> it = index.iterator(); it.hasNext(); ) {
Annotation annotation = it.next();
List<Feature> features = new ArrayList<Feature>();
if (annotation.getType().getName().contains("com.mycompany")) {
features = annotation.getType().getFeatures();
}
List<String> fasl = new ArrayList<String>();
for (Feature feature : features) {
if (feature.getName().contains("com.mycompany")) {
String name = feature.getShortName();
String value = annotation.getStringValue(feature);
fasl.add(name + "=\"" + value + "\"");
}
}
System.out.println(
annotation.getType().getShortName() + ": " +
annotation.getCoveredText() + " " +
(fasl.size() > 0 ? StringUtils.join(fasl.iterator(), ",") : "") + " " +
annotation.getBegin() + ":" + annotation.getEnd());
}
System.out.println("==");
}
}
|
The next filter in the chain is the (Lucene provided, since 3.0 I think) SynonymFilter. It needs a reference to a SynonymMap. An empty SynonymMap was provided to the UimaAETokenizer, which it populated, and now it is available for use by the SynonymFilter. And yes, I do realize that this sort of pass-by-reference stuff is frowned upon in the Java world, but at least in this case, it keeps the code simple and easy to understand.
At the end of this step, the SynonymFilter will set the synonym terms at the same offset as the original term, and set the position increment gap to 0.
The next two filters are the LowerCaseFilter and StopFilter, to lowercase the tokens and remove stopwords respectively. I wanted them to not operate on tokens generated off the UIMA annotations in my UimaAETokenizer, similar to how the PorterStemFilter operates on Lucene 4.0. Specifically, with PorterStemFilter, it is possible to mark certain terms as keywords using KeywordAttribute.setKeyword(true), and these terms will be skipped for stemming.
However, this functionality does not exist in Lucene (yet), so I have opened a JIRA (LUCENE-3236) with the necessary patches for this, hopefully it will be incorporated into Lucene at some point. In the interim, you can use the versions below, which are functionality-wise identical to the patched versions I provided in the JIRA.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | // Source: src/main/java/com/mycompany/tgni/lucene/LowerCaseFilter.java
package com.mycompany.tgni.lucene;
import java.io.IOException;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.util.CharacterUtils;
import org.apache.lucene.util.Version;
public final class LowerCaseFilter extends TokenFilter {
private final CharacterUtils charUtils;
private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
private final KeywordAttribute keywordAtt =
addAttribute(KeywordAttribute.class);
private boolean ignoreKeyword = false;
/**
* Extra constructor to trigger new keyword-aware behavior.
*/
public LowerCaseFilter(Version matchVersion, TokenStream in,
boolean ignoreKeyword) {
super(in);
charUtils = CharacterUtils.getInstance(matchVersion);
this.ignoreKeyword = ignoreKeyword;
}
/**
* Old ctor.
*/
public LowerCaseFilter(Version matchVersion, TokenStream in) {
this(matchVersion, in, false);
}
@Override
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
if (ignoreKeyword && keywordAtt.isKeyword()) {
// do nothing
return true;
}
final char[] buffer = termAtt.buffer();
final int length = termAtt.length();
for (int i = 0; i < length;) {
i += Character.toChars(
Character.toLowerCase(charUtils.codePointAt(buffer, i)), buffer, i);
}
return true;
} else
return false;
}
}
|
The only real change is an extra constructor to trigger keyword-aware behavior, the addition of the KeywordAttribute (so this filter is now keyword aware) and a little if condition in the incrementToken() method to short circuit the lowercasing in case the term is marked as a keyword.
Similarly, the StopFilter below is also almost identical to the stock Lucene StopFilter. Like the custom version of the LowerCaseFilter, the only changes are the extra constructor (to trigger keyword-aware behavior), the addition of a KeywordAttribute to its list of recognized attributes and an extra condition in the (protected) accept() method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 | // Source: src/main/java/com/mycompany/tgni/lucene/StopFilter.java
package com.mycompany.tgni.lucene;
import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Set;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.analysis.util.FilteringTokenFilter;
import org.apache.lucene.util.Version;
public final class StopFilter extends FilteringTokenFilter {
private final CharArraySet stopWords;
private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);
private final KeywordAttribute keywordAtt =
addAttribute(KeywordAttribute.class);
private boolean ignoreKeyword = false;
/**
* New ctor to trigger keyword-aware behavior.
*/
public StopFilter(Version matchVersion, TokenStream input, Set<?> stopWords,
boolean ignoreCase, boolean ignoreKeyword) {
super(true, input);
this.stopWords = stopWords instanceof CharArraySet ?
(CharArraySet) stopWords :
new CharArraySet(matchVersion, stopWords, ignoreCase);
this.ignoreKeyword = ignoreKeyword;
}
/**
* Old ctor for current behavior.
*/
public StopFilter(Version matchVersion, TokenStream input, Set<?> stopWords,
boolean ignoreCase) {
this(matchVersion, input, stopWords, ignoreCase, false);
}
public StopFilter(Version matchVersion, TokenStream in, Set<?> stopWords) {
this(matchVersion, in, stopWords, false);
}
public static Set<Object> makeStopSet(Version matchVersion,
String... stopWords) {
return makeStopSet(matchVersion, stopWords, false);
}
public static Set<Object> makeStopSet(Version matchVersion,
List<?> stopWords) {
return makeStopSet(matchVersion, stopWords, false);
}
public static Set<Object> makeStopSet(Version matchVersion,
String[] stopWords, boolean ignoreCase) {
CharArraySet stopSet = new CharArraySet(matchVersion, stopWords.length,
ignoreCase);
stopSet.addAll(Arrays.asList(stopWords));
return stopSet;
}
public static Set<Object> makeStopSet(Version matchVersion, List<?> stopWords,
boolean ignoreCase){
CharArraySet stopSet = new CharArraySet(matchVersion, stopWords.size(),
ignoreCase);
stopSet.addAll(stopWords);
return stopSet;
}
@Override
protected boolean accept() throws IOException {
return (ignoreKeyword && keywordAtt.isKeyword()) ||
!stopWords.contains(termAtt.buffer(), 0, termAtt.length());
}
}
|
And finally, my analyzer contains the PorterStemFilter, which already recognizes keywords, so no changes needed there.
To test this analyzer, I wrote a little JUnit test that takes the snippets of text that I used to test my UIMA AEs before.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 | // Source: src/test/java/com/mycompany/tgni/lucene/UimaAETokenizerTest.java
package com.mycompany.tgni.lucene;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.synonym.SynonymFilter;
import org.apache.lucene.analysis.synonym.SynonymMap;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.Version;
import org.junit.Test;
import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotatorsTest;
public class UimaAETokenizerTest {
private Analyzer analyzer;
@Test
public void testUimaKeywordTokenizer() throws Exception {
analyzer = getAnalyzer();
for (String s : KeywordAnnotatorsTest.TEST_STRINGS) {
System.out.println("input=" + s);
TokenStream tokenStream = analyzer.tokenStream("f", new StringReader(s));
while (tokenStream.incrementToken()) {
CharTermAttribute termAttr =
tokenStream.getAttribute(CharTermAttribute.class);
OffsetAttribute offsetAttr =
tokenStream.getAttribute(OffsetAttribute.class);
System.out.print("output term=" +
new String(termAttr.buffer(), 0, termAttr.length()) +
", offset=" + offsetAttr.startOffset() + "," +
offsetAttr.endOffset());
KeywordAttribute keywordAttr =
tokenStream.getAttribute(KeywordAttribute.class);
System.out.print(", keyword?" + keywordAttr.isKeyword());
PositionIncrementAttribute posIncAttr =
tokenStream.getAttribute(PositionIncrementAttribute.class);
System.out.print(", posinc=" + posIncAttr.getPositionIncrement());
System.out.println();
}
}
}
private Analyzer getAnalyzer() throws Exception {
if (analyzer == null) {
List<String> stopwords = new ArrayList<String>();
BufferedReader swreader = new BufferedReader(
new FileReader(new File(
"src/main/resources/stopwords.txt")));
String line;
while ((line = swreader.readLine()) != null) {
if (StringUtils.isEmpty(line) || line.startsWith("#")) {
continue;
}
stopwords.add(StringUtils.trim(line));
}
swreader.close();
final Set<?> stopset = StopFilter.makeStopSet(
Version.LUCENE_40, stopwords);
analyzer = new Analyzer() {
@Override
public TokenStream tokenStream(String fieldName, Reader reader) {
SynonymMap synonymsMap = new SynonymMap();
TokenStream input = new UimaAETokenizer(reader,
"src/main/resources/descriptors/TaxonomyMappingAE.xml",
null, synonymsMap);
input = new SynonymFilter(input, synonymsMap);
input = new LowerCaseFilter(Version.LUCENE_40, input, true);
input = new StopFilter(Version.LUCENE_40, input, stopset, false, true);
input = new PorterStemFilter(input);
return input;
}
};
}
return analyzer;
}
}
|
The output (edited for readability) of the test shows that the analyzer works as expected. You can see the effects of each of the filters in our analyzer in the different examples below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 | input=Born in the USA I was...
output term=born, offset=0,4, keyword?false, posinc=1
output term=USA, offset=12,15, keyword?true, posinc=3
output term=i, offset=16,17, keyword?false, posinc=1
input=CSC and IBM are Fortune 500 companies.
output term=CSC, offset=0,3, keyword?true, posinc=1
output term=IBM, offset=8,11, keyword?true, posinc=2
output term=fortun, offset=16,23, keyword?false, posinc=2
output term=500, offset=24,27, keyword?false, posinc=1
output term=compani, offset=28,37, keyword?false, posinc=1
input=Linux is embraced by the Oracles and IBMs of the world
output term=linux, offset=0,5, keyword?false, posinc=1
output term=embrac, offset=9,17, keyword?false, posinc=2
output term=oracl, offset=25,32, keyword?false, posinc=3
output term=IBMs, offset=37,41, keyword?true, posinc=2
output term=IBM, offset=37,41, keyword?true, posinc=0
output term=world, offset=49,54, keyword?false, posinc=3
input=PET scans are uncomfortable.
output term=PET, offset=0,3, keyword?true, posinc=1
output term=scan, offset=4,9, keyword?false, posinc=1
output term=uncomfort, offset=14,27, keyword?false, posinc=2
input=The HIV-1 virus is an AIDS carrier
output term=HIV-1, offset=4,9, keyword?true, posinc=2
output term=HIV 1, offset=4,9, keyword?true, posinc=0
output term=HIV1, offset=4,9, keyword?true, posinc=0
output term=viru, offset=10,15, keyword?false, posinc=1
output term=AIDS, offset=22,26, keyword?true, posinc=3
output term=Acquired Immunity Deficiency Syndrome, offset=22,26, keyword?true, posinc=0
output term=carrier, offset=27,34, keyword?false, posinc=1
input=Unstructured Information Management Application (UIMA) is fantastic!
output term=unstructur, offset=0,12, keyword?false, posinc=1
output term=inform, offset=13,24, keyword?false, posinc=1
output term=manag, offset=25,35, keyword?false, posinc=1
output term=applic, offset=36,47, keyword?false, posinc=1
output term=UIMA, offset=49,53, keyword?true, posinc=1
output term=fantast, offset=58,67, keyword?false, posinc=2
input=Born in the U.S.A., I was...
output term=born, offset=0,4, keyword?false, posinc=1
output term=U.S.A., offset=12,18, keyword?true, posinc=3
output term=USA, offset=12,18, keyword?true, posinc=0
output term=i, offset=20,21, keyword?false, posinc=1
input=He is a free-wheeling kind of guy.
output term=he, offset=0,2, keyword?false, posinc=1
output term=free-wheeling, offset=8,21, keyword?true, posinc=3
output term=freewheeling, offset=8,21, keyword?true, posinc=0
output term=free wheeling, offset=8,21, keyword?true, posinc=0
output term=kind, offset=22,26, keyword?false, posinc=1
output term=gui, offset=30,33, keyword?false, posinc=2
input=Magellan was one of our great mariners
output term=magellan, offset=0,8, keyword?false, posinc=1
output term=on, offset=13,16, keyword?false, posinc=2
output term=our, offset=20,23, keyword?false, posinc=2
output term=great, offset=24,29, keyword?false, posinc=1
output term=mariners, offset=30,38, keyword?true, posinc=1
input=Get your daily dose of Vitamin A here!
output term=get, offset=0,3, keyword?false, posinc=1
output term=your, offset=4,8, keyword?false, posinc=1
output term=daili, offset=9,14, keyword?false, posinc=1
output term=dose, offset=15,19, keyword?false, posinc=1
output term=Vitamin A, offset=23,32, keyword?true, posinc=2
output term=here, offset=33,37, keyword?false, posinc=1
|
So anyway, thats about it for today. This information is probably not all that useful unless you are trying to do something along similar lines, but hopefully it was interesting :-). Next week, I hope to incorporate this analyzer into Neo4J's Lucene based IndexService (for looking up nodes in a graph).
Sujit, Gret work..
ReplyDeleteI have been playing with UIMA and wanted to test some of your examples and could not find the source for one of your libraries
tgni.uima.utils.UimaUtils;
I also checked your SourceForge repositary but it was not updated lately I think, or this code was somewhere else, I could not find
Could you share "tgni.uima.utils.UimaUtil" ?
Thanks
Hi Cem, thanks for the kind words. I have added the UimaUtils class to my post. Currently I am working with a local git repository - this is a skunkworks project I am doing on my own time that attempts to solve a problem at work. When complete, I plan on giving my company first dibs on the project, if they don't want it then I will open-source it - so currently there is no public repo for this code.
ReplyDelete