Saturday, June 25, 2011

Running a UIMA Analysis Engine in a Lucene Analyzer Chain

Last week, I wrote about a UIMA Aggregate Analysis Engine (AE) that annotates keywords in a body of text, optionally inserting synonyms, using a combination of pattern matching and dictionary lookups. The idea is that this analysis will be done on text on its way into a Lucene index. So this week, I describe the Lucene Analyzer chain that I built around the AE I described last week.

A picture is worth a thousand words, so here is one that shows what I am (or will be soon, in much greater detail) talking about.


As you can imagine, most of the work happens in the UimaAETokenizer. The tokenizer is a buffering (non-streaming) Tokenizer, ie, the entire text is read from the Reader and analyzed by the UIMA AE, then individual tokens returned on successive calls to its incrementToken() method. I decided to use the new (to me) AttributeSource.State object to keep track of the tokenizer's state between calls to incrementToken() (found out about it by grokking through the Synonym filter example in the LIA2 book).

After (UIMA) analysis, the annotated tokens are marked as Keyword, any transformed values for the annotation are set into the SynonymMap (for use by the synonym filter, next in the chain). Text that is not annotated are split up (by punctuation and whitespace) and returned as plain Lucene Term (or CharTerm since Lucene 3.x) tokens. Here is the code for the Tokenizer class.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
// Source: src/main/java/com/mycompany/tgni/lucene/UimaAETokenizer.java
package com.mycompany.tgni.lucene;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.Reader;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.synonym.SynonymMap;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;

import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotation;
import com.mycompany.tgni.uima.utils.UimaUtils;

/**
 * Tokenizes a block of text from the passed in reader and
 * annotates it with the specified UIMA Analysis Engine. Terms
 * in the text that are not annotated by the Analysis Engine
 * are split on whitespace and punctuation. Attributes available:
 * CharTermAttribute, OffsetAttribute, PositionIncrementAttribute
 * and KeywordAttribute. 
 */
public final class UimaAETokenizer extends Tokenizer {

  private final CharTermAttribute termAttr;
  private final OffsetAttribute offsetAttr;
  private final PositionIncrementAttribute posIncAttr;
  private final KeywordAttribute keywordAttr;

  private AttributeSource.State current;
  private AnalysisEngine ae;
  private SynonymMap synmap;
  private LinkedList<IntRange> rangeList;
  private Map<IntRange,Object> rangeMap;
  private Reader reader = null;
  private boolean eof = false;

  private static final Pattern PUNCT_OR_SPACE_PATTERN = 
    Pattern.compile("[\\p{Punct}\\s+]");
  private static final String SYN_DELIMITER = "__";
  
  public UimaAETokenizer(Reader input, 
      String aeDescriptor, Map<String,Object> aeParams,
      SynonymMap synonymMap) {
    super(input);
    // validate inputs
    try {
      ae = UimaUtils.getAE(aeDescriptor, aeParams);
    } catch (Exception e) {
      throw new RuntimeException(e);
    }
    if (synonymMap == null) {
      throw new RuntimeException(
        "Need valid (non-null) reference to a SynonymMap");
    }
    synmap = synonymMap;
    reader = new BufferedReader(input);
    // set available attributes
    termAttr = addAttribute(CharTermAttribute.class);
    offsetAttr = addAttribute(OffsetAttribute.class);
    posIncAttr = addAttribute(PositionIncrementAttribute.class);
    keywordAttr = addAttribute(KeywordAttribute.class);
    // initialize variables
    rangeList = new LinkedList<IntRange>();
    rangeMap = new HashMap<IntRange,Object>();
  }
  
  @Override
  public boolean incrementToken() throws IOException {
    if (rangeList.size() > 0) {
      populateAttributes();
      current = captureState();
      restoreState(current);
      if (rangeList.size() == 0) {
        eof = true;
      }
      return true;
    }
    // if no more tokens, return
    if (eof) {
      return false;
    }
    // analyze input and buffer tokens
    clearAttributes();
    rangeList.clear();
    rangeMap.clear();
    try {
      List<String> texts = IOUtils.readLines(reader);
      for (String text : texts) {
        JCas jcas = UimaUtils.runAE(ae, text);
        FSIndex<? extends Annotation> fsindex = 
          jcas.getAnnotationIndex(KeywordAnnotation.type);
        int pos = 0;
        for (Iterator<? extends Annotation> it = fsindex.iterator();
            it.hasNext(); ) {
          KeywordAnnotation annotation = (KeywordAnnotation) it.next();
          int begin = annotation.getBegin();
          int end = annotation.getEnd();
          if (pos < begin) {
            // this is plain text, split this up by whitespace
            // into individual terms
            addNonAnnotatedTerms(pos, text.substring(pos, begin));
          }
          IntRange range = new IntRange(begin, end);
          mergeAnnotationInfo(range, annotation);     
          pos = end;
        }
        if (pos < text.length()) {
          addNonAnnotatedTerms(pos, text.substring(pos));
        }
        current = captureState();
      }
    } catch (Exception e) {
      throw new IOException(e);
    }
    // return the first term from rangeList
    populateAttributes();
    return true;
  }
  
  private void populateAttributes() {
    if (rangeList.size() == 0) {
      return;
    }
    // return buffered tokens one by one. If current
    // token has an associated UimaAnnotationAttribute,
    // then set the attribute in addition to term
    IntRange range = rangeList.removeFirst();
    if (rangeMap.containsKey(range)) {
      Object rangeValue = rangeMap.get(range);
      if (rangeValue instanceof KeywordAnnotation) {
        // this is a UIMA Keyword annotation
        KeywordAnnotation annotation = (KeywordAnnotation) rangeValue;
        String term = annotation.getCoveredText();
        String transformedValue = annotation.getTransformedValue();
        if (StringUtils.isNotEmpty(transformedValue)) {
          List<Token> tokens = SynonymMap.makeTokens(
            Arrays.asList(StringUtils.split(
            transformedValue, SYN_DELIMITER)));
          // rather than add all the synonym tokens in a single
          // add, we have to do this separately to ensure that
          // the position increment attribute is set to 0 for
          // all the synonyms, not just the first one
          for (Token token : tokens) {
            synmap.add(Arrays.asList(term), Arrays.asList(token), true, true);
          }
        }
        offsetAttr.setOffset(annotation.getBegin(), 
          annotation.getEnd());
        termAttr.copyBuffer(term.toCharArray(), 0, term.length());
        termAttr.setLength(term.length());
        keywordAttr.setKeyword(true);
        posIncAttr.setPositionIncrement(1);
      } else {
        // this is a plain text term
        String term = (String) rangeValue;
        termAttr.copyBuffer(term.toCharArray(), 0, term.length());
        termAttr.setLength(term.length());
        offsetAttr.setOffset(range.getMinimumInteger(), 
          range.getMaximumInteger());
        keywordAttr.setKeyword(false);
        posIncAttr.setPositionIncrement(1);
      }
    }
  }

  private void addNonAnnotatedTerms(int pos, String snippet) {
    int start = 0;
    Matcher m = PUNCT_OR_SPACE_PATTERN.matcher(snippet);
    while (m.find(start)) {
      int begin = m.start();
      int end = m.end();
      if (start == begin) {
        // this is a punctuation character, skip it
        start = end;
        continue;
      }
      IntRange range = new IntRange(pos + start, pos + begin);
      rangeList.add(range);
      rangeMap.put(range, snippet.substring(start, begin));
      start = end; 
    }
    // take care of trailing string in snippet
    if (start < snippet.length()) {
      IntRange range = new IntRange(pos + start, pos + snippet.length());
      rangeList.add(range);
      rangeMap.put(range, snippet.substring(start));
    }
  }

  private void mergeAnnotationInfo(IntRange range, 
      KeywordAnnotation annotation) {
    // verify if the range has not already been recognized.
    // this is possible if multiple AEs recognize and act
    // on the same pattern/dictionary entry
    if (rangeMap.containsKey(range) &&
        rangeMap.get(range) instanceof KeywordAnnotation) {
      KeywordAnnotation prevAnnotation = 
        (KeywordAnnotation) rangeMap.get(range);
      Set<String> synonyms = new HashSet<String>();
      if (StringUtils.isNotEmpty(
          prevAnnotation.getTransformedValue())) {
        synonyms.addAll(Arrays.asList(StringUtils.split(
          prevAnnotation.getTransformedValue(), SYN_DELIMITER)));
      }
      if (StringUtils.isNotEmpty(annotation.getTransformedValue())) {
        synonyms.addAll(Arrays.asList(StringUtils.split(
          annotation.getTransformedValue(), SYN_DELIMITER)));
      }
      annotation.setTransformedValue(StringUtils.join(
        synonyms.iterator(), SYN_DELIMITER));
      rangeMap.put(range, annotation);
    } else {
      rangeList.add(range);
      rangeMap.put(range, annotation);
    }
  }
}

The UimaUtils class is a simple utilities class that wraps common UIMA operations such as building an Analysis Engine from a descriptor, running an Analysis Engine, etc. Its code is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
// Source: ./src/main/java/com/mycompany/tgni/uima/utils/UimaUtils.java
package com.mycompany.tgni.uima.utils;

import java.io.IOException;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.uima.UIMAFramework;
import org.apache.uima.analysis_engine.AnalysisEngine;
import org.apache.uima.analysis_engine.AnalysisEngineDescription;
import org.apache.uima.analysis_engine.AnalysisEngineProcessException;
import org.apache.uima.cas.FSIndex;
import org.apache.uima.cas.Feature;
import org.apache.uima.jcas.JCas;
import org.apache.uima.jcas.tcas.Annotation;
import org.apache.uima.resource.ResourceInitializationException;
import org.apache.uima.util.InvalidXMLException;
import org.apache.uima.util.ProcessTrace;
import org.apache.uima.util.ProcessTraceEvent;
import org.apache.uima.util.XMLInputSource;

/**
 * Largely copied from the TestUtils class in UIMA Sandbox component
 * AlchemyAPIAnnotator.
 */
public class UimaUtils {

  public static AnalysisEngine getAE(
      String descriptor, Map<String,Object> params) 
      throws IOException, InvalidXMLException, 
      ResourceInitializationException {
    AnalysisEngine ae = null;
    try {
      XMLInputSource in = new XMLInputSource(descriptor);
      AnalysisEngineDescription desc = 
        UIMAFramework.getXMLParser().
        parseAnalysisEngineDescription(in);
      if (params != null) {
        for (String key : params.keySet()) {
          desc.getAnalysisEngineMetaData().
            getConfigurationParameterSettings().
            setParameterValue(key, params.get(key));
        }
      }
      ae = UIMAFramework.produceAnalysisEngine(desc);
    } catch (Exception e) {
      throw new ResourceInitializationException(e);
    }
    return ae;
  }
  
  public static JCas runAE(AnalysisEngine ae, String text)
      throws AnalysisEngineProcessException,
      ResourceInitializationException {
    JCas jcas = ae.newJCas();
    jcas.setDocumentText(text);
    ProcessTrace trace = ae.process(jcas);
    for (ProcessTraceEvent evt : trace.getEvents()) {
      if (evt != null && evt.getResultMessage() != null &&
          evt.getResultMessage().contains("error")) {
        throw new AnalysisEngineProcessException(
          new Exception(evt.getResultMessage()));
      }
    }
    return jcas;
  }
  
  public static void printResults(JCas jcas) {
    FSIndex<Annotation> index = jcas.getAnnotationIndex();
    for (Iterator<Annotation> it = index.iterator(); it.hasNext(); ) {
      Annotation annotation = it.next();
      List<Feature> features = new ArrayList<Feature>();
      if (annotation.getType().getName().contains("com.mycompany")) {
        features = annotation.getType().getFeatures();
      }
      List<String> fasl = new ArrayList<String>();
      for (Feature feature : features) {
        if (feature.getName().contains("com.mycompany")) {
          String name = feature.getShortName();
          String value = annotation.getStringValue(feature);
          fasl.add(name + "=\"" + value + "\"");
        }
      }
      System.out.println(
        annotation.getType().getShortName() + ": " +
        annotation.getCoveredText() + " " +
        (fasl.size() > 0 ? StringUtils.join(fasl.iterator(), ",") : "") + " " +
        annotation.getBegin() + ":" + annotation.getEnd());
    }
    System.out.println("==");
  }
}

The next filter in the chain is the (Lucene provided, since 3.0 I think) SynonymFilter. It needs a reference to a SynonymMap. An empty SynonymMap was provided to the UimaAETokenizer, which it populated, and now it is available for use by the SynonymFilter. And yes, I do realize that this sort of pass-by-reference stuff is frowned upon in the Java world, but at least in this case, it keeps the code simple and easy to understand.

At the end of this step, the SynonymFilter will set the synonym terms at the same offset as the original term, and set the position increment gap to 0.

The next two filters are the LowerCaseFilter and StopFilter, to lowercase the tokens and remove stopwords respectively. I wanted them to not operate on tokens generated off the UIMA annotations in my UimaAETokenizer, similar to how the PorterStemFilter operates on Lucene 4.0. Specifically, with PorterStemFilter, it is possible to mark certain terms as keywords using KeywordAttribute.setKeyword(true), and these terms will be skipped for stemming.

However, this functionality does not exist in Lucene (yet), so I have opened a JIRA (LUCENE-3236) with the necessary patches for this, hopefully it will be incorporated into Lucene at some point. In the interim, you can use the versions below, which are functionality-wise identical to the patched versions I provided in the JIRA.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Source: src/main/java/com/mycompany/tgni/lucene/LowerCaseFilter.java
package com.mycompany.tgni.lucene;

import java.io.IOException;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.util.CharacterUtils;
import org.apache.lucene.util.Version;

public final class LowerCaseFilter extends TokenFilter {
  private final CharacterUtils charUtils;
  private final CharTermAttribute termAtt = 
    addAttribute(CharTermAttribute.class);
  private final KeywordAttribute keywordAtt = 
    addAttribute(KeywordAttribute.class);
  
  private boolean ignoreKeyword = false;

  /**
   * Extra constructor to trigger new keyword-aware behavior.
   */
  public LowerCaseFilter(Version matchVersion, TokenStream in, 
      boolean ignoreKeyword) {
    super(in);
    charUtils = CharacterUtils.getInstance(matchVersion);
    this.ignoreKeyword = ignoreKeyword;
  }

  /**
   * Old ctor.
   */
  public LowerCaseFilter(Version matchVersion, TokenStream in) {
    this(matchVersion, in, false);
  }
  
  @Override
  public final boolean incrementToken() throws IOException {
    if (input.incrementToken()) {
      if (ignoreKeyword && keywordAtt.isKeyword()) {
        // do nothing
        return true;
      }
      final char[] buffer = termAtt.buffer();
      final int length = termAtt.length();
      for (int i = 0; i < length;) {
       i += Character.toChars(
         Character.toLowerCase(charUtils.codePointAt(buffer, i)), buffer, i);
      }
      return true;
    } else
      return false;
  }
}

The only real change is an extra constructor to trigger keyword-aware behavior, the addition of the KeywordAttribute (so this filter is now keyword aware) and a little if condition in the incrementToken() method to short circuit the lowercasing in case the term is marked as a keyword.

Similarly, the StopFilter below is also almost identical to the stock Lucene StopFilter. Like the custom version of the LowerCaseFilter, the only changes are the extra constructor (to trigger keyword-aware behavior), the addition of a KeywordAttribute to its list of recognized attributes and an extra condition in the (protected) accept() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Source: src/main/java/com/mycompany/tgni/lucene/StopFilter.java
package com.mycompany.tgni.lucene;

import java.io.IOException;
import java.util.Arrays;
import java.util.List;
import java.util.Set;

import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.util.CharArraySet;
import org.apache.lucene.analysis.util.FilteringTokenFilter;
import org.apache.lucene.util.Version;

public final class StopFilter extends FilteringTokenFilter {

  private final CharArraySet stopWords;
  private final CharTermAttribute termAtt = 
    addAttribute(CharTermAttribute.class);
  private final KeywordAttribute keywordAtt =
    addAttribute(KeywordAttribute.class);

  private boolean ignoreKeyword = false;
  
  /**
   * New ctor to trigger keyword-aware behavior.
   */
  public StopFilter(Version matchVersion, TokenStream input, Set<?> stopWords, 
      boolean ignoreCase, boolean ignoreKeyword) {
    super(true, input);
    this.stopWords = stopWords instanceof CharArraySet ? 
      (CharArraySet) stopWords : 
      new CharArraySet(matchVersion, stopWords, ignoreCase);
    this.ignoreKeyword = ignoreKeyword;
  }

  /**
   * Old ctor for current behavior.
   */
  public StopFilter(Version matchVersion, TokenStream input, Set<?> stopWords, 
      boolean ignoreCase) {
    this(matchVersion, input, stopWords, ignoreCase, false);
  }
  
  public StopFilter(Version matchVersion, TokenStream in, Set<?> stopWords) {
    this(matchVersion, in, stopWords, false);
  }

  public static Set<Object> makeStopSet(Version matchVersion, 
      String... stopWords) {
    return makeStopSet(matchVersion, stopWords, false);
  }
  
  public static Set<Object> makeStopSet(Version matchVersion, 
      List<?> stopWords) {
    return makeStopSet(matchVersion, stopWords, false);
  }
    
  public static Set<Object> makeStopSet(Version matchVersion, 
      String[] stopWords, boolean ignoreCase) {
    CharArraySet stopSet = new CharArraySet(matchVersion, stopWords.length, 
      ignoreCase);
    stopSet.addAll(Arrays.asList(stopWords));
    return stopSet;
  }
  
  public static Set<Object> makeStopSet(Version matchVersion, List<?> stopWords,
       boolean ignoreCase){
    CharArraySet stopSet = new CharArraySet(matchVersion, stopWords.size(),
      ignoreCase);
    stopSet.addAll(stopWords);
    return stopSet;
  }
  
  @Override
  protected boolean accept() throws IOException {
    return (ignoreKeyword && keywordAtt.isKeyword()) || 
      !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
  }
}

And finally, my analyzer contains the PorterStemFilter, which already recognizes keywords, so no changes needed there.

To test this analyzer, I wrote a little JUnit test that takes the snippets of text that I used to test my UIMA AEs before.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
// Source: src/test/java/com/mycompany/tgni/lucene/UimaAETokenizerTest.java
package com.mycompany.tgni.lucene;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.en.PorterStemFilter;
import org.apache.lucene.analysis.synonym.SynonymFilter;
import org.apache.lucene.analysis.synonym.SynonymMap;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.Version;
import org.junit.Test;

import com.mycompany.tgni.uima.annotators.keyword.KeywordAnnotatorsTest;

public class UimaAETokenizerTest {

  private Analyzer analyzer;
  
  @Test
  public void testUimaKeywordTokenizer() throws Exception {
    analyzer = getAnalyzer();
    for (String s : KeywordAnnotatorsTest.TEST_STRINGS) {
      System.out.println("input=" + s);
      TokenStream tokenStream = analyzer.tokenStream("f", new StringReader(s));
      while (tokenStream.incrementToken()) {
        CharTermAttribute termAttr = 
          tokenStream.getAttribute(CharTermAttribute.class);
        OffsetAttribute offsetAttr = 
          tokenStream.getAttribute(OffsetAttribute.class);
        System.out.print("output term=" + 
          new String(termAttr.buffer(), 0, termAttr.length()) +
          ", offset=" + offsetAttr.startOffset() + "," + 
          offsetAttr.endOffset());
        KeywordAttribute keywordAttr = 
          tokenStream.getAttribute(KeywordAttribute.class);
        System.out.print(", keyword?" + keywordAttr.isKeyword());
        PositionIncrementAttribute posIncAttr = 
          tokenStream.getAttribute(PositionIncrementAttribute.class);
        System.out.print(", posinc=" + posIncAttr.getPositionIncrement());
        System.out.println();
      }
    }
  }

  private Analyzer getAnalyzer() throws Exception {
    if (analyzer == null) {
      List<String> stopwords = new ArrayList<String>();
      BufferedReader swreader = new BufferedReader(
        new FileReader(new File(
        "src/main/resources/stopwords.txt")));
      String line;
      while ((line = swreader.readLine()) != null) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        stopwords.add(StringUtils.trim(line));
      }
      swreader.close();
      final Set<?> stopset = StopFilter.makeStopSet(
        Version.LUCENE_40, stopwords);
      analyzer = new Analyzer() {
        @Override
        public TokenStream tokenStream(String fieldName, Reader reader) {
          SynonymMap synonymsMap = new SynonymMap();
          TokenStream input = new UimaAETokenizer(reader,
            "src/main/resources/descriptors/TaxonomyMappingAE.xml",
            null, synonymsMap);
          input = new SynonymFilter(input, synonymsMap);
          input = new LowerCaseFilter(Version.LUCENE_40, input, true);
          input = new StopFilter(Version.LUCENE_40, input, stopset, false, true);
          input = new PorterStemFilter(input);
          return input;
        }
      };
    }
    return analyzer;
  }
}

The output (edited for readability) of the test shows that the analyzer works as expected. You can see the effects of each of the filters in our analyzer in the different examples below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
input=Born in the USA I was...
  output term=born, offset=0,4, keyword?false, posinc=1
  output term=USA, offset=12,15, keyword?true, posinc=3
  output term=i, offset=16,17, keyword?false, posinc=1
input=CSC and IBM are Fortune 500 companies.
  output term=CSC, offset=0,3, keyword?true, posinc=1
  output term=IBM, offset=8,11, keyword?true, posinc=2
  output term=fortun, offset=16,23, keyword?false, posinc=2
  output term=500, offset=24,27, keyword?false, posinc=1
  output term=compani, offset=28,37, keyword?false, posinc=1
input=Linux is embraced by the Oracles and IBMs of the world
  output term=linux, offset=0,5, keyword?false, posinc=1
  output term=embrac, offset=9,17, keyword?false, posinc=2
  output term=oracl, offset=25,32, keyword?false, posinc=3
  output term=IBMs, offset=37,41, keyword?true, posinc=2
    output term=IBM, offset=37,41, keyword?true, posinc=0
  output term=world, offset=49,54, keyword?false, posinc=3
input=PET scans are uncomfortable.
  output term=PET, offset=0,3, keyword?true, posinc=1
  output term=scan, offset=4,9, keyword?false, posinc=1
  output term=uncomfort, offset=14,27, keyword?false, posinc=2
input=The HIV-1 virus is an AIDS carrier
  output term=HIV-1, offset=4,9, keyword?true, posinc=2
    output term=HIV 1, offset=4,9, keyword?true, posinc=0
    output term=HIV1, offset=4,9, keyword?true, posinc=0
  output term=viru, offset=10,15, keyword?false, posinc=1
  output term=AIDS, offset=22,26, keyword?true, posinc=3
    output term=Acquired Immunity Deficiency Syndrome, offset=22,26, keyword?true, posinc=0
  output term=carrier, offset=27,34, keyword?false, posinc=1
input=Unstructured Information Management Application (UIMA) is fantastic!
  output term=unstructur, offset=0,12, keyword?false, posinc=1
  output term=inform, offset=13,24, keyword?false, posinc=1
  output term=manag, offset=25,35, keyword?false, posinc=1
  output term=applic, offset=36,47, keyword?false, posinc=1
  output term=UIMA, offset=49,53, keyword?true, posinc=1
  output term=fantast, offset=58,67, keyword?false, posinc=2
input=Born in the U.S.A., I was...
  output term=born, offset=0,4, keyword?false, posinc=1
  output term=U.S.A., offset=12,18, keyword?true, posinc=3
    output term=USA, offset=12,18, keyword?true, posinc=0
  output term=i, offset=20,21, keyword?false, posinc=1
input=He is a free-wheeling kind of guy.
  output term=he, offset=0,2, keyword?false, posinc=1
  output term=free-wheeling, offset=8,21, keyword?true, posinc=3
    output term=freewheeling, offset=8,21, keyword?true, posinc=0
    output term=free wheeling, offset=8,21, keyword?true, posinc=0
  output term=kind, offset=22,26, keyword?false, posinc=1
  output term=gui, offset=30,33, keyword?false, posinc=2
input=Magellan was one of our great mariners
  output term=magellan, offset=0,8, keyword?false, posinc=1
  output term=on, offset=13,16, keyword?false, posinc=2
  output term=our, offset=20,23, keyword?false, posinc=2
  output term=great, offset=24,29, keyword?false, posinc=1
  output term=mariners, offset=30,38, keyword?true, posinc=1
input=Get your daily dose of Vitamin A here!
  output term=get, offset=0,3, keyword?false, posinc=1
  output term=your, offset=4,8, keyword?false, posinc=1
  output term=daili, offset=9,14, keyword?false, posinc=1
  output term=dose, offset=15,19, keyword?false, posinc=1
  output term=Vitamin A, offset=23,32, keyword?true, posinc=2
  output term=here, offset=33,37, keyword?false, posinc=1

So anyway, thats about it for today. This information is probably not all that useful unless you are trying to do something along similar lines, but hopefully it was interesting :-). Next week, I hope to incorporate this analyzer into Neo4J's Lucene based IndexService (for looking up nodes in a graph).

2 comments:

  1. Sujit, Gret work..
    I have been playing with UIMA and wanted to test some of your examples and could not find the source for one of your libraries
    tgni.uima.utils.UimaUtils;
    I also checked your SourceForge repositary but it was not updated lately I think, or this code was somewhere else, I could not find
    Could you share "tgni.uima.utils.UimaUtil" ?
    Thanks

    ReplyDelete
  2. Hi Cem, thanks for the kind words. I have added the UimaUtils class to my post. Currently I am working with a local git repository - this is a skunkworks project I am doing on my own time that attempts to solve a problem at work. When complete, I plan on giving my company first dibs on the project, if they don't want it then I will open-source it - so currently there is no public repo for this code.

    ReplyDelete

Comments are moderated to prevent spam.