Saturday, December 26, 2009

A Lucene POS Tagging TokenFilter

Of late, I've been busy converting a fairly large application from Lucene 2.4 to Lucene 2.9. There are quite a few improvements - see this white paper from Lucid Imagination for a quick overview (sign-up required). Along with the many performance improvements, there are some API changes and deprecations. Some of these deprecations came with a fair amount of warning, such as the Hits object which is finally going away in Lucene 3.0, but some others are scheduled more aggressively - there is a brand new API for TokenStream in Lucene 2.9, and the old API will also be removed in Lucene 3.0.

The new API is actually quite nice - its more flexible, so you can add custom properties to the Token object during analysis, simply by adding the required TokenFilter. A TokenFilter stores user-created Attributes for a Token, which can be retrieved during Analysis. Prior to this Attribute based API, you would probably use the Token's payload to do the same thing. In any case, the new API is described in detail in the Javadocs. I thought it would be interesting to actually build a TokenFilter using the docs, so I recycled some old ideas to create a Part of Speech tagging TokenFilter.

Custom Attribute - Interface and Implementation

The first step is to create a custom attribute to hold the information about the Part of Speech. The interface simply defines getters and setters for the Pos object.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
// Source: src/main/java/net/sf/jtmt/postaggers/lucene/PosAttribute.java
package net.sf.jtmt.postaggers.lucene;

import net.sf.jtmt.postaggers.Pos;

import org.apache.lucene.util.Attribute;

/**
 * Part of Speech Attribute.
 */
public interface PosAttribute extends Attribute {
  public void setPos(Pos pos);
  public Pos getPos();
}

The implementation is the name of the interface suffixed by Impl, a convention enforced by the AttributeSource class. The class also extends AttributeImpl.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
// Source: src/main/java/net/sf/jtmt/postaggers/lucene/PosAttributeImpl.java
package net.sf.jtmt.postaggers.lucene;

import org.apache.lucene.util.AttributeImpl;

import net.sf.jtmt.postaggers.Pos;

public class PosAttributeImpl extends AttributeImpl implements PosAttribute {

  private static final long serialVersionUID = -8416041956010464591L;

  private Pos pos = Pos.OTHER;
  
  public Pos getPos() {
    return pos;
  }

  public void setPos(Pos pos) {
    this.pos = pos;
  }

  @Override
  public void clear() {
    this.pos = Pos.OTHER;
  }

  @Override
  public void copyTo(AttributeImpl target) {
    ((PosAttributeImpl) target).setPos(pos);
  }

  @Override
  public boolean equals(Object other) {
    if (other == this) {
      return true;
    }
    if (other instanceof PosAttributeImpl) {
      return pos == ((PosAttributeImpl) other).getPos();
    }
    return false;
  }

  @Override
  public int hashCode() {
    return pos.ordinal();
  }
}

TokenFilter

The Part of Speech TokenFilter is actually the meat of the class. The constructor takes in the necessary parameters to start it up, and the incrementToken() method implements the logic necessary to do the POS tagging. As I said before, I am just recycling old ideas here, so refer to the my old post if you have trouble following the code.

Essentially we open two TokenStreams one for the current token and one for the next token (since we cannot peek ahead, we need to have two streams). For each current token, we query Wordnet for the POS. If only a single POS is returned, then we just use that one. If no POS is returned, we use suffix rules to guess the POS. If multiple POS are returned, we use a set of inter-POS transition probabilities to determine the "most likely" POS for the current term.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
// Source: src/main/java/net/sf/jtmt/postaggers/lucene/PosTaggingFilter.java
package net.sf.jtmt.postaggers.lucene;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

import net.sf.jtmt.clustering.ByValueComparator;
import net.sf.jtmt.postaggers.Pos;

import org.apache.commons.collections15.keyvalue.MultiKey;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;

import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.item.IIndexWord;

/**
 * Sets the POS tag for each token in the token stream. Uses Wordnet and some
 * grammar rules to make an initial determination. If Wordnet returns multiple
 * POS possibilities, we use the context surrounding the word (previous and
 * next characters) and a table of pre-calculated probabilities (from the
 * Brown corpus) to determine the most likely POS. If Wordnet returns a single 
 * POS, that is accepted. If Wordnet cannot determine the POS, then word 
 * suffix rules are used to guess the POS.
 */
public class PosTaggingFilter extends TokenFilter {

  private String prevTerm = null;
  private String nextTerm = null;

  private Pos prevPos = null;

  private TokenStream suffixStream;
  private IDictionary wordnetDictionary;
  private Map<String,Pos> suffixPosMap;
  private Map<MultiKey<Pos>,Double> transitionProbabilities;
  
  private PosAttribute posAttr;
  private TermAttribute termAttr;
  private TermAttribute suffixTermAttr;
  
  protected PosTaggingFilter(TokenStream input, 
      TokenStream suffixStream,
      String wordnetDictionaryPath, String suffixRulesPath, 
      String posTransitionProbabilitiesPath) throws Exception {
    super(input);
    // declare the POS attribute for both streams
    this.posAttr = (PosAttribute) addAttribute(PosAttribute.class);
    this.termAttr = (TermAttribute) addAttribute(TermAttribute.class);
    this.suffixTermAttr = 
      (TermAttribute) suffixStream.addAttribute(TermAttribute.class);
    this.prevTerm = null;
    this.suffixStream = suffixStream;
    this.input.reset();
    this.suffixStream.reset();
    // advance the pointer to the next token for the suffix stream
    if (this.suffixStream.incrementToken()) {
      this.nextTerm = suffixTermAttr.term();
    }
    // create artifacts for doing the POS tagging
    this.wordnetDictionary = initWordnetDictionary(wordnetDictionaryPath);
    this.suffixPosMap = initSuffixPosMap(suffixRulesPath); 
    this.transitionProbabilities = initTransitionProbabilities(
      posTransitionProbabilitiesPath);
  }

  public final boolean incrementToken() throws IOException {
    String currentTerm = null;
    if (input.incrementToken()) {
      currentTerm = termAttr.term();
    } else {
      return false;
    }
    if (suffixStream.incrementToken()) {
      this.nextTerm = suffixTermAttr.term();
    } else {
      this.nextTerm = null;
    }
    termAttr.setTermBuffer(currentTerm);
    // find the POS of the current word from Wordnet
    List<Pos> currentPoss = getPosFromWordnet(currentTerm);
    if (currentPoss.size() == 1) {
      // unambiguous match, look no further
      posAttr.setPos(currentPoss.get(0));
    } else if (currentPoss.size() == 0) {
      // wordnet could not find a POS, use suffix rules to find
      if (prevTerm != null) {
        // this is not thr first word, check for capitalization for Noun
        if (currentTerm.charAt(0) == Character.UPPERCASE_LETTER) {
          posAttr.setPos(Pos.NOUN);
        }
      }
      if (posAttr.getPos() != null) {
        Pos pos = getPosFromSuffixRules(currentTerm);
        posAttr.setPos(pos);
      }
    } else {
      // wordnet reported multiple POS, find the best one
      Pos pos = getMostProbablePos(currentPoss, nextTerm, prevPos);
      posAttr.setPos(pos);
    }
    this.prevTerm = currentTerm;
    this.prevPos = posAttr.getPos();
    return true;
  }
  
  private IDictionary initWordnetDictionary(String wordnetDictionaryPath) 
      throws Exception {
    IDictionary wordnetDictionary = new Dictionary(
      new URL("file", null, wordnetDictionaryPath));
    wordnetDictionary.open();
    return wordnetDictionary;
  }

  private Map<String, Pos> initSuffixPosMap(String suffixRulesPath) 
      throws Exception {
    Map<String,Pos> suffixPosMap = new TreeMap<String, Pos>(
      new Comparator<String>() {
        public int compare(String s1, String s2) {
          int l1 = s1 == null ? 0 : s1.length();
          int l2 = s2 == null ? 0 : s2.length();
          if (l1 == l2) {
            return 0;
          } else {
            return (l2 > l1 ? 1 : -1);
          }
        }
    });
    BufferedReader reader = new BufferedReader(new FileReader(suffixRulesPath));
    String line = null;
    while ((line = reader.readLine()) != null) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      String[] suffixPosPair = StringUtils.split(line, "\t");
      suffixPosMap.put(suffixPosPair[0], Pos.valueOf(suffixPosPair[1]));
    }
    reader.close();
    return suffixPosMap;
  }

  private Map<MultiKey<Pos>,Double> initTransitionProbabilities(
      String transitionProbabilitiesPath) throws Exception {
    Map<MultiKey<Pos>,Double> transitionProbabilities = 
      new HashMap<MultiKey<Pos>,Double>();
    BufferedReader reader = new BufferedReader(
      new FileReader(transitionProbabilitiesPath));
    String line = null;
    int row = 0;
    while ((line = reader.readLine()) != null) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      String[] cols = StringUtils.split(line, "\t");
      for (int col = 0; col < cols.length; col++) {
        MultiKey<Pos> key = new MultiKey<Pos>(Pos.values()[row], Pos.values()[col]);
        transitionProbabilities.put(key, Double.valueOf(cols[col]));
      }
      row++;
    }
    reader.close();
    return transitionProbabilities;
  }

  private List<Pos> getPosFromWordnet(String currentTerm) {
    List<Pos> poss = new ArrayList<Pos>();
    for (Pos pos : Pos.values()) {
      try {
        IIndexWord indexWord = wordnetDictionary.getIndexWord(
          currentTerm, Pos.toWordnetPos(pos));
        if (indexWord != null) {
          poss.add(pos);
        }
      } catch (NullPointerException e) {
        // JWI throws NPE if it cannot find a word in dictionary
        continue;
      }
    }
    return poss;
  }

  private Pos getPosFromSuffixRules(String currentTerm) {
    for (String suffix : suffixPosMap.keySet()) {
      if (StringUtils.lowerCase(currentTerm).endsWith(suffix)) {
        return suffixPosMap.get(suffix);
      }
    }
    return Pos.OTHER;
  }

  private Pos getMostProbablePos(List<Pos> currentPoss, String nextTerm,
      Pos prevPos) {
    Map<Pos,Double> posProbs = new HashMap<Pos,Double>();
    // find the possible POS values for the previous and current term
    if (prevPos != null) {
      for (Pos currentPos : currentPoss) {
        MultiKey<Pos> key = new MultiKey<Pos>(prevPos, currentPos);
        double prob = transitionProbabilities.get(key);
        if (posProbs.containsKey(currentPos)) {
          posProbs.put(currentPos, posProbs.get(currentPos) + prob);
        } else {
          posProbs.put(currentPos, prob);
        }
      }
    }
    // find the possible POS values for the current and previous term
    if (nextTerm != null) {
      List<Pos> nextPoss = getPosFromWordnet(nextTerm);
      if (nextPoss.size() == 0) {
        nextPoss.add(Pos.OTHER);
      }
      for (Pos currentPos : currentPoss) {
        for (Pos nextPos : nextPoss) {
          MultiKey<Pos> key = new MultiKey<Pos>(currentPos, nextPos);
          double prob = transitionProbabilities.get(key);
          if (posProbs.containsKey(currentPos)) {
            posProbs.put(currentPos, posProbs.get(currentPos) + prob);
          } else {
            posProbs.put(currentPos, prob);
          }
        }
      }
    }
    // now find the current Pos with the maximum probability
    if (posProbs.size() == 0) {
      return Pos.OTHER;
    } else {
      ByValueComparator<Pos,Double> bvc = 
        new ByValueComparator<Pos,Double>(posProbs);
      List<Pos> posList = new ArrayList<Pos>();
      posList.addAll(posProbs.keySet());
      Collections.sort(posList, Collections.reverseOrder(bvc));
      return posList.get(0);
    }
  }
}

Analyzer

For the Analyzer, I used a StandardAnalyzer and tacked on the POS TokenFilter to it. So as it spits out the Token with the term in it, it will also contain the PosAttribute, which the user can access if desired. The code is written as a JUnit test, since I was really testing the TokenFilter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// Source: src/test/java/net/sf/jtmt/postaggers/lucene/PosTaggingFilterTest.java
package net.sf.jtmt.postaggers.lucene;

import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.tokenattributes.TermAttribute;
import org.apache.lucene.util.Version;
import org.junit.Test;

/**
 * Test for the Lucene POS Tagging Filter.
 */
public class PosTaggingFilterTest {

  private String[] INPUT_TEXTS = {
    "The growing popularity of Linux in Asia, Europe, and the US is " +
    "a major concern for Microsoft.",
    "Jaguar will sell its new XJ-6 model in the US for a small fortune.",
    "The union is in a sad state.",
    "Please do not state the obvious.",
    "I am looking forward to the state of the union address.",
    "I have a bad cold today.",
    "The cold war was over long ago."
  };
  
  private Analyzer analyzer = new Analyzer() {
    @Override
    public TokenStream tokenStream(String fieldName, Reader reader) {
      return new StandardTokenizer(Version.LUCENE_CURRENT, reader);
    }
  };
  
  @Test
  public void testPosTagging() throws Exception {
    for (String inputText : INPUT_TEXTS) {
      System.out.println("Input: " + inputText);
      List<String> tags = new ArrayList<String>();
      TokenStream input = analyzer.tokenStream(
        "f", new StringReader(inputText));
      TokenStream suffixStream = 
        analyzer.tokenStream("f", new StringReader(inputText));
      input = new PosTaggingFilter(input, suffixStream,
        "/opt/wordnet-3.0/dict", "src/main/resources/pos_suffixes.txt",
        "src/main/resources/pos_trans_prob.txt");
      TermAttribute termAttribute = 
        (TermAttribute) input.addAttribute(TermAttribute.class);
      PosAttribute posAttribute = 
        (PosAttribute) input.addAttribute(PosAttribute.class);
      while (input.incrementToken()) {
        tags.add(termAttribute.term() + "/" + posAttribute.getPos());
      }
      input.end();
      input.close();
      StringBuilder tagBuf = new StringBuilder();
      tagBuf.append("Tagged: ").append(StringUtils.join(tags.iterator(), " "));
      System.out.println(tagBuf.toString());
    }
  }
}

Results

The test inputs are identical to the test in the old post, although the new code seems to have difficulty figuring out verbs. Here are the results (cleaned up a bit for display).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Input:  The growing popularity of Linux in Asia, Europe, and the US is 
        a major concern for Microsoft.
Tagged: The/OTHER growing/ADJECTIVE popularity/NOUN of/OTHER Linux/NOUN 
        in/ADJECTIVE Asia/NOUN Europe/NOUN and/OTHER the/OTHER US/NOUN 
        is/OTHER a/NOUN major/ADJECTIVE concern/NOUN for/OTHER 
        Microsoft/OTHER

Input:  Jaguar will sell its new XJ-6 model in the US for a small fortune.
Tagged: Jaguar/NOUN will/NOUN sell/NOUN its/OTHER new/ADVERB XJ-6/OTHER 
        model/ADJECTIVE in/NOUN the/OTHER US/NOUN for/OTHER a/NOUN 
        small/ADJECTIVE fortune/NOUN

Input:  The union is in a sad state.
Tagged: The/OTHER union/NOUN is/OTHER in/ADJECTIVE a/NOUN sad/ADJECTIVE 
        state/NOUN

Input:  Please do not state the obvious.
Tagged: Please/VERB do/VERB not/ADVERB state/NOUN the/OTHER obvious/ADJECTIVE

Input:  I am looking forward to the state of the union address.
Tagged: I/ADJECTIVE am/NOUN looking/ADJECTIVE forward/NOUN to/OTHER the/OTHER
         state/NOUN of/OTHER the/OTHER union/ADJECTIVE address/NOUN

Input:  I have a bad cold today.
Tagged: I/ADJECTIVE have/NOUN a/NOUN bad/ADJECTIVE cold/NOUN today/NOUN

Input:  The cold war was over long ago.
Tagged: The/OTHER cold/ADJECTIVE war/NOUN was/OTHER over/ADVERB long/VERB 
        ago/ADJECTIVE

Although the new TokenStream API is on a short deprecation fuse, its probably not such a big deal since most people don't use it directly. From a migration point of view, even if you do use the TokenStream API directly, its probably going to be less code than say, search or indexing code, so I guess its all good.

8 comments (moderated to prevent spam):

Clay Fink said...

Nice work. Thanks for making me aware of the Attribute interface. I tried something like this with Lucene and developed a tokenfilter that did named entity tagging by wrapping an existing NER tagger (the Stanford NLP group tagger). Attribute would make this much easier.

Sujit Pal said...

Thanks Clay, glad the post helped.

amacinho said...

Thanks for the post!

I wonder if you have some other pointers for the uses of new Attribute API. I haven't understood if it's possible to use these custom attributes in indexing and searching.

What I want is being able to issue queries like

1) "read/VERB book/NOUN"
2) "read/ADJECTIVE book/NOUN"
3) "read/* book/*"

which would find

1) All occurrences of read&book where the former is a verb and the latter is noun e.g "I'm reading book."
2) All the occurrences of read&book where the former is an adjective e.g. "This is a nice bedtime reading book".
3) All occurrences of read&book without any constraints on POS tags. e.g. "I've spent all morning reading and booking." (assuming "and" is stop-filtered).

I could ignore the POS tag and do a general search and then do the filtering outside the Lucene but then it would render the POS tag attribute useless. I can't find a way to utilize this new interface.

Sujit Pal said...

Hi amachinho, interesting use case!

I think perhaps during indexing if you can do the POS tagging and store the terms as word/POS pairs, then you could probably do the search like you are proposing. For the read/* query, your query would have to be re-written to read/NOUN or read/VERB or...

amacinho said...

Thanks for the suggestion.

In the mean time, I bought the Lucene in Action book and got another lead for the problem. It looks like the interface PositionIncrementAttribute provides a way to put multiple tokens in the same place. http://lucene.apache.org/java/3_0_0/api/core/org/apache/lucene/analysis/Token.html#setPositionIncrement(int)

I guess this is originally useful for synonym injection in the indexing phase but it could also help me with another related problem: I can put the POS tag of a word in the same position with the word, so now I can also do searches like: "DET JJ dog" which would retrieve documents containing the angry dog, a frustrated dog, etc.

Unknown said...

hey - awesome stuff - thanks Sujit.

I've successfully ported this to .NET, but got a quick question - in your pos_suffixes file the "en" suffix appears twice - once as a adjective and once as a verb ending.

My java isn't too hot, but as far as I can see your code doesn't deal with that case? Or am I missing something?

Thanks very much!

James

Sujit Pal said...

Thanks for pointing that out...I think this may be a typo when I copied over the suffixes from the sites mentioned - the first one I can no longer get to btw. I think the mapping "en" to adjective is incorrect - I can think of verbs such as "listen", "thicken", etc. but no adjectives...

e.p.f said...

Incidentally, I was looking for a solution to amacinho's problem, and if I understand correctly, the new tokenstream API doesn't help.

Instead, zero position increments allow to inject part-of-speech tags so that a search for "JJ dog" returns all documents where the word dog appears after an adjective. However, this solution does not allow to search for words with a specific part-of-speech tag, that is, to search for book/VERB instead of book/NOUN. I think a way to achieve this could be to inject the tags in the text with normal position increments, each tag right after the word it refers to.

if you then need to search for book/NOUN you just search for "book NOUN", and if you need to search for "JJ book" you search for "JJ book"~1. Except that "JJ book"~1 is equivalent to "book JJ"~1 :(