Salmon Run: IR Math in Java : Rule based POS Tagger

In my previous post, I described an HMM based Part of Speech tagger. This post describes a rule based POS tagger loosely based on ideas underlying the design of the Brill Tagger, as described in Chapter 4 of the TMAP book.

The rule based tagger provides a single method tagSentence(), which takes a sentence as input, and returns the same sentence tagged with the appropriate part of speech. I was too lazy to add a convenience method to return the POS for a given word in a sentence, since its really simple, and I have already done this in the code in my previous post.

Wordnet is used to find the part of speech for each word in the sentence. We use MIT Java Wordnet Interface (JWI) to access the Wordnet database. Wordnet can only recognize the following four parts of speech - NOUN, VERB, ADJECTIVE and ADVERB. Therefore, our POS tagging is restricted to these four and OTHER. We enhance our Pos enum class from our previous post with methods to convert Wordnet POS to and from our Pos enum, as shown below:

// Source: src/main/java/com/mycompany/myapp/postaggers/Pos.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.collections15.BidiMap;
import org.apache.commons.collections15.bidimap.DualHashBidiMap;
import org.apache.commons.lang.StringUtils;

import edu.mit.jwi.item.POS;

public enum Pos {

  NOUN, VERB, ADJECTIVE, ADVERB, OTHER;

  private static Map<String,Pos> bmap = null;
  private static BidiMap<Pos,POS> wmap = null;
  private static final String translationFile = 
    "src/main/resources/brown_tags.txt";
  
  public static Pos fromBrownTag(String btag) throws Exception {
    // .. omitted for brevity, see previous post for body
  }

  public static Pos fromWordnetPOS(POS pos) {
    if (wmap == null) {
      wmap = buildPosBidiMap();
    }
    return wmap.getKey(pos);
  }
  
  public static POS toWordnetPos(Pos pos) {
    if (wmap == null) {
      wmap = buildPosBidiMap();
    }
    return wmap.get(pos);
  }

  private static BidiMap<Pos,POS> buildPosBidiMap() {
    wmap = new DualHashBidiMap<Pos,POS>();
    wmap.put(Pos.NOUN, POS.NOUN);
    wmap.put(Pos.VERB, POS.VERB);
    wmap.put(Pos.ADJECTIVE, POS.ADJECTIVE);
    wmap.put(Pos.ADVERB, POS.ADVERB);
    wmap.put(Pos.OTHER, null);
    return wmap;
  }
}

When Wordnet is asked what the POS of a particular word is, it can return one of the following three results, each of which are handled differently as described below:

No POS found for word
Single Unique POS found for word
Multiple POS found for word

No POS found for Word

In this case, Wordnet may not know about the word being checked, or it could be a proper noun. We use a combination of word pattern rules to try and guess the POS for this word. If none of the patterns match, then the POS is considered to be OTHER.

First, we check to see if the first letter is uppercase, in that case we assume that it is a proper noun, and therefore tag it with Pos.NOUN.

If not, we check if the word ends with one of the known suffixes that exist in our suffix to POS mappings, longest suffix first. The POS corresponding to the first matched suffix is used to tag the word.

If not, the word is tagged as Pos.OTHER.

Single Unique POS found for Word

In this case, there is no confusion -- Wordnet tells us that there is a single POS found for the word, so we tag the word with this POS, and continue on with our life...err, I mean, the next word.

Multiple POS found for Word

When Wordnet reports multiple POS possibilities for a word, it means that the word can be used as different POS depending on where it is used in the sentence - in other words, the context determines the POS.

The tagger considers the context as the word trigram surrounding the word (i.e. before/current/after). Two rules are fired, the word-backward rule and the word-forward rule, unless the word happens to occur at the beginning or end of the sentence, in which case only one rule is fired. The objective of these rules is to find the most likely POS for the word based on the POS of its anterior and posterior neighbors.

The probabilities used to compute the likelihood comes from the transition probabilities (A values) that were computed from the Brown Corpus in my previous post.

Each rule finds the highest probability of a particular POS pairs (before/current and current/after) occurring and associates it with the word's POS. The two probabilities are then added and the word POS corresponding to the highest probability is used to tag the word.

In addition, we could have used the emission probabilities (Π) from from our last post to similarly resolve ambiguous POS for the first word in the sentence, but we did not do this.

Tagger code

The code for the tagger is shown below:

// Source: src/main/java/com/mycompany/myapp/postaggers/RuleBasedTagger.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.FileReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import com.mycompany.myapp.clustering.ByValueComparator;
import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;
import com.mycompany.myapp.tokenizers.WordTokenizer;

import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.item.IIndexWord;

public class RuleBasedTagger {

  private final Log log = LogFactory.getLog(getClass());
  
  private class Context {
    public String prev;
    public List<Pos> prevPos;
    public String curr;
    public List<Pos> nextPos;
    public String next;
    public String toString() {
      return StringUtils.join(new String[] {prev,curr,next}, "/");
    }
  };
  
  private IDictionary wordnetDictionary;
  private Map<String,Pos> suffixPosMap;
  private double[][] tp;
  
  public void setWordnetDictLocation(String wordnetDictLocation) 
      throws Exception {
    this.wordnetDictionary = new Dictionary(
      new URL("file", null, wordnetDictLocation));
    this.wordnetDictionary.open();
  }

  public void setSuffixMappingLocation(String suffixMappingLocation) 
      throws Exception {
    String line;
    this.suffixPosMap = new TreeMap<String,Pos>(
      new Comparator<String>() {
        public int compare(String s1, String s2) {
          int l1 = s1.length();
          int l2 = s2.length();
          if (l1 == l2) {
            return s1.compareTo(s2);
          } else {
            return (l2 > l1 ? 1 : -1);
          }
        }
      }
    );
    BufferedReader reader = new BufferedReader(
      new FileReader(suffixMappingLocation));
    while ((line = reader.readLine()) != null) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      String[] suffixPosPair = StringUtils.split(line, "\t");
      suffixPosMap.put(suffixPosPair[0], Pos.valueOf(suffixPosPair[1]));
    }
    reader.close();
  }

  public void setTransitionProbabilityDatafile(
      String transitionProbabilityDatafile) throws Exception {
    int numPos = Pos.values().length;
    tp = new double[numPos][numPos];
    BufferedReader reader = new BufferedReader(
      new FileReader(transitionProbabilityDatafile));
    int i = 0; // row
    String line;
    while ((line = reader.readLine()) != null) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      String[] parts = StringUtils.split(line, "\t");
      for (int j = 0; j < parts.length; j++) {
        tp[i][j] = Double.valueOf(parts[j]);
      }
      i++;
    }
    reader.close();
  }

  public String tagSentence(String sentence) throws Exception {
    StringBuilder taggedSentenceBuilder = new StringBuilder();
    WordTokenizer wordTokenizer = new WordTokenizer();
    wordTokenizer.setText(sentence);
    List<Token> tokens = new ArrayList<Token>();
    Token token = null;
    while ((token = wordTokenizer.nextToken()) != null) {
      tokens.add(token);
    }
    // extract the words from the tokens
    List<String> words = new ArrayList<String>();
    for (Token tok : tokens) {
      if (tok.getType() == TokenType.WORD) {
        words.add(tok.getValue());
      }
    }
    // for each word, find the pos
    int position = 0;
    for (String word : words) {
      Pos partOfSpeech = getPartOfSpeech(words, word, position);
      taggedSentenceBuilder.append(word).
        append("/").
        append(partOfSpeech.name()).
        append(" ");
      position++;
    }
    return taggedSentenceBuilder.toString();
  }

  private Pos getPartOfSpeech(List<String> wordList, String word, 
      int position) {
    List<Pos> partsOfSpeech = getPosFromWordnet(word);
    int numPos = partsOfSpeech.size();
    if (numPos == 0) {
      // unknown Pos, apply word rules to figure out Pos
      if (startsWithUppercase(word)) {
        return Pos.NOUN;
      }
      Pos pos = getPosBasedOnSuffixRules(word);
      if (pos != null) {
        return pos;
      } else {
        return Pos.OTHER;
      }
    } else if (numPos == 1) {
      // unique Pos, return
      return partsOfSpeech.get(0);
    } else {
      // ambiguous Pos, apply disambiguation rules
      Context context = getContext(wordList, position);
      Map<Pos,Double> posProbs = new HashMap<Pos,Double>();
      if (context.prev != null) {
        // backward neighbor rule
        accumulatePosProbabilities(posProbs, word, partsOfSpeech, 
          context.prev, context.prevPos, false);
      }
      if (context.next != null) {
        // forward neighbor rule
        accumulatePosProbabilities(posProbs, word, partsOfSpeech, 
          context.next, context.nextPos, true);
      }
      if (posProbs.size() == 0) {
        return Pos.OTHER;
      } else {
        ByValueComparator<Pos,Double> bvc = 
          new ByValueComparator<Pos,Double>(posProbs);
        List<Pos> poslist = new ArrayList<Pos>();
        poslist.addAll(posProbs.keySet());
        Collections.sort(poslist, Collections.reverseOrder(bvc));
        return poslist.get(0);
      }
    }
  }

  private List<Pos> getPosFromWordnet(String word) {
    List<Pos> poslist = new ArrayList<Pos>();
    for (Pos pos : Pos.values()) {
      try {
        IIndexWord indexWord = 
          wordnetDictionary.getIndexWord(word, Pos.toWordnetPos(pos));
        if (indexWord != null) {
          poslist.add(pos);
        }
      } catch (NullPointerException e) {
        // JWI throws this if it cannot find the word in its dictionary
        // so we just dont add anything to the poslist.
        continue;
      }
    }
    return poslist;
  }

  private boolean startsWithUppercase(String word) {
    return word.charAt(0) == Character.UPPERCASE_LETTER;
  }

  private Pos getPosBasedOnSuffixRules(String word) {
    for (String suffix : suffixPosMap.keySet()) {
      if (StringUtils.lowerCase(word).endsWith(suffix)) {
        return suffixPosMap.get(suffix);
      }
    }
    return null;
  }

  private Context getContext(List<String> words, int wordPosition) {
    Context context = new Context();
    if ((wordPosition - 1) >= 0) {
      context.prev = words.get(wordPosition - 1);
      context.prevPos = getPosFromWordnet(context.prev);
    }
    context.curr = words.get(wordPosition);
    if ((wordPosition + 1) < words.size()) {
      context.next = words.get(wordPosition + 1);
      context.nextPos = getPosFromWordnet(context.next);
    }
    return context;
  }
  
  private void accumulatePosProbabilities(
      Map<Pos,Double> posProbabilities,
      String word, List<Pos> wordPosList, String neighbor, 
      List<Pos> neighborPosList, boolean isForwardRule) {
    if (isForwardRule) {
      for (Pos wordPos : wordPosList) {
        for (Pos neighborPos : neighborPosList) {
          double prob = 
            tp[wordPos.ordinal()][neighborPos.ordinal()];
          updatePosProbabilities(posProbabilities, wordPos, prob);
        }
      }
    } else {
      for (Pos neighborPos : neighborPosList) {
        for (Pos wordPos : wordPosList) {
          double prob = 
            tp[neighborPos.ordinal()][wordPos.ordinal()];
          updatePosProbabilities(posProbabilities, wordPos, prob);
        }
      }
    }
  }

  private void updatePosProbabilities(
      Map<Pos,Double> posProbabilities,
      Pos wordPos, double prob) {
    Double origProb = posProbabilities.get(wordPos);
    if (origProb == null) {
      posProbabilities.put(wordPos, prob);
    } else {
      posProbabilities.put(wordPos, prob + origProb);
    }
  }
}

Test Code and Data Files

The test code for this is really simple. All we do is to instantiate the RuleBasedTagger and set into it the location of the Wordnet dictionary, the suffix mapping data file and the transition probabilities (A values) from the HMM data file that we built in our previous post. The files are described in more detail below. Once instantiated and set up, we feed it a set of sentences, and get back POS-tagged sentences. Here is the code for the JUnit test.

// Source: src/test/java/com/mycompany/myapp/postaggers/RuleBasedTaggerTest.java
package com.mycompany.myapp.postaggers;

import org.junit.Test;

public class RuleBasedTaggerTest {

  private String[] INPUT_TEXTS = {
    "The growing popularity of Linux in Asia, Europe, and the US is " +
    "a major concern for Microsoft.",
    "Jaguar will sell its new XJ-6 model in the US for a small fortune.",
    "The union is in a sad state.",
    "Please do not state the obvious.",
    "I am looking forward to the state of the union address.",
    "I have a bad cold today.",
    "The cold war was over long ago."
  };

  @Test
  public void testTagSentence() throws Exception {
    for (String sentence : INPUT_TEXTS) {
      RuleBasedTagger tagger = new RuleBasedTagger();
      tagger.setWordnetDictLocation("/opt/wordnet-3.0/dict");
      tagger.setSuffixMappingLocation("src/main/resources/pos_suffixes.txt");
      tagger.setTransitionProbabilityDatafile(
        "src/main/resources/pos_trans_prob.txt");
      String taggedSentence = tagger.tagSentence(sentence);
      System.out.println("Original: " + sentence);
      System.out.println("Tagged:   " + taggedSentence);
    }
  }
}

I am using Wordnet-3.0 data files which can be downloaded from here.

The suffix to POS mapping file was created manually from data available here and here. The suffix and POS are tab separated. A partial listing follows:

# Source: src/main/resources/pos_suffixes.txt
# POS Suffixes
#SUFFIX POS
dom     NOUN
ity     NOUN
ment    NOUN
sion    NOUN
tion    NOUN
ness    NOUN
ance    NOUN
ence    NOUN
er      NOUN
...

The transition probabilities (A values) are taken from the HMM text file, which was generated from the Brown Corpus as described in my previous post. The file is a tab separated file of observed probabilities for transitioning from one POS to another. The numbers should add up to 1 across each line. Here is what the file looks like:

# Source: src/main/resources/pos_trans_prob.txt
# NOUN  VERB    ADJECTIVE  ADVERB  OTHER
0.155   0.156   0.019      0.025   0.645
0.095   0.195   0.168      0.094   0.449
0.639   0.024   0.148      0.005   0.183
0.052   0.228   0.111      0.041   0.569
0.206   0.199   0.205      0.039   0.351

Results

The results of running the test are shown below (edited for readability by adding newlines to break up the original and tagged sentences so words don't break across lines). As you can see, the tagging is fairly accurate.

Original: The growing popularity of Linux in Asia, Europe, and the US is 
          a major concern for Microsoft.
Tagged:   The/OTHER growing/ADJECTIVE popularity/NOUN of/OTHER Linux/NOUN 
          in/ADJECTIVE Asia/NOUN Europe/NOUN and/OTHER the/OTHER US/NOUN 
          is/OTHER a/NOUN major/ADJECTIVE concern/NOUN for/NOUN 
          Microsoft/OTHER 

Original: Jaguar will sell its new XJ-6 model in the US for a small fortune.
Tagged:   Jaguar/NOUN will/NOUN sell/VERB its/OTHER new/OTHER XJ-6/OTHER
          model/ADJECTIVE in/NOUN the/OTHER US/NOUN for/NOUN a/NOUN 
          small/ADJECTIVE fortune/NOUN 

Original: The union is in a sad state.
Tagged:   The/OTHER union/OTHER is/OTHER in/ADJECTIVE a/NOUN sad/ADJECTIVE
          state/NOUN 

Original: Please do not state the obvious.
Tagged:   Please/VERB do/VERB not/ADVERB state/VERB the/OTHER 
          obvious/ADJECTIVE 

Original: I am looking forward to the state of the union address.
Tagged:   I/ADJECTIVE am/NOUN looking/ADJECTIVE forward/NOUN to/OTHER 
          the/OTHER state/OTHER of/OTHER the/OTHER union/ADJECTIVE 
          address/NOUN 

Original: I have a bad cold today.
Tagged:   I/ADJECTIVE have/NOUN a/NOUN bad/ADJECTIVE cold/NOUN today/NOUN 

Original: The cold war was over long ago.
Tagged:   The/OTHER cold/ADJECTIVE war/NOUN was/OTHER over/ADVERB 
          long/VERB ago/ADJECTIVE

Conclusion

Unlike the HMM approach, where all the information was built into the model at the outset, the Rule based approach takes advantage of the fact that we can use a tagged dictionary (Wordnet) to tell the POS for a word as we encounter it, and that most words resolve to a single POS. For those that are not recognized by Wordnet, we use simple word pattern matching to deduce the POS. For those that resolve to multiple POS, we use collocation probability rules to disambiguate it.

And now, for something completely different...

On a completely different note, regular (i.e. more than first-time) readers may have noticed some new widgets on my blog. This week, I added a tag cloud widget, spiffed up the "Blogs I read" widget with a new one from Blogger which displays the favicon of the blog and reports on the last time it was updated, added social bookmarking links at the bottom of the post, and added URLs for Atom and RSS Syndication feeds for my blog. Hope you like them, and no, this is not some sinister black-hat SEO bid to pollute/enrich the World Wide Web with my blog contents. For my reasons, read on...

I've wanted to get a tag cloud ever since I wrote about how to build one with Python. I think it is an awesome way to provide a snapshot of the entire site contents at a glance. However, Blogger does not give you a Tag Cloud Widget, although it does give you a Label widget on which this one is based, and I never had the time to mess around with the HTML Template until last week.

For the social bookmarking links, I noticed that one of you had submitted one of my posts to Digg. Whoever it was, thank you, especially since you had to do this manually. For those of who who like my posts enough to submit to your favorite social bookmarking site, there is now a bank of links under my byline and tags which you can click and which will pre-populate the title and URL so you don't have to cut and paste.

The syndication URLs just seemed to be a good idea because I've written quite a few times about ROME and RSS, and its funny that I didn't offer a syndication feed URL myself before this. So now I offer two, one for RSS and another for Atom - take your pick :-).

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

14 comments (moderated to prevent spam):

Shrushti Interiors said...: dear sir,
I am LAVANYA from india , karnataka,bangalore.I am in pesit cse brnch. i am doing project on word classification using jwi. i want to know what i hav to do coz am new to java. plz do reply me i want have yr kind help . my mail id is lavanya2290@gmail.com

thanking u,
LAVANYA; 9/03/2009 4:46 AM
Sujit Pal said...: Hi Lavanya, not sure I can be of much help here... If you are new to Java and you need to learn it for your project, you should. The JWI project provides some documentation as well, so depending on your needs, you will probably find what you need there. I am not an expert on either word classification or JWI, but I have done some work with both, so if you have specific questions, feel free to ask and I can try to help.; 9/07/2009 11:11 AM
allanctan said...: Hi Sujit, Thank you for your POS posts... they are very useful. Just wondering which of the two approaches (rule-based vs. HMM) is better? more accurate?; 9/10/2009 4:00 AM
Sujit Pal said...: Hi allanctan, thanks for your appreciation. I think it may make sense to figure out the POS transition probabilities using HMM over a representative training set, then use that as one of the inputs to a rule based tagger.; 9/11/2009 6:09 PM
Shahid said...: Respected Sir
I just went through your code, just amazing. I was trying to make it run. The problem is I'm unable to download the Brown corpus from nltk.org. The site allows to download the corpus using python CLI. The problem is I'm behind a proxy server of the college and python CLI doesn't allows me to download. I don't know anything about python so unable to use it. Kindly help me out in either providing me some other link to 'Brown Corpus' or configuring the python to let it download behind the proxy.
I will be highly obliged if you lend me some help in this regard.

Thanking you
Shahid; 5/01/2010 5:36 AM
Sujit Pal said...: Hi Shahid, if you are behind a proxy server that places limits on how much you can download, I don't think sending you another URL (if I knew of one, which I don't) would help. Perhaps just talk to your college admin to download it for you?; 5/01/2010 11:14 AM
azeemigi said...: Dear Sir,

Thank you for this post and it helped a lot to understand the implementation level procedures. I am currently doing my final year project and I'm developing a Chatterbot to obtain concise information to human query from online repositories. Currently I'm implementing POS tagger. I have resources to use only Wordnet as the lexical database. I'm planning to use Penn Treebank tagset instead of Wordnet tags. Is it possible to do tag words with PTB tagset using Wordnet? I'm thinking about building a finite state transducer... Also I'd like to have your advice over choosing techniques (rule bases, HMM, and Brill Tagger) for this project.

Thank You; 8/27/2010 2:37 AM
Sujit Pal said...: Thanks Azeem, and you're welcome. I am not an expert in this area, so my advice may be incorrect. So if your intuition does not agree with what I suggest, then your intuition is probably correct :-).

My preference (as I have described) is to use HMM to identify the transition probabilities from a test corpus, then use use Wordnet or PTB to tag the non-ambiguous cases, then disambiguate the rest using transition probabilities for the most likely POS.

I haven't used PTB, so can't say for sure if the mapping would be possible.; 8/27/2010 1:15 PM
Aarthi said...: Hi Sujit, I am working in question categorization. Can u tel me how pos tagging can be applied to the questions?; 2/01/2012 1:13 AM
Sujit Pal said...: Hi Aarthi, you could decide (based on your knowledge of the question data) what part(s) of speech in your questions are good discriminators, and then extract only those for categorization. You could even run some analysis on words in various POS categories to find which ones are good candidates.; 2/04/2012 11:59 AM
Java Developer said...: Hi Sujit I am folling your blogs last 7 months and its really helping my work.but currently i am stuck at a problem while making Hirarchy of Noun Pharses in a sentence using Nlp. I am not finding any alogirithm for same

I want to make a hirarchy like this.

nike -> company, nike -> shoes, shoes -> boots, boots -> ugg boots; 8/07/2012 3:18 AM
Sujit Pal said...: I don't know if you can do this without reference to an external knowledge base which "knows" about all or part of the hierarchy. I haven't used any myself (the taxonomy I speak about off and on is one maintained by my company, and we reference external datastores as well as do manual curation). You may find part of your hierarchy in Wordnet, and maybe some other part in something like OpenCalais, but not 100% sure about that.; 8/11/2012 12:16 PM
Unknown said...: And what is the use of hmm_tagger.dat file i didn't get that.; 10/30/2015 11:15 AM
Sujit Pal said...: Hi Vikram did you post to the wrong blog post? Dont see a hmm_tagger.dat file here...; 10/30/2015 2:58 PM

Salmon Run

Friday, November 14, 2008

IR Math in Java : Rule based POS Tagger

No POS found for Word

Single Unique POS found for Word

Multiple POS found for Word

Tagger code

Test Code and Data Files

Results

Conclusion

And now, for something completely different...

14 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me