Salmon Run: IR Math in Java : HMM Based POS Tagger/Recognizer

Saturday, November 08, 2008

IR Math in Java : HMM Based POS Tagger/Recognizer

As you know, I have been slowly working my way through Dr Konchady's TMAP book, and coding up the algorithms in Java. By doing so, I hope to understand the techniques and mathematical models presented, so I can apply them to real-life problems in the future. In this post I describe an implementation of a Hidden Markov Model based Part of Speech recognizer/tagger, based on the material presented in Chapter 4 of the TMAP book.

Background

What follows is my take on what an HMM is and how it can be used for Part of Speech (POS) tagging. For a more detailed, math-heavy, and possibly more accurate description of HMM and their internals, read the Wikipedia article or Dr Rabiner's tutorial or the TMAP book if you happen to own it. A Hidden Markov Model can be thought of as a probabilistic finite state machine. Its states can be represented by the set S = {S₁, S₂, ..., S_n} which are not directly visible. What is visible is a set of Observations O = {O₁, O₂, ..., O_m} which are the result of the machine moving from one state to the other. The probabilities of the machine starting in one of the states S_i is specified by the one-dimensional matrix Π of size n. The probabilities of the machine moving from one state to another is specified by a two dimensional matrix A of size n*n. Finally, the probability of an observation being observed when the machine is in a certain state is given by the two dimensional matrix B of size n*m. More succintly,

  H = f(Π, A, B)
  where:
    H = the HMM
    Π = start probabilities. The element Π_i represents 
        the probability that the HMM starts a sequence in State
        State S_i, where i in (0..n-1).
    A = transition probabilities. The element A_i,j represents
        the probability of a transition from State S_i to
        State S_j, where i and j in (0..n-1).
    B = emission probabilities. The element B_i,j represents
        the probability of an Observation O_j occurring while
        the machine is in State S_i, where i in (0..n-1)
        and j in (0..m-1).
    n = number of states.
    m = number of unique observations.

The objective of POS tagging is to tag each word of a sentence with its part-of-speech tag. While some words can be unambiguously tagged, ie their is only one POS that the word is ever used for, there are quite a few which can represent different POS depending on its usage in the sentence. For example, cold can be both a noun and an adjective, and catch can be both a noun and a verb. The fact that the word exists in the sentence is known, while the POS for the word is unknown. Therefore the HMM built for POS tagging would model the words as visible observations and the set of possible POS as the hidden states.

As far as POS tagging is concerned, the main problems that can be solved by HMMs are as follows. Given an HMM,

Finding the most likely state sequence for a given observation sequence. In this case, we pass in a sentence, and tag each word with its most likely POS.
Finding the most likely state for a given observation in a sequence. This is useful for word sense disambiguation (WSD), so we can tell the most likely POS that a particular word in a sentence belongs to.

The problems above are identical from the point of view of a HMM, and are solved using the Viterbi algorithm.

The second problem is to find the probability of a certain sequence of observations. This can be used to answer questions such as whether a sentence such as "I am going to the market" is more common than one such as "To the market I go". The Forward-Backward Algorithm is used to solve this kind of problem. This can be useful for applications that predict the most likely outcome given a set of input observations, but probably is not important from the perspective of POS tagging. Both the above problems need to have a HMM built from (manually) tagged data.

A third problem of HMMs is how to build one given a corpus of untagged text. Such an HMM would allow us to solve the second type of problem. However, since the HMM has not been fed with tagged words, it must depend on a clustering algorithm such as K-Means to cluster the words into undefined hidden states, which are of no use when attempting to solve the second type of problem. The two learning algorithms used here are the K-Means Algorithm to build the initial HMM and the Baum-Welch Algorithm to refine it. As with the second type of problem, this does not have much applications where POS taggers are concerned.

I used the Java HMM library Jahmm to do all of the heavy computational lifting. It has implementations of the algorithms mentioned above, as well as several utility methods and classes to model various kinds of Observation.

Building the HMM from Tagged Data

For my tagged corpus, I used the Brown Corpus, downloading the data from the Natural Language Toolkit Project (NTLP). The corpus is a set of about 500 files containing one sentence per line, each manually tagged with a very comprehensive set of POS tags described here. Since I plan to use Wordnet at some point with this data, and Wordnet only categorizes words as one of 4 categories, I set up my own Part of Speech Enum called Pos which has 5 categories, the 4 from Wordnet and OTHER. As a result, I had to the dumb the Brown tags down using the translation table shown below:

BTAG	POS	BTAG	POS	BTAG	POS	BTAG	POS
(	OTHER	FW-CS	OTHER	MD	VERB	RBR+CS	ADVERB
)	OTHER	FW-DT	OTHER	MD*	VERB	RBT	ADVERB
*	OTHER	FW-DT+BEZ	OTHER	MD+HV	VERB	RN	ADVERB
,	OTHER	FW-DTS	OTHER	MD+PPSS	VERB	RP	ADVERB
--	OTHER	FW-HV	VERB	MD+TO	VERB	RP+IN	ADVERB
.	OTHER	FW-IN	OTHER	NN	NOUN	TO	OTHER
:	OTHER	FW-IN+AT	OTHER	NN$	NOUN	TO+VB	VERB
ABL	OTHER	FW-IN+NN	OTHER	NN+BEZ	NOUN	UH	OTHER
ABN	OTHER	FW-IN+NP	OTHER	NN+HVD	NOUN	VB	VERB
ABX	OTHER	FW-JJ	ADJECTIVE	NN+HVZ	NOUN	VB+AT	VERB
AP	OTHER	FW-JJR	ADJECTIVE	NN+IN	NOUN	VB+IN	VERB
AP$	OTHER	FW-JJT	ADJECTIVE	NN+MD	NOUN	VB+JJ	VERB
AP+AP	OTHER	FW-NN	NOUN	NN+NN	NOUN	VB+PPO	VERB
AT	ADJECTIVE	FW-NN$	NOUN	NNS	NOUN	VB+RP	VERB
BE	VERB	FW-NNS	NOUN	NNS$	NOUN	VB+TO	VERB
BED	VERB	FW-NP	NOUN	NNS+MD	NOUN	VB+VB	VERB
BED*	VERB	FW-NPS	NOUN	NP	NOUN	VBD	VERB
BEDZ	VERB	FW-NR	NOUN	NP$	NOUN	VBG	VERB
BEDZ*	VERB	FW-OD	NOUN	NP+BEZ	NOUN	VBG+TO	VERB
BEG	VERB	FW-PN	OTHER	NP+HVZ	NOUN	VBN	VERB
BEM	VERB	FW-PP$	OTHER	NP+MD	NOUN	VBN+TO	VERB
BEM*	VERB	FW-PPL	OTHER	NPS	NOUN	VBZ	VERB
BEN	VERB	FW-PPL+VBZ	OTHER	NPS$	NOUN	WDT	OTHER
BER	VERB	FW-PPO	OTHER	NR	NOUN	WDT+BER	OTHER
BER*	VERB	FW-PPO+IN	OTHER	NR$	NOUN	WDT+BER+PP	OTHER
BEZ	VERB	FW-PPS	OTHER	NR+MD	NOUN	WDT+BEZ	OTHER
BEZ*	VERB	FW-PPSS	OTHER	NRS	NOUN	WDT+DO+PPS	OTHER
CC	OTHER	FW-PPSS+HV	OTHER	OD	NOUN	WDT+DOD	OTHER
CD	NOUN	FW-QL	OTHER	PN	OTHER	WDT+HVZ	OTHER
CD$	NOUN	FW-RB	ADVERB	PN$	OTHER	WP$	OTHER
CS	OTHER	FW-RB+CC	ADVERB	PN+BEZ	OTHER	WPO	OTHER
DO	VERB	FW-TO+VB	VERB	PN+HVD	OTHER	WPS	OTHER
DO*	VERB	FW-UH	OTHER	PN+HVZ	OTHER	WPS+BEZ	OTHER
DO+PPSS	VERB	FW-VB	VERB	PN+MD	OTHER	WPS+HVD	OTHER
DOD	VERB	FW-VBD	VERB	PP$	OTHER	WPS+HVZ	OTHER
DOD*	VERB	FW-VBG	VERB	PP$$	OTHER	WPS+MD	OTHER
DOZ	VERB	FW-VBN	VERB	PPL	OTHER	WQL	OTHER
DOZ*	VERB	FW-VBZ	VERB	PPLS	OTHER	WRB	ADVERB
DT	OTHER	FW-WDT	OTHER	PPO	OTHER	WRB+BER	ADVERB
DT$	OTHER	FW-WPO	OTHER	PPS	OTHER	WRB+BEZ	ADVERB
DT+BEZ	OTHER	FW-WPS	OTHER	PPS+BEZ	OTHER	WRB+DO	ADVERB
DT+MD	OTHER	HV	VERB	PPS+HVD	OTHER	WRB+DOD	ADVERB
DTI	OTHER	HV*	VERB	PPS+HVZ	OTHER	WRB+DOD*	ADVERB
DTS	OTHER	HV+TO	VERB	PPS+MD	OTHER	WRB+DOZ	ADVERB
DTS+BEZ	OTHER	HVD	VERB	PPSS	OTHER	WRB+IN	ADVERB
DTX	OTHER	HVD*	VERB	PPSS+BEM	OTHER	WRB+MD	ADVERB
EX	VERB	HVG	VERB	PPSS+BER	OTHER	-	-
EX+BEZ	VERB	HVN	VERB	PPSS+BEZ	OTHER	-	-
EX+HVD	VERB	HVZ	VERB	PPSS+BEZ*	OTHER	-	-
EX+HVZ	VERB	HVZ*	VERB	PPSS+HV	OTHER	-	-
EX+MD	VERB	IN	OTHER	PPSS+HVD	OTHER	-	-
FW-*	OTHER	IN+IN	OTHER	PPSS+MD	OTHER	-	-
FW-AT	ADJECTIVE	IN+PPO	OTHER	PPSS+VB	OTHER	-	-
FW-AT+NN	ADJECTIVE	JJ	ADJECTIVE	QL	OTHER	-	-
FW-AT+NP	ADJECTIVE	JJ$	ADJECTIVE	QLP	OTHER	-	-
FW-BE	VERB	JJ+JJ	ADJECTIVE	RB	ADVERB	-	-
FW-BER	VERB	JJR	ADJECTIVE	RB$	ADVERB	-	-
FW-BEZ	VERB	JJR+CS	ADJECTIVE	RB+BEZ	ADVERB	-	-
FW-CC	OTHER	JJS	ADJECTIVE	RB+CS	ADVERB	-	-
FW-CD	NOUN	JJT	ADJECTIVE	RBR	ADVERB	-	-

You may notice that some of the mappings are not correct. Unfortunately, my knowledge of formal English grammar is not as good as I would like it to be, owing to having been educated in an environment that posited that a person can learn to recognize incorrect grammar better by reading and writing enough gramatically correct sentences rather than through a study of language rules. My grandfather, obviously regarding all this as crazy talk, briefly attempted to rectify that, armed with a Wren and Martin and a 18" ruler, but as you can probably see, it did not work out all that well :-).

The code for the Pos Enum is shown below. As mentioned earlier, it exposes a set of 5 POS values, and has a convenience method to convert the Brown tag into corresponding Pos.

// Source: src/main/java/com/mycompany/myapp/postaggers/Pos.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.lang.StringUtils;

/**
 * Enumeration of Parts of Speech being considered. Conversions from
 * Brown Tags and Wordnet tags are handled by convenience methods.
 */
public enum Pos {

  NOUN, VERB, ADJECTIVE, ADVERB, OTHER;

  private static Map<String,Pos> bmap = null;
  private static final String translationFile = 
    "src/main/resources/brown_tags.txt";
  
  public static Pos fromBrownTag(String btag) throws Exception {
    if (bmap == null) {
      bmap = new HashMap<String,Pos>();
      BufferedReader reader = new BufferedReader(new InputStreamReader(
          new FileInputStream(translationFile)));
      String line;
      while ((line = reader.readLine()) != null) {
        if (line.startsWith("#")) {
          continue;
        }
        String[] cols = StringUtils.split(line, "\t");
        bmap.put(StringUtils.lowerCase(cols[0]), Pos.valueOf(cols[1])); 
      }
      reader.close();
    }
    Pos pos = bmap.get(btag);
    if (pos == null) {
      return Pos.OTHER;
    }
    return pos;
  }
}

The BrownCorpusReader reads through each tagged file in the Brown Corpus directory, extracts the word and the tag out of each tagged word, converts the Brown tag to its equivalent Pos value, and accumulates the occurrences into internal counters. Once all files are processed, the counters are normalized into the three probability matrices Π, A and B that we spoke about earlier.

Since the number of words tagged in any corpus is potentially quite large, we represent the words (or observations) in the HMM as an integer. That is why the BrownCorpusReader also dumps out a list of unique words it found in the corpus into a flat file which can be pulled back into memory later to do the mapping between the word and the integer observation Id.

// Source: src/main/java/com/mycompany/myapp/postaggers/BrownCorpusReader.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileWriter;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.StringTokenizer;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import Jama.Matrix;

/**
 * Reads a file or directory of tagged text from Brown Corpus and
 * computes the various probability matrices for the HMM.
 */
public class BrownCorpusReader {

  private final Log log = LogFactory.getLog(getClass());
  
  private String dataFilesLocation;
  private String wordDictionaryLocation;
  private boolean debug;

  private Bag<String> piCounts = new HashBag<String>();
  private Bag<String> aCounts = new HashBag<String>();
  private Map<String,Double[]> wordPosMap = 
    new HashMap<String,Double[]>();
  
  private Matrix pi;
  private Matrix a;
  private Matrix b;
  private List<String> words;
  
  public void setDataFilesLocation(String dataFilesLocation) {
    this.dataFilesLocation = dataFilesLocation;
  }
  
  public void setWordDictionaryLocation(String wordDictionaryLocation) {
    this.wordDictionaryLocation = wordDictionaryLocation;
  }
  
  public void setDebug(boolean debug) {
    this.debug = debug;
  }

  public void read() throws Exception {
    File location = new File(dataFilesLocation);
    File[] inputs;
    if (location.isDirectory()) {
      inputs = location.listFiles();
    } else {
      inputs = new File[] {location};
    }
    int currfile = 0;
    int totfiles = inputs.length;
    for (File input : inputs) {
      currfile++;
      log.info("Processing file (" + currfile + "/" + totfiles + "): " + 
        input.getName());
      BufferedReader reader = new BufferedReader(new InputStreamReader(
        new FileInputStream(input)));
      String line;
      while ((line = reader.readLine()) != null) {
        if (StringUtils.isEmpty(line)) {
          continue;
        }
        StringTokenizer tok = new StringTokenizer(line, " ");
        int wordIndex = 0;
        Pos prevPos = null;
        while (tok.hasMoreTokens()) {
          String taggedWord = tok.nextToken();
          String[] wordTagPair = StringUtils.split(
            StringUtils.lowerCase(StringUtils.trim(taggedWord)), "/");
          if (wordTagPair.length != 2) {
            continue;
          }
          Pos pos = Pos.fromBrownTag(wordTagPair[1]);
          if (! wordPosMap.containsKey(wordTagPair[0])) {
            // create an entry
            Double[] posProbs = new Double[Pos.values().length];
            for (int i = 0; i < posProbs.length; i++) {
              posProbs[i] = new Double(0.0D);
            }
            wordPosMap.put(wordTagPair[0], posProbs);
          }
          Double[] posProbs = wordPosMap.get(wordTagPair[0]);
          posProbs[pos.ordinal()] += 1.0D;
          wordPosMap.put(wordTagPair[0], posProbs);
          if (wordIndex == 0) {
            // first word, update piCounts
            piCounts.add(pos.name());
          } else {
            aCounts.add(StringUtils.join(new String[] {
              prevPos.name(), pos.name()}, ":"));
          }
          prevPos = pos;
          wordIndex++;
        }
      }
      reader.close();
    }
    // normalize counts to probabilities
    int numPos = Pos.values().length;
    // compute pi
    pi = new Matrix(numPos, 1);
    for (int i = 0; i < numPos; i++) {
      pi.set(i, 0, piCounts.getCount((Pos.values()[i]).name()));
    }
    pi = pi.times(1 / pi.norm1());
    // compute a
    a = new Matrix(numPos, numPos);
    for (int i = 0; i < numPos; i++) {
      for (int j = 0; j < numPos; j++) {
        a.set(i, j, aCounts.getCount(StringUtils.join(new String[] {
          (Pos.values()[i]).name(), (Pos.values()[j]).name()
        }, ":")));
      }
    }
    // compute b
    int numWords = wordPosMap.size();
    words = new ArrayList<String>();
    words.addAll(wordPosMap.keySet());
    b = new Matrix(numPos, numWords);
    for (int i = 0; i < numPos; i++) {
      for (int j = 0; j < numWords; j++) {
        String word = words.get(j);
        b.set(i, j, wordPosMap.get(word)[i]);
      }
    }
    // normalize across rows for a and b (sum of cols in each row == 1.0)
    for (int i = 0; i < numPos; i++) {
      double rowSumA = 0.0D;
      for (int j = 0; j < numPos; j++) {
        rowSumA += a.get(i, j);
      }
      for (int j = 0; j < numPos; j++) {
        a.set(i, j, (a.get(i, j) / rowSumA));
      }
      double rowSumB = 0.0D;
      for (int j = 0; j < numWords; j++) {
        rowSumB += b.get(i, j);
      }
      for (int j = 0; j < numWords; j++) {
        b.set(i, j, (b.get(i, j) / rowSumB));
      }
    }
    // write out brown word dictionary for later use
    writeDictionary();
    // debug
    if (debug) {
      pi.print(8, 4);
      a.print(8, 4);
      b.print(8, 4);
      System.out.println(words.toString());
    }
  }
  
  public List<String> getWords() {
    return words;
  }
  
  public double[] getPi() {
    double[] pia = new double[pi.getRowDimension()];
    for (int i = 0; i < pia.length; i++) {
      pia[i] = pi.get(i, 0);
    }
    return pia;
  }
  
  public double[][] getA() {
    double[][] aa = new double[a.getRowDimension()][a.getColumnDimension()];
    for (int i = 0; i < a.getRowDimension(); i++) {
      for (int j = 0; j < a.getColumnDimension(); j++) {
        aa[i][j] = a.get(i, j);
      }
    }
    return aa;
  }
  
  public double[][] getB() {
    double[][] ba = new double[b.getRowDimension()][b.getColumnDimension()];
    for (int i = 0; i < b.getRowDimension(); i++) {
      for (int j = 0; j < b.getColumnDimension(); j++) {
        ba[i][j] = b.get(i, j);
      }
    }
    return ba;
  }

  private void writeDictionary() throws Exception {
    FileWriter dictWriter = new FileWriter(wordDictionaryLocation);
    for (String word : words) {
      dictWriter.write(word + "\n");
    }
    dictWriter.flush();
    dictWriter.close();
  }
}

We generate the HMM and serialize it to disk as a flat file. That decouples the building of the HMM from the actual usage, and saves a few CPU cycles and makes the tests run a bit faster. In addition, if this solution was to be used in a real-life situation, it would be much faster to load the HMM from a flat file than to build it from a tagged corpus. Our serialized HMM file looks like this (edited to truncate the number of observations for readability).

On a quick side note, the Jahmm example uses the ObservationDiscrete class based on an Enum to model a small finite set of observations. This works well if the number of observations in your set are small and well known. In our case, we consider a unique word as an observation, and we have approximately 3900 of them, so we used the ObservationInteger class to model the observation, and our flat file serves as a mapping between the integer id for the Observation to the actual word.

Hmm v1.0

NbStates 5

State
Pi 0.127
A 0.155 0.156 0.019 0.025 0.645 
IntegerOPDF [0 0 0.00002 0.00003 0 0 0.00001 0 0.00001 ...]

State
Pi 0.057
A 0.095 0.195 0.168 0.094 0.449 
IntegerOPDF [0 0 0 0 0 0.00001 0 0 0 0 0 0 0.00001 0.00005 ...]

State
Pi 0.164
A 0.639 0.024 0.148 0.005 0.183 
IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...]

State
Pi 0.083
A 0.052 0.228 0.111 0.041 0.569 
IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...]

State
Pi 0.569
A 0.206 0.199 0.205 0.039 0.351 
IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0032 0.00033 0.00064 ...]

The following JUnit test snippet shows how we use the HmmTagger class (described below) to call the BrownCorpusReader and build and persist the HMM.

  @Test
  public void testBuildFromBrownAndWrite() throws Exception {
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat");
    Hmm<ObservationInteger> hmm = hmmTagger.buildFromBrownCorpus();
    hmmTagger.saveToFile(hmm);
  }

HMM Tagger class

I then create a HmmTagger class that can build an HMM from the BrownCorpusReader as well as from a serialized HMM file shown above. The HmmTagger contains all the methods that are needed to solve the common HMM problems listed above. The code for the HmmTagger is as follows:

// Source: src/main/java/com/mycompany/myapp/postaggers/HmmTagger.java
package com.mycompany.myapp.postaggers;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.text.DecimalFormat;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import be.ac.ulg.montefiore.run.jahmm.ForwardBackwardCalculator;
import be.ac.ulg.montefiore.run.jahmm.Hmm;
import be.ac.ulg.montefiore.run.jahmm.ObservationInteger;
import be.ac.ulg.montefiore.run.jahmm.OpdfInteger;
import be.ac.ulg.montefiore.run.jahmm.OpdfIntegerFactory;
import be.ac.ulg.montefiore.run.jahmm.ViterbiCalculator;
import be.ac.ulg.montefiore.run.jahmm.io.HmmReader;
import be.ac.ulg.montefiore.run.jahmm.io.HmmWriter;
import be.ac.ulg.montefiore.run.jahmm.io.OpdfIntegerReader;
import be.ac.ulg.montefiore.run.jahmm.io.OpdfReader;
import be.ac.ulg.montefiore.run.jahmm.io.OpdfWriter;
import be.ac.ulg.montefiore.run.jahmm.learn.BaumWelchLearner;
import be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner;
import be.ac.ulg.montefiore.run.jahmm.toolbox.KullbackLeiblerDistanceCalculator;

/**
 * HMM based POS Tagger.
 */
public class HmmTagger {

  private static final DecimalFormat OBS_FORMAT = 
    new DecimalFormat("##.#####");
  
  private final Log log = LogFactory.getLog(getClass());

  private String dataDir;
  private String dictionaryLocation;
  private String hmmFileName;
  
  private Map<String,Integer> words = 
    new HashMap<String,Integer>();
  
  public void setDataDir(String brownDataDir) {
    this.dataDir = brownDataDir;
  }

  public void setDictionaryLocation(String dictionaryLocation) {
    this.dictionaryLocation = dictionaryLocation;
  }

  public void setHmmFileName(String hmmFileName) {
    this.hmmFileName = hmmFileName;
  }

  /**
   * Builds up an HMM where states are parts of speech given by the Pos
   * Enum, and the observations are actual words in the tagged Brown
   * corpus. Each integer observation corresponds to the position of 
   * a word found in the Brown corpus.
   * @return an HMM.
   * @throws Exception if one is thrown.
   */
  public Hmm<ObservationInteger> buildFromBrownCorpus() 
      throws Exception {
    BrownCorpusReader brownReader = new BrownCorpusReader();
    brownReader.setDataFilesLocation(dataDir);
    brownReader.setWordDictionaryLocation(dictionaryLocation);
    brownReader.read();
    int nbStates = Pos.values().length;
    OpdfIntegerFactory factory = new OpdfIntegerFactory(nbStates);
    Hmm<ObservationInteger> hmm = 
      new Hmm<ObservationInteger>(nbStates, factory); 
    double[] pi = brownReader.getPi();
    for (int i = 0; i < nbStates; i++) {
      hmm.setPi(i, pi[i]);
    }
    double[][] a = brownReader.getA();
    for (int i = 0; i < nbStates; i++) {
      for (int j = 0; j < nbStates; j++) {
        hmm.setAij(i, j, a[i][j]);
      }
    }
    double[][] b = brownReader.getB();
    for (int i = 0; i < nbStates; i++) {
      for (int j = 0; j < nbStates; j++) {
        hmm.setOpdf(i, new OpdfInteger(b[i]));
      }
    }
    int seq = 0;
    for (String word : brownReader.getWords()) {
      words.put(word, seq);
      seq++;
    }
    return hmm;
  }
  
  /**
   * Builds an HMM from a formatted file describing the HMM. The format is
   * specified by the Jahmm project, and it has utility methods to read and
   * write HMMs from and to text files. We use this because the builder that
   * builds an HMM from the Brown corpus is computationally intensive and
   * this strategy provides us a way to partition the process.
   * @return a HMM
   * @throws Exception if one is thrown.
   */
  public Hmm<ObservationInteger> buildFromHmmFile() throws Exception {
    File hmmFile = new File(hmmFileName);
    if (! hmmFile.exists()) {
      throw new Exception("HMM File: " + hmmFile.getName() + 
        " does not exist");
    }
    FileReader fileReader = new FileReader(hmmFile);
    OpdfReader<OpdfInteger> opdfReader = new OpdfIntegerReader();
    Hmm<ObservationInteger> hmm = 
      HmmReader.read(fileReader, opdfReader);
    return hmm;
  }
  
  /**
   * Utility method to save an HMM into a formatted text file describing the
   * HMM. The format is specified by the Jahmm project, which also provides
   * utility methods to write a HMM to the text file.
   * @param hmm the HMM to write.
   * @throws Exception if one is thrown.
   */
  public void saveToFile(Hmm<ObservationInteger> hmm) 
      throws Exception {
    FileWriter fileWriter = new FileWriter(hmmFileName);
    // we create our own impl of the OpdfIntegerWriter because we want
    // to control the formatting of the opdf probabilities. With the 
    // default OpdfIntegerWriter, small probabilities get written in 
    // the exponential format, ie 1.234..E-4, which the HmmReader does
    // not recognize.
    OpdfWriter<OpdfInteger> opdfWriter = 
      new OpdfWriter<OpdfInteger>() {
        @Override
        public void write(Writer writer, OpdfInteger opdf) 
            throws IOException {
          String s = "IntegerOPDF [";
          for (int i = 0; i < opdf.nbEntries(); i++)
            s += OBS_FORMAT.format(opdf.probability(
              new ObservationInteger(i))) + " ";
            writer.write(s + "]\n");
          }
    };
    HmmWriter.write(fileWriter, opdfWriter, hmm);
    fileWriter.flush();
    fileWriter.close();
  }

  /**
   * Given the HMM, returns the probability of observing the sequence 
   * of words specified in the sentence. Uses the Forward-Backward 
   * algorithm to compute the probability.
   * @param sentence the sentence to check.
   * @param hmm a reference to a prebuilt HMM.
   * @return the probability of observing this sequence.
   * @throws Exception if one is thrown.
   */
  public double getObservationProbability(String sentence, 
      Hmm<ObservationInteger> hmm) throws Exception {
    String[] tokens = tokenizeSentence(sentence);
    List<ObservationInteger> observations = getObservations(tokens);
    ForwardBackwardCalculator fbc = 
      new ForwardBackwardCalculator(observations, hmm);
    return fbc.probability();
  }

  /**
   * Given an HMM and an untagged sentence, tags each word with the part of
   * speech it is most likely to belong in. Uses the Viterbi algorithm.
   * @param sentence the sentence to tag.
   * @param hmm the HMM to use.
   * @return a tagged sentence.
   * @throws Exception if one is thrown.
   */
  public String tagSentence(String sentence, 
      Hmm<ObservationInteger> hmm) throws Exception {
    String[] tokens = tokenizeSentence(sentence);
    List<ObservationInteger> observations = getObservations(tokens);
    ViterbiCalculator vc = new ViterbiCalculator(observations, hmm);
    int[] ids = vc.stateSequence();
    StringBuilder tagBuilder = new StringBuilder();
    for (int i = 0; i < ids.length; i++) {
      tagBuilder.append(tokens[i]).
        append("/").
        append((Pos.values()[ids[i]]).name()).
        append(" ");
    }
    return tagBuilder.toString();
  }
  
  /**
   * Given an HMM, a sentence and a word within the sentence which needs to 
   * be disambiguated, returns the most likely Pos for the specified word.
   * @param word the word to find the Pos for.
   * @param sentence the sentence.
   * @param hmm the HMM.
   * @return the most likely POS.
   * @throws Exception if one is thrown.
   */
  public Pos getMostLikelyPos(String word, String sentence, 
      Hmm<ObservationInteger> hmm) throws Exception {
    if (words == null || words.size() == 0) {
      loadWordsFromDictionary();
    }
    String[] tokens = tokenizeSentence(sentence);
    List<ObservationInteger> observations = getObservations(tokens);
    int wordPos = -1;
    for (int i = 0; i < tokens.length; i++) {
      if (tokens[i].equalsIgnoreCase(word)) {
        wordPos = i;
        break;
      }
    }
    if (wordPos == -1) {
      throw new IllegalArgumentException("Word [" + word + 
        "] does not exist in sentence [" + sentence + "]");
    }
    ViterbiCalculator vc = new ViterbiCalculator(observations, hmm);
    int[] ids = vc.stateSequence();
    return Pos.values()[ids[wordPos]];
  }

  /**
   * Given an existing HMM, this method will send in a List of sentences from
   * a possibly different untagged source, to refine the HMM.
   * @param sentences the List of sentences to teach.
   * @return a HMM that has been taught using the observation sequences.
   * @throws Exception if one is thrown.
   */
  public Hmm<ObservationInteger> teach(List<String> sentences)
      throws Exception {
    if (words == null || words.size() == 0) {
      loadWordsFromDictionary();
    }
    OpdfIntegerFactory factory = new OpdfIntegerFactory(words.size());
    List<List<ObservationInteger>> sequences = 
      new ArrayList<List<ObservationInteger>>();
    for (String sentence : sentences) {
      List<ObservationInteger> sequence = 
        getObservations(tokenizeSentence(sentence));
      sequences.add(sequence);
    }
    KMeansLearner<ObservationInteger> kml = 
      new KMeansLearner<ObservationInteger>(
      Pos.values().length, factory, sequences);
    Hmm<ObservationInteger> hmm = kml.iterate();
    // refine it with Baum-Welch Learner
    BaumWelchLearner bwl = new BaumWelchLearner();
    Hmm<ObservationInteger> refinedHmm = bwl.iterate(hmm, sequences);
    return refinedHmm;
  }
  
  /**
   * Convenience method to compute the distance between two HMMs. This can 
   * be used to stop the teaching process once more teaching is not
   * producing any appreciable improvement in the HMM, ie, the HMM
   * converges. The caller will need to match the result of this method 
   * with a number based on experience.
   * @param hmm1 the original HMM.
   * @param hmm2 the HMM that was most recently taught.
   * @return the difference measure between the two HMMs.
   * @throws Exception if one is thrown.
   */
  public double difference(Hmm<ObservationInteger> hmm1,
      Hmm<ObservationInteger> hmm2) throws Exception {
    KullbackLeiblerDistanceCalculator kdc = 
      new KullbackLeiblerDistanceCalculator();
    return kdc.distance(hmm1, hmm2);
  }
  
  private String[] tokenizeSentence(String sentence) {
    String[] tokens = StringUtils.split(
      StringUtils.lowerCase(StringUtils.trim(sentence)), " ");
    return tokens;
  }
  
  private List<ObservationInteger> getObservations(String[] tokens)
      throws Exception {
    if (words == null || words.size() == 0) {
      loadWordsFromDictionary();
    }
    List<ObservationInteger> observations = 
      new ArrayList<ObservationInteger>();
    for (String token : tokens) {
      observations.add(new ObservationInteger(words.get(token)));
    }
    return observations;
  }
  
  private void loadWordsFromDictionary() throws Exception {
    BufferedReader reader = new BufferedReader(
      new FileReader(dictionaryLocation));
    String word;
    int seq = 0;
    while ((word = reader.readLine()) != null) {
      words.put(word, seq);
      seq++;
    }
    reader.close();
  }
}

Word Sense Disambiguation

Given a sentence, a human user can figure the correct POS for each word almost immediately, but with an HMM, we can only tell which is the most likely POS for the word given the sequence of words in the sentence. Obviously, this depends on how large and accurate the HMM's training data set was. Here is how the HmmTagger is called to determine the most likely POS for a word in the sentence.

  @Test
  public void testWordSenseDisambiguation() throws Exception {
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat");
    Hmm<ObservationInteger> hmm = 
      hmmTagger.buildFromHmmFile();
    String[] testSentences = new String[] {
      "The dog ran after the cat .",
      "Failure dogs his path .",
      "The cold steel cuts through the flesh .",
      "He had a bad cold .",
      "He will catch the ball .",
      "Salmon is the catch of the day ."
    };
    String[] testWords = new String[] {
      "dog",
      "dogs",
      "cold",
      "cold",
      "catch",
      "catch"
    };
    for (int i = 0; i < testSentences.length; i++) {
      System.out.println("Original sentence: " + testSentences[i]);
      Pos wordPos = hmmTagger.getMostLikelyPos(testWords[i], 
        testSentences[i], hmm); 
      System.out.println("Pos(" + testWords[i] + ")=" + wordPos);
    }
  }

And here are the results. As you can see, the HMM did well on all but the second sentence.

Original sentence: The dog ran after the cat .
Pos(dog)=NOUN

Original sentence: Failure dogs his path .
Pos(dogs)=NOUN

Original sentence: The cold steel cuts through the flesh .
Pos(cold)=ADJECTIVE

Original sentence: He had a bad cold .
Pos(cold)=NOUN

Original sentence: He will catch the ball .
Pos(catch)=VERB

Original sentence: Salmon is the catch of the day .
Pos(catch)=NOUN

POS Tagging

POS Tagging uses the same algorithm as Word Sense Disambiguation. Given a HMM trained with a sufficiently large and accurate corpus of tagged words, we can now use it to automatically tag sentences from a similar corpus. Here is the JUnit code snippet to do tag the sentences we used in our previous test.

  @Test
  public void testPosTagging() throws Exception {
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat");
    Hmm<ObservationInteger> hmm = hmmTagger.buildFromHmmFile();
    // POS tagging
    String[] testSentences = new String[] {...};
    for (int i = 0; i < testSentences.length; i++) {
      System.out.println("Original sentence: " + testSentences[i]);
      System.out.println("Tagged sentence: " + 
        hmmTagger.tagSentence(testSentences[i], hmm));
    }
  }

And here are the results.

Original sentence: The dog ran after the cat .
Tagged sentence: the/ADJECTIVE dog/NOUN ran/VERB after/OTHER 
  the/ADJECTIVE cat/NOUN ./OTHER 

Original sentence: Failure dogs his path .
Tagged sentence: failure/NOUN dogs/NOUN his/OTHER path/NOUN ./OTHER 

Original sentence: The cold steel cuts through the flesh .
Tagged sentence: the/ADJECTIVE cold/ADJECTIVE steel/NOUN cuts/NOUN 
  through/OTHER the/ADJECTIVE flesh/NOUN ./OTHER 

Original sentence: He had a bad cold .
Tagged sentence: he/OTHER had/VERB a/ADJECTIVE bad/ADJECTIVE cold/NOUN 
  ./OTHER 

Original sentence: He will catch the ball .
Tagged sentence: he/OTHER will/VERB catch/VERB the/ADJECTIVE ball/NOUN 
  ./OTHER 

Original sentence: Salmon is the catch of the day .
Tagged sentence: salmon/NOUN is/VERB the/ADJECTIVE catch/NOUN of/OTHER 
  the/ADJECTIVE day/NOUN ./OTHER

Sentence Likelihood

HMMs can be used to predict if one sentence is more likely to occur than another one, by comparing the observation probability of a certain sequence of words with another sequence. So for example, we find that the HMM believes that sentences spoken by Master Yoda of Star Wars fame are less likely to occur in "normal" English than sentences expressing similar meaning that you or I would speak.

  @Test
  public void testObservationProbability() throws Exception {
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat");
    Hmm<ObservationInteger> hmm = hmmTagger.buildFromHmmFile();
    System.out.println("P(I am worried)=" + 
      hmmTagger.getObservationProbability("I am worried", hmm));
    System.out.println("P(Worried I am)=" +  
      hmmTagger.getObservationProbability("Worried I am", hmm));
  }

As expected, our results indicate that the HMM understands us better than it understands Master Yoda.

1 2	P(I am worried)=5.446081633660202E-11 P(Worried I am)=1.2623833954125002E-11

Unsupervised Learning

The final problem we can solve with a HMM is to build one from a set of untagged data. This HMM can then be used for solving the Sentence Likelihood problem, but not the POS Tagging or the WSD problems. To set this up, I picked up a bunch of of Yoda quotes from this page and fed it into a newly instantiated HMM. I then took the same two sentences and asked the HMM which was more probable. Here is the test code snippet to do that:

  @Test
  public void testTeachYodaAndObserveProbabilities() throws Exception {
    List<String> sentences = Arrays.asList(new String[] {
      "Powerful you have become .",
      "The dark side I sense in you .",
      "Grave danger you are in .",
      "Impatient you are .",
      "Try not .",
      "Do or do not there is no try .",
      "Consume you the dark path will .",
      "Always in motion is the future .",
      "Around the survivors a perimeter create .",
      "Size matters not .",
      "Blind we are if creation of this army we could not see .",
      "Help you I can yes .",
      "Strong am I with the force .",
      "Agree with you the council does .",
      "Your apprentice he will be .",
      "Worried I am .",
      "Always two there are .",
      "When 900 years you reach look as good you will not ."
    });
    HmmTagger hmmTagger = new HmmTagger();
    hmmTagger.setDataDir("/opt/brown-2.0");
    hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt");
    Hmm<ObservationInteger> learnedHmm = hmmTagger.teach(sentences);
    System.out.println("P(Worried I am)=" +  
      hmmTagger.getObservationProbability("Worried I am", learnedHmm));
    System.out.println("P(I am worried)=" + 
      hmmTagger.getObservationProbability("I am worried", learnedHmm));
  }

Now, as you can see, this new HMM understands Yoda better than it understands us :-).

1 2	P(Worried I am)=4.455273233553778E-6 P(I am worried)=2.4569521508568634E-6

Conclusions

Personally, this learning curve was quite a steep one for me. The theory was fairly easy to grasp from an intuitive standpoint, but then understanding how to model the POS tagging problem as a HMM took me a while. Once I crossed that hurdle, it took me a fair bit of effort to figure out how to use Jahmm to build and solve a HMM.

I think it was worth it, though. HMMs are a very powerful modeling tool for text mining, and can be used to model a variety of real life situations. Using a library such as Jahmm means that you just have to figure out how to model your problem and to solve it using the tools provided.

Hopefully, if you've been reading this far, and you started out not knowing or with a vague idea of what an HMM was and how it could be used for POS tagging (as was my situation couple of months ago), this post has provided some information as well as an example of using the Jahmm API to build and solve an HMM.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

44 comments (moderated to prevent spam):

Anonymous said...: great application, well explained.
I am student in France in computing and language analysis.

I understand your program but i don't really see how are files like brown_tags.txt or brown_dict.txt ...

I would like see an example if it's possible.

Thanks for all.; 1/22/2009 3:53 AM
Sujit Pal said...: Hi, thanks for the kind words, and I am glad you liked the post. The brown_tags.txt is basically a flat file version of the table which I got from the web page referenced, and the brown_dict.txt is the dictionary that I got off the NTLP project.; 1/24/2009 10:30 AM
Kishor Shamra said...: Hi Paul ,
Great application .......I was also trying to use the application but i could n't found brown_dict.txt file anywhere can u please help me??; 2/11/2009 4:36 AM
Kishor Shamra said...: gr8 explanation!!!!!! very good blog, I was trying to develop the same application but can't find the file brown_dict.txt brown_tags.txt
can u please help me on this??; 2/11/2009 4:48 AM
Sujit Pal said...: Hi sharma, thanks for your appreciation. I got the brown_dict.txt file from the NLTK Project, and the brown_tags.txt is a flat file version of the table in this page.; 2/14/2009 3:32 PM
Anonymous said...: Hello,I am a student from China,I wanna say it's a great application,and now i am trying to use the application,but there is an error,i can't find the class POS,could you help me?
Thank you!; 6/07/2009 5:21 AM
Sujit Pal said...: Thank you. Are you trying out the code in this post? I did not see any reference to the POS class in the code here. I have used the POS class elsewhere, its from the MIT Java-Wordnet Interface jar file (jwi-*.jar). Google for the full name and you will find the project page and thence the jar file.; 6/09/2009 1:36 PM
allanctan said...: Hi Sujit,

Just noticed in HMMTagger.java line#92-#97... the second loop (j) is not necessary?; 9/16/2009 5:45 AM
Sujit Pal said...: Thanks, you are right...removing this in the svn version and checking in.; 9/18/2009 11:42 PM
Anonymous said...: Hi. I'm also using Jahmm but I'm encountering a problem. I want to make a voice recognition
software and use HMM to classify the data.

An example of the feature vector(training data) I used is like this:

739.095423 -68.9217791 -41.94023848 190.343314 70.14624796
19.4320868 101.1893881 60.00366733 1.083977616 31.19825832
44.07300049 19.78601438

To learn the HMMs, my code is:

public Hmm learnBleh(List>
sequences){
int numberOfHiddenStates = 12;
do{

KMeansLearner kml = new
KMeansLearner(numberOfHiddenStates, new
OpdfGaussianFactory(), sequences);

trainedHmm = kml.learn();
BaumWelchLearner bwl = new BaumWelchLearner();
bwl.setNbIterations(20);
trainedHmm = bwl.learn(trainedHmm, sequences);
numberOfHiddenStates++;
}while(Double.isNaN(trainedHmm.getPi(0)) && numberOfHiddenStates
<50);

return trainedHmm;
}

How come I get an error:
Exception in thread "main" java.lang.IllegalArgumentException: Variance
must be positive
at
be.ac.ulg.montefiore.run.distributions.GaussianDistribution.(Gaussian
Distribution.java:43)
at
be.ac.ulg.montefiore.run.jahmm.OpdfGaussian.fit(OpdfGaussian.java:124)
at
be.ac.ulg.montefiore.run.jahmm.OpdfGaussian.fit(OpdfGaussian.java:93)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.learnOpdf(KMeansLearner.
java:165)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.iterate(KMeansLearner.ja
va:67)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.learn(KMeansLearner.java
:96)

I don't know what the error means. Have you ever encountered an error like this? Thanks.
All help is greatly appreciated!; 1/14/2010 1:39 AM
Ashwin Jayaprakash said...: Hi Sujit, I got to read and appreciate this old post only today. Nice work, especially the "Yoda speak" example :)

BTW, are you following the Mahout project? It looks like something you'd enjoy.

Ashwin.; 1/16/2010 2:15 PM
Sujit Pal said...: @Anonymous: I think I have seen this error before, but not sure about the solution anymore, this was a while ago. I will try to find out what I did or if I can reproduce this, and get back.

@Ashwin: Thanks, I /am/ following the Mahout project, sort of. Last I looked it was too early stage unless one is a developer on that project, but I know that there has been some activity recently - I should probably look at it again. Thanks for the reminder.; 1/21/2010 2:37 PM
Zoloo said...: Hello Sujit. It is very helpful to me. But I have a question? Which is it bi-gram or tri-gram.

Thanks.; 7/29/2010 7:02 AM
Sujit Pal said...: Thanks Zoloo. Not sure about the question though?; 8/06/2010 7:10 PM
Barnabas said...: Hi, Sujit. Thanks for your amazingly good explanation ^^ it was really helpful to me.

I'm now focusing on Named Entity Recognition problem, and I want to use HMM (with Jahmm library) to solve this problem. I made training corpus which is Named Entity marked by hand, so that I can use it as a training data. But the thing is that, how to train it automatically.
I've looking for training method in Jahmm, but there are only 2 way to train, BaumWelchLearner and KMeansLearner, and they only gets Observation Sequences as a parameter.
For me, this is the problem. Because, for the training, I want to mark Named Entity Type to the words when process the training.

This is long and boring question, sorry about this.;;

In summary, can I train to make all Hmm variables (pi, transition prob. and observation prob. ) automatically using internal method of Jahmm ? if not, how can I train them without making this Hmm variables with my own hand.

Thanks for all, in advance. ^^;; 8/10/2010 12:37 PM
Sujit Pal said...: Thanks Barnabas. Unfortunately I don't think I know enough to answer your question. From what I understand about HMMs, you need a set of observation sequences to train it, but I probably have a lot to learn about this stuff. I would suggest asking the author - he has put his code on googlecode, and that has a discussion list as well.; 8/17/2010 12:09 PM
Unknown said...: Thanks for your comment, Sujit.
I've found another good library, mallet (http://mallet.cs.umass.edu/) it supports supervised learning of HMM, CRF, MEMM. By using this library, I will implement NER system. If there's some progress, I'll let you know, thanks ^^; 8/29/2010 11:11 AM
Sujit Pal said...: Hi Barnabas, best of luck, and thanks for the pointer to mallet, definitely want to take a look at it.; 8/30/2010 12:19 PM
Shruti said...: Hey Sujit,
Great article! I was just wondering what you considered as the initial probabilities in the HMM. Do we consider random numbers that add up to 1 or is there some other way to set initial probabilities?; 2/13/2011 10:19 PM
Sujit Pal said...: Thanks Shruti. The initial probabilities were calculated based on the actual occurrences of a word and the POS in the tagged Brown corpus.; 3/01/2011 10:27 AM
Shantanu said...: Nice application. But i can't set the brown corpus...Please help me to build it for bengali langage..; 7/18/2011 12:16 AM
Sujit Pal said...: Hi Shantanu, you got the wrong guy, sorry :-). Even though I am Bengali, I can't read or write the language (well not completely true, but annotating bengali text with grammar is way past my capabilities), spent most of my life outside Bengal.; 7/29/2011 2:30 PM
Swati said...: hi Sujeet...i tried working with the tagger.but when i give my own input file it gives an error.Does it not work with unknown words not present in the dictionary?; 9/19/2011 8:37 AM
Sujit Pal said...: Hi Swati, yes, I believe you need the dictionary to contain one or more occurrences of the words you are seeking to tag. It figures out the POS of such words by looking at the context of these words in your corpus and finding the most likely one.; 10/07/2011 11:25 AM
Deuz said...: Hi Sujit,

Nice blog you have here. Do the source code is still available for download?

Thanks; 4/28/2012 8:21 AM
Sujit Pal said...: Thanks Deuz, and yes, most of the code from this time is checked into jtmt.sf.net. If you don't find it there its probably not available, but you can copy paste the code from the blog post if it helps (line numbers are not included in the paste).; 4/28/2012 11:14 AM
unni said...: sir how can i run your code?; 11/21/2014 9:44 PM
Sujit Pal said...: Hi unni, the post contains JUnit test snippets (search for @Test in this page) that contains to build an HMM model from the Brown corpus, use the model to detect POS for words, and rank sentences in order of plausibility. You can either build a JUnit test class with these, or put the code inside a main() method and execute them.; 11/22/2014 9:11 AM
Anonymous said...: Sir thanks for this great app and tutorial. I wonder what should be the content of hmm_tagger.dat file.; 4/28/2015 11:03 PM
Sujit Pal said...: Hi, I put them all in my JTMT project on sourceforge (link to project at bottom of post) - the hmm_tagger.dat file I used is here. Deleting your other two comments since they are the same thing, sorry about the delay in replying.; 4/29/2015 10:40 AM
Anonymous said...: I tried running your code. but i always get error. Am i doing something wrong? it always fail on the observationpart; 10/07/2015 2:08 AM
Anonymous said...: how do you add words in the brown_dict.txt ??; 10/07/2015 3:25 AM
Sujit Pal said...: Its been a while since I wrote this so can't promise that I will be able to figure it out. Do you get an error message? Also, you may want to download the code from the jtmt.sf.net project SVN repo - they should be the same as whats on the post, but just to be sure.

I think the brown_dict.txt is just the words from the brown_tags.txt file. I built it because JAHMM needed it as a list of words.; 10/08/2015 4:51 PM
Anonymous said...: I did try to add words on the brown_dict file but everything got messed up. Do you add words manually or is there a way to add them using jahmm?; 10/11/2015 12:20 AM
Anonymous said...: how did you create the tagger.dat file ?; 10/11/2015 12:23 AM
Sujit Pal said...: I don't recall, but I am almost certain that I didn't, since the most likely thing I would have done would be to transform the Brown Corpus data into the format JAHMM or JAHMM based code requires. The tagger file is created using HmmTagger, look at the saveToFile method. For how its called take a look at HmmTaggerTest#testBuildFromBrownAndWrite. Take a look at the code, its reasonably well commented, and you should find all your answers in there - its all in the postaggers (src/main/java and src/test/java) package.; 10/11/2015 12:57 PM
Unknown said...: Hi, I liked the idea.
But what if I want to tag any input sentence with more than five tagsets(NOUN, VERB, ADJECTIVE, ADVERB, OTHER)?
Thanks; 10/30/2015 10:57 AM
Sujit Pal said...: Hi Vikram, you will need training data for whatever you want to tag with. In my case the training data comes from the brown corpus which is tagged with the 5 POS tags, so thats why the classification happens only with 5 tags. If you want to tag with more POS tags, you will have to get a training set that has more tags - Penn Treebank has more, but the full corpus is not free.; 10/30/2015 3:04 PM
Unknown said...: Hello, sujit sir
Where can i get documentation of JHMM lib.
I did not find any resources except this.
Thanks; 11/08/2015 1:44 AM
Sujit Pal said...: Hi Vikram, its been a while so don't remember exactly, but I believe I initially used the JAHMM example here then started looking through the source code, specifically the unit tests to see how to do various things.; 11/08/2015 7:24 PM
Anonymous said...: How can I find the unknown word i.e the word which is not in the corpus, are there any algorithms that I can refer or anything else.

Thanks!; 1/10/2016 7:56 PM
Sujit Pal said...: Do you mean predict the probability of an unknown word? There is nothing directly applicable AFAIK, but you could probably use smoothing techniques to include _UNK_ (representative token for unknown words) and then find the probability of _UNK_. But not completely sure if this will work for you.; 1/11/2016 5:59 AM
Lalise said...: I got a problem when I was trying to use a corpus which is tagged (POS) to work on WSD using java. The code perform well when I use untagged corpus. I tried to change the format of the data, by putting '_' , '/' to split the tag name from the word . What should I do now?; 5/29/2018 12:13 AM
Sujit Pal said...: First off, apologies for the delayed answer. In case you haven't figured this out yourself or found another way to solve your problem by now, fix is to replace "/" in BrownCorpusReader.java line 84 to "_".; 4/29/2020 2:29 PM

Salmon Run

Saturday, November 08, 2008

IR Math in Java : HMM Based POS Tagger/Recognizer

Background

Building the HMM from Tagged Data

HMM Tagger class

Word Sense Disambiguation

POS Tagging

Sentence Likelihood

Unsupervised Learning

Conclusions

44 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me