Saturday, November 08, 2008

IR Math in Java : HMM Based POS Tagger/Recognizer

As you know, I have been slowly working my way through Dr Konchady's TMAP book, and coding up the algorithms in Java. By doing so, I hope to understand the techniques and mathematical models presented, so I can apply them to real-life problems in the future. In this post I describe an implementation of a Hidden Markov Model based Part of Speech recognizer/tagger, based on the material presented in Chapter 4 of the TMAP book.

Background

What follows is my take on what an HMM is and how it can be used for Part of Speech (POS) tagging. For a more detailed, math-heavy, and possibly more accurate description of HMM and their internals, read the Wikipedia article or Dr Rabiner's tutorial or the TMAP book if you happen to own it. A Hidden Markov Model can be thought of as a probabilistic finite state machine. Its states can be represented by the set S = {S1, S2, ..., Sn} which are not directly visible. What is visible is a set of Observations O = {O1, O2, ..., Om} which are the result of the machine moving from one state to the other. The probabilities of the machine starting in one of the states Si is specified by the one-dimensional matrix Π of size n. The probabilities of the machine moving from one state to another is specified by a two dimensional matrix A of size n*n. Finally, the probability of an observation being observed when the machine is in a certain state is given by the two dimensional matrix B of size n*m. More succintly,

```  H = f(Π, A, B)
where:
H = the HMM
Π = start probabilities. The element Πi represents
the probability that the HMM starts a sequence in State
State Si, where i in (0..n-1).
A = transition probabilities. The element Ai,j represents
the probability of a transition from State Si to
State Sj, where i and j in (0..n-1).
B = emission probabilities. The element Bi,j represents
the probability of an Observation Oj occurring while
the machine is in State Si, where i in (0..n-1)
and j in (0..m-1).
n = number of states.
m = number of unique observations.
```

The objective of POS tagging is to tag each word of a sentence with its part-of-speech tag. While some words can be unambiguously tagged, ie their is only one POS that the word is ever used for, there are quite a few which can represent different POS depending on its usage in the sentence. For example, cold can be both a noun and an adjective, and catch can be both a noun and a verb. The fact that the word exists in the sentence is known, while the POS for the word is unknown. Therefore the HMM built for POS tagging would model the words as visible observations and the set of possible POS as the hidden states.

As far as POS tagging is concerned, the main problems that can be solved by HMMs are as follows. Given an HMM,

1. Finding the most likely state sequence for a given observation sequence. In this case, we pass in a sentence, and tag each word with its most likely POS.
2. Finding the most likely state for a given observation in a sequence. This is useful for word sense disambiguation (WSD), so we can tell the most likely POS that a particular word in a sentence belongs to.

The problems above are identical from the point of view of a HMM, and are solved using the Viterbi algorithm.

The second problem is to find the probability of a certain sequence of observations. This can be used to answer questions such as whether a sentence such as "I am going to the market" is more common than one such as "To the market I go". The Forward-Backward Algorithm is used to solve this kind of problem. This can be useful for applications that predict the most likely outcome given a set of input observations, but probably is not important from the perspective of POS tagging. Both the above problems need to have a HMM built from (manually) tagged data.

A third problem of HMMs is how to build one given a corpus of untagged text. Such an HMM would allow us to solve the second type of problem. However, since the HMM has not been fed with tagged words, it must depend on a clustering algorithm such as K-Means to cluster the words into undefined hidden states, which are of no use when attempting to solve the second type of problem. The two learning algorithms used here are the K-Means Algorithm to build the initial HMM and the Baum-Welch Algorithm to refine it. As with the second type of problem, this does not have much applications where POS taggers are concerned.

I used the Java HMM library Jahmm to do all of the heavy computational lifting. It has implementations of the algorithms mentioned above, as well as several utility methods and classes to model various kinds of Observation.

Building the HMM from Tagged Data

For my tagged corpus, I used the Brown Corpus, downloading the data from the Natural Language Toolkit Project (NTLP). The corpus is a set of about 500 files containing one sentence per line, each manually tagged with a very comprehensive set of POS tags described here. Since I plan to use Wordnet at some point with this data, and Wordnet only categorizes words as one of 4 categories, I set up my own Part of Speech Enum called Pos which has 5 categories, the 4 from Wordnet and OTHER. As a result, I had to the dumb the Brown tags down using the translation table shown below:

You may notice that some of the mappings are not correct. Unfortunately, my knowledge of formal English grammar is not as good as I would like it to be, owing to having been educated in an environment that posited that a person can learn to recognize incorrect grammar better by reading and writing enough gramatically correct sentences rather than through a study of language rules. My grandfather, obviously regarding all this as crazy talk, briefly attempted to rectify that, armed with a Wren and Martin and a 18" ruler, but as you can probably see, it did not work out all that well :-).

The code for the Pos Enum is shown below. As mentioned earlier, it exposes a set of 5 POS values, and has a convenience method to convert the Brown tag into corresponding Pos.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45``` ```// Source: src/main/java/com/mycompany/myapp/postaggers/Pos.java package com.mycompany.myapp.postaggers; import java.io.BufferedReader; import java.io.FileInputStream; import java.io.InputStreamReader; import java.util.HashMap; import java.util.Map; import org.apache.commons.lang.StringUtils; /** * Enumeration of Parts of Speech being considered. Conversions from * Brown Tags and Wordnet tags are handled by convenience methods. */ public enum Pos { NOUN, VERB, ADJECTIVE, ADVERB, OTHER; private static Map bmap = null; private static final String translationFile = "src/main/resources/brown_tags.txt"; public static Pos fromBrownTag(String btag) throws Exception { if (bmap == null) { bmap = new HashMap(); BufferedReader reader = new BufferedReader(new InputStreamReader( new FileInputStream(translationFile))); String line; while ((line = reader.readLine()) != null) { if (line.startsWith("#")) { continue; } String[] cols = StringUtils.split(line, "\t"); bmap.put(StringUtils.lowerCase(cols[0]), Pos.valueOf(cols[1])); } reader.close(); } Pos pos = bmap.get(btag); if (pos == null) { return Pos.OTHER; } return pos; } } ```

The BrownCorpusReader reads through each tagged file in the Brown Corpus directory, extracts the word and the tag out of each tagged word, converts the Brown tag to its equivalent Pos value, and accumulates the occurrences into internal counters. Once all files are processed, the counters are normalized into the three probability matrices Π, A and B that we spoke about earlier.

Since the number of words tagged in any corpus is potentially quite large, we represent the words (or observations) in the HMM as an integer. That is why the BrownCorpusReader also dumps out a list of unique words it found in the corpus into a flat file which can be pulled back into memory later to do the mapping between the word and the integer observation Id.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209``` ```// Source: src/main/java/com/mycompany/myapp/postaggers/BrownCorpusReader.java package com.mycompany.myapp.postaggers; import java.io.BufferedReader; import java.io.File; import java.io.FileInputStream; import java.io.FileWriter; import java.io.InputStreamReader; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.StringTokenizer; import org.apache.commons.collections15.Bag; import org.apache.commons.collections15.bag.HashBag; import org.apache.commons.lang.StringUtils; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import Jama.Matrix; /** * Reads a file or directory of tagged text from Brown Corpus and * computes the various probability matrices for the HMM. */ public class BrownCorpusReader { private final Log log = LogFactory.getLog(getClass()); private String dataFilesLocation; private String wordDictionaryLocation; private boolean debug; private Bag piCounts = new HashBag(); private Bag aCounts = new HashBag(); private Map wordPosMap = new HashMap(); private Matrix pi; private Matrix a; private Matrix b; private List words; public void setDataFilesLocation(String dataFilesLocation) { this.dataFilesLocation = dataFilesLocation; } public void setWordDictionaryLocation(String wordDictionaryLocation) { this.wordDictionaryLocation = wordDictionaryLocation; } public void setDebug(boolean debug) { this.debug = debug; } public void read() throws Exception { File location = new File(dataFilesLocation); File[] inputs; if (location.isDirectory()) { inputs = location.listFiles(); } else { inputs = new File[] {location}; } int currfile = 0; int totfiles = inputs.length; for (File input : inputs) { currfile++; log.info("Processing file (" + currfile + "/" + totfiles + "): " + input.getName()); BufferedReader reader = new BufferedReader(new InputStreamReader( new FileInputStream(input))); String line; while ((line = reader.readLine()) != null) { if (StringUtils.isEmpty(line)) { continue; } StringTokenizer tok = new StringTokenizer(line, " "); int wordIndex = 0; Pos prevPos = null; while (tok.hasMoreTokens()) { String taggedWord = tok.nextToken(); String[] wordTagPair = StringUtils.split( StringUtils.lowerCase(StringUtils.trim(taggedWord)), "/"); if (wordTagPair.length != 2) { continue; } Pos pos = Pos.fromBrownTag(wordTagPair[1]); if (! wordPosMap.containsKey(wordTagPair[0])) { // create an entry Double[] posProbs = new Double[Pos.values().length]; for (int i = 0; i < posProbs.length; i++) { posProbs[i] = new Double(0.0D); } wordPosMap.put(wordTagPair[0], posProbs); } Double[] posProbs = wordPosMap.get(wordTagPair[0]); posProbs[pos.ordinal()] += 1.0D; wordPosMap.put(wordTagPair[0], posProbs); if (wordIndex == 0) { // first word, update piCounts piCounts.add(pos.name()); } else { aCounts.add(StringUtils.join(new String[] { prevPos.name(), pos.name()}, ":")); } prevPos = pos; wordIndex++; } } reader.close(); } // normalize counts to probabilities int numPos = Pos.values().length; // compute pi pi = new Matrix(numPos, 1); for (int i = 0; i < numPos; i++) { pi.set(i, 0, piCounts.getCount((Pos.values()[i]).name())); } pi = pi.times(1 / pi.norm1()); // compute a a = new Matrix(numPos, numPos); for (int i = 0; i < numPos; i++) { for (int j = 0; j < numPos; j++) { a.set(i, j, aCounts.getCount(StringUtils.join(new String[] { (Pos.values()[i]).name(), (Pos.values()[j]).name() }, ":"))); } } // compute b int numWords = wordPosMap.size(); words = new ArrayList(); words.addAll(wordPosMap.keySet()); b = new Matrix(numPos, numWords); for (int i = 0; i < numPos; i++) { for (int j = 0; j < numWords; j++) { String word = words.get(j); b.set(i, j, wordPosMap.get(word)[i]); } } // normalize across rows for a and b (sum of cols in each row == 1.0) for (int i = 0; i < numPos; i++) { double rowSumA = 0.0D; for (int j = 0; j < numPos; j++) { rowSumA += a.get(i, j); } for (int j = 0; j < numPos; j++) { a.set(i, j, (a.get(i, j) / rowSumA)); } double rowSumB = 0.0D; for (int j = 0; j < numWords; j++) { rowSumB += b.get(i, j); } for (int j = 0; j < numWords; j++) { b.set(i, j, (b.get(i, j) / rowSumB)); } } // write out brown word dictionary for later use writeDictionary(); // debug if (debug) { pi.print(8, 4); a.print(8, 4); b.print(8, 4); System.out.println(words.toString()); } } public List getWords() { return words; } public double[] getPi() { double[] pia = new double[pi.getRowDimension()]; for (int i = 0; i < pia.length; i++) { pia[i] = pi.get(i, 0); } return pia; } public double[][] getA() { double[][] aa = new double[a.getRowDimension()][a.getColumnDimension()]; for (int i = 0; i < a.getRowDimension(); i++) { for (int j = 0; j < a.getColumnDimension(); j++) { aa[i][j] = a.get(i, j); } } return aa; } public double[][] getB() { double[][] ba = new double[b.getRowDimension()][b.getColumnDimension()]; for (int i = 0; i < b.getRowDimension(); i++) { for (int j = 0; j < b.getColumnDimension(); j++) { ba[i][j] = b.get(i, j); } } return ba; } private void writeDictionary() throws Exception { FileWriter dictWriter = new FileWriter(wordDictionaryLocation); for (String word : words) { dictWriter.write(word + "\n"); } dictWriter.flush(); dictWriter.close(); } } ```

We generate the HMM and serialize it to disk as a flat file. That decouples the building of the HMM from the actual usage, and saves a few CPU cycles and makes the tests run a bit faster. In addition, if this solution was to be used in a real-life situation, it would be much faster to load the HMM from a flat file than to build it from a tagged corpus. Our serialized HMM file looks like this (edited to truncate the number of observations for readability).

On a quick side note, the Jahmm example uses the ObservationDiscrete class based on an Enum to model a small finite set of observations. This works well if the number of observations in your set are small and well known. In our case, we consider a unique word as an observation, and we have approximately 3900 of them, so we used the ObservationInteger class to model the observation, and our flat file serves as a mapping between the integer id for the Observation to the actual word.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28``` ```Hmm v1.0 NbStates 5 State Pi 0.127 A 0.155 0.156 0.019 0.025 0.645 IntegerOPDF [0 0 0.00002 0.00003 0 0 0.00001 0 0.00001 ...] State Pi 0.057 A 0.095 0.195 0.168 0.094 0.449 IntegerOPDF [0 0 0 0 0 0.00001 0 0 0 0 0 0 0.00001 0.00005 ...] State Pi 0.164 A 0.639 0.024 0.148 0.005 0.183 IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...] State Pi 0.083 A 0.052 0.228 0.111 0.041 0.569 IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...] State Pi 0.569 A 0.206 0.199 0.205 0.039 0.351 IntegerOPDF [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0.0032 0.00033 0.00064 ...] ```

The following JUnit test snippet shows how we use the HmmTagger class (described below) to call the BrownCorpusReader and build and persist the HMM.

 ```1 2 3 4 5 6 7 8 9``` ``` @Test public void testBuildFromBrownAndWrite() throws Exception { HmmTagger hmmTagger = new HmmTagger(); hmmTagger.setDataDir("/opt/brown-2.0"); hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt"); hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat"); Hmm hmm = hmmTagger.buildFromBrownCorpus(); hmmTagger.saveToFile(hmm); } ```

HMM Tagger class

I then create a HmmTagger class that can build an HMM from the BrownCorpusReader as well as from a serialized HMM file shown above. The HmmTagger contains all the methods that are needed to solve the common HMM problems listed above. The code for the HmmTagger is as follows:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312``` ```// Source: src/main/java/com/mycompany/myapp/postaggers/HmmTagger.java package com.mycompany.myapp.postaggers; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.FileWriter; import java.io.IOException; import java.io.Writer; import java.text.DecimalFormat; import java.util.ArrayList; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.commons.lang.StringUtils; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import be.ac.ulg.montefiore.run.jahmm.ForwardBackwardCalculator; import be.ac.ulg.montefiore.run.jahmm.Hmm; import be.ac.ulg.montefiore.run.jahmm.ObservationInteger; import be.ac.ulg.montefiore.run.jahmm.OpdfInteger; import be.ac.ulg.montefiore.run.jahmm.OpdfIntegerFactory; import be.ac.ulg.montefiore.run.jahmm.ViterbiCalculator; import be.ac.ulg.montefiore.run.jahmm.io.HmmReader; import be.ac.ulg.montefiore.run.jahmm.io.HmmWriter; import be.ac.ulg.montefiore.run.jahmm.io.OpdfIntegerReader; import be.ac.ulg.montefiore.run.jahmm.io.OpdfReader; import be.ac.ulg.montefiore.run.jahmm.io.OpdfWriter; import be.ac.ulg.montefiore.run.jahmm.learn.BaumWelchLearner; import be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner; import be.ac.ulg.montefiore.run.jahmm.toolbox.KullbackLeiblerDistanceCalculator; /** * HMM based POS Tagger. */ public class HmmTagger { private static final DecimalFormat OBS_FORMAT = new DecimalFormat("##.#####"); private final Log log = LogFactory.getLog(getClass()); private String dataDir; private String dictionaryLocation; private String hmmFileName; private Map words = new HashMap(); public void setDataDir(String brownDataDir) { this.dataDir = brownDataDir; } public void setDictionaryLocation(String dictionaryLocation) { this.dictionaryLocation = dictionaryLocation; } public void setHmmFileName(String hmmFileName) { this.hmmFileName = hmmFileName; } /** * Builds up an HMM where states are parts of speech given by the Pos * Enum, and the observations are actual words in the tagged Brown * corpus. Each integer observation corresponds to the position of * a word found in the Brown corpus. * @return an HMM. * @throws Exception if one is thrown. */ public Hmm buildFromBrownCorpus() throws Exception { BrownCorpusReader brownReader = new BrownCorpusReader(); brownReader.setDataFilesLocation(dataDir); brownReader.setWordDictionaryLocation(dictionaryLocation); brownReader.read(); int nbStates = Pos.values().length; OpdfIntegerFactory factory = new OpdfIntegerFactory(nbStates); Hmm hmm = new Hmm(nbStates, factory); double[] pi = brownReader.getPi(); for (int i = 0; i < nbStates; i++) { hmm.setPi(i, pi[i]); } double[][] a = brownReader.getA(); for (int i = 0; i < nbStates; i++) { for (int j = 0; j < nbStates; j++) { hmm.setAij(i, j, a[i][j]); } } double[][] b = brownReader.getB(); for (int i = 0; i < nbStates; i++) { for (int j = 0; j < nbStates; j++) { hmm.setOpdf(i, new OpdfInteger(b[i])); } } int seq = 0; for (String word : brownReader.getWords()) { words.put(word, seq); seq++; } return hmm; } /** * Builds an HMM from a formatted file describing the HMM. The format is * specified by the Jahmm project, and it has utility methods to read and * write HMMs from and to text files. We use this because the builder that * builds an HMM from the Brown corpus is computationally intensive and * this strategy provides us a way to partition the process. * @return a HMM * @throws Exception if one is thrown. */ public Hmm buildFromHmmFile() throws Exception { File hmmFile = new File(hmmFileName); if (! hmmFile.exists()) { throw new Exception("HMM File: " + hmmFile.getName() + " does not exist"); } FileReader fileReader = new FileReader(hmmFile); OpdfReader opdfReader = new OpdfIntegerReader(); Hmm hmm = HmmReader.read(fileReader, opdfReader); return hmm; } /** * Utility method to save an HMM into a formatted text file describing the * HMM. The format is specified by the Jahmm project, which also provides * utility methods to write a HMM to the text file. * @param hmm the HMM to write. * @throws Exception if one is thrown. */ public void saveToFile(Hmm hmm) throws Exception { FileWriter fileWriter = new FileWriter(hmmFileName); // we create our own impl of the OpdfIntegerWriter because we want // to control the formatting of the opdf probabilities. With the // default OpdfIntegerWriter, small probabilities get written in // the exponential format, ie 1.234..E-4, which the HmmReader does // not recognize. OpdfWriter opdfWriter = new OpdfWriter() { @Override public void write(Writer writer, OpdfInteger opdf) throws IOException { String s = "IntegerOPDF ["; for (int i = 0; i < opdf.nbEntries(); i++) s += OBS_FORMAT.format(opdf.probability( new ObservationInteger(i))) + " "; writer.write(s + "]\n"); } }; HmmWriter.write(fileWriter, opdfWriter, hmm); fileWriter.flush(); fileWriter.close(); } /** * Given the HMM, returns the probability of observing the sequence * of words specified in the sentence. Uses the Forward-Backward * algorithm to compute the probability. * @param sentence the sentence to check. * @param hmm a reference to a prebuilt HMM. * @return the probability of observing this sequence. * @throws Exception if one is thrown. */ public double getObservationProbability(String sentence, Hmm hmm) throws Exception { String[] tokens = tokenizeSentence(sentence); List observations = getObservations(tokens); ForwardBackwardCalculator fbc = new ForwardBackwardCalculator(observations, hmm); return fbc.probability(); } /** * Given an HMM and an untagged sentence, tags each word with the part of * speech it is most likely to belong in. Uses the Viterbi algorithm. * @param sentence the sentence to tag. * @param hmm the HMM to use. * @return a tagged sentence. * @throws Exception if one is thrown. */ public String tagSentence(String sentence, Hmm hmm) throws Exception { String[] tokens = tokenizeSentence(sentence); List observations = getObservations(tokens); ViterbiCalculator vc = new ViterbiCalculator(observations, hmm); int[] ids = vc.stateSequence(); StringBuilder tagBuilder = new StringBuilder(); for (int i = 0; i < ids.length; i++) { tagBuilder.append(tokens[i]). append("/"). append((Pos.values()[ids[i]]).name()). append(" "); } return tagBuilder.toString(); } /** * Given an HMM, a sentence and a word within the sentence which needs to * be disambiguated, returns the most likely Pos for the specified word. * @param word the word to find the Pos for. * @param sentence the sentence. * @param hmm the HMM. * @return the most likely POS. * @throws Exception if one is thrown. */ public Pos getMostLikelyPos(String word, String sentence, Hmm hmm) throws Exception { if (words == null || words.size() == 0) { loadWordsFromDictionary(); } String[] tokens = tokenizeSentence(sentence); List observations = getObservations(tokens); int wordPos = -1; for (int i = 0; i < tokens.length; i++) { if (tokens[i].equalsIgnoreCase(word)) { wordPos = i; break; } } if (wordPos == -1) { throw new IllegalArgumentException("Word [" + word + "] does not exist in sentence [" + sentence + "]"); } ViterbiCalculator vc = new ViterbiCalculator(observations, hmm); int[] ids = vc.stateSequence(); return Pos.values()[ids[wordPos]]; } /** * Given an existing HMM, this method will send in a List of sentences from * a possibly different untagged source, to refine the HMM. * @param sentences the List of sentences to teach. * @return a HMM that has been taught using the observation sequences. * @throws Exception if one is thrown. */ public Hmm teach(List sentences) throws Exception { if (words == null || words.size() == 0) { loadWordsFromDictionary(); } OpdfIntegerFactory factory = new OpdfIntegerFactory(words.size()); List> sequences = new ArrayList>(); for (String sentence : sentences) { List sequence = getObservations(tokenizeSentence(sentence)); sequences.add(sequence); } KMeansLearner kml = new KMeansLearner( Pos.values().length, factory, sequences); Hmm hmm = kml.iterate(); // refine it with Baum-Welch Learner BaumWelchLearner bwl = new BaumWelchLearner(); Hmm refinedHmm = bwl.iterate(hmm, sequences); return refinedHmm; } /** * Convenience method to compute the distance between two HMMs. This can * be used to stop the teaching process once more teaching is not * producing any appreciable improvement in the HMM, ie, the HMM * converges. The caller will need to match the result of this method * with a number based on experience. * @param hmm1 the original HMM. * @param hmm2 the HMM that was most recently taught. * @return the difference measure between the two HMMs. * @throws Exception if one is thrown. */ public double difference(Hmm hmm1, Hmm hmm2) throws Exception { KullbackLeiblerDistanceCalculator kdc = new KullbackLeiblerDistanceCalculator(); return kdc.distance(hmm1, hmm2); } private String[] tokenizeSentence(String sentence) { String[] tokens = StringUtils.split( StringUtils.lowerCase(StringUtils.trim(sentence)), " "); return tokens; } private List getObservations(String[] tokens) throws Exception { if (words == null || words.size() == 0) { loadWordsFromDictionary(); } List observations = new ArrayList(); for (String token : tokens) { observations.add(new ObservationInteger(words.get(token))); } return observations; } private void loadWordsFromDictionary() throws Exception { BufferedReader reader = new BufferedReader( new FileReader(dictionaryLocation)); String word; int seq = 0; while ((word = reader.readLine()) != null) { words.put(word, seq); seq++; } reader.close(); } } ```

Word Sense Disambiguation

Given a sentence, a human user can figure the correct POS for each word almost immediately, but with an HMM, we can only tell which is the most likely POS for the word given the sequence of words in the sentence. Obviously, this depends on how large and accurate the HMM's training data set was. Here is how the HmmTagger is called to determine the most likely POS for a word in the sentence.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31``` ``` @Test public void testWordSenseDisambiguation() throws Exception { HmmTagger hmmTagger = new HmmTagger(); hmmTagger.setDataDir("/opt/brown-2.0"); hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt"); hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat"); Hmm hmm = hmmTagger.buildFromHmmFile(); String[] testSentences = new String[] { "The dog ran after the cat .", "Failure dogs his path .", "The cold steel cuts through the flesh .", "He had a bad cold .", "He will catch the ball .", "Salmon is the catch of the day ." }; String[] testWords = new String[] { "dog", "dogs", "cold", "cold", "catch", "catch" }; for (int i = 0; i < testSentences.length; i++) { System.out.println("Original sentence: " + testSentences[i]); Pos wordPos = hmmTagger.getMostLikelyPos(testWords[i], testSentences[i], hmm); System.out.println("Pos(" + testWords[i] + ")=" + wordPos); } } ```

And here are the results. As you can see, the HMM did well on all but the second sentence.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17``` ```Original sentence: The dog ran after the cat . Pos(dog)=NOUN Original sentence: Failure dogs his path . Pos(dogs)=NOUN Original sentence: The cold steel cuts through the flesh . Pos(cold)=ADJECTIVE Original sentence: He had a bad cold . Pos(cold)=NOUN Original sentence: He will catch the ball . Pos(catch)=VERB Original sentence: Salmon is the catch of the day . Pos(catch)=NOUN ```

POS Tagging

POS Tagging uses the same algorithm as Word Sense Disambiguation. Given a HMM trained with a sufficiently large and accurate corpus of tagged words, we can now use it to automatically tag sentences from a similar corpus. Here is the JUnit code snippet to do tag the sentences we used in our previous test.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15``` ``` @Test public void testPosTagging() throws Exception { HmmTagger hmmTagger = new HmmTagger(); hmmTagger.setDataDir("/opt/brown-2.0"); hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt"); hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat"); Hmm hmm = hmmTagger.buildFromHmmFile(); // POS tagging String[] testSentences = new String[] {...}; for (int i = 0; i < testSentences.length; i++) { System.out.println("Original sentence: " + testSentences[i]); System.out.println("Tagged sentence: " + hmmTagger.tagSentence(testSentences[i], hmm)); } } ```

And here are the results.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22``` ```Original sentence: The dog ran after the cat . Tagged sentence: the/ADJECTIVE dog/NOUN ran/VERB after/OTHER the/ADJECTIVE cat/NOUN ./OTHER Original sentence: Failure dogs his path . Tagged sentence: failure/NOUN dogs/NOUN his/OTHER path/NOUN ./OTHER Original sentence: The cold steel cuts through the flesh . Tagged sentence: the/ADJECTIVE cold/ADJECTIVE steel/NOUN cuts/NOUN through/OTHER the/ADJECTIVE flesh/NOUN ./OTHER Original sentence: He had a bad cold . Tagged sentence: he/OTHER had/VERB a/ADJECTIVE bad/ADJECTIVE cold/NOUN ./OTHER Original sentence: He will catch the ball . Tagged sentence: he/OTHER will/VERB catch/VERB the/ADJECTIVE ball/NOUN ./OTHER Original sentence: Salmon is the catch of the day . Tagged sentence: salmon/NOUN is/VERB the/ADJECTIVE catch/NOUN of/OTHER the/ADJECTIVE day/NOUN ./OTHER ```

Sentence Likelihood

HMMs can be used to predict if one sentence is more likely to occur than another one, by comparing the observation probability of a certain sequence of words with another sequence. So for example, we find that the HMM believes that sentences spoken by Master Yoda of Star Wars fame are less likely to occur in "normal" English than sentences expressing similar meaning that you or I would speak.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12``` ``` @Test public void testObservationProbability() throws Exception { HmmTagger hmmTagger = new HmmTagger(); hmmTagger.setDataDir("/opt/brown-2.0"); hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt"); hmmTagger.setHmmFileName("src/test/resources/hmm_tagger.dat"); Hmm hmm = hmmTagger.buildFromHmmFile(); System.out.println("P(I am worried)=" + hmmTagger.getObservationProbability("I am worried", hmm)); System.out.println("P(Worried I am)=" + hmmTagger.getObservationProbability("Worried I am", hmm)); } ```

As expected, our results indicate that the HMM understands us better than it understands Master Yoda.

 ```1 2``` ```P(I am worried)=5.446081633660202E-11 P(Worried I am)=1.2623833954125002E-11 ```

Unsupervised Learning

The final problem we can solve with a HMM is to build one from a set of untagged data. This HMM can then be used for solving the Sentence Likelihood problem, but not the POS Tagging or the WSD problems. To set this up, I picked up a bunch of of Yoda quotes from this page and fed it into a newly instantiated HMM. I then took the same two sentences and asked the HMM which was more probable. Here is the test code snippet to do that:

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31``` ``` @Test public void testTeachYodaAndObserveProbabilities() throws Exception { List sentences = Arrays.asList(new String[] { "Powerful you have become .", "The dark side I sense in you .", "Grave danger you are in .", "Impatient you are .", "Try not .", "Do or do not there is no try .", "Consume you the dark path will .", "Always in motion is the future .", "Around the survivors a perimeter create .", "Size matters not .", "Blind we are if creation of this army we could not see .", "Help you I can yes .", "Strong am I with the force .", "Agree with you the council does .", "Your apprentice he will be .", "Worried I am .", "Always two there are .", "When 900 years you reach look as good you will not ." }); HmmTagger hmmTagger = new HmmTagger(); hmmTagger.setDataDir("/opt/brown-2.0"); hmmTagger.setDictionaryLocation("src/test/resources/brown_dict.txt"); Hmm learnedHmm = hmmTagger.teach(sentences); System.out.println("P(Worried I am)=" + hmmTagger.getObservationProbability("Worried I am", learnedHmm)); System.out.println("P(I am worried)=" + hmmTagger.getObservationProbability("I am worried", learnedHmm)); } ```

Now, as you can see, this new HMM understands Yoda better than it understands us :-).

 ```1 2``` ```P(Worried I am)=4.455273233553778E-6 P(I am worried)=2.4569521508568634E-6 ```

Conclusions

Personally, this learning curve was quite a steep one for me. The theory was fairly easy to grasp from an intuitive standpoint, but then understanding how to model the POS tagging problem as a HMM took me a while. Once I crossed that hurdle, it took me a fair bit of effort to figure out how to use Jahmm to build and solve a HMM.

I think it was worth it, though. HMMs are a very powerful modeling tool for text mining, and can be used to model a variety of real life situations. Using a library such as Jahmm means that you just have to figure out how to model your problem and to solve it using the tools provided.

Hopefully, if you've been reading this far, and you started out not knowing or with a vague idea of what an HMM was and how it could be used for POS tagging (as was my situation couple of months ago), this post has provided some information as well as an example of using the Jahmm API to build and solve an HMM.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

42 comments (moderated to prevent spam):

Anonymous said...

great application, well explained.
I am student in France in computing and language analysis.

I understand your program but i don't really see how are files like brown_tags.txt or brown_dict.txt ...

I would like see an example if it's possible.

Thanks for all.

Sujit Pal said...

Hi, thanks for the kind words, and I am glad you liked the post. The brown_tags.txt is basically a flat file version of the table which I got from the web page referenced, and the brown_dict.txt is the dictionary that I got off the NTLP project.

sharma said...

Hi Paul ,
Great application .......I was also trying to use the application but i could n't found brown_dict.txt file anywhere can u please help me??

sharma said...

gr8 explanation!!!!!! very good blog, I was trying to develop the same application but can't find the file brown_dict.txt brown_tags.txt

Sujit Pal said...

Hi sharma, thanks for your appreciation. I got the brown_dict.txt file from the NLTK Project, and the brown_tags.txt is a flat file version of the table in this page.

Anonymous said...

Hello,I am a student from China,I wanna say it's a great application,and now i am trying to use the application,but there is an error,i can't find the class POS,could you help me?
Thank you!

Sujit Pal said...

Thank you. Are you trying out the code in this post? I did not see any reference to the POS class in the code here. I have used the POS class elsewhere, its from the MIT Java-Wordnet Interface jar file (jwi-*.jar). Google for the full name and you will find the project page and thence the jar file.

allanctan said...

Hi Sujit,

Just noticed in HMMTagger.java line#92-#97... the second loop (j) is not necessary?

Sujit Pal said...

Thanks, you are right...removing this in the svn version and checking in.

Anonymous said...

Hi. I'm also using Jahmm but I'm encountering a problem. I want to make a voice recognition
software and use HMM to classify the data.

An example of the feature vector(training data) I used is like this:

739.095423 -68.9217791 -41.94023848 190.343314 70.14624796
19.4320868 101.1893881 60.00366733 1.083977616 31.19825832
44.07300049 19.78601438

To learn the HMMs, my code is:

public Hmm learnBleh(List>
sequences){
int numberOfHiddenStates = 12;
do{

KMeansLearner kml = new
KMeansLearner(numberOfHiddenStates, new
OpdfGaussianFactory(), sequences);

trainedHmm = kml.learn();
BaumWelchLearner bwl = new BaumWelchLearner();
bwl.setNbIterations(20);
trainedHmm = bwl.learn(trainedHmm, sequences);
numberOfHiddenStates++;
}while(Double.isNaN(trainedHmm.getPi(0)) && numberOfHiddenStates
<50);

return trainedHmm;
}

How come I get an error:
Exception in thread "main" java.lang.IllegalArgumentException: Variance
must be positive
at
be.ac.ulg.montefiore.run.distributions.GaussianDistribution.(Gaussian
Distribution.java:43)
at
be.ac.ulg.montefiore.run.jahmm.OpdfGaussian.fit(OpdfGaussian.java:124)
at
be.ac.ulg.montefiore.run.jahmm.OpdfGaussian.fit(OpdfGaussian.java:93)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.learnOpdf(KMeansLearner.
java:165)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.iterate(KMeansLearner.ja
va:67)
at
be.ac.ulg.montefiore.run.jahmm.learn.KMeansLearner.learn(KMeansLearner.java
:96)

I don't know what the error means. Have you ever encountered an error like this? Thanks.
All help is greatly appreciated!

Ashwin Jayaprakash said...

Hi Sujit, I got to read and appreciate this old post only today. Nice work, especially the "Yoda speak" example :)

BTW, are you following the Mahout project? It looks like something you'd enjoy.

Ashwin.

Sujit Pal said...

@Anonymous: I think I have seen this error before, but not sure about the solution anymore, this was a while ago. I will try to find out what I did or if I can reproduce this, and get back.

@Ashwin: Thanks, I /am/ following the Mahout project, sort of. Last I looked it was too early stage unless one is a developer on that project, but I know that there has been some activity recently - I should probably look at it again. Thanks for the reminder.

Zoloo said...

Hello Sujit. It is very helpful to me. But I have a question? Which is it bi-gram or tri-gram.

Thanks.

Sujit Pal said...

Thanks Zoloo. Not sure about the question though?

Barnabas said...

Hi, Sujit. Thanks for your amazingly good explanation ^^ it was really helpful to me.

I'm now focusing on Named Entity Recognition problem, and I want to use HMM (with Jahmm library) to solve this problem. I made training corpus which is Named Entity marked by hand, so that I can use it as a training data. But the thing is that, how to train it automatically.
I've looking for training method in Jahmm, but there are only 2 way to train, BaumWelchLearner and KMeansLearner, and they only gets Observation Sequences as a parameter.
For me, this is the problem. Because, for the training, I want to mark Named Entity Type to the words when process the training.

In summary, can I train to make all Hmm variables (pi, transition prob. and observation prob. ) automatically using internal method of Jahmm ? if not, how can I train them without making this Hmm variables with my own hand.

Thanks for all, in advance. ^^;

Sujit Pal said...

Thanks Barnabas. Unfortunately I don't think I know enough to answer your question. From what I understand about HMMs, you need a set of observation sequences to train it, but I probably have a lot to learn about this stuff. I would suggest asking the author - he has put his code on googlecode, and that has a discussion list as well.

Barnabas said...

I've found another good library, mallet (http://mallet.cs.umass.edu/) it supports supervised learning of HMM, CRF, MEMM. By using this library, I will implement NER system. If there's some progress, I'll let you know, thanks ^^

Sujit Pal said...

Hi Barnabas, best of luck, and thanks for the pointer to mallet, definitely want to take a look at it.

Shruti said...

Hey Sujit,
Great article! I was just wondering what you considered as the initial probabilities in the HMM. Do we consider random numbers that add up to 1 or is there some other way to set initial probabilities?

Sujit Pal said...

Thanks Shruti. The initial probabilities were calculated based on the actual occurrences of a word and the POS in the tagged Brown corpus.

Shantanu said...

Nice application. But i can't set the brown corpus...Please help me to build it for bengali langage..

Sujit Pal said...

Hi Shantanu, you got the wrong guy, sorry :-). Even though I am Bengali, I can't read or write the language (well not completely true, but annotating bengali text with grammar is way past my capabilities), spent most of my life outside Bengal.

Swati said...

hi Sujeet...i tried working with the tagger.but when i give my own input file it gives an error.Does it not work with unknown words not present in the dictionary?

Sujit Pal said...

Hi Swati, yes, I believe you need the dictionary to contain one or more occurrences of the words you are seeking to tag. It figures out the POS of such words by looking at the context of these words in your corpus and finding the most likely one.

Deuz said...

Hi Sujit,

Nice blog you have here. Do the source code is still available for download?

Thanks

Sujit Pal said...

Thanks Deuz, and yes, most of the code from this time is checked into jtmt.sf.net. If you don't find it there its probably not available, but you can copy paste the code from the blog post if it helps (line numbers are not included in the paste).

unni said...

sir how can i run your code?

Sujit Pal said...

Hi unni, the post contains JUnit test snippets (search for @Test in this page) that contains to build an HMM model from the Brown corpus, use the model to detect POS for words, and rank sentences in order of plausibility. You can either build a JUnit test class with these, or put the code inside a main() method and execute them.

Anonymous said...

Sir thanks for this great app and tutorial. I wonder what should be the content of hmm_tagger.dat file.

Sujit Pal said...

Hi, I put them all in my JTMT project on sourceforge (link to project at bottom of post) - the hmm_tagger.dat file I used is here. Deleting your other two comments since they are the same thing, sorry about the delay in replying.

Anonymous said...

I tried running your code. but i always get error. Am i doing something wrong? it always fail on the observationpart

Anonymous said...

how do you add words in the brown_dict.txt ??

Sujit Pal said...

Its been a while since I wrote this so can't promise that I will be able to figure it out. Do you get an error message? Also, you may want to download the code from the jtmt.sf.net project SVN repo - they should be the same as whats on the post, but just to be sure.

I think the brown_dict.txt is just the words from the brown_tags.txt file. I built it because JAHMM needed it as a list of words.

Anonymous said...

I did try to add words on the brown_dict file but everything got messed up. Do you add words manually or is there a way to add them using jahmm?

Anonymous said...

how did you create the tagger.dat file ?

Sujit Pal said...

I don't recall, but I am almost certain that I didn't, since the most likely thing I would have done would be to transform the Brown Corpus data into the format JAHMM or JAHMM based code requires. The tagger file is created using HmmTagger, look at the saveToFile method. For how its called take a look at HmmTaggerTest#testBuildFromBrownAndWrite. Take a look at the code, its reasonably well commented, and you should find all your answers in there - its all in the postaggers (src/main/java and src/test/java) package.

Hi, I liked the idea.
But what if I want to tag any input sentence with more than five tagsets(NOUN, VERB, ADJECTIVE, ADVERB, OTHER)?
Thanks

Sujit Pal said...

Hi Vikram, you will need training data for whatever you want to tag with. In my case the training data comes from the brown corpus which is tagged with the 5 POS tags, so thats why the classification happens only with 5 tags. If you want to tag with more POS tags, you will have to get a training set that has more tags - Penn Treebank has more, but the full corpus is not free.

Hello, sujit sir
Where can i get documentation of JHMM lib.
I did not find any resources except this.
Thanks

Sujit Pal said...

Hi Vikram, its been a while so don't remember exactly, but I believe I initially used the JAHMM example here then started looking through the source code, specifically the unit tests to see how to do various things.

Anonymous said...

How can I find the unknown word i.e the word which is not in the corpus, are there any algorithms that I can refer or anything else.

Thanks!

Sujit Pal said...

Do you mean predict the probability of an unknown word? There is nothing directly applicable AFAIK, but you could probably use smoothing techniques to include _UNK_ (representative token for unknown words) and then find the probability of _UNK_. But not completely sure if this will work for you.