In my previous post, I described an HMM based Part of Speech tagger. This post describes a rule based POS tagger loosely based on ideas underlying the design of the Brill Tagger, as described in Chapter 4 of the TMAP book.
The rule based tagger provides a single method tagSentence(), which takes a sentence as input, and returns the same sentence tagged with the appropriate part of speech. I was too lazy to add a convenience method to return the POS for a given word in a sentence, since its really simple, and I have already done this in the code in my previous post.
Wordnet is used to find the part of speech for each word in the sentence. We use MIT Java Wordnet Interface (JWI) to access the Wordnet database. Wordnet can only recognize the following four parts of speech - NOUN, VERB, ADJECTIVE and ADVERB. Therefore, our POS tagging is restricted to these four and OTHER. We enhance our Pos enum class from our previous post with methods to convert Wordnet POS to and from our Pos enum, as shown below:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 | // Source: src/main/java/com/mycompany/myapp/postaggers/Pos.java
package com.mycompany.myapp.postaggers;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.HashMap;
import java.util.Map;
import org.apache.commons.collections15.BidiMap;
import org.apache.commons.collections15.bidimap.DualHashBidiMap;
import org.apache.commons.lang.StringUtils;
import edu.mit.jwi.item.POS;
public enum Pos {
NOUN, VERB, ADJECTIVE, ADVERB, OTHER;
private static Map<String,Pos> bmap = null;
private static BidiMap<Pos,POS> wmap = null;
private static final String translationFile =
"src/main/resources/brown_tags.txt";
public static Pos fromBrownTag(String btag) throws Exception {
// .. omitted for brevity, see previous post for body
}
public static Pos fromWordnetPOS(POS pos) {
if (wmap == null) {
wmap = buildPosBidiMap();
}
return wmap.getKey(pos);
}
public static POS toWordnetPos(Pos pos) {
if (wmap == null) {
wmap = buildPosBidiMap();
}
return wmap.get(pos);
}
private static BidiMap<Pos,POS> buildPosBidiMap() {
wmap = new DualHashBidiMap<Pos,POS>();
wmap.put(Pos.NOUN, POS.NOUN);
wmap.put(Pos.VERB, POS.VERB);
wmap.put(Pos.ADJECTIVE, POS.ADJECTIVE);
wmap.put(Pos.ADVERB, POS.ADVERB);
wmap.put(Pos.OTHER, null);
return wmap;
}
}
|
When Wordnet is asked what the POS of a particular word is, it can return one of the following three results, each of which are handled differently as described below:
- No POS found for word
- Single Unique POS found for word
- Multiple POS found for word
No POS found for Word
In this case, Wordnet may not know about the word being checked, or it could be a proper noun. We use a combination of word pattern rules to try and guess the POS for this word. If none of the patterns match, then the POS is considered to be OTHER.
First, we check to see if the first letter is uppercase, in that case we assume that it is a proper noun, and therefore tag it with Pos.NOUN.
If not, we check if the word ends with one of the known suffixes that exist in our suffix to POS mappings, longest suffix first. The POS corresponding to the first matched suffix is used to tag the word.
If not, the word is tagged as Pos.OTHER.
Single Unique POS found for Word
In this case, there is no confusion -- Wordnet tells us that there is a single POS found for the word, so we tag the word with this POS, and continue on with our life...err, I mean, the next word.
Multiple POS found for Word
When Wordnet reports multiple POS possibilities for a word, it means that the word can be used as different POS depending on where it is used in the sentence - in other words, the context determines the POS.
The tagger considers the context as the word trigram surrounding the word (i.e. before/current/after). Two rules are fired, the word-backward rule and the word-forward rule, unless the word happens to occur at the beginning or end of the sentence, in which case only one rule is fired. The objective of these rules is to find the most likely POS for the word based on the POS of its anterior and posterior neighbors.
The probabilities used to compute the likelihood comes from the transition probabilities (A values) that were computed from the Brown Corpus in my previous post.
Each rule finds the highest probability of a particular POS pairs (before/current and current/after) occurring and associates it with the word's POS. The two probabilities are then added and the word POS corresponding to the highest probability is used to tag the word.
In addition, we could have used the emission probabilities (Π) from from our last post to similarly resolve ambiguous POS for the first word in the sentence, but we did not do this.
Tagger code
The code for the tagger is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 | // Source: src/main/java/com/mycompany/myapp/postaggers/RuleBasedTagger.java
package com.mycompany.myapp.postaggers;
import java.io.BufferedReader;
import java.io.FileReader;
import java.net.URL;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import com.mycompany.myapp.clustering.ByValueComparator;
import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;
import com.mycompany.myapp.tokenizers.WordTokenizer;
import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.item.IIndexWord;
public class RuleBasedTagger {
private final Log log = LogFactory.getLog(getClass());
private class Context {
public String prev;
public List<Pos> prevPos;
public String curr;
public List<Pos> nextPos;
public String next;
public String toString() {
return StringUtils.join(new String[] {prev,curr,next}, "/");
}
};
private IDictionary wordnetDictionary;
private Map<String,Pos> suffixPosMap;
private double[][] tp;
public void setWordnetDictLocation(String wordnetDictLocation)
throws Exception {
this.wordnetDictionary = new Dictionary(
new URL("file", null, wordnetDictLocation));
this.wordnetDictionary.open();
}
public void setSuffixMappingLocation(String suffixMappingLocation)
throws Exception {
String line;
this.suffixPosMap = new TreeMap<String,Pos>(
new Comparator<String>() {
public int compare(String s1, String s2) {
int l1 = s1.length();
int l2 = s2.length();
if (l1 == l2) {
return s1.compareTo(s2);
} else {
return (l2 > l1 ? 1 : -1);
}
}
}
);
BufferedReader reader = new BufferedReader(
new FileReader(suffixMappingLocation));
while ((line = reader.readLine()) != null) {
if (StringUtils.isEmpty(line) || line.startsWith("#")) {
continue;
}
String[] suffixPosPair = StringUtils.split(line, "\t");
suffixPosMap.put(suffixPosPair[0], Pos.valueOf(suffixPosPair[1]));
}
reader.close();
}
public void setTransitionProbabilityDatafile(
String transitionProbabilityDatafile) throws Exception {
int numPos = Pos.values().length;
tp = new double[numPos][numPos];
BufferedReader reader = new BufferedReader(
new FileReader(transitionProbabilityDatafile));
int i = 0; // row
String line;
while ((line = reader.readLine()) != null) {
if (StringUtils.isEmpty(line) || line.startsWith("#")) {
continue;
}
String[] parts = StringUtils.split(line, "\t");
for (int j = 0; j < parts.length; j++) {
tp[i][j] = Double.valueOf(parts[j]);
}
i++;
}
reader.close();
}
public String tagSentence(String sentence) throws Exception {
StringBuilder taggedSentenceBuilder = new StringBuilder();
WordTokenizer wordTokenizer = new WordTokenizer();
wordTokenizer.setText(sentence);
List<Token> tokens = new ArrayList<Token>();
Token token = null;
while ((token = wordTokenizer.nextToken()) != null) {
tokens.add(token);
}
// extract the words from the tokens
List<String> words = new ArrayList<String>();
for (Token tok : tokens) {
if (tok.getType() == TokenType.WORD) {
words.add(tok.getValue());
}
}
// for each word, find the pos
int position = 0;
for (String word : words) {
Pos partOfSpeech = getPartOfSpeech(words, word, position);
taggedSentenceBuilder.append(word).
append("/").
append(partOfSpeech.name()).
append(" ");
position++;
}
return taggedSentenceBuilder.toString();
}
private Pos getPartOfSpeech(List<String> wordList, String word,
int position) {
List<Pos> partsOfSpeech = getPosFromWordnet(word);
int numPos = partsOfSpeech.size();
if (numPos == 0) {
// unknown Pos, apply word rules to figure out Pos
if (startsWithUppercase(word)) {
return Pos.NOUN;
}
Pos pos = getPosBasedOnSuffixRules(word);
if (pos != null) {
return pos;
} else {
return Pos.OTHER;
}
} else if (numPos == 1) {
// unique Pos, return
return partsOfSpeech.get(0);
} else {
// ambiguous Pos, apply disambiguation rules
Context context = getContext(wordList, position);
Map<Pos,Double> posProbs = new HashMap<Pos,Double>();
if (context.prev != null) {
// backward neighbor rule
accumulatePosProbabilities(posProbs, word, partsOfSpeech,
context.prev, context.prevPos, false);
}
if (context.next != null) {
// forward neighbor rule
accumulatePosProbabilities(posProbs, word, partsOfSpeech,
context.next, context.nextPos, true);
}
if (posProbs.size() == 0) {
return Pos.OTHER;
} else {
ByValueComparator<Pos,Double> bvc =
new ByValueComparator<Pos,Double>(posProbs);
List<Pos> poslist = new ArrayList<Pos>();
poslist.addAll(posProbs.keySet());
Collections.sort(poslist, Collections.reverseOrder(bvc));
return poslist.get(0);
}
}
}
private List<Pos> getPosFromWordnet(String word) {
List<Pos> poslist = new ArrayList<Pos>();
for (Pos pos : Pos.values()) {
try {
IIndexWord indexWord =
wordnetDictionary.getIndexWord(word, Pos.toWordnetPos(pos));
if (indexWord != null) {
poslist.add(pos);
}
} catch (NullPointerException e) {
// JWI throws this if it cannot find the word in its dictionary
// so we just dont add anything to the poslist.
continue;
}
}
return poslist;
}
private boolean startsWithUppercase(String word) {
return word.charAt(0) == Character.UPPERCASE_LETTER;
}
private Pos getPosBasedOnSuffixRules(String word) {
for (String suffix : suffixPosMap.keySet()) {
if (StringUtils.lowerCase(word).endsWith(suffix)) {
return suffixPosMap.get(suffix);
}
}
return null;
}
private Context getContext(List<String> words, int wordPosition) {
Context context = new Context();
if ((wordPosition - 1) >= 0) {
context.prev = words.get(wordPosition - 1);
context.prevPos = getPosFromWordnet(context.prev);
}
context.curr = words.get(wordPosition);
if ((wordPosition + 1) < words.size()) {
context.next = words.get(wordPosition + 1);
context.nextPos = getPosFromWordnet(context.next);
}
return context;
}
private void accumulatePosProbabilities(
Map<Pos,Double> posProbabilities,
String word, List<Pos> wordPosList, String neighbor,
List<Pos> neighborPosList, boolean isForwardRule) {
if (isForwardRule) {
for (Pos wordPos : wordPosList) {
for (Pos neighborPos : neighborPosList) {
double prob =
tp[wordPos.ordinal()][neighborPos.ordinal()];
updatePosProbabilities(posProbabilities, wordPos, prob);
}
}
} else {
for (Pos neighborPos : neighborPosList) {
for (Pos wordPos : wordPosList) {
double prob =
tp[neighborPos.ordinal()][wordPos.ordinal()];
updatePosProbabilities(posProbabilities, wordPos, prob);
}
}
}
}
private void updatePosProbabilities(
Map<Pos,Double> posProbabilities,
Pos wordPos, double prob) {
Double origProb = posProbabilities.get(wordPos);
if (origProb == null) {
posProbabilities.put(wordPos, prob);
} else {
posProbabilities.put(wordPos, prob + origProb);
}
}
}
|
Test Code and Data Files
The test code for this is really simple. All we do is to instantiate the RuleBasedTagger and set into it the location of the Wordnet dictionary, the suffix mapping data file and the transition probabilities (A values) from the HMM data file that we built in our previous post. The files are described in more detail below. Once instantiated and set up, we feed it a set of sentences, and get back POS-tagged sentences. Here is the code for the JUnit test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | // Source: src/test/java/com/mycompany/myapp/postaggers/RuleBasedTaggerTest.java
package com.mycompany.myapp.postaggers;
import org.junit.Test;
public class RuleBasedTaggerTest {
private String[] INPUT_TEXTS = {
"The growing popularity of Linux in Asia, Europe, and the US is " +
"a major concern for Microsoft.",
"Jaguar will sell its new XJ-6 model in the US for a small fortune.",
"The union is in a sad state.",
"Please do not state the obvious.",
"I am looking forward to the state of the union address.",
"I have a bad cold today.",
"The cold war was over long ago."
};
@Test
public void testTagSentence() throws Exception {
for (String sentence : INPUT_TEXTS) {
RuleBasedTagger tagger = new RuleBasedTagger();
tagger.setWordnetDictLocation("/opt/wordnet-3.0/dict");
tagger.setSuffixMappingLocation("src/main/resources/pos_suffixes.txt");
tagger.setTransitionProbabilityDatafile(
"src/main/resources/pos_trans_prob.txt");
String taggedSentence = tagger.tagSentence(sentence);
System.out.println("Original: " + sentence);
System.out.println("Tagged: " + taggedSentence);
}
}
}
|
I am using Wordnet-3.0 data files which can be downloaded from here.
The suffix to POS mapping file was created manually from data available here and here. The suffix and POS are tab separated. A partial listing follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 | # Source: src/main/resources/pos_suffixes.txt
# POS Suffixes
#SUFFIX POS
dom NOUN
ity NOUN
ment NOUN
sion NOUN
tion NOUN
ness NOUN
ance NOUN
ence NOUN
er NOUN
...
|
The transition probabilities (A values) are taken from the HMM text file, which was generated from the Brown Corpus as described in my previous post. The file is a tab separated file of observed probabilities for transitioning from one POS to another. The numbers should add up to 1 across each line. Here is what the file looks like:
1 2 3 4 5 6 7 | # Source: src/main/resources/pos_trans_prob.txt
# NOUN VERB ADJECTIVE ADVERB OTHER
0.155 0.156 0.019 0.025 0.645
0.095 0.195 0.168 0.094 0.449
0.639 0.024 0.148 0.005 0.183
0.052 0.228 0.111 0.041 0.569
0.206 0.199 0.205 0.039 0.351
|
Results
The results of running the test are shown below (edited for readability by adding newlines to break up the original and tagged sentences so words don't break across lines). As you can see, the tagging is fairly accurate.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | Original: The growing popularity of Linux in Asia, Europe, and the US is
a major concern for Microsoft.
Tagged: The/OTHER growing/ADJECTIVE popularity/NOUN of/OTHER Linux/NOUN
in/ADJECTIVE Asia/NOUN Europe/NOUN and/OTHER the/OTHER US/NOUN
is/OTHER a/NOUN major/ADJECTIVE concern/NOUN for/NOUN
Microsoft/OTHER
Original: Jaguar will sell its new XJ-6 model in the US for a small fortune.
Tagged: Jaguar/NOUN will/NOUN sell/VERB its/OTHER new/OTHER XJ-6/OTHER
model/ADJECTIVE in/NOUN the/OTHER US/NOUN for/NOUN a/NOUN
small/ADJECTIVE fortune/NOUN
Original: The union is in a sad state.
Tagged: The/OTHER union/OTHER is/OTHER in/ADJECTIVE a/NOUN sad/ADJECTIVE
state/NOUN
Original: Please do not state the obvious.
Tagged: Please/VERB do/VERB not/ADVERB state/VERB the/OTHER
obvious/ADJECTIVE
Original: I am looking forward to the state of the union address.
Tagged: I/ADJECTIVE am/NOUN looking/ADJECTIVE forward/NOUN to/OTHER
the/OTHER state/OTHER of/OTHER the/OTHER union/ADJECTIVE
address/NOUN
Original: I have a bad cold today.
Tagged: I/ADJECTIVE have/NOUN a/NOUN bad/ADJECTIVE cold/NOUN today/NOUN
Original: The cold war was over long ago.
Tagged: The/OTHER cold/ADJECTIVE war/NOUN was/OTHER over/ADVERB
long/VERB ago/ADJECTIVE
|
Conclusion
Unlike the HMM approach, where all the information was built into the model at the outset, the Rule based approach takes advantage of the fact that we can use a tagged dictionary (Wordnet) to tell the POS for a word as we encounter it, and that most words resolve to a single POS. For those that are not recognized by Wordnet, we use simple word pattern matching to deduce the POS. For those that resolve to multiple POS, we use collocation probability rules to disambiguate it.
And now, for something completely different...
On a completely different note, regular (i.e. more than first-time) readers may have noticed some new widgets on my blog. This week, I added a tag cloud widget, spiffed up the "Blogs I read" widget with a new one from Blogger which displays the favicon of the blog and reports on the last time it was updated, added social bookmarking links at the bottom of the post, and added URLs for Atom and RSS Syndication feeds for my blog. Hope you like them, and no, this is not some sinister black-hat SEO bid to pollute/enrich the World Wide Web with my blog contents. For my reasons, read on...
I've wanted to get a tag cloud ever since I wrote about how to build one with Python. I think it is an awesome way to provide a snapshot of the entire site contents at a glance. However, Blogger does not give you a Tag Cloud Widget, although it does give you a Label widget on which this one is based, and I never had the time to mess around with the HTML Template until last week.
For the social bookmarking links, I noticed that one of you had submitted one of my posts to Digg. Whoever it was, thank you, especially since you had to do this manually. For those of who who like my posts enough to submit to your favorite social bookmarking site, there is now a bank of links under my byline and tags which you can click and which will pre-populate the title and URL so you don't have to cut and paste.
The syndication URLs just seemed to be a good idea because I've written quite a few times about ROME and RSS, and its funny that I didn't offer a syndication feed URL myself before this. So now I offer two, one for RSS and another for Atom - take your pick :-).
Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.
dear sir,
ReplyDeleteI am LAVANYA from india , karnataka,bangalore.I am in pesit cse brnch. i am doing project on word classification using jwi. i want to know what i hav to do coz am new to java. plz do reply me i want have yr kind help . my mail id is lavanya2290@gmail.com
thanking u,
LAVANYA
Hi Lavanya, not sure I can be of much help here... If you are new to Java and you need to learn it for your project, you should. The JWI project provides some documentation as well, so depending on your needs, you will probably find what you need there. I am not an expert on either word classification or JWI, but I have done some work with both, so if you have specific questions, feel free to ask and I can try to help.
ReplyDeleteHi Sujit, Thank you for your POS posts... they are very useful. Just wondering which of the two approaches (rule-based vs. HMM) is better? more accurate?
ReplyDeleteHi allanctan, thanks for your appreciation. I think it may make sense to figure out the POS transition probabilities using HMM over a representative training set, then use that as one of the inputs to a rule based tagger.
ReplyDeleteRespected Sir
ReplyDeleteI just went through your code, just amazing. I was trying to make it run. The problem is I'm unable to download the Brown corpus from nltk.org. The site allows to download the corpus using python CLI. The problem is I'm behind a proxy server of the college and python CLI doesn't allows me to download. I don't know anything about python so unable to use it. Kindly help me out in either providing me some other link to 'Brown Corpus' or configuring the python to let it download behind the proxy.
I will be highly obliged if you lend me some help in this regard.
Thanking you
Shahid
Hi Shahid, if you are behind a proxy server that places limits on how much you can download, I don't think sending you another URL (if I knew of one, which I don't) would help. Perhaps just talk to your college admin to download it for you?
ReplyDeleteDear Sir,
ReplyDeleteThank you for this post and it helped a lot to understand the implementation level procedures. I am currently doing my final year project and I'm developing a Chatterbot to obtain concise information to human query from online repositories. Currently I'm implementing POS tagger. I have resources to use only Wordnet as the lexical database. I'm planning to use Penn Treebank tagset instead of Wordnet tags. Is it possible to do tag words with PTB tagset using Wordnet? I'm thinking about building a finite state transducer... Also I'd like to have your advice over choosing techniques (rule bases, HMM, and Brill Tagger) for this project.
Thank You
Thanks Azeem, and you're welcome. I am not an expert in this area, so my advice may be incorrect. So if your intuition does not agree with what I suggest, then your intuition is probably correct :-).
ReplyDeleteMy preference (as I have described) is to use HMM to identify the transition probabilities from a test corpus, then use use Wordnet or PTB to tag the non-ambiguous cases, then disambiguate the rest using transition probabilities for the most likely POS.
I haven't used PTB, so can't say for sure if the mapping would be possible.
Hi Sujit, I am working in question categorization. Can u tel me how pos tagging can be applied to the questions?
ReplyDeleteHi Aarthi, you could decide (based on your knowledge of the question data) what part(s) of speech in your questions are good discriminators, and then extract only those for categorization. You could even run some analysis on words in various POS categories to find which ones are good candidates.
ReplyDeleteHi Sujit I am folling your blogs last 7 months and its really helping my work.but currently i am stuck at a problem while making Hirarchy of Noun Pharses in a sentence using Nlp. I am not finding any alogirithm for same
ReplyDeleteI want to make a hirarchy like this.
nike -> company, nike -> shoes, shoes -> boots, boots -> ugg boots
I don't know if you can do this without reference to an external knowledge base which "knows" about all or part of the hierarchy. I haven't used any myself (the taxonomy I speak about off and on is one maintained by my company, and we reference external datastores as well as do manual curation). You may find part of your hierarchy in Wordnet, and maybe some other part in something like OpenCalais, but not 100% sure about that.
ReplyDeleteAnd what is the use of hmm_tagger.dat file i didn't get that.
ReplyDeleteHi Vikram did you post to the wrong blog post? Dont see a hmm_tagger.dat file here...
ReplyDelete