Last week I was handed an interesting task - flagging non-nouns from single word synonyms for concepts in our Medical Ontology. Now, the Part of Speech (POS) tag is not really an intrinsic attribute of a word - rather, the POS is determined by the context in which the word appears. For example, consider the two sentences - "He had an infectious laugh", where laugh is a Noun, and "He will laugh at the joke", where the same word is now a Verb.
However, a word with multiple senses (usually) has one dominant sense, so in the absence of my context, we could call that the word's POS. In order to calculate the dominant POS for words, I decided to use the words in the Brown Corpus, a manually tagged text corpus of about 500 articles. The Brown Corpus is available (after nltk.download()) as part of the Natural Language Processing Toolkit (NLTK).
The idea is to create a frequency distribution of (word, tag) using all the words and their tags in the Brown Corpus. Then for each word, we normalize the frequencies across the different POS tags to derive the probability of a word having each of these tags. Output is in a flat file. Heres the code to build this data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | # Source: dict_build.py
import nltk
from nltk.corpus import brown
DELIM = "_|_"
NORMED_TAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]
POSTAGS = {
"NN" : "noun",
"VB" : "verb",
"JJ" : "adjective",
"RB" : "adverb",
"DT" : "determiner",
"IN" : "preposition",
"OT" : "other"
}
def normalize_brown_postags():
brown_tags = open("../data/brown_tags.csv", 'rb')
tag_map = dict()
for line in brown_tags:
line = line.strip()
if len(line) == 0 or line.startswith("#"):
continue
tag_name, tag_description = line.split("\t")[0:2]
tag_desc_words = set(nltk.word_tokenize(tag_description.lower()))
is_tagged = False
for normed_tag in NORMED_TAGS[:-1]:
desc_pattern = POSTAGS[normed_tag]
if desc_pattern in tag_desc_words:
tag_map[tag_name] = normed_tag
is_tagged = True
if not is_tagged:
tag_map[tag_name] = "OT"
brown_tags.close()
return tag_map
def retag_brown_words(tag_map):
wordpos_fd = nltk.FreqDist()
for word, tag in brown.tagged_words():
if tag_map.has_key(tag):
normed_pos = tag_map[tag]
retagged_word = DELIM.join([word.lower(), normed_pos])
wordpos_fd.inc(retagged_word)
return wordpos_fd
def compose_record(word, wordpos_fd):
freqs = []
for tag in NORMED_TAGS:
wordpos = DELIM.join([word, tag])
freqs.append(wordpos_fd[wordpos])
sum_freqs = float(sum(freqs))
nf = [float(f) / sum_freqs for f in freqs]
return "%s\t%s\n" % (word, "\t".join(["%5.3f" % (x) for x in nf]))
# main
tag_map = normalize_brown_postags()
wordpos_fd = retag_brown_words(tag_map)
already_seen_words = set()
brown_dict = open("../data/brown_dict.csv", 'wb')
brown_dict.write("#WORD\t%s\n" % ("\t".join(NORMED_TAGS)))
for wordpos in wordpos_fd.keys():
word, tag = wordpos.split(DELIM)
if word in already_seen_words:
continue
brown_dict.write(compose_record(word, wordpos_fd))
already_seen_words.add(word)
brown_dict.close()
|
I got the Brown tag descriptions from this page, I copied the table into OpenOffice Calc and wrote it out as a CVS file (brown_tags.csv) referenced above. I reduced the large set of Brown tags to a small set of 7 tags - Nouns (NN), Verbs (VB), Adjectives (JJ), Adverbs (RB), Determiners (DT), Prepositions (IN), and Other (OT). I had originally reduced it just 4 tags (similar to Wordnet) - NN, VB, JJ and RB - but extended it to account for the phrase sequences (see below). The output of the code above is a file that looks like this:
1 2 3 4 5 6 7 | #WORD NN VB JJ RB DT IN OT
as 0.000 0.000 0.000 0.000 0.000 0.017 0.983
his 0.000 0.000 0.000 0.000 0.995 0.000 0.005
be 0.000 1.000 0.000 0.000 0.000 0.000 0.000
on 0.000 0.000 0.000 0.083 0.000 0.917 0.000
at 0.000 0.000 0.000 0.000 0.000 1.000 0.000
...
|
The data is used from Java, for maintainability (we are a Java shop), and because part of the work involves checking against a middleware component that is Java-only, but here is some testing code that illustrates how this information can be used.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 | # Source: predict.py
from __future__ import division
import numpy as np
import nltk
NORMTAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]
def load_word_dict(dict_file):
word_dict = {}
fdict = open(dict_file, "rb")
for line in fdict:
line = line.strip()
if len(line) == 0 or line.startswith("#"):
continue
cols = line.split("\t")
word = cols[0]
probs = [float(x) for x in cols[1:]]
word_dict[word] = probs
fdict.close()
return word_dict
def assert_true(fn, message):
if fn != True:
print "Assert failed:", message
def predict_pos(word, word_dict):
if word_dict.has_key(word):
probs = np.array(word_dict[word])
return NORMTAGS[np.argmax(probs)]
else:
return "OT"
def predict_if_noun(word, word_dict):
return predict_pos(word, word_dict) == "NN"
# test cases for individual words
word_dict = load_word_dict("../data/brown_dict.csv")
assert_true(predict_if_noun("hypothalamus", word_dict),
"Hypothalamus == NOUN!")
assert_true(not predict_if_noun("intermediate", word_dict),
"Intermediate != NOUN!")
assert_true(predict_if_noun("laugh", word_dict), "Laugh ~= NOUN!")
|
Having done this, I got to thinking that perhaps we could extend the idea for multi-word phrases as well, ie, we could flag noun phrases instead of nouns. Like their single word counterparts, these phrases are also synonyms of concepts, and do not provide any context to decide what kind of phrases they are.
A phrase is a sequence of POS tags. I decided to use the Penn Treebank (PTB) corpus. The full PTB is not free, but NLTK provides a small subset of the PTB for educational purposes. The idea here is to scan the chunks, filtering out noun phrases and creating a frequency distribution of the tag sequences for noun phrases.
Since not all possible English words are possible in the Brown Corpus, we create a transition matrix that will predict what POS tag will appear given the current POS tag. That way, we will be able to assign a likely POS tag for words that are not in our dictionary. As before, I reduce the PTB tags down to our set of 7 tags. The code for generating the POS sequences and the POS transition matrix is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 | # Source:phrase_seqs.py
from __future__ import division
import nltk
import numpy as np
from nltk.corpus import treebank_chunk
NORMTAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]
POSTAGS = {
"NN" : "noun",
"VB" : "verb",
"JJ" : "adjective",
"RB" : "adverb",
"DT" : "determiner",
"IN" : "preposition",
"OT" : "other"
}
def normalize_ptb_tags():
tag_map = {}
ptb_tags = open("../data/ptb_tags.csv", 'rb')
for line in ptb_tags:
line = line.strip()
if len(line) == 0 or line.startswith("#"):
continue
tag, desc = line.split("\t")
desc_words = nltk.word_tokenize(desc.lower().replace("-", " "))
is_tagged = False
for key in NORMTAGS[:-1]:
postag_desc = POSTAGS[key]
if postag_desc in desc_words:
tag_map[tag] = key
is_tagged = True
if not is_tagged:
tag_map[tag] = "OT"
ptb_tags.close()
return tag_map
def get_chunks(tree, phrase_type, tags):
try:
tree.node
except AttributeError:
return
else:
if tree.node == phrase_type:
tags.append(tree)
else:
for child in tree:
get_chunks(child, phrase_type, tags)
def index_of(tag):
if tag == "START":
return 0
elif tag == "END":
return len(NORMTAGS) + 1
else:
return NORMTAGS.index(tag) + 1
def update_trans_freqs(trans_freqs, tag_seq):
tags = ["START"]
tags.extend(tag_seq.split(" "))
tags.append("END")
bigrams = nltk.bigrams(tags)
for bigram in bigrams:
row = index_of(bigram[0])
col = index_of(bigram[1])
trans_freqs[row, col] += 1
# generate phrases as a sequence of (normalized) POS tags and
# transition probabilities across POS tags.
tag_map = normalize_ptb_tags()
np_fd = nltk.FreqDist()
trans_freqs = np.zeros((len(NORMTAGS) + 2, len(NORMTAGS) + 2))
for tree in treebank_chunk.chunked_sents():
chunks = []
get_chunks(tree, "NP", chunks)
for chunk in chunks:
tagged_poss = [tagged_word[1] for tagged_word in chunk]
normed_tags = []
for tagged_pos in tagged_poss:
try:
normed_tags.append(tag_map[tagged_pos])
except KeyError:
normed_tags.append("OT")
np_fd.inc(" ".join(normed_tags))
fout = open("../data/np_tags.csv", 'wb')
for tag_seq in np_fd.keys():
fout.write("%s\t%d\n" % (tag_seq, np_fd[tag_seq]))
update_trans_freqs(trans_freqs, tag_seq)
fout.close()
# normalize so they are all probablities (by row sum)
trans_probs = trans_freqs / np.linalg.norm(trans_freqs, axis=1)[:, np.newaxis]
trans_probs[~np.isfinite(trans_probs)] = 0.0
np.savetxt("../data/pos_trans.csv", trans_probs, fmt="%7.5f", delimiter="\t")
|
Thanks to this presentation and this project for teaching me how to parse a NLTK tree. The output of the code above are the following two files, the first a list of inter-POS transition probabilities to figure out what an unknown POS tag should be given the current one.
1 2 3 4 5 6 7 8 9 10 | # START NN VB JJ RB DT IN OT END
0.00000 0.43902 0.06724 0.21358 0.04351 0.78312 0.02769 0.37574 0.00000 # START
0.00000 0.71857 0.07099 0.13159 0.00346 0.00693 0.00000 0.36708 0.57139 # NN
0.00000 0.83146 0.03079 0.18477 0.15397 0.01540 0.00000 0.27715 0.41573 # VB
0.00000 0.94283 0.06168 0.20266 0.00441 0.01322 0.00000 0.22029 0.13217 # JJ
0.00000 0.08762 0.04381 0.65716 0.04381 0.00000 0.00000 0.04381 0.74478 # RB
0.00000 0.79556 0.14763 0.50030 0.04921 0.03281 0.00000 0.29526 0.06561 # DT
0.00000 0.68599 0.00000 0.34300 0.00000 0.17150 0.00000 0.34300 0.51450 # IN
0.00000 0.66386 0.11246 0.33374 0.02177 0.06893 0.01814 0.59131 0.28296 # OT
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 # END
|
and the other is a list of POS tag sequences in the limited PTB corpus, alongwith their counts.
1 2 3 4 5 6 7 | DT JJ NN 1162
OT NN 1097
DT NN NN 875
DT 830
NN NN NN 547
JJ 359
...
|
Once again, the consumer of this data is in the Java layer, but we do the following tests for our own satisfaction.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 | def load_phrase_tags(phrase_tag_file):
phrase_tags = set()
ftags = open(phrase_tag_file, 'rb')
for line in ftags:
line = line.strip()
if len(line) == 0 or line.startswith("#"):
continue
phrase, count = line.split("\t")
phrase_tags.add(phrase)
ftags.close()
return phrase_tags
def assert_true(fn, message):
if fn != True:
print "Assert failed:", message
def tag_to_index(tag):
if tag == "START":
return 0
elif tag == "END":
return len(NORMTAGS) + 1
else:
return NORMTAGS.index(tag) + 1
def index_to_tag(index):
if index == 0:
return "START"
elif index == len(NORMTAGS) + 1:
return "END"
else:
return NORMTAGS[index - 1]
def predict_likely_pos(prev_tag, trans_probs):
row = tag_to_index(prev_tag)
probs = trans_probs[row, :]
return index_to_tag(np.argmax(probs))
def predict_if_noun_phrase(phrase, trans_probs, phrase_tags):
words = nltk.word_tokenize(phrase)
tags = []
for word in words:
if word_dict.has_key(word):
tags.append(predict_pos(word, word_dict))
else:
prev_tag = "START" if len(tags) == 0 else tags[-1]
tags.append(predict_likely_pos(prev_tag, trans_probs))
return " ".join(tags) in phrase_tags
# test cases for phrases
phrase_tags = load_phrase_tags("../data/np_tags.csv")
trans_probs = np.loadtxt("../data/pos_trans.csv", delimiter="\t")
assert_true(predict_if_noun_phrase("time flies", trans_probs, phrase_tags),
"time flies == NP!")
assert_true(not predict_if_noun_phrase(
"were spoken", trans_probs, phrase_tags),
"were spoken == VP!")
|
The phrase stuff seems to be a bit too loose at the moment, it matches almost any phrase you throw at it as a noun-phrase. One possibility is to manually combine rules into a smaller number of regex rules, with a cutoff mandating a minimum required frequency of a tag sequence for it to be considered. Alternatively, we could extract Verb, Adjective and Adverb phrases and attempt to match against all of them, choosing the one with the highest frequency.
One other problem with this approach is that the data sets we have considered are a bit on the small side, and they don't match our domain. An option is to POS tag and chunk a large amount of relevant text with something like OpenNLP, then use the above approach to find free-standing noun phrases.
8 comments (moderated to prevent spam):
Interesting !! Got me thinking, what if you could use Stanford NLP which gives more richer POS tagging than OpenNLP
http://nlp.stanford.edu/software/dependencies_manual.pdf
And also combine it with the dependency tree to glean more about the semantic structure...Note that it will be a memory hog though.
http://nlp.stanford.edu/pubs/LREC06_dependencies.pdf
http://nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/semgraph/SemanticGraphFactory.html
Let me know what you think. Am I barking up the wrong tree ??
Ravi Kiran Bhaskar
Thanks Ravi. This is the second time someone recommended the Stanford Parser, so I guess I really should try it again, found it a bit slow the last time I did. In my current scenario, I am trying to "normalize" the tags to a smaller set so I can (manually, after counting and sorting by frequency) construct POS sequence templates that translate to different phrase types, so a richer set of tags may not necessarily help.
I dont know your exact scenario or design so I wont comment on what you should use, However, I will layout my thought process (atleast I might learn something form you that I overlooked :-) ).
At least from the limited testing I did HMM based taggers (opennlp) are slightly less accurate than the CRF based pos taggers. As you noted Stanford NLP is memory hungry and slower.
coming to paring down the nouns...I believe, sometimes to narrow down you need increased richness in tagging so you can really strip down to really correct ones
Lets take for example "Genetically modified food" could not be detected as a whole by comon HMM taggers however Stanford Parser can tag it as "Adverbial modifier"
similarly an "adjectival modifier" adds more context for you to work with...look at the first link you might get some ideas.
Thanks,
Ravi Kiran Bhaskar
Hi Ravi, thanks for the detailed comment. Yes, completely agree, from my very limited testing of Stanford parser, it looks to be much more accurate than OpenNLP (and that in itself may be a good argument for using Stanford).
For context, the application is to flag possible "bad" entries from a large (approx 20M) list of free standing words (or phrases) by computing the POS (or phrase type) - the intuition is that nouns and noun phrases are "good" and others should be reviewed manually.
OpenNLP came up because the tagged corpora in NLTK is limited, and I figured I could manufacture some training data by tagging/chunking a largish corpus like PubMed using OpenNLP's OOB trained model. As you rightly pointed out, I could also use the Stanford Parser as well and get more accurate and richer tagging.
However, the reason I said it may be of limited use is because my current approach is to reduce the PTB (and in future the OpenNLP/Stanford) tags to a smaller set (for example NN, NNS, NNP, etc all become NN), in order to generalize them into candidate POS sequences for matching. I haven't had great success with this approach (at least not so far), but I have a few alternatives in mind - one is to find candidate sequences for other phrase types and then do a bake-off instead of a yes/no decision I have in place currently. Another is to get more training data (thats where OpenNLP came in :-)).
I am beginning to see your point about Stanford being able to tag at a more granular level, and I guess it sort of makes sense to do now even with my current approach (as long as the performance is acceptable and the resulting accuracy cuts down manual review time drastically) - in the sense that its better to round off things as late as possible to keep errors low.
BTW dont forget to check the caseless models. I found them very useful when the nouns in the text are not properly camel cased. :-)
Ravi Kiran Bhaskar
Thanks Ravi. I am guessing these should be used as a fallback to report additional tags right?
Sujit I used caseless as primary itself, not just as fallback. I sent you sample code to your email.
Ravi Kiran Bhaskar
I see, thank you. Let me look at the code and I will ask if I have questions.
Post a Comment