Some time back, I came across a question someone asked about possible approaches to building a Named Entity Recognizer (NER) for the Consumer Electronics (CE) industry on LinkedIn's Natural Language Processing People group. I had just finished reading the NLTK Book and had some ideas, but I wanted to test my understanding, so I decided to build one. This post describes this effort.
The approach is actually quite portable and not tied to NLTK and Python, you could, for example, build a Java/Scala based NER using components from OpenNLP and Weka using this approach. But NLTK provides all the components you need in one single package, and I wanted to get familiar with it, so I ended up using NLTK and Python.
The idea is that you take some Consumer Electronics text, mark the chunks (words/phrases) you think should be Named Entities, then train a (binary) classifier on it. Each word in the training set, along with some features such as its Part of Speech (POS), Shape, etc is a training input to the classifier. If the word is part of a CE Named Entity (NE) chunk, then its trained class is True otherwise it is False. You then use this classifier to predict the class (CE NE or not) of words in (previously unseen) text from the Consumer Electronics domain.
Tagging
For training text, I copy-pasted bodies of text from CNET Reviews, across a variety of CE subdomains such as Cell Phones, Cameras, Laptops, TVs, etc. My first attempt at tagging the text was to do it manually, which in retrospect was mostly a waste of time (I ended up discarding it since I changed my mind several times about what constitutes a CE NE during the course of the tagging). It wasn't a complete waste because I did gain some insights about how a CE NE "looked", and I used that insight to write code that bootstrapped the tags for me. I guess this will make most NLP practitioners cringe a bit, but the NER was just a learning exercise for me, and I just couldn't face the prospect of having to spend another week re-tagging the corpus manually the "right way".
Here's the code that extracts CE NEs from the corpus. It looks for contiguous runs of title cased words and numbers and writes out the CE NEs to STDOUT, where I use Unix tools to create a sorted unique set. The code also sets up certain exclusions so as to retrieve good CE NE chunks.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | #!/usr/bin/python
# Source: src/cener/bootstrap.py
from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import re
stopwords = set(["The", "This", "Though", "While",
"Using", "It", "Its", "A", "An", "As", "Now",
"At", "But", "Although", "Am", "Perhaps",
"January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November", "December"])
def iotag(token):
# remove stopwords
if token in stopwords:
return False
if (re.match("^[A-Z].*", token) or
re.match("^[a-z][A-Z].*", token) or
re.search("[0-9]", token) or
token == ",s"):
return True
else:
return False
# if current iotag == "I" and (prev iotag == "I" or next iotag == "I"
# then keep the iotag value else flip it
def modify_tags(pairs):
output_tags = []
idx = 0
for pair in pairs:
if pair[1]:
if idx == 0:
output_tags.append((pair[0], pair[1] and pairs[idx+1][1]))
elif idx == len(pairs):
output_tags.append((pair[0], pair[1] and pairs[idx-1][1]))
else:
output_tags.append((pair[0], pair[1] and
(pairs[idx-1][1] or pairs[idx+1][1])))
else:
output_tags.append(pair)
idx = idx + 1
return output_tags
def partition_pairs(pairs):
output_pairs_list = []
output_pairs = []
for pair in pairs:
if pair[1]:
output_pairs.append(pair)
else:
if len(output_pairs) > 0:
output_pairs_list.append(output_pairs)
output_pairs = []
return output_pairs_list
def main():
ce_words = set()
input = open("cnet_reviews.txt", 'rb')
for line in input:
line = line[:-1]
if len(line.strip()) == 0:
continue
sents = sent_tokenize(line)
for sent in sents:
tokens = word_tokenize(sent)
iotags = map(lambda token: iotag(token), tokens)
ce_pairs_list = partition_pairs(modify_tags(zip(tokens, iotags)))
if len(ce_pairs_list) == 0:
continue
for ce_pairs in ce_pairs_list:
print " ".join(map(lambda pair: pair[0], ce_pairs))
for ce_pair in ce_pairs:
ce_words.add(ce_pair[0])
input.close()
if __name__ == "__main__":
main()
|
The code is run from the command line like so:
1 | sujit@cyclone:cener$ ./bootstrap.py | sort | uniq > ce_phrases.txt
|
This resulted in 625 candidate CE NE phrases. I then manually inspected the file and removed phrases that were obviously wrong, such as 0.34 Mbps, New York Times, UC Davis, PBS Kids, etc and ended up with 570 CE NE phrases which I used to train my classifier.
Training
For training, the input text was split into sentences, and the part of speech (POS) tags for each word in the sentence generated using a Trigram/Bigram/Unigram backoff POS tagger trained on the Penn Treebank Corpus. Then the chunks containing the CE NE phrases (identified during the tagging phase) were found by shingling the sentences against the CE NE phrases, and annotated with a variation of IOB tags. The IOB tags we use are True for words which are within a CE NE chunk and False otherwise.
Once all the chunks are tagged, we do another pass to resolve references to CE NEs, for example, the word "S3" can be used to refer to the CE NE "Samsung Galaxy S3", so we search for all words in the input that are not currently annotated as part of a CE NE chunk but which are part of one of the CE NE tags identified earlier.
We then do a 90/10 split of these POS and IOB tagged sentences, and generate features for each word in the training set (the 90 split). The features we chose are the word itself, its POS, the previous and next word and POS, and the "shape" of the word.
Classifiers were trained using the featureset described above and their accuracy measured against the evaluation set (the 10 split). The numbers are quite impressive, here they are:
Classifier | Features | Accuracy |
Naive Bayes | Word, POS | 0.93908045977 |
Naive Bayes | Word, POS, Word-1, POS-1, Word+1, POS+1 | 0.95763546798 |
Naive Bayes | Word, POS, Word-1, POS-1, Word+1, POS+1, Shape | 0.945812807882 |
Decision Tree | Word, POS, Word-1, POS-1, Word+1, POS+1, Shape | 0.983251231527 |
Maxent | Word, POS, Word-1, POS-1, Word+1, POS+1, Shape | 0.98013136289 |
As you can see, the best accuracy is from the Decision Tree classifer, but the results against the test set were not as good, possibly due to overfitting. The Maxent classifier had the next best accuracy, and so I used that to classify my final test set. In terms of time to train the classifier, the Naive Bayes classifier trained the quickest, followed by the Decision Tree classifier and followed (by a very large margin) by the Maxent classifier. The different classifiers can be built by uncommenting the relevant one in cener.py (see below) and running the following command. The command trains the classifier, then serializes the classifer to disk (ce_ner_classifier.pkl), then evaluates the classifier against the evaluation set and reports the accuracy number.
1 | sujit@cyclone:cener$ ./cener.py train
|
Classification
The final step is to use the classifier to classify some text to recognize CE NEs in some new text. For this, I chose part of a recent review on the LG Spectrum 2. The command to run it is:
1 | sujit@cyclone:cener$ ./cener.py test
|
Note that I was going for convenience by hardcoding the filenames inside the script. If you want to make it more general, it should be fairly easy to do. I show below the output of the NER using the Maxent classifier. The output of the Maxent classifier is shown below.
The wait for a decent LG phone on Verizon is finally over with the Spectrum 2 .
Not only does it run the new ( ish ) Android 4.0 Ice Cream Sandwich operating system , it also has a screen that does n't require two hands and a stylus .
In addition , it 's priced right at the $ 100 mark , making it one of the more affordable Big Red handsets .
With its noticeably sectioned back plate and defined edges , the LG Spectrum 2 's design looks more thought-out and deliberate than is usual for LG 's latest run of devices , save for the high-end Nexus 4 and Optimus G .
It measures 5.31 inches tall and 2.69 inches wide .
At 0.36 inch thick and 5.16 ounces , it 's thicker and a bit heavier than most LG handsets I 've run into , and it 's a tight fit in a small jeans pocket , but it 's comfortable when held in the hand or pinned between the cheek and shoulder .
On the left there are a Micro-USB port and two separate buttons for adjusting the volume .
Up top are a 3.5mm headphone jack and a circular sleep/power button , the edges of which light up blue whenever it 's pressed .
The rear of the phone houses an 8-megapixel camera with an LED flash .
Though plastic , the black plate is coated with a textured , rubberlike material that feels almost like leather .
The cover has two small slits at the bottom for the audio speaker .
Removing the plate gives access to the 2,150mAh battery , a microSD card slot , and Verizon 's 4G LTE SIM card .
Directly on the other side of the cover are the NFC antenna and wireless charging coil .
The 4.7-inch True HD IPS screen is bright and vivid , and texts and icons rendered crisply and clearly .
It has the same screen as the unlocked LG Optimus 4X HD , with the same 1,280x720-pixel resolution .
Overall , the display is vivid and bright , not to mention responsive to the touch .
At the time of the 4X HD review , I was very impressed with the screen .
However , having now spent time with higher-tier LG devices such as the Nexus 4 and the Optimus G , I noticed that upon closer inspection , the Spectrum 2 's display is n't as crystal-clear as the two others .
Default wallpapers looked a tad noisy , and gradient patterns appeared streaky , but only by a small margin .
Above the screen is a 1.3-megapixel camera and below are four hot keys ( back , home , recent apps , and menu ) that illuminate in blue when in use .
The code for the NER is divided into two files. The cener_lib.py contains functions that call various NLTK APIs and the cener.py is the user level code to train and test the NER. I originally thought it would be a good idea to separate the two, but looking back, it appears to be a bit pointless. But anyway, here is the code for the cener_lib.py.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 | # Source: src/cener/cener_lib.py
import nltk
from nltk.corpus import treebank
from nltk.tokenize import word_tokenize
import re
def train_pos_tagger():
"""
Trains a POS tagger with sentences from Penn Treebank
and returns it.
"""
train_sents = treebank.tagged_sents(simplify_tags=True)
tagger = nltk.TrigramTagger(train_sents, backoff=
nltk.BigramTagger(train_sents, backoff=
nltk.UnigramTagger(train_sents, backoff=
nltk.DefaultTagger("NN"))))
return tagger
def ce_phrases():
"""
Returns a list of phrases found using bootstrap.py ordered
by number of words descending (so code traversing the list
will encounter the longest phrases first).
"""
def by_phrase_len(x, y):
lx = len(word_tokenize(x))
ly = len(word_tokenize(y))
if lx == ly:
return 0
elif lx < ly:
return 1
else:
return -1
ceps = []
phrasefile = open("ce_phrases.txt", 'rb')
for cep in phrasefile:
ceps.append(cep[:-1])
phrasefile.close()
return map(lambda phrase: word_tokenize(phrase),
sorted(ceps, cmp=by_phrase_len))
def ce_phrase_words(ce_phrases):
"""
Returns a set of words in the ce_phrase list. This is
used to tag words that refer to the NE but does not
have a consistent pattern to match against.
"""
ce_words = set()
for ce_phrase_tokens in ce_phrases:
for ce_word in ce_phrase_tokens:
ce_words.add(ce_word)
return ce_words
def slice_matches(a1, a2):
"""
Returns True if the two arrays are content wise identical,
False otherwise.
"""
if len(a1) != len(a2):
return False
else:
for i in range(0, len(a1)):
if a1[i] != a2[i]:
return False
return True
def slots_available(matched_slots, start, end):
"""
Returns True if all the slots in the matched_slots array slice
[start:end] are False, ie, available, else returns False.
"""
return len(filter(lambda slot: slot, matched_slots[start:end])) == 0
def promote_coreferences(tuple, ce_words):
"""
Sets the io_tag to True if it is not set and if the word is
in the set ce_words. Returns the updated tuple (word, pos, iotag)
"""
return (tuple[0], tuple[1],
True if tuple[2] == False and tuple[0] in ce_words else tuple[2])
def tag(sentence, pos_tagger, ce_phrases, ce_words):
"""
Tokenizes the input sentence into words, computes the part of
speech and the IO tag (for whether this word is "in" a CE named
entity or not), and returns a list of (word, pos_tag, io_tag)
tuples.
"""
tokens = word_tokenize(sentence)
# add POS tags using our trained POS Tagger
pos_tagged = pos_tagger.tag(tokens)
# add the IO(not B) tags from the phrases we discovered
# during bootstrap.
words = [w for (w, p) in pos_tagged]
pos_tags = [p for (w, p) in pos_tagged]
io_tags = map(lambda word: False, words)
for ce_phrase in ce_phrases:
start = 0
while start < len(words):
end = start + len(ce_phrase)
if slots_available(io_tags, start, end) and \
slice_matches(words[start:end], ce_phrase):
for j in range(start, end):
io_tags[j] = True
start = end + 1
else:
start = start + 1
# zip the three lists together
pos_io_tagged = map(lambda ((word, pos_tag), io_tag):
(word, pos_tag, io_tag), zip(zip(words, pos_tags), io_tags))
# "coreference" handling. If a single word is found which is
# contained in the set of words created by our phrases, set
# the IO(not B) tag to True if it is False
return map(lambda tuple: promote_coreferences(tuple, ce_words),
pos_io_tagged)
shape_A = re.compile("[A-Zbdfhklt0-9#$&/@|]")
shape_x = re.compile("[acemnorsuvwxz]")
shape_i = re.compile("[i]")
shape_g = re.compile("[gpqy]")
shape_j = re.compile("[j]")
def shape(word):
wbuf = []
for c in word:
wbuf.append("A" if re.match(shape_A, c) != None
else "x" if re.match(shape_x, c) != None
else "i" if re.match(shape_i, c) != None
else "g" if re.match(shape_g, c) != None
else "j")
return "".join(wbuf)
def word_features(tagged_sent, wordpos):
return {
"word": tagged_sent[wordpos][0],
"pos": tagged_sent[wordpos][1],
"prevword": "<START>" if wordpos == 0 else tagged_sent[wordpos-1][0],
"prevpos": "<START>" if wordpos == 0 else tagged_sent[wordpos-1][1],
"nextword": "<END>" if wordpos == len(tagged_sent)-1
else tagged_sent[wordpos+1][0],
"nextpos": "<END>" if wordpos == len(tagged_sent)-1
else tagged_sent[wordpos+1][1],
"shape": shape(tagged_sent[wordpos][0])
}
|
And the code for cener.py to train and test the classifier...
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 | #!/usr/bin/python
# Source: src/cener/cener.py
import sys
import cPickle as pickle
from cener_lib import *
from nltk.tokenize import sent_tokenize, word_tokenize
def train_ner(pickle_file):
# initialize
pos_tagger = train_pos_tagger()
ceps = ce_phrases()
cep_words = ce_phrase_words(ceps)
# train classifier
sentfile = open("cnet_reviews_sents.txt", 'rb')
featuresets = []
for sent in sentfile:
tagged_sent = tag(sent, pos_tagger, ceps, cep_words)
for idx, (word, pos_tag, io_tag) in enumerate(tagged_sent):
featuresets.append((word_features(tagged_sent, idx), io_tag))
sentfile.close()
split = int(0.9 * len(featuresets))
# random.shuffle(featuresets)
train_set, test_set = featuresets[0:split], featuresets[split:]
# classifier = nltk.NaiveBayesClassifier.train(train_set)
# classifier = nltk.DecisionTreeClassifier.train(train_set)
classifier = nltk.MaxentClassifier.train(train_set, algorithm="GIS", trace=0)
# evaluate classifier
print "accuracy=", nltk.classify.accuracy(classifier, test_set)
if pickle_file != None:
# pickle classifier
pickled_classifier = open(pickle_file, 'wb')
pickle.dump(classifier, pickled_classifier)
pickled_classifier.close()
return classifier
def get_trained_ner(pickle_file):
pickled_classifier = open(pickle_file, 'rb')
classifier = pickle.load(pickled_classifier)
pickled_classifier.close()
return classifier
def test_ner(input_file, classifier):
pos_tagger = train_pos_tagger()
input = open(input_file, 'rb')
for line in input:
line = line[:-1]
if len(line.strip()) == 0:
continue
for sent in sent_tokenize(line):
tokens = word_tokenize(sent)
pos_tagged = pos_tagger.tag(tokens)
io_tags = []
for idx, (word, pos) in enumerate(pos_tagged):
io_tags.append(classifier.classify(word_features(pos_tagged, idx)))
ner_sent = zip(tokens, io_tags)
print_sent = []
for token, io_tag in ner_sent:
if io_tag == True:
print_sent.append("<u>" + token + "</u>")
else:
print_sent.append(token)
print " ".join(print_sent)
input.close()
def main():
if len(sys.argv) != 2:
print "Usage ./cener.py [train|test]"
sys.exit(-1)
if sys.argv[1] == "train":
classifier = train_ner("ce_ner_classifier.pkl")
else:
classifier = get_trained_ner("ce_ner_classifier.pkl")
test_ner("test.txt", classifier)
if __name__ == "__main__":
main()
|
The code (and all the data) is also available here on my GitHub page.
Ideas for Improvement
The reported accuracy numbers are quite impressive, but the actual results against the test sentences not quite so much, More training data would probably help, as would perhaps better quality of tagging.
Another idea is to not do reference resolution during tagging, but instead postponing this to a second stage following entity recognition. That way, the references will be localized to the text under analysis, thus reducing false positives.