Friday, November 30, 2012

A Consumer Electronics Named Entity Recognizer using NLTK


Some time back, I came across a question someone asked about possible approaches to building a Named Entity Recognizer (NER) for the Consumer Electronics (CE) industry on LinkedIn's Natural Language Processing People group. I had just finished reading the NLTK Book and had some ideas, but I wanted to test my understanding, so I decided to build one. This post describes this effort.

The approach is actually quite portable and not tied to NLTK and Python, you could, for example, build a Java/Scala based NER using components from OpenNLP and Weka using this approach. But NLTK provides all the components you need in one single package, and I wanted to get familiar with it, so I ended up using NLTK and Python.

The idea is that you take some Consumer Electronics text, mark the chunks (words/phrases) you think should be Named Entities, then train a (binary) classifier on it. Each word in the training set, along with some features such as its Part of Speech (POS), Shape, etc is a training input to the classifier. If the word is part of a CE Named Entity (NE) chunk, then its trained class is True otherwise it is False. You then use this classifier to predict the class (CE NE or not) of words in (previously unseen) text from the Consumer Electronics domain.

Tagging


For training text, I copy-pasted bodies of text from CNET Reviews, across a variety of CE subdomains such as Cell Phones, Cameras, Laptops, TVs, etc. My first attempt at tagging the text was to do it manually, which in retrospect was mostly a waste of time (I ended up discarding it since I changed my mind several times about what constitutes a CE NE during the course of the tagging). It wasn't a complete waste because I did gain some insights about how a CE NE "looked", and I used that insight to write code that bootstrapped the tags for me. I guess this will make most NLP practitioners cringe a bit, but the NER was just a learning exercise for me, and I just couldn't face the prospect of having to spend another week re-tagging the corpus manually the "right way".

Here's the code that extracts CE NEs from the corpus. It looks for contiguous runs of title cased words and numbers and writes out the CE NEs to STDOUT, where I use Unix tools to create a sorted unique set. The code also sets up certain exclusions so as to retrieve good CE NE chunks.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#!/usr/bin/python
# Source: src/cener/bootstrap.py

from nltk.tokenize import sent_tokenize
from nltk.tokenize import word_tokenize
import re

stopwords = set(["The", "This", "Though", "While", 
  "Using", "It", "Its", "A", "An", "As", "Now",
  "At", "But", "Although", "Am", "Perhaps",
  "January", "February", "March", "April", "May", "June",
  "July", "August", "September", "October", "November", "December"])

def iotag(token):
  # remove stopwords
  if token in stopwords:
    return False
  if (re.match("^[A-Z].*", token) or
      re.match("^[a-z][A-Z].*", token) or
      re.search("[0-9]", token) or
      token == ",s"):
    return True
  else:
    return False

# if current iotag == "I" and (prev iotag == "I" or next iotag == "I"
# then keep the iotag value else flip it
def modify_tags(pairs):
  output_tags = []
  idx = 0
  for pair in pairs:
    if pair[1]:
      if idx == 0:
        output_tags.append((pair[0], pair[1] and pairs[idx+1][1]))
      elif idx == len(pairs):
        output_tags.append((pair[0], pair[1] and pairs[idx-1][1]))
      else:
        output_tags.append((pair[0], pair[1] and
          (pairs[idx-1][1] or pairs[idx+1][1])))
    else:
      output_tags.append(pair)
    idx = idx + 1
  return output_tags

def partition_pairs(pairs):
  output_pairs_list = []
  output_pairs = []
  for pair in pairs:
    if pair[1]:
      output_pairs.append(pair)
    else:
      if len(output_pairs) > 0:
        output_pairs_list.append(output_pairs)
        output_pairs = []
  return output_pairs_list

def main():
  ce_words = set()
  input = open("cnet_reviews.txt", 'rb')
  for line in input:
    line = line[:-1]
    if len(line.strip()) == 0:
      continue
    sents = sent_tokenize(line)
    for sent in sents:
      tokens = word_tokenize(sent)
      iotags = map(lambda token: iotag(token), tokens)
      ce_pairs_list = partition_pairs(modify_tags(zip(tokens, iotags)))
      if len(ce_pairs_list) == 0:
        continue
      for ce_pairs in ce_pairs_list:
        print " ".join(map(lambda pair: pair[0], ce_pairs))
        for ce_pair in ce_pairs:
          ce_words.add(ce_pair[0])
  input.close()

if __name__ == "__main__":
  main()

The code is run from the command line like so:

1
sujit@cyclone:cener$ ./bootstrap.py | sort | uniq > ce_phrases.txt

This resulted in 625 candidate CE NE phrases. I then manually inspected the file and removed phrases that were obviously wrong, such as 0.34 Mbps, New York Times, UC Davis, PBS Kids, etc and ended up with 570 CE NE phrases which I used to train my classifier.

Training


For training, the input text was split into sentences, and the part of speech (POS) tags for each word in the sentence generated using a Trigram/Bigram/Unigram backoff POS tagger trained on the Penn Treebank Corpus. Then the chunks containing the CE NE phrases (identified during the tagging phase) were found by shingling the sentences against the CE NE phrases, and annotated with a variation of IOB tags. The IOB tags we use are True for words which are within a CE NE chunk and False otherwise.

Once all the chunks are tagged, we do another pass to resolve references to CE NEs, for example, the word "S3" can be used to refer to the CE NE "Samsung Galaxy S3", so we search for all words in the input that are not currently annotated as part of a CE NE chunk but which are part of one of the CE NE tags identified earlier.

We then do a 90/10 split of these POS and IOB tagged sentences, and generate features for each word in the training set (the 90 split). The features we chose are the word itself, its POS, the previous and next word and POS, and the "shape" of the word.

Classifiers were trained using the featureset described above and their accuracy measured against the evaluation set (the 10 split). The numbers are quite impressive, here they are:

Classifier Features Accuracy
Naive Bayes Word, POS 0.93908045977
Naive Bayes Word, POS, Word-1, POS-1, Word+1, POS+1 0.95763546798
Naive Bayes Word, POS, Word-1, POS-1, Word+1, POS+1, Shape 0.945812807882
Decision Tree Word, POS, Word-1, POS-1, Word+1, POS+1, Shape 0.983251231527
Maxent Word, POS, Word-1, POS-1, Word+1, POS+1, Shape 0.98013136289

As you can see, the best accuracy is from the Decision Tree classifer, but the results against the test set were not as good, possibly due to overfitting. The Maxent classifier had the next best accuracy, and so I used that to classify my final test set. In terms of time to train the classifier, the Naive Bayes classifier trained the quickest, followed by the Decision Tree classifier and followed (by a very large margin) by the Maxent classifier. The different classifiers can be built by uncommenting the relevant one in cener.py (see below) and running the following command. The command trains the classifier, then serializes the classifer to disk (ce_ner_classifier.pkl), then evaluates the classifier against the evaluation set and reports the accuracy number.

1
sujit@cyclone:cener$ ./cener.py train

Classification


The final step is to use the classifier to classify some text to recognize CE NEs in some new text. For this, I chose part of a recent review on the LG Spectrum 2. The command to run it is:

1
sujit@cyclone:cener$ ./cener.py test

Note that I was going for convenience by hardcoding the filenames inside the script. If you want to make it more general, it should be fairly easy to do. I show below the output of the NER using the Maxent classifier. The output of the Maxent classifier is shown below.


The wait for a decent LG phone on Verizon is finally over with the Spectrum 2 .

Not only does it run the new ( ish ) Android 4.0 Ice Cream Sandwich operating system , it also has a screen that does n't require two hands and a stylus .

In addition , it 's priced right at the $ 100 mark , making it one of the more affordable Big Red handsets .

With its noticeably sectioned back plate and defined edges , the LG Spectrum 2 's design looks more thought-out and deliberate than is usual for LG 's latest run of devices , save for the high-end Nexus 4 and Optimus G .

It measures 5.31 inches tall and 2.69 inches wide .

At 0.36 inch thick and 5.16 ounces , it 's thicker and a bit heavier than most LG handsets I 've run into , and it 's a tight fit in a small jeans pocket , but it 's comfortable when held in the hand or pinned between the cheek and shoulder .

On the left there are a Micro-USB port and two separate buttons for adjusting the volume .

Up top are a 3.5mm headphone jack and a circular sleep/power button , the edges of which light up blue whenever it 's pressed .

The rear of the phone houses an 8-megapixel camera with an LED flash .

Though plastic , the black plate is coated with a textured , rubberlike material that feels almost like leather .

The cover has two small slits at the bottom for the audio speaker .

Removing the plate gives access to the 2,150mAh battery , a microSD card slot , and Verizon 's 4G LTE SIM card .

Directly on the other side of the cover are the NFC antenna and wireless charging coil .

The 4.7-inch True HD IPS screen is bright and vivid , and texts and icons rendered crisply and clearly .

It has the same screen as the unlocked LG Optimus 4X HD , with the same 1,280x720-pixel resolution .

Overall , the display is vivid and bright , not to mention responsive to the touch .

At the time of the 4X HD review , I was very impressed with the screen .

However , having now spent time with higher-tier LG devices such as the Nexus 4 and the Optimus G , I noticed that upon closer inspection , the Spectrum 2 's display is n't as crystal-clear as the two others .

Default wallpapers looked a tad noisy , and gradient patterns appeared streaky , but only by a small margin .

Above the screen is a 1.3-megapixel camera and below are four hot keys ( back , home , recent apps , and menu ) that illuminate in blue when in use .


The code for the NER is divided into two files. The cener_lib.py contains functions that call various NLTK APIs and the cener.py is the user level code to train and test the NER. I originally thought it would be a good idea to separate the two, but looking back, it appears to be a bit pointless. But anyway, here is the code for the cener_lib.py.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
# Source: src/cener/cener_lib.py
import nltk
from nltk.corpus import treebank
from nltk.tokenize import word_tokenize
import re

def train_pos_tagger():
  """
  Trains a POS tagger with sentences from Penn Treebank
  and returns it.
  """
  train_sents = treebank.tagged_sents(simplify_tags=True)
  tagger = nltk.TrigramTagger(train_sents, backoff=
    nltk.BigramTagger(train_sents, backoff=
    nltk.UnigramTagger(train_sents, backoff=
    nltk.DefaultTagger("NN"))))
  return tagger

def ce_phrases():
  """
  Returns a list of phrases found using bootstrap.py ordered
  by number of words descending (so code traversing the list
  will encounter the longest phrases first).
  """
  def by_phrase_len(x, y):
    lx = len(word_tokenize(x))
    ly = len(word_tokenize(y))
    if lx == ly:
      return 0
    elif lx < ly:
      return 1
    else:
      return -1
  ceps = []
  phrasefile = open("ce_phrases.txt", 'rb')
  for cep in phrasefile:
    ceps.append(cep[:-1])
  phrasefile.close()
  return map(lambda phrase: word_tokenize(phrase),
    sorted(ceps, cmp=by_phrase_len))

def ce_phrase_words(ce_phrases):
  """
  Returns a set of words in the ce_phrase list. This is
  used to tag words that refer to the NE but does not
  have a consistent pattern to match against.
  """
  ce_words = set()
  for ce_phrase_tokens in ce_phrases:
    for ce_word in ce_phrase_tokens:
      ce_words.add(ce_word)
  return ce_words

def slice_matches(a1, a2):
  """
  Returns True if the two arrays are content wise identical,
  False otherwise.
  """
  if len(a1) != len(a2):
    return False
  else:
    for i in range(0, len(a1)):
      if a1[i] != a2[i]:
        return False
    return True
  
def slots_available(matched_slots, start, end):
  """
  Returns True if all the slots in the matched_slots array slice
  [start:end] are False, ie, available, else returns False.
  """
  return len(filter(lambda slot: slot, matched_slots[start:end])) == 0

def promote_coreferences(tuple, ce_words):
  """
  Sets the io_tag to True if it is not set and if the word is
  in the set ce_words. Returns the updated tuple (word, pos, iotag)
  """
  return (tuple[0], tuple[1],
    True if tuple[2] == False and tuple[0] in ce_words else tuple[2])

def tag(sentence, pos_tagger, ce_phrases, ce_words):
  """
  Tokenizes the input sentence into words, computes the part of
  speech and the IO tag (for whether this word is "in" a CE named
  entity or not), and returns a list of (word, pos_tag, io_tag)
  tuples.
  """
  tokens = word_tokenize(sentence)
  # add POS tags using our trained POS Tagger
  pos_tagged = pos_tagger.tag(tokens)
  # add the IO(not B) tags from the phrases we discovered
  # during bootstrap.
  words = [w for (w, p) in pos_tagged]
  pos_tags = [p for (w, p) in pos_tagged]
  io_tags = map(lambda word: False, words)
  for ce_phrase in ce_phrases:
    start = 0
    while start < len(words):
      end = start + len(ce_phrase)
      if slots_available(io_tags, start, end) and \
          slice_matches(words[start:end], ce_phrase):
        for j in range(start, end):
          io_tags[j] = True
        start = end + 1
      else:
        start = start + 1
  # zip the three lists together
  pos_io_tagged = map(lambda ((word, pos_tag), io_tag):
    (word, pos_tag, io_tag), zip(zip(words, pos_tags), io_tags))
  # "coreference" handling. If a single word is found which is
  # contained in the set of words created by our phrases, set
  # the IO(not B) tag to True if it is False
  return map(lambda tuple: promote_coreferences(tuple, ce_words),
    pos_io_tagged)

shape_A = re.compile("[A-Zbdfhklt0-9#$&/@|]")
shape_x = re.compile("[acemnorsuvwxz]")
shape_i = re.compile("[i]")
shape_g = re.compile("[gpqy]")
shape_j = re.compile("[j]")

def shape(word):
  wbuf = []
  for c in word:
    wbuf.append("A" if re.match(shape_A, c) != None
      else "x" if re.match(shape_x, c) != None
      else "i" if re.match(shape_i, c) != None
      else "g" if re.match(shape_g, c) != None
      else "j")
  return "".join(wbuf)

def word_features(tagged_sent, wordpos):
  return {
    "word": tagged_sent[wordpos][0],
    "pos": tagged_sent[wordpos][1],
    "prevword": "<START>" if wordpos == 0 else tagged_sent[wordpos-1][0],
    "prevpos": "<START>" if wordpos == 0 else tagged_sent[wordpos-1][1],
    "nextword": "<END>" if wordpos == len(tagged_sent)-1
                        else tagged_sent[wordpos+1][0],
    "nextpos": "<END>" if wordpos == len(tagged_sent)-1
                       else tagged_sent[wordpos+1][1],
    "shape": shape(tagged_sent[wordpos][0])
  }

And the code for cener.py to train and test the classifier...

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#!/usr/bin/python
# Source: src/cener/cener.py

import sys
import cPickle as pickle
from cener_lib import *
from nltk.tokenize import sent_tokenize, word_tokenize

def train_ner(pickle_file):
  # initialize
  pos_tagger = train_pos_tagger()
  ceps = ce_phrases()
  cep_words = ce_phrase_words(ceps)
  # train classifier
  sentfile = open("cnet_reviews_sents.txt", 'rb')
  featuresets = []
  for sent in sentfile:
    tagged_sent = tag(sent, pos_tagger, ceps, cep_words)
    for idx, (word, pos_tag, io_tag) in enumerate(tagged_sent):
      featuresets.append((word_features(tagged_sent, idx), io_tag))
  sentfile.close()
  split = int(0.9 * len(featuresets))
#  random.shuffle(featuresets)
  train_set, test_set = featuresets[0:split], featuresets[split:]
#  classifier = nltk.NaiveBayesClassifier.train(train_set)
#  classifier = nltk.DecisionTreeClassifier.train(train_set)
  classifier = nltk.MaxentClassifier.train(train_set, algorithm="GIS", trace=0)
  # evaluate classifier
  print "accuracy=", nltk.classify.accuracy(classifier, test_set)
  if pickle_file != None:
    # pickle classifier
    pickled_classifier = open(pickle_file, 'wb')
    pickle.dump(classifier, pickled_classifier)
    pickled_classifier.close()
  return classifier

def get_trained_ner(pickle_file):
  pickled_classifier = open(pickle_file, 'rb')
  classifier = pickle.load(pickled_classifier)
  pickled_classifier.close()
  return classifier

def test_ner(input_file, classifier):
  pos_tagger = train_pos_tagger()
  input = open(input_file, 'rb')
  for line in input:
    line = line[:-1]
    if len(line.strip()) == 0:
      continue
    for sent in sent_tokenize(line):
      tokens = word_tokenize(sent)
      pos_tagged = pos_tagger.tag(tokens)
      io_tags = []
      for idx, (word, pos) in enumerate(pos_tagged):
        io_tags.append(classifier.classify(word_features(pos_tagged, idx)))
      ner_sent = zip(tokens, io_tags)
      print_sent = []
      for token, io_tag in ner_sent:
        if io_tag == True:
          print_sent.append("<u>" + token + "</u>")
        else:
          print_sent.append(token)
      print " ".join(print_sent)

  input.close()
      
def main():
  if len(sys.argv) != 2:
    print "Usage ./cener.py [train|test]"
    sys.exit(-1)
  if sys.argv[1] == "train":
    classifier = train_ner("ce_ner_classifier.pkl")
  else:
    classifier = get_trained_ner("ce_ner_classifier.pkl")
    test_ner("test.txt", classifier)
  
if __name__ == "__main__":
  main()

The code (and all the data) is also available here on my GitHub page.

Ideas for Improvement


The reported accuracy numbers are quite impressive, but the actual results against the test sentences not quite so much, More training data would probably help, as would perhaps better quality of tagging.

Another idea is to not do reference resolution during tagging, but instead postponing this to a second stage following entity recognition. That way, the references will be localized to the text under analysis, thus reducing false positives.


27 comments (moderated to prevent spam):

Anonymous said...

Hello! I got a really big help from your post!

I am a very beginner of the Machine Learning, and I have no idea how to apply those algorithms such as HMM, Maxent as you did in your post!

Could you tell me briefly of how to apply those classifying algorithms to this?

Thanks a lot.

Sujit Pal said...

Hi, glad it helped you. The basic idea is to mark up some input with what you want to predict (training set), train a model with this data, and use the model to predict with new test data (test set). Training involves deciding what elements of the training set to use to build the model, this is where domain knowledge or experience comes in. In this case we are marking up consumer electronics entities in sentences. The features used to train the model is word/POS before/after for a specific number of words. Generally (I have not done this here), you would split up your training set further into two parts, and for each feature set you selected, train a model with the first split and test with the second split, varying the split a fixed number of times and evaluating your model's accuracy - this process is called cross-validation. Ultimately, once you have decided the "best" feature set to use, you build the model and run the test sentences against it to predict the consumer electronics entities in the test set. The models themselves are part of NLTK (I also use scikits-learn, there is some overlap, but scikits-learn is more general purpose and has a larger selection of classification algorithms), so the idea is to select the features and feed it into the provided algorithms to train the models. Sorry about the long winded reply, but the question was pretty broad, hope it helped.

Anonymous said...

Thanks so much.... Your blog is a place where I found out that Machine Learning could be something very interesting.
:) Your answer helped me a loooot!!!!!
Thank you again :)

Unknown said...

Hi,
Thank you very much for you post. I am learning NLTK.

Currently I am evaluating with help of you code available in GitHub.

https://github.com/sujitpal/nltk-examples/tree/master/src/medorleg2.

I have executed classify.py xval. It is showing error like
IOError: [Errno 2] No such file or directory: 'data/sentences.txt'
directory: 'data/label.txt'

Could you please share the data folder. I need to know how can we form the labels and sentences to classify.

Sujit Pal said...

Hi Sumanth, you are welcome. Unfortunately, the data is proprietary and I cannot share it. However, the sentences.txt file was a file of sentences, one per line, from a medical and legal corpus respectively, and labels.txt was the corresponding (0/1) label of where the sentence came from. So for example:

# sentences.txt
The patient suffered permanent hearing loss.
The hearing was scheduled for the end of the month

# labels.txt
0
1

Hopefully the examples above helps. The idea was to get the two sets from clearly different sources specific to each industry, removing the need for manual labeling of training data. You can do the same by grabbing online magazines from different areas, say fashion and technology.

Xiangtao Wang said...

Hi Salmon, Do you have any idea about How to extract company's products from its company website . Currently I have this requirement. Different website have different structures, it is very difficult . Do you think it is possible ?

Sujit Pal said...

Hi Xiangtao, I think you are right about the difficulty :-). I haven't done this myself, but there was a group at CNET which did just this, although I don't know any specifics of their job. The traditional way would be to understand each website's structure and crawl and parse out the information you need - over time you will end up with reusable components from older parsers. It may also be possible to get feeds of product catalogs from some companies. Another way may be to use a strategy like Boilerpipe to train a parser to recognize names, prices and specifications in website text.

Xiangtao Wang said...

Thanks Sujit. I need to figure out the structure of the page. Usually the page have more than one products with regular position. I also have a another idea to use word2vec (https://code.google.com/p/word2vec/) .
I plan to cluster(k-means) the html element instead of only word and find their similarity. Hopefully I can get some
regular clusters. I will try it ,thanks

Sujit Pal said...

You are welcome and good luck.

sam said...

I am getting the following error in the file cener_lib.py

#zip the three lists together
pos_io_tagged = map(lambda ((word, pos_tag), io_tag):

word, pos_tag, io_tag are underlined in my IDE and the error says sublist parameters are not supported in 3.x

I would be glad if you replied asap. Meanwhile I am trying to get through this error.

Sujit Pal said...

Hi Sam, haven't used Python 3x yet, but here is what this piece of code does. Your input is the 3 lists words, post_tags and io_tags. The zip(zip(words, pos_tags), io_tags) results in a list of the form list((word, pos_tag), io_tag). The map flattens this to a form list(word, pos_tag, io_tag). In this case word, pos_tag and io_tag are temporary variables only valid within the lambda, which is why they show up as undefined in your IDE.

sam said...

I have already tokenized, POS tagged and chunked my data according to my own methods. I just want to your code for NER tagging and nothing else. What if I remove your POSTagger portion, will the code still work and get trained for NER properly ?
Removing this:
def train_pos_tagger():
and the tagging portion.

Also can u provide me with the data that you used for training ?? Or show me the format so I can prepare my data according to that ??

Thanks for the help.

Sujit Pal said...

Hi Sam, the code and data are available on github here. I think it should be fine to remove calls to my POS tagger since you already have this info from somewhere else.

sam said...

Thanks for your immediate replies Sujit. I have managed downloading all your data from the link you provided in your last answer and the code perfectly runs on your data sets ( yay !! ). Now I have to tune it for my data sets. For which I have to request you to please tell me how you have made your training data sets so that I can follow the same pattern and prepare my training and testing data sets on the very same format.

Thanking in anticipation.

Sujit Pal said...

Hi Sam, already answered this by mail on LI, putting it here for others in case they have similar questions.

For training data, I look in the text for certain patterns, specifically runs of title cased words, words with numbers and so on. The boostrap.py does this. Output of this code is a set of candidate entities one per line.

I then convert my input to a set of sentences, one per line. The train_ner tokenizes each sentence into words, POS tags it, and uses the candidate entities to mark it with an IOB tag. This step now models the sentence as a list of (word, POS, IOB). This is done in the tag() function in cener_lib.py.

Now using this tagged sentence, we extract the following features per triple: the word, the POS, previous word, previous POS, next word, next POS, word "shape" - this is done in word_features() function in cener_lib.py.

So basically I build up the training data in code from the two inputs - the sentences and the candidate entities. The end result is input features which are the word, POS, prev and next word, prev and next POS and shape, and the target variable is the IOB tag. You train a classifier to predict the IOB tag given the features.

John said...

Hey Sujit!

Thanks for this great post. Can you please explain how you would calculate precision and recall?

Thanks

j

Sujit Pal said...

Hi John, you are welcome and sorry about the delay in responding. For precision and recall, you would need a gold set to measure against. You could then compare the expected outputs (from the gold set) and the actual outputs (from running the classifier model) and build a confusion matrix. The confusion matrix looks like [[TP, FP], [FN, TN]], and precision and recall can be expressed as combinations of these values, ie Precision = TP / (TP + FP) and recall = TP / (TP + FN).

Anonymous said...

Hi Sujit - this is interesting work you've done. I am with a company where we have information on over 1 billion products across thousands of categories. If this dataset were available to you, do you think it would be useful in creating a list of key words associated with each brand and then using that to extract/infer mention of those brand's products within review or other text?

Sujit Pal said...

Thank you for the appreciation Sri Velamoor. With the kind of data you have, you could probably start with POS tagging or simple frequency counting to identify entity names (as frequent noun phrases) in your content, then use something like this to store these entities and stream your review text against it. If your information is pre-segregated by brand or product category, you might be able to leverage that information as well.

Anonymous said...

Hey SUjit,

Would your code work in python3.x environments. Do you have any issues with syntax?

Sujit Pal said...

Just started using Python3 for a MOOC recently, really can't tell if the current code will work or not with it. If you are using Python 3 for other stuff, maybe try running it and let us know if it works there or not?

Anonymous said...

Hi Sujit Sir,

Can you please tell how does one tag a corpus manually ? So for example I were to manually label CE in the sentence "I would love to have an apple iphone 6s" then after the labelling would it look like this ===> "I would love to have an apple iphone 6s"

Sujit Pal said...

I think maybe your (XML like) tags may have gotten eaten up by the commenting software, but the tagging should be the IOB style tags, something like this -- I/O would/O love/O to/O have/O an/O apple/B-CE iphone/I-CE 6s/I-CE ./O. I built the tags using the code for bootstrap.py in the post, its basically a bunch of rules about the names that I observed by looking through the text, you might have to make some modifications depending on your content.

Anonymous said...

Hi, I ran your code over my own corpus. But what do I do if I need to recognise two different entities from the text?

Unknown said...

Hello,

when i run the 'python bootstrap.py | sort> ce_phrases.txt' file was create but null file create and when i debugging the file i got this line
(iotags = map(lambda token: iotag(token), tokens))
null every time

Sujit Pal said...

I think you might need the input file. In retrospect I think I should have shared it, since that is the one thing that would allow you to run the code for yourself without investing in time to annotate your own dataset. But unfortunately I don't have this data anymore, and the idea when I wrote the post was to explain the methodology more than anything else.

Sujit Pal said...

@Anonymous (for comment dated 6/07/2018): sorry about the delay in responding, and am guessing you mean two entity types. In that case you could mark them up differently, for entity_type_1 and entity_type_2, you would mark it up using BIO tagging as B-entity_type_1, I-entity_type_1, B-entity_type_2, I-entity_type_2, and O. You probably want something better than my code to train with though -- take a look at my fork of the nerds project for some interesting possibilities. My fork is currently ahead of the master, but I am working with the developers of the parent project to fold in my changes.