Saturday, March 30, 2013

A HMM based Gene Tagger using NLTK


In Prof. Michael Collin's Natural Language Processing class on Coursera, the first programming assignment consisted of building a Gene Named Entity Recognizer using a Hidden Markov Model. The training set consists of 13,795 sentences with entities representing genes marked with an I tag and all other words are marked with a O tag. Also provided are a tagged validation set of 509 sentences to test the algorithm against and a test set of 510 sentences to tag.

While the Coursera Honor code prevents students from sharing solutions (including programming assignments), this post does not qualify as one. First, the assignment expressly forbids the use of NLTK, since the objective of the assignment is to implement the Viterbi algorithm from scratch, and experimenting with different tweaks to make it better. Second, I wasn't able to implement the transition probabilities for trigram backoff properly, although the problem is very likely my imperfect understanding of the NLTK API. So you are better off doing the assignment from scratch as required.

However, I think that a lot of stuff in the NLP/ML domain is applied NLP/ML, so knowing how to solve something with an existing library/toolkit is as important as knowing (and being able to implement) the algorithm itself. I thought it would be interesting to use NLTK to build the tagger, hence this post.

Interestingly, the Masters in Language Technology program at the University of Gothenberg, Sweden, uses NLTK for its lab work, one of which provided me with insights on how to use NLTK's HMM package. Its good to know that there are others with the same opinion as me :-).

An HMM is a function of three probability distributions - the prior probabilities, which describes the probabilities of seeing the different tags in the data; the transition probabilities, which defines the probability of seeing a tag conditioned on the previous tag, and the emission probabilities, which defines the probability of seeing a word conditioned on a tag. An example may make this clearer.

The assignment starts off by asking to construct a simple HMM model for the training data, measuring its performance using the validation data, and finally applying the model to predict the IO tag for the test data. We then identify rare words (those occurring less than 5 times in the training data), and replace them with the string "_RARE_" and measure the performance. The next step is to identify types of the rare words, such as numerics, all caps, etc, and replacing them with "_RARE_xx_" strings, and measure the performance once again. Finally, the model is modified to work with transition probabilities that are conditioned against two previous tags instead of one, ie the trigram model.

The last part is where I had problems. My initial approach was to convert everything to bigrams, so that priors were the probabilities of seeing the different tag pairs in the data; the transition probabilities was the probability of seeing a tag bigram (t2,t3) conditioned on the previous tag bigram (t1,t2); and the emission probabilities was the probability of seeing a word bigram (w1,w2) conditioned on the corresponding tag pair (t1,t2). Mikhail Korobov, one of my classmates at Prof Collin's course, was kind enough to point me to this paper (PDF) which uses a similar approach.

Because the transition probabilities are now sparser, the precision and recall numbers went down pretty drastically. I then attempted to build a conditional probability distribution for the transition probabilities that used a backoff model, ie, if the MLE for the trigram probability (t3|t1,t2) was not available, fall back to the MLE for the bigram probability (t3|t2), and finally to the MLE for the unigram probability N(t3)/N. Unfortunately, this made all the predicted tag bigrams look like (I,O), so the model is now as good as a (binary) broken clock.

Here is the code for the tagger (also available on my GitHub). I wrote the code where I started with the baseline and kept making changes to support the functionalities required by the assignment by adding a new key-value pair as a command line parameter. Each successive key value turns on all previous key values (see the main method to see what I mean). Also the key values are used to identify the results.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
from __future__ import division

import itertools
import sys

import collections
import nltk
import nltk.probability
import numpy as np

######################## utility methods ###########################

def findRareWords(train_file):
  """
  Extra pass through the training file to identify rare
  words and deal with them in the accumulation process.
  """
  wordsFD = nltk.FreqDist()
  reader = nltk.corpus.reader.TaggedCorpusReader(".", train_file)
  for word in reader.words():
    wordsFD.inc(word.lower())
  return set(filter(lambda x: wordsFD[x] < 5, wordsFD.keys()))

def normalizeRareWord(word, rareWords, replaceRare):
  """
  Introduce a bit of variety. Even though they are all rare,
  we classify rare words into "kinds" of rare words.
  """
  if word in rareWords:
    if replaceRare:
      if word.isalnum():
        return "_RARE_NUMERIC_"
      elif word.upper() == word:
        return "_RARE_ALLCAPS_"
      elif word[-1:].isupper():
        return "_RARE_LASTCAP"
      else:
        return "_RARE_"
    else:
      return "_RARE_"
  else:
    return word

def pad(sent, tags=True):
  """
  Pad sentences with the start and stop tags and return padded
  sentence.
  """
  if tags:
    padded = [("<*>", "<*>"), ("<*>", "<*>")]
  else:
    padded = ["<*>", "<*>"]
  padded.extend(sent)
  if tags:
    padded.append(("<$>", "<$>"))
  else:
    padded.append("<$>")
  return padded

def calculateMetrics(actual, predicted):
  """
  Returns the number of cases where prediction and actual NER
  tags are the same, divided by the number of tags for the
  sentence.
  """
  pred_p = map(lambda x: "I" if x == "I" else "O", predicted)
  cm = nltk.metrics.confusionmatrix.ConfusionMatrix(actual, pred_p)
  keys = ["I", "O"]
  metrics = np.matrix(np.zeros((2, 2)))
  for x, y in [(x, y) for x in range(0, 2)
                      for y in range(0, 2)]:
    try:
      metrics[x, y] = cm[keys[x], keys[y]]
    except KeyError:
      pass
  tp = metrics[0, 0]
  tn = metrics[1, 1]
  fp = metrics[0, 1]
  fn = metrics[1, 0]
  precision = 0 if (tp + fp) == 0 else tp / (tp + fp)
  recall = 0 if (tp + fn) == 0 else tp / (tp + fn)
  fmeasure = (0 if (precision + recall) == 0
    else (2 * precision * recall) / (precision + recall))
  accuracy = (tp + tn) / (tp + tn + fp + fn)
  return (precision, recall, fmeasure, accuracy)

def writeResult(fout, hmm, words):
  """
  Writes out the result in the required format.
  """
  tags = hmm.best_path(words)
  for word, tag in zip(words, tags)[2:-1]:
    fout.write("%s %s\n" % (word, tag))
  fout.write("\n")

def bigramToUnigram(bigrams):
  """
  Convert a list of bigrams to the equivalent unigram list.
  """
  unigrams = [bigrams[0][0]]
  unigrams.extend([x[1] for x in bigrams])
  return unigrams

def calculateBackoffTransCPD(tagsFD, transCFD, trans2CFD):
  """
  Uses a backoff model to calculate a smoothed conditional
  probability distribution on the training data.
  """
  probDistDict = collections.defaultdict(nltk.DictionaryProbDist)
  tags = tagsFD.keys()
  conds = [x for x in itertools.permutations(tags, 2)]
  for tag in tags:
    conds.append((tag, tag))
  for (t1, t2) in conds:
    probDict = collections.defaultdict(float)
    prob = 0
    for t3 in tags:
      trigramsFD = trans2CFD[(t1, t2)]
      if trigramsFD.N() > 0 and trigramsFD.freq(t3) > 0:
        prob = trigramsFD.freq(t3) / trigramsFD.N()
      else:
        bigramsFD = transCFD[t2]
        if bigramsFD.N() > 0 and bigramsFD.freq(t3) > 0:
          prob = bigramsFD.freq(t3) / bigramsFD.N()
        else:
          prob = tagsFD[t3] / tagsFD.N()
      probDict[t3] = prob
    probDistDict[(t1, t2)] = nltk.DictionaryProbDist(probDict)
  return nltk.DictionaryConditionalProbDist(probDistDict)

class Accumulator:
  """
  Convenience class to accumulate all the frequencies
  into a set of data structures.
  """
  def __init__(self, rareWords, replaceRare, useTrigrams):
    self.rareWords = rareWords
    self.replaceRare = replaceRare
    self.useTrigrams = useTrigrams
    self.words = set()
    self.tags = set()
    self.priorsFD = nltk.FreqDist()
    self.transitionsCFD = nltk.ConditionalFreqDist()
    self.outputsCFD = nltk.ConditionalFreqDist()
    # additional data structures for trigram
    self.transitions2CFD = nltk.ConditionalFreqDist()
    self.tagsFD = nltk.FreqDist()

  def addSentence(self, sent, norm_func):
    # preprocess
    unigrams = [(norm_func(word, self.rareWords, self.replaceRare), tag)
      for (word, tag) in sent]
    prevTag = None
    prev2Tag = None
    if self.useTrigrams:
      # each state is represented by a tag bigram
      bigrams = nltk.bigrams(unigrams)
      for ((w1, t1), (w2, t2)) in bigrams:
        self.words.add((w1, w2))
        self.tags.add((t1, t2))
        self.priorsFD.inc((t1, t2))
        self.outputsCFD[(t1, t2)].inc((w1, w2))
        if prevTag is not None:
          self.transitionsCFD[prevTag].inc(t2)
        if prev2Tag is not None:
          self.transitions2CFD[prev2Tag].inc((t1, t2))
        prevTag = t2
        prev2Tag = (t1, t2)
        self.tagsFD.inc(prevTag)
    else:
      # each state is represented by an tag unigram
      for word, tag in unigrams:
        self.words.add(word)
        self.tags.add(tag)
        self.priorsFD.inc(tag)
        self.outputsCFD[tag].inc(word)
        if prevTag is not None:
          self.transitionsCFD[prevTag].inc(tag)
        prevTag = tag

####################### train, validate, test ##################

def train(train_file, 
    rareWords, replaceRare, useTrigrams, trigramBackoff):
  """
  Read the file and populate the various frequency and
  conditional frequency distributions and build the HMM
  off these data structures.
  """
  acc = Accumulator(rareWords, replaceRare, useTrigrams)
  reader = nltk.corpus.reader.TaggedCorpusReader(".", train_file)
  for sent in reader.tagged_sents():
    unigrams = pad(sent)
    acc.addSentence(unigrams, normalizeRareWord)
  if useTrigrams:
    if trigramBackoff:
      backoffCPD = calculateBackoffTransCPD(acc.tagsFD, acc.transitionsCFD,
        acc.transitions2CFD)
      return nltk.HiddenMarkovModelTagger(list(acc.words), list(acc.tags),
        backoffCPD,
        nltk.ConditionalProbDist(acc.outputsCFD, nltk.ELEProbDist),
        nltk.ELEProbDist(acc.priorsFD))
    else:
      return nltk.HiddenMarkovModelTagger(list(acc.words), list(acc.tags),
        nltk.ConditionalProbDist(acc.transitions2CFD, nltk.ELEProbDist,
        len(acc.transitions2CFD.conditions())),
        nltk.ConditionalProbDist(acc.outputsCFD, nltk.ELEProbDist),
        nltk.ELEProbDist(acc.priorsFD))
  else:
    return nltk.HiddenMarkovModelTagger(list(acc.words), list(acc.tags),
      nltk.ConditionalProbDist(acc.transitionsCFD, nltk.ELEProbDist,
      len(acc.transitionsCFD.conditions())),
      nltk.ConditionalProbDist(acc.outputsCFD, nltk.ELEProbDist),
      nltk.ELEProbDist(acc.priorsFD))

def validate(hmm, validation_file, rareWords, replaceRare, useTrigrams):
  """
  Tests the HMM against the validation file.
  """
  precision = 0
  recall = 0
  fmeasure = 0
  accuracy = 0
  nSents = 0
  reader = nltk.corpus.reader.TaggedCorpusReader(".", validation_file)
  for sent in reader.tagged_sents():
    sent = pad(sent)
    words = [word for (word, tag) in sent]
    tags = [tag for (word, tag) in sent]
    normWords = map(lambda x: normalizeRareWord(
      x, rareWords, replaceRare), words)
    if useTrigrams:
      # convert words to word bigrams
      normWords = nltk.bigrams(normWords)
    predictedTags = hmm.best_path(normWords)
    if useTrigrams:
      # convert tag bigrams back to unigrams
      predictedTags = bigramToUnigram(predictedTags)
    (p, r, f, a) = calculateMetrics(tags[2:-1], predictedTags[2:-1])
    precision += p
    recall += r
    fmeasure += f
    accuracy += a
    nSents += 1
  print("Accuracy=%f, Precision=%f, Recall=%f, F1-Measure=%f\n" %
    (accuracy/nSents, precision/nSents, recall/nSents,
    fmeasure/nSents))

def test(hmm, test_file, result_file, rareWords, replaceRare, useTrigrams):
  """
  Tests the HMM against the test file (without tags) and writes
  out the results to the result file.
  """
  fin = open(test_file, 'rb')
  fout = open(result_file, 'wb')
  for line in fin:
    line = line.strip()
    words = pad([word for word in line.split(" ")], tags=False)
    normWords = map(lambda x: normalizeRareWord(
      x, rareWords, replaceRare), words)
    if useTrigrams:
      # convert words to word bigrams
      normWords = nltk.bigrams(normWords)
    tags = hmm.best_path(normWords)
    if useTrigrams:
      # convert tag bigrams back to unigrams
      tags = bigramToUnigram(tags)
    fout.write(" ".join(["/".join([word, tag])
      for (word, tag) in (zip(words, tags))[2:-1]]) + "\n")
  fin.close()
  fout.close()

def main():
  normalizeRare = False
  replaceRare = False
  useTrigrams = False
  trigramBackoff = False
  if len(sys.argv) > 1:
    args = sys.argv[1:]
    for arg in args:
      k, v = arg.split("=")
      if k == "normalize-rare":
        normalizeRare = True if v.lower() == "true" else False
      elif k == "replace-rare":
        normalizeRare = True
        replaceRare = True if v.lower() == "true" else False
      elif k == "use-trigrams":
        normalizeRare = True
        replaceRare = True
        useTrigrams = True if v.lower() == "true" else False
      elif k == "trigram-backoff":
        normalizeRare = True
        replaceRare = True
        useTrigrams = True
        trigramBackoff = True if v.lower() == "true" else False
      else:
        continue
  rareWords = set()
  if normalizeRare:
    rareWords = findRareWords("gene.train")
  hmm = train("gene.train",
    rareWords, replaceRare, useTrigrams, trigramBackoff)
  validate(hmm, "gene.validate", rareWords, replaceRare, useTrigrams)
  test(hmm, "gene.test", "gene.test.out",
    rareWords, replaceRare, useTrigrams)
  
if __name__ == "__main__":
  main()

And here are the results.

Model Accuracy Precision Recall F1-Measure
(baseline) 0.941894 0.337643 0.346306 0.324261
normalize-rare 0.940066 0.319570 0.333753 0.309895
replace-rare 0.940960 0.319570 0.335538 0.311532
use-trigrams 0.901701 0.001742 0.011788 0.002912
trigram-backoff 0.902182 0.000000 0.000000 0.000000

As you can see, the best results in this case seem to come from the baseline bigram model. Prior to this, I had built an NER by modeling it as a classification problem. However, the use of HMMs for this problem domain looks fairly common, as we can see from the papers here (PDF) and here (PDF). Another resource I came across that I want to explore further is NLTK's TnT (Trigrams'n'Tags) tagger, which seems to be a better fit for the trigram-backoff model.

Sunday, March 24, 2013

Implementing the RAKE Algorithm with NLTK


The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn't improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK's built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Adapted from: github.com/aneesha/RAKE/rake.py
from __future__ import division
import operator
import nltk
import string

def isPunct(word):
  return len(word) == 1 and word in string.punctuation

def isNumeric(word):
  try:
    float(word) if '.' in word else int(word)
    return True
  except ValueError:
    return False

class RakeKeywordExtractor:

  def __init__(self):
    self.stopwords = set(nltk.corpus.stopwords.words())
    self.top_fraction = 1 # consider top third candidate keywords by score

  def _generate_candidate_keywords(self, sentences):
    phrase_list = []
    for sentence in sentences:
      words = map(lambda x: "|" if x in self.stopwords else x,
        nltk.word_tokenize(sentence.lower()))
      phrase = []
      for word in words:
        if word == "|" or isPunct(word):
          if len(phrase) > 0:
            phrase_list.append(phrase)
            phrase = []
        else:
          phrase.append(word)
    return phrase_list

  def _calculate_word_scores(self, phrase_list):
    word_freq = nltk.FreqDist()
    word_degree = nltk.FreqDist()
    for phrase in phrase_list:
      degree = len(filter(lambda x: not isNumeric(x), phrase)) - 1
      for word in phrase:
        word_freq.inc(word)
        word_degree.inc(word, degree) # other words
    for word in word_freq.keys():
      word_degree[word] = word_degree[word] + word_freq[word] # itself
    # word score = deg(w) / freq(w)
    word_scores = {}
    for word in word_freq.keys():
      word_scores[word] = word_degree[word] / word_freq[word]
    return word_scores

  def _calculate_phrase_scores(self, phrase_list, word_scores):
    phrase_scores = {}
    for phrase in phrase_list:
      phrase_score = 0
      for word in phrase:
        phrase_score += word_scores[word]
      phrase_scores[" ".join(phrase)] = phrase_score
    return phrase_scores
    
  def extract(self, text, incl_scores=False):
    sentences = nltk.sent_tokenize(text)
    phrase_list = self._generate_candidate_keywords(sentences)
    word_scores = self._calculate_word_scores(phrase_list)
    phrase_scores = self._calculate_phrase_scores(
      phrase_list, word_scores)
    sorted_phrase_scores = sorted(phrase_scores.iteritems(),
      key=operator.itemgetter(1), reverse=True)
    n_phrases = len(sorted_phrase_scores)
    if incl_scores:
      return sorted_phrase_scores[0:int(n_phrases/self.top_fraction)]
    else:
      return map(lambda x: x[0],
        sorted_phrase_scores[0:int(n_phrases/self.top_fraction)])

def test():
  rake = RakeKeywordExtractor()
  keywords = rake.extract("""
Compatibility of systems of linear constraints over the set of natural 
numbers. Criteria of compatibility of a system of linear Diophantine 
equations, strict inequations, and nonstrict inequations are considered. 
Upper bounds for components of a minimal set of solutions and algorithms 
of construction of minimal generating sets of solutions for all types of 
systems are given. These criteria and the corresponding algorithms for 
constructing a minimal supporting set of solutions can be used in solving 
all the considered types of systems and systems of mixed types.
  """, incl_scores=True)
  print keywords
  
if __name__ == "__main__":
  test()

The results are nearly identical, and I hope you will agree that the code is much more readable (if you are familiar with the NLTK API, of course). Here is the results from the original RAKE implementation, compared to my NLTK based implementation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
sujit@cyclone:src$ python rake.py
[('minimal generating sets', 8.6666666666666661), 
 ('linear diophantine equations', 8.5), 
 ('minimal supporting set', 7.6666666666666661), 
 ('minimal set', 4.6666666666666661), 
 ('linear constraints', 4.5), 
 ('upper bounds', 4.0), 
 ('natural numbers', 4.0), 
 ('nonstrict inequations', 4.0)]
sujit@cyclone:src$ python rake_nltk.py
[('minimal generating sets', 8.6666666666666661), 
 ('linear diophantine equations', 8.5), 
 ('minimal supporting set', 7.6666666666666661), 
 ('minimal set', 4.6666666666666661), 
 ('linear constraints', 4.5), 
 ('upper bounds', 4.0), 
 ('natural numbers', 4.0), 
 ('nonstrict inequations', 4.0), 
 ('strict inequations', 4.0)]

As you can see, the NLTK based version returns one more candidate keyword. This is because it finds 27 candidate keywords instead of 24 keywords. I believe its related to the way the original version handles punctuation (ie as part of the preceding word) compared to NLTK's approach (as a separate token). In any case, I am not convinced its a bug because one of extra keywords returned was "corresponding algorithms" with score 3.5, which seems reasonable.

An optimization on top of this is to find candidate keywords that span a stopword. So the next step would involve storing the text into a Lucene index and hitting it with n2 SpanQuery calls where n is the number of extracted keywords, something along these lines.


Saturday, March 16, 2013

The Wikipedia Bob Alice HMM example using scikit-learn


Recently I needed to build a Hidden Markov Model (HMM). I have played with HMMs previously, but it was a while ago, so I needed to brush up on the underlying concepts. For that, the Wikipedia article is actually quite effective. My objective was to take an off the shelf HMM implementation, train it and use it to predict (ie, the HMM algorithm itself is a black box).

Scikit-Learn is an open-source Python machine-learning library has several HMM implementations. The documentation is somewhat light, though, so I wanted to see if I could implement the Bob-Alice example from the Wikipedia article (there is a similar example on the Wikipedia article on the Viterbi algorithm), and if the resulting HMM returned believable results.

The Bob-Alice example is described here. Here is the corresponding implementation using Python and scikit-learn.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from __future__ import division
import numpy as np
from sklearn import hmm

states = ["Rainy", "Sunny"]
n_states = len(states)

observations = ["walk", "shop", "clean"]
n_observations = len(observations)

start_probability = np.array([0.6, 0.4])

transition_probability = np.array([
  [0.7, 0.3],
  [0.4, 0.6]
])

emission_probability = np.array([
  [0.1, 0.4, 0.5],
  [0.6, 0.3, 0.1]
])

model = hmm.MultinomialHMM(n_components=n_states)
model._set_startprob(start_probability)
model._set_transmat(transition_probability)
model._set_emissionprob(emission_probability)

# predict a sequence of hidden states based on visible states
bob_says = [0, 2, 1, 1, 2, 0]
logprob, alice_hears = model.decode(bob_says, algorithm="viterbi")
print "Bob says:", ", ".join(map(lambda x: observations[x], bob_says))
print "Alice hears:", ", ".join(map(lambda x: states[x], alice_hears))

The output of this code is shown below. As you can see, it looks quite reasonable given the constraints in the example.

1
2
Bob says: walk, clean, shop, shop, clean, walk
Alice hears: Sunny, Rainy, Rainy, Rainy, Rainy, Sunny

Even though its a silly little example, it helped me understand how to model a Named Entity Recognizer as a HMM for a Coursera class I am taking. Hopefully it helps you for something (or at least you found it interesting :-)).

Saturday, March 09, 2013

Solr: Custom Ranking with Function Queries


Solr has had support for Function Queries since version 3.1, but before sometime last week, I did not have a use for it. Which is probably why when I would read about Function Queries, they would seem like a nice idea, but not interesting enough to pursue further.

Most people get introduced to Function Queries through the bf parameter in the DisMax Query Parser or through the geodist function in Spatial Search. So far, I haven't had the opportunity to personally use either feature in a real application. My introduction to Function Queries was through a problem posed to me by one of my coworkers.

The problem was as follows. We want to be able to customize our search results based on what a (logged-in) user tells us about himself or herself via their profile. This could be gender, age, ethnicity and a variety of other things. On the content side, we can annotate the document with various features corresponding to these profile features. For example, we can assign a score to a document that indicates its appeal/information value to males versus females that would correspond to the profile's gender.

So the idea is that if we know that the profile is male, we should boost the documents that have a high male appeal score and deboost the ones that have a high female appeal score, and vice versa if the profile is female. This idea can be easily extended for multi-category features such as ethnicity as well. In this post, I will describe a possible implementation that uses Function Queries to rerank search results using male/female appeal document scores.

For testing, I created some dummy data of 100,000 records with three fields - title, mscore and fscore. The mscore and fscore are random integers in a range of 1-1000, and the title contains one of three strings "coffee", "cocoa" and "sugar" plus the mscore and fscore values (primarily for visual feedback). Here is some Scala/SolrJ code that will generate and populate the data into a vanilla Solr 4.1.0 instance.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// Source: src/main/scala/com/mycompany/solr4extras/funcquery/FuncQueryDataGenerator.scala
package com.mycompany.solr4extras.funcquery

import java.util.Random

import scala.collection.JavaConversions._

import org.apache.solr.client.solrj.impl.HttpSolrServer
import org.apache.solr.common.SolrInputDocument

object FuncQueryDataGenerator extends App {
  
  generate()

  def generate(): Unit = {
    val solrServer = new HttpSolrServer("http://localhost:8983/solr/")
    solrServer.deleteByQuery("*:*")
    solrServer.commit()
    val randomGenerator = new Random()
    val titleWords = Array[String]("coffee", "cocoa", "sugar")
    for (i <- 0 until 1000) {
      val docs = (0 until 100).map(j => { 
        val ms = randomGenerator.nextFloat()
        val fs = randomGenerator.nextFloat()
        val mscore = Math.round(ms * 1000.0F)
        val fscore = Math.round(fs * 1000.0F)
        val word = titleWords(randomGenerator.nextInt(2))
        val title = word + ": M " + mscore + " F " + fscore
        println("adding title: " + title)
        val doc = new SolrInputDocument()
        doc.addField("id", ((i * 100) + j))
        doc.addField("mscore", mscore)
        doc.addField("fscore", fscore)
        doc.addField("title", title)
        doc
      })
      solrServer.add(docs)
      solrServer.commit()
    }
    solrServer.commit()
  }
}

The title is already present in schema.xml with type="text_general", which works fine for us, since it will tokenize individual words (we want to be able to search on coffee, cocoa and sugar). We add the mscore and fscore field definitions also in the schema.xml file in the fields block as follows:

1
2
  <field name="mscore" type="int" indexed="true" stored="true"/>
  <field name="fscore" type="int" indexed="true" stored="true"/>

Our function query looks like this:

1
2
3
4
sum(pow(mscore, mn), pow(fscore, fn)).
  where:
    mn = 0.1 for female profiles, 10 for male profiles
    fn = 10 for female profiles, 0.1 for male profiles

The numbers mn and fn are somewhat arbitary, the idea is that we want to influence the result scores by boosting up the high mscore documents and deboosting the high fscore documents.

As a baseline, I query for "coffee" and get back the followin result set. Only the top 5 titles are shown, with scores in parenthesis.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
http://localhost:8983/solr/collection1/select?\
    q=title:coffee&\
    wt=xml&\
    indent=true&\
    fl=*,score

(numFound=50010, maxScore=0.7406558)

coffee: M 557 F 15 (0.7406558)
coffee: M 567 F 636 (0.7406558)
coffee: M 938 F 817 (0.7406558)
coffee: M 113 F 362 (0.7406558)
coffee: M 553 F 278 (0.7406558)
...

but if I now add the function query to boost the high mscore documents, I get these results. As you can see, mscore is high for all the top 5 results.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
http://localhost:8983/solr/collection1/select?\
    q={!boost b=sum(pow(mscore,10),pow(fscore,0.1))}title:coffee&\
    wt=xml&\
    indent=true&\
    fl=*,score

(numFound=50010 maxScore=7.406558E29)

coffee: M 1000 F 27 (7.406558E29)
coffee: M 1000 F 56 (7.406558E29)
coffee: M 1000 F 766 (7.406558E29)
coffee: M 1000 F 965 (7.406558E29)
coffee: M 1000 F 796 (7.406558E29)
...

Flipping the function to boost the high fscore documents instead returns results that have the top fscore of 1000 in the top 5 results.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
http://localhost:8983/solr/collection1/select?\
    q={!boost b=sum(pow(fscore,10),pow(mscore,0.1))}title:coffee&\
    wt=xml&\
    indent=true&\
    fl=*,score

(numFound=50010 maxScore=7.406558E29)

coffee: M 74 F 1000 (7.406558E29)
coffee: M 146 F 1000 (7.406558E29)
coffee: M 422 F 1000 (7.406558E29)
coffee: M 708 F 1000 (7.406558E29)
coffee: M 421 F 1000 (7.406558E29)
...

Function Queries can also be used for date boosting (similar to the example above), weighted multi-field sorting and even tapping into raw Lucene statistics using the new Relevance group of functions. Now that I have a reasonable understanding of Function Queries, I plan on experimenting with these as well.

Two references that I found helpful (apart from the previously referenced FunctionQuery wiki page) are LucidWork's Documentation page on Function Queries and Tom Nolan's comprehensive post comparing boost methods on Solr. The solution to my problem was modelled in large part based on Tom Nolan's post.