Sunday, March 24, 2013

Implementing the RAKE Algorithm with NLTK


The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn't improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK's built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Adapted from: github.com/aneesha/RAKE/rake.py
from __future__ import division
import operator
import nltk
import string

def isPunct(word):
  return len(word) == 1 and word in string.punctuation

def isNumeric(word):
  try:
    float(word) if '.' in word else int(word)
    return True
  except ValueError:
    return False

class RakeKeywordExtractor:

  def __init__(self):
    self.stopwords = set(nltk.corpus.stopwords.words())
    self.top_fraction = 1 # consider top third candidate keywords by score

  def _generate_candidate_keywords(self, sentences):
    phrase_list = []
    for sentence in sentences:
      words = map(lambda x: "|" if x in self.stopwords else x,
        nltk.word_tokenize(sentence.lower()))
      phrase = []
      for word in words:
        if word == "|" or isPunct(word):
          if len(phrase) > 0:
            phrase_list.append(phrase)
            phrase = []
        else:
          phrase.append(word)
    return phrase_list

  def _calculate_word_scores(self, phrase_list):
    word_freq = nltk.FreqDist()
    word_degree = nltk.FreqDist()
    for phrase in phrase_list:
      degree = len(filter(lambda x: not isNumeric(x), phrase)) - 1
      for word in phrase:
        word_freq.inc(word)
        word_degree.inc(word, degree) # other words
    for word in word_freq.keys():
      word_degree[word] = word_degree[word] + word_freq[word] # itself
    # word score = deg(w) / freq(w)
    word_scores = {}
    for word in word_freq.keys():
      word_scores[word] = word_degree[word] / word_freq[word]
    return word_scores

  def _calculate_phrase_scores(self, phrase_list, word_scores):
    phrase_scores = {}
    for phrase in phrase_list:
      phrase_score = 0
      for word in phrase:
        phrase_score += word_scores[word]
      phrase_scores[" ".join(phrase)] = phrase_score
    return phrase_scores
    
  def extract(self, text, incl_scores=False):
    sentences = nltk.sent_tokenize(text)
    phrase_list = self._generate_candidate_keywords(sentences)
    word_scores = self._calculate_word_scores(phrase_list)
    phrase_scores = self._calculate_phrase_scores(
      phrase_list, word_scores)
    sorted_phrase_scores = sorted(phrase_scores.iteritems(),
      key=operator.itemgetter(1), reverse=True)
    n_phrases = len(sorted_phrase_scores)
    if incl_scores:
      return sorted_phrase_scores[0:int(n_phrases/self.top_fraction)]
    else:
      return map(lambda x: x[0],
        sorted_phrase_scores[0:int(n_phrases/self.top_fraction)])

def test():
  rake = RakeKeywordExtractor()
  keywords = rake.extract("""
Compatibility of systems of linear constraints over the set of natural 
numbers. Criteria of compatibility of a system of linear Diophantine 
equations, strict inequations, and nonstrict inequations are considered. 
Upper bounds for components of a minimal set of solutions and algorithms 
of construction of minimal generating sets of solutions for all types of 
systems are given. These criteria and the corresponding algorithms for 
constructing a minimal supporting set of solutions can be used in solving 
all the considered types of systems and systems of mixed types.
  """, incl_scores=True)
  print keywords
  
if __name__ == "__main__":
  test()

The results are nearly identical, and I hope you will agree that the code is much more readable (if you are familiar with the NLTK API, of course). Here is the results from the original RAKE implementation, compared to my NLTK based implementation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
sujit@cyclone:src$ python rake.py
[('minimal generating sets', 8.6666666666666661), 
 ('linear diophantine equations', 8.5), 
 ('minimal supporting set', 7.6666666666666661), 
 ('minimal set', 4.6666666666666661), 
 ('linear constraints', 4.5), 
 ('upper bounds', 4.0), 
 ('natural numbers', 4.0), 
 ('nonstrict inequations', 4.0)]
sujit@cyclone:src$ python rake_nltk.py
[('minimal generating sets', 8.6666666666666661), 
 ('linear diophantine equations', 8.5), 
 ('minimal supporting set', 7.6666666666666661), 
 ('minimal set', 4.6666666666666661), 
 ('linear constraints', 4.5), 
 ('upper bounds', 4.0), 
 ('natural numbers', 4.0), 
 ('nonstrict inequations', 4.0), 
 ('strict inequations', 4.0)]

As you can see, the NLTK based version returns one more candidate keyword. This is because it finds 27 candidate keywords instead of 24 keywords. I believe its related to the way the original version handles punctuation (ie as part of the preceding word) compared to NLTK's approach (as a separate token). In any case, I am not convinced its a bug because one of extra keywords returned was "corresponding algorithms" with score 3.5, which seems reasonable.

An optimization on top of this is to find candidate keywords that span a stopword. So the next step would involve storing the text into a Lucene index and hitting it with n2 SpanQuery calls where n is the number of extracted keywords, something along these lines.


2 comments (moderated to prevent spam):

Muhammad Mubasher said...

Very nice work, can you please share it in repository? Or should I use it by copy pasting.

Sujit Pal said...

Hi Muhammad, thanks for the kind words. Its available on github here: https://github.com/sujitpal/mlia-examples/blob/master/src/salary_pred/rake_nltk.py. Its part of another mini-project which did not work out too well :-), but this part works fine. Oh and BTW, no worries about the multiple comments, I get that a lot, I guess the message is not very visible...