Sunday, March 24, 2013

Implementing the RAKE Algorithm with NLTK


The Rapid Automatic Keyword Extraction (RAKE) algorithm extracts keywords from text, by identifying runs of non-stopwords and then scoring these phrases across the document. It requires no training, the only input is a list of stop words for a given language, and a tokenizer that splits the text into sentences and sentences into words.

The RAKE algorithm is described in the book Text Mining Applications and Theory by Michael W Berry (free PDF). There is a (relatively) well-known Python implementation and somewhat less well-known Java implementation.

I started looking for something along these lines because I needed to parse a block of text before vectorizing it and using the resulting features as input to a predictive model. Vectorizing text is quite easy with Scikit-Learn as shown in its Text Processing Tutorial. What I was trying to do was to cut down the noise by extracting keywords from the input text and passing a concatenation of the keywords into the vectorizer. It didn't improve results by much in my cross-validation tests, however, so I ended up not using it. But keyword extraction can have other uses, so I decided to explore it a bit more.

I had started off using the Python implementation directly from my application code (by importing it as a module). I soon noticed that it was doing a lot of extra work because it was implemented in pure Python. I was using NLTK anyway for other stuff in this application, so it made sense to convert it to also use NLTK so I could hand off some of the work to NLTK's built-in functions. So here is another RAKE implementation, this time using Python and NLTK.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
# Adapted from: github.com/aneesha/RAKE/rake.py
from __future__ import division
import operator
import nltk
import string

def isPunct(word):
  return len(word) == 1 and word in string.punctuation

def isNumeric(word):
  try:
    float(word) if '.' in word else int(word)
    return True
  except ValueError:
    return False

class RakeKeywordExtractor:

  def __init__(self):
    self.stopwords = set(nltk.corpus.stopwords.words())
    self.top_fraction = 1 # consider top third candidate keywords by score

  def _generate_candidate_keywords(self, sentences):
    phrase_list = []
    for sentence in sentences:
      words = map(lambda x: "|" if x in self.stopwords else x,
        nltk.word_tokenize(sentence.lower()))
      phrase = []
      for word in words:
        if word == "|" or isPunct(word):
          if len(phrase) > 0:
            phrase_list.append(phrase)
            phrase = []
        else:
          phrase.append(word)
    return phrase_list

  def _calculate_word_scores(self, phrase_list):
    word_freq = nltk.FreqDist()
    word_degree = nltk.FreqDist()
    for phrase in phrase_list:
      degree = len(filter(lambda x: not isNumeric(x), phrase)) - 1
      for word in phrase:
        word_freq.inc(word)
        word_degree.inc(word, degree) # other words
    for word in word_freq.keys():
      word_degree[word] = word_degree[word] + word_freq[word] # itself
    # word score = deg(w) / freq(w)
    word_scores = {}
    for word in word_freq.keys():
      word_scores[word] = word_degree[word] / word_freq[word]
    return word_scores

  def _calculate_phrase_scores(self, phrase_list, word_scores):
    phrase_scores = {}
    for phrase in phrase_list:
      phrase_score = 0
      for word in phrase:
        phrase_score += word_scores[word]
      phrase_scores[" ".join(phrase)] = phrase_score
    return phrase_scores
    
  def extract(self, text, incl_scores=False):
    sentences = nltk.sent_tokenize(text)
    phrase_list = self._generate_candidate_keywords(sentences)
    word_scores = self._calculate_word_scores(phrase_list)
    phrase_scores = self._calculate_phrase_scores(
      phrase_list, word_scores)
    sorted_phrase_scores = sorted(phrase_scores.iteritems(),
      key=operator.itemgetter(1), reverse=True)
    n_phrases = len(sorted_phrase_scores)
    if incl_scores:
      return sorted_phrase_scores[0:int(n_phrases/self.top_fraction)]
    else:
      return map(lambda x: x[0],
        sorted_phrase_scores[0:int(n_phrases/self.top_fraction)])

def test():
  rake = RakeKeywordExtractor()
  keywords = rake.extract("""
Compatibility of systems of linear constraints over the set of natural 
numbers. Criteria of compatibility of a system of linear Diophantine 
equations, strict inequations, and nonstrict inequations are considered. 
Upper bounds for components of a minimal set of solutions and algorithms 
of construction of minimal generating sets of solutions for all types of 
systems are given. These criteria and the corresponding algorithms for 
constructing a minimal supporting set of solutions can be used in solving 
all the considered types of systems and systems of mixed types.
  """, incl_scores=True)
  print keywords
  
if __name__ == "__main__":
  test()

The results are nearly identical, and I hope you will agree that the code is much more readable (if you are familiar with the NLTK API, of course). Here is the results from the original RAKE implementation, compared to my NLTK based implementation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
sujit@cyclone:src$ python rake.py
[('minimal generating sets', 8.6666666666666661), 
 ('linear diophantine equations', 8.5), 
 ('minimal supporting set', 7.6666666666666661), 
 ('minimal set', 4.6666666666666661), 
 ('linear constraints', 4.5), 
 ('upper bounds', 4.0), 
 ('natural numbers', 4.0), 
 ('nonstrict inequations', 4.0)]
sujit@cyclone:src$ python rake_nltk.py
[('minimal generating sets', 8.6666666666666661), 
 ('linear diophantine equations', 8.5), 
 ('minimal supporting set', 7.6666666666666661), 
 ('minimal set', 4.6666666666666661), 
 ('linear constraints', 4.5), 
 ('upper bounds', 4.0), 
 ('natural numbers', 4.0), 
 ('nonstrict inequations', 4.0), 
 ('strict inequations', 4.0)]

As you can see, the NLTK based version returns one more candidate keyword. This is because it finds 27 candidate keywords instead of 24 keywords. I believe its related to the way the original version handles punctuation (ie as part of the preceding word) compared to NLTK's approach (as a separate token). In any case, I am not convinced its a bug because one of extra keywords returned was "corresponding algorithms" with score 3.5, which seems reasonable.

An optimization on top of this is to find candidate keywords that span a stopword. So the next step would involve storing the text into a Lucene index and hitting it with n2 SpanQuery calls where n is the number of extracted keywords, something along these lines.


8 comments (moderated to prevent spam):

Muhammad Mubasher said...

Very nice work, can you please share it in repository? Or should I use it by copy pasting.

Sujit Pal said...

Hi Muhammad, thanks for the kind words. Its available on github here: https://github.com/sujitpal/mlia-examples/blob/master/src/salary_pred/rake_nltk.py. Its part of another mini-project which did not work out too well :-), but this part works fine. Oh and BTW, no worries about the multiple comments, I get that a lot, I guess the message is not very visible...

Vinod Nambiar said...

Dear Sujith,

I am new entrant into the text analysis area. Due to my curiosity, I am trying to explore this. I would like to ask a question. Can we have any control on the engines like RAKE etc to use our own vocabulary to search for. After searching the net, I could trace that it uses its own logic to pick that.

Look forward to hear from you.

Sujit Pal said...

Hi Vinod, you can't have your own dictionary with Rake, it uses a property of keywords. But there are other extraction tools like Kea which make use of a controlled vocabulary.

Vinod Nambiar said...

Dear Sujith,

Much appreciate you for the speedy response. I'll explore the KEA instead of RAKE.

Thank you once again.

Thanks & Regards,
Vinod Nambiar.

Sujit Pal said...

You're welcome Vinod. Kea is on my list of things to look at but I haven't gotten around to it yet. If you write up your findings, would appreciate a pointer to the URL.

Vinod Nambiar said...

Sure. Will keep you posted.

One question, Is there are way we can link custom dictionary with RAKE ? Any possibilities of adding additional code or modules ?

Just a thought.

Thanks & Regards,
Vinod Nambiar.

Sujit Pal said...

Thanks. To answer your question, Rake looks for frequent runs of words delimited by stopwords in sentences. I guess you could redefine the stopwords for your corpus, or filter the output through a custom dictionary. Alternatively you could cap the number of words you will allow a keyword to have (top and bottom). Can't think of any others, I guess it would depend on your use case.