Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described - one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence - the latter is used by LingPipe.
I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.
The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.
For input data, I decided to use Twitter. I'm not that familiar with the Twitter API, but I'm taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | from __future__ import division
import json
import sys
def main():
if len(sys.argv) != 2:
print "Usage: %s /path/to/twitter/json/list.txt"
sys.exit(-1)
fin = open(sys.argv[1], 'rb')
fout = open("twitter_messages.txt", 'wb')
for line in fin:
try:
data = json.loads(line.strip())
lang = data["lang"]
if lang == "en":
tweet = data["text"]
tweet = tweet.replace("\n", " ").replace("\\s+", " ")
tweet = tweet.encode("ascii", "ignore")
if len(tweet) == 0:
continue
fout.write("%s\n" % (tweet))
except KeyError:
continue
fin.close()
fout.close()
if __name__ == "__main__":
main()
|
The main code for extracting interesting phrases implements the math described above. The main difference is that I extend the approach to N-grams higher than 2. For trigrams, for example, I consider all trigrams which contain the bigrams with LLR > 0 as the first two words, and so on. The code consider upto 5-grams (although increasing it is simply a matter of changing the upper limit in the code). As it turns out, the largest significant N-gram in my data turns out to be a 4-gram.
Another important difference is that I have not used any heuristic to prune the N-gram set, other than LLR > 0. The LLG book prescibes additional heuristics for this such as keeping ngrams of nouns, verbs, adjectives or prepositions only, excluding all phrases that don't end with nouns, excluding phrases consisting of all stopwords, etc. I wasn't sure how effective they would be with Tweets, so I skipped them.
So anyway, here's the code. As always, all the code described in this post can be found in my github page for this subproject. The data is not included because I am not sure if there are legal implications to doing so, and anyway, its simple and painless enough to grab some data off Twitter and build up your own dataset.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | from __future__ import division
import operator
import nltk
import numpy as np
from scipy.stats import binom
import string
def isValid(word):
if word.startswith("#"):
return False # no hashtag
else:
vword = word.translate(string.maketrans("", ""), string.punctuation)
return len(vword) == len(word)
def llr(c1, c2, c12, n):
# H0: Independence p(w1,w2) = p(w1,~w2) = c2/N
p0 = c2 / n
# H1: Dependence, p(w1,w2) = c12/N
p10 = c12 / n
# H1: p(~w1,w2) = (c2-c12)/N
p11 = (c2 - c12) / n
# binomial probabilities
# H0: b(c12; c1, p0), b(c2-c12; N-c1, p0)
# H1: b(c12, c1, p10), b(c2-c12; N-c1, p11)
probs = np.matrix([
[binom(c1, p0).logpmf(c12), binom(n - c1, p0).logpmf(c2 - c12)],
[binom(c1, p10).logpmf(c12), binom(n - c1, p11).logpmf(c2 - c12)]])
# LLR = p(H1) / p(H0)
return np.sum(probs[1, :]) - np.sum(probs[0, :])
def isLikelyNGram(ngram, phrases):
if len(ngram) == 2:
return True
prevGram = ngram[:-1]
return phrases.has_key(prevGram)
def main():
# accumulate words and word frequency distributions
lines = []
unigramFD = nltk.FreqDist()
fin = open("twitter_messages.txt", 'rb')
i = 0
for line in fin:
i += 1
words = nltk.word_tokenize(line.strip().lower())
words = filter(lambda x: isValid(x), words)
[unigramFD.inc(x) for x in words]
lines.append(words)
if i > 1000:
break
fin.close()
# identify likely phrases using a multi-pass algorithm based
# on the LLR approach described in the Building Search Applications
# Lucene, LingPipe and GATE book, except that we treat n-gram
# collocations beyond 2 as n-1 gram plus a unigram.
phrases = nltk.defaultdict(float)
prevGramFD = None
for i in range(2, 5):
ngramFD = nltk.FreqDist()
for words in lines:
nextGrams = nltk.ngrams(words, i)
nextGrams = filter(lambda x: isLikelyNGram(x, phrases), nextGrams)
[ngramFD.inc(x) for x in nextGrams]
for k, v in ngramFD.iteritems():
if v > 1:
c1 = unigramFD[k[0]] if prevGramFD == None else prevGramFD[k[:-1]]
c2 = unigramFD[k[1]] if prevGramFD == None else unigramFD[k[len(k) - 1]]
c12 = ngramFD[k]
n = unigramFD.N() if prevGramFD == None else prevGramFD.N()
phrases[k] = llr(c1, c2, c12, n)
# only consider bigrams where LLR > 0, ie P(H1) > P(H0)
likelyPhrases = nltk.defaultdict(float)
likelyPhrases.update([(k, v) for (k, v)
in phrases.iteritems() if len(k) == i and v > 0])
print "==== #-grams = %d ====" % (i)
sortedPhrases = sorted(likelyPhrases.items(),
key=operator.itemgetter(1), reverse=True)
for k, v in sortedPhrases:
print k, v
prevGramFD = ngramFD
if __name__ == "__main__":
main()
|
Here are the results of my run. As you can see, some of these, such as "solar eclipse", "the fact", etc, seem quite reasonable, and things like "one new unfollower" appear to be popular Twitterisms.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | sujit@cyclone:phrases$ python interesting_phrases.py
==== #-grams = 2 ====
('today', 'stats') 4.57956434533
('http', 'ipad') 1.45257884653
('the', 'same') 1.43108593163
('i', 'thought') 1.37514113846
('new', 'followers') 0.876128729563
('ea', 'ea') 0.749974634332
('my', 'favorite') 0.723215257916
('a', 'lot') 0.716683441626
('the', 'fact') 0.705937550309
('i', 'said') 0.668642986328
('my', 'head') 0.153309860951
('i', 'guess') 0.0987318394656
('solar', 'eclipse') 0.0902467987635
('fanboys', 'de') 0.0902467987635
('anybody', 'else') 0.0901414524373
('treat', 'yourself') 0.089930759785
('last', 'word') 0.089193335502
('who', 'wants') 0.0874024479576
('just', 'found') 0.084031365521
('this', 'whole') 0.0831885949118
('this', 'suck') 0.0831885949118
('on', 'saturday') 0.0827672096072
('my', 'daily') 0.0766571226909
('my', 'mom') 0.0766571226909
('my', 'dog') 0.0766571226909
('a', 'month') 0.0733913865804
('a', 'question') 0.0733913865804
('to', 'pass') 0.0700203041438
('the', 'son') 0.068018723947
('the', 'dark') 0.068018723947
('the', 'past') 0.068018723947
==== #-grams = 3 ====
('one', 'new', 'unfollower') 2.29374909449
('http', 'ipad', 'ipadgames') 1.49704633659
('2', 'new', 'unfollowers') 0.749659068453
('very', 'last', 'word') 0.0902221271564
('i', 'just', 'found') 0.0878684937321
==== #-grams = 4 ====
('and', 'one', 'new', 'unfollower') 0.176934229693
|
This is all I have for this week. I started off implementing this in order to remind myself of the mechanics of the approach, but it turned out to be quite a lot of fun. Hope you enjoyed it too.
4 comments (moderated to prevent spam):
Have you tried TwitterNLP from Carnegie Mellon ?I believe you can get better entities/collocations and can get rid of Twitterisms :-)
Ravi Kiran Bhaskar
Technical Architect
The Washington Post
Thanks for the pointer Ravi, and no I haven't, I'll take a look.
Hi Sujit Pal,
it may not be directly related with this one. But could you give me some idea regarding: how to exclude those n-grams that appear in a lower list with one additional token at the end.
For example: the president of (trigram), the president of the (four-gram), the president of the states (five-gram). The five-gram excludes all lower n-grams. Thanks alot : )
Surafel M.
Comp. Science Student
Hi Surafel, we can use the fact that we are constructing a significant N-gram by joining a (N-1)-gram with a new word to post process the output to do what you want. For each phrase, you could look for [:-1] of it in the phrase list and mark it for deletion. Once you are done you just filter for the phrases that are not marked for deletion. (Deleted the old comment that involved looking through all shorter phrases, that did not exploit the structure and wasn't as efficient as this one).
Post a Comment