## Saturday, May 25, 2013

### Feature Selection with Scikit-Learn

I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person's ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.

One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.

The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered - Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz's passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.

In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123``` ```from __future__ import division import matplotlib.pyplot as plt import numpy as np from sklearn.cross_validation import KFold from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.feature_selection import f_classif from sklearn.linear_model import LogisticRegression from sklearn.linear_model import RidgeClassifier from sklearn.linear_model import SGDClassifier from sklearn.metrics import accuracy_score from sklearn.multiclass import OneVsRestClassifier from sklearn.naive_bayes import BernoulliNB from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import MultinomialNB from sklearn.svm import LinearSVC from sklearn.tree import DecisionTreeClassifier N_TRAIN_ROWS = 139 N_FEATURES = 204355 N_FOLDS = 10 ethCode = { "CEU": 0, "GIH": 1, "JPT": 2, "ASW": 3, "YRI": 4 } def load(filename, numInstances, numFeatures): headerLines = 3 ftrain = open(filename, 'rb') X = np.zeros((numInstances, numFeatures)) y = np.zeros((numInstances,)) i = 0 for line in ftrain: i += 1 if i <= headerLines: continue line = line.strip() cols = line.split("\t") y[i - headerLines - 1] = ethCode[cols] for j in range(1, len(cols) - 1): if cols[j] == "1": X[i - headerLines - 1, j] = cols[j] ftrain.close() return X, y def evaluate(X, y, nfolds, clf, nfeats, clfname, scoreFunc): kfold = KFold(X.shape, n_folds=nfolds) acc = 0 i = 0 print("%s (#-features=%d)..." % (clfname, nfeats)) for train, test in kfold: i += 1 Xtrain, Xtest, ytrain, ytest = X[test], X[train], y[test], y[train] clf.fit(Xtrain, ytrain) ypred = clf.predict(Xtest) score = accuracy_score(ytest, ypred) print " Fold #%d, accuracy=%f" % (i, score) acc += score acc /= nfolds print "## %s (#-features=%d) accuracy=%f" % (clfname, nfeats, acc) return acc def plot(accuracies, xvals, legends): fig = plt.figure() ax = fig.add_subplot(111) cm = [color + marker for color in ["b", "g", "r", "c", "m", "y", "b"] for marker in ["o", "D"]] for i in range(0, accuracies.shape): ax.plot(xvals, accuracies[i, :], color=cm[i], marker=cm[i], label=legends[i]) plt.xlabel("#-Features") plt.ylabel("Accuracy") plt.title("Accuracy vs #-Features for different classifiers") ax.set_xscale("log") box = ax.get_position() ax.set_position([box.x0, box.y0 + box.height * 0.3, box.width, box.height * 0.7]) ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.15), fancybox=True, shadow=True, ncol=3) plt.show() def main(): X, y = load("genestrain.tab", N_TRAIN_ROWS, N_FEATURES) nFeatures = np.array([N_FEATURES, 50000, 5000, 500, 50, 10]) clfs = [ BernoulliNB(), MultinomialNB(), GaussianNB(), DecisionTreeClassifier(), RandomForestClassifier(n_estimators=10), OneVsRestClassifier(LinearSVC(random_state=0)), OneVsRestClassifier(LogisticRegression()), OneVsRestClassifier(SGDClassifier()), OneVsRestClassifier(RidgeClassifier()), ] clfnames = map(lambda x: type(x).__name__ if type(x).__name__ != 'OneVsRestClassifier' else type(x.estimator).__name__, clfs) scoreFuncs = [chi2, f_classif] accuracies = np.zeros((len(clfs), len(nFeatures), len(scoreFuncs))) for k in range(0, len(scoreFuncs)): Xtrunc = X.copy() for j in range(0, len(nFeatures)): if nFeatures[j] != N_FEATURES: featureSelector = SelectKBest(score_func=scoreFuncs[k], k=nFeatures[j]) Xtrunc = featureSelector.fit_transform(X, y) for i in range(0, len(clfs)): accuracies[i, j, k] = evaluate(Xtrunc, y, N_FOLDS, clfs[i], nFeatures[j], clfnames[i], scoreFuncs[k]) # print out accuracy matrix for k in range(0, len(scoreFuncs)): for i in range(0, len(clfs)): print "%22s " % clfnames[i], for j in range(0, accuracies.shape): print "%5.3f" % accuracies[i, j, k], print plot(accuracies[:, :, k], nFeatures, clfnames) if __name__ == "__main__": main() ```

The results and graph of accuracies for different feature sizes on different classifiers for the Chi-squared measure is shown below:

 ```1 2 3 4 5 6 7 8 9 10 11``` ```200k 500k 5k 500 50 10 ----------------------------------------------------------- BernoulliNB 0.332 0.386 0.454 0.525 0.585 0.233 MultinomialNB 0.383 0.488 0.505 0.519 0.489 0.158 GaussianNB 0.294 0.427 0.537 0.568 0.585 0.233 DecisionTreeClassifier 0.566 0.571 0.551 0.576 0.582 0.233 RandomForestClassifier 0.380 0.481 0.544 0.534 0.564 0.233 LinearSVC 0.406 0.554 0.570 0.585 0.585 0.233 LogisticRegression 0.391 0.512 0.582 0.585 0.585 0.233 SGDClassifier 0.401 0.557 0.559 0.585 0.585 0.239 RidgeClassifier 0.403 0.556 0.568 0.568 0.585 0.233 ``` And the same results and graph, but using the ANOVA F measure for feature selection, is shown below:

 ```1 2 3 4 5 6 7 8 9 10 11``` ```200k 500k 5k 500 50 10 ----------------------------------------------------------- BernoulliNB 0.332 0.406 0.556 0.544 0.544 0.501 MultinomialNB 0.383 0.452 0.585 0.585 0.544 0.455 GaussianNB 0.294 0.436 0.585 0.585 0.585 0.585 DecisionTreeClassifier 0.566 0.566 0.577 0.585 0.585 0.585 RandomForestClassifier 0.363 0.474 0.556 0.563 0.563 0.585 LinearSVC 0.406 0.559 0.585 0.585 0.585 0.585 LogisticRegression 0.391 0.527 0.585 0.585 0.585 0.479 SGDClassifier 0.401 0.475 0.583 0.585 0.563 0.585 RidgeClassifier 0.403 0.559 0.585 0.585 0.585 0.585 ``` As you can see (reading right to left on the graph), the accuracy seems to increase as the number of features are pruned, until a point beyond which there seems to be too few features for the classifier to make any reliable conclusions.

## Saturday, May 11, 2013

### Finding Significant Phrases in Tweets with NLTK

Earlier this week, there was a question about finding significant phrases in text on the Natural Language Processing People (login required) group on LinkedIn. I suggested looking at this LingPipe tutorial. The idea is to find statistically significant word collocations, ie, those that occur more frequently than we can explain away as due to chance. I first became aware of this approach from the LLG Book, where two approaches are described - one based on Log-Likelihood Ratios (LLR) and one based on the Chi-Squared test of independence - the latter is used by LingPipe.

I had originally set out to actually provide an implementation for my suggestion (to answer a followup question). However, the Scipy Pydoc notes that the chi-squared test may be invalid when the number of observed or expected frequencies in each category are too small. Our algorithm compares just two observed and expected frequencies, so it probably qualifies. Hence I went with the LLR approach, even though it is slightly more involved.

The idea is to find, for each bigram pair, the likelihood that the components are dependent on each other versus the likelihood that they are not. For bigrams which have a positive LLR, we repeat the analysis by adding its neighbor word, and arrive at a list of trigrams with positive LLR, and so on, until we reach the N-gram level we think makes sense for the corpus. You can find an explanation of the math in one of my earlier posts, but you will probably find a better explanation in the LLG book.

For input data, I decided to use Twitter. I'm not that familiar with the Twitter API, but I'm taking the Introduction to Data Science course on Coursera, and the first assignment provided some code to pull data from the Twitter 1% feed, so I just reused that. I preprocess the feed so I am left with about 65k English tweets using the following code:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29``` ```from __future__ import division import json import sys def main(): if len(sys.argv) != 2: print "Usage: %s /path/to/twitter/json/list.txt" sys.exit(-1) fin = open(sys.argv, 'rb') fout = open("twitter_messages.txt", 'wb') for line in fin: try: data = json.loads(line.strip()) lang = data["lang"] if lang == "en": tweet = data["text"] tweet = tweet.replace("\n", " ").replace("\\s+", " ") tweet = tweet.encode("ascii", "ignore") if len(tweet) == 0: continue fout.write("%s\n" % (tweet)) except KeyError: continue fin.close() fout.close() if __name__ == "__main__": main() ```

The main code for extracting interesting phrases implements the math described above. The main difference is that I extend the approach to N-grams higher than 2. For trigrams, for example, I consider all trigrams which contain the bigrams with LLR > 0 as the first two words, and so on. The code consider upto 5-grams (although increasing it is simply a matter of changing the upper limit in the code). As it turns out, the largest significant N-gram in my data turns out to be a 4-gram.

Another important difference is that I have not used any heuristic to prune the N-gram set, other than LLR > 0. The LLG book prescibes additional heuristics for this such as keeping ngrams of nouns, verbs, adjectives or prepositions only, excluding all phrases that don't end with nouns, excluding phrases consisting of all stopwords, etc. I wasn't sure how effective they would be with Tweets, so I skipped them.

So anyway, here's the code. As always, all the code described in this post can be found in my github page for this subproject. The data is not included because I am not sure if there are legal implications to doing so, and anyway, its simple and painless enough to grab some data off Twitter and build up your own dataset.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83``` ```from __future__ import division import operator import nltk import numpy as np from scipy.stats import binom import string def isValid(word): if word.startswith("#"): return False # no hashtag else: vword = word.translate(string.maketrans("", ""), string.punctuation) return len(vword) == len(word) def llr(c1, c2, c12, n): # H0: Independence p(w1,w2) = p(w1,~w2) = c2/N p0 = c2 / n # H1: Dependence, p(w1,w2) = c12/N p10 = c12 / n # H1: p(~w1,w2) = (c2-c12)/N p11 = (c2 - c12) / n # binomial probabilities # H0: b(c12; c1, p0), b(c2-c12; N-c1, p0) # H1: b(c12, c1, p10), b(c2-c12; N-c1, p11) probs = np.matrix([ [binom(c1, p0).logpmf(c12), binom(n - c1, p0).logpmf(c2 - c12)], [binom(c1, p10).logpmf(c12), binom(n - c1, p11).logpmf(c2 - c12)]]) # LLR = p(H1) / p(H0) return np.sum(probs[1, :]) - np.sum(probs[0, :]) def isLikelyNGram(ngram, phrases): if len(ngram) == 2: return True prevGram = ngram[:-1] return phrases.has_key(prevGram) def main(): # accumulate words and word frequency distributions lines = [] unigramFD = nltk.FreqDist() fin = open("twitter_messages.txt", 'rb') i = 0 for line in fin: i += 1 words = nltk.word_tokenize(line.strip().lower()) words = filter(lambda x: isValid(x), words) [unigramFD.inc(x) for x in words] lines.append(words) if i > 1000: break fin.close() # identify likely phrases using a multi-pass algorithm based # on the LLR approach described in the Building Search Applications # Lucene, LingPipe and GATE book, except that we treat n-gram # collocations beyond 2 as n-1 gram plus a unigram. phrases = nltk.defaultdict(float) prevGramFD = None for i in range(2, 5): ngramFD = nltk.FreqDist() for words in lines: nextGrams = nltk.ngrams(words, i) nextGrams = filter(lambda x: isLikelyNGram(x, phrases), nextGrams) [ngramFD.inc(x) for x in nextGrams] for k, v in ngramFD.iteritems(): if v > 1: c1 = unigramFD[k] if prevGramFD == None else prevGramFD[k[:-1]] c2 = unigramFD[k] if prevGramFD == None else unigramFD[k[len(k) - 1]] c12 = ngramFD[k] n = unigramFD.N() if prevGramFD == None else prevGramFD.N() phrases[k] = llr(c1, c2, c12, n) # only consider bigrams where LLR > 0, ie P(H1) > P(H0) likelyPhrases = nltk.defaultdict(float) likelyPhrases.update([(k, v) for (k, v) in phrases.iteritems() if len(k) == i and v > 0]) print "==== #-grams = %d ====" % (i) sortedPhrases = sorted(likelyPhrases.items(), key=operator.itemgetter(1), reverse=True) for k, v in sortedPhrases: print k, v prevGramFD = ngramFD if __name__ == "__main__": main() ```

Here are the results of my run. As you can see, some of these, such as "solar eclipse", "the fact", etc, seem quite reasonable, and things like "one new unfollower" appear to be popular Twitterisms.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41``` ```sujit@cyclone:phrases\$ python interesting_phrases.py ==== #-grams = 2 ==== ('today', 'stats') 4.57956434533 ('http', 'ipad') 1.45257884653 ('the', 'same') 1.43108593163 ('i', 'thought') 1.37514113846 ('new', 'followers') 0.876128729563 ('ea', 'ea') 0.749974634332 ('my', 'favorite') 0.723215257916 ('a', 'lot') 0.716683441626 ('the', 'fact') 0.705937550309 ('i', 'said') 0.668642986328 ('my', 'head') 0.153309860951 ('i', 'guess') 0.0987318394656 ('solar', 'eclipse') 0.0902467987635 ('fanboys', 'de') 0.0902467987635 ('anybody', 'else') 0.0901414524373 ('treat', 'yourself') 0.089930759785 ('last', 'word') 0.089193335502 ('who', 'wants') 0.0874024479576 ('just', 'found') 0.084031365521 ('this', 'whole') 0.0831885949118 ('this', 'suck') 0.0831885949118 ('on', 'saturday') 0.0827672096072 ('my', 'daily') 0.0766571226909 ('my', 'mom') 0.0766571226909 ('my', 'dog') 0.0766571226909 ('a', 'month') 0.0733913865804 ('a', 'question') 0.0733913865804 ('to', 'pass') 0.0700203041438 ('the', 'son') 0.068018723947 ('the', 'dark') 0.068018723947 ('the', 'past') 0.068018723947 ==== #-grams = 3 ==== ('one', 'new', 'unfollower') 2.29374909449 ('http', 'ipad', 'ipadgames') 1.49704633659 ('2', 'new', 'unfollowers') 0.749659068453 ('very', 'last', 'word') 0.0902221271564 ('i', 'just', 'found') 0.0878684937321 ==== #-grams = 4 ==== ('and', 'one', 'new', 'unfollower') 0.176934229693 ```

This is all I have for this week. I started off implementing this in order to remind myself of the mechanics of the approach, but it turned out to be quite a lot of fun. Hope you enjoyed it too.

## Friday, May 03, 2013

### Inter-Document Similarity with Scikit-Learn and NLTK

Someone recently asked me about using Python to calculate document similarity across text documents. The application had to do with cheating detection, ie, compare student transcripts and flag documents with (abnormally) high similarity for further investigation. For security reasons, I could not get access to actual student transcripts. But the basic idea was to convince ourselves that this approach is valid, and come up with a code template for doing this.

I have been playing quite a bit with NLTK lately, but for this work, I decided to use the Python ML Toolkit Scikit-Learn, which has pretty powerful text processing facilities. I did end up using NLTK for its cosine similarity function, but that was about it.

I decided to use the coffee-sugar-cocoa mini-corpus of 53 documents to test out the code - I first found this in Dr Manu Konchady's TextMine project, and I have used it off and on. For convenience I have made it available at the github location for the sub-project.

Heres the code. It first parses this data into a temporary one line per file version, then vectorizes each line and creates a term document matrix for the corpus.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80``` ```from __future__ import division from operator import itemgetter from nltk.cluster.util import cosine_distance import numpy as np import random import re from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import Pipeline def preprocess(fnin, fnout): fin = open(fnin, 'rb') fout = open(fnout, 'wb') buf = [] id = "" category = "" for line in fin: line = line.strip() if line.find("-- Document Separator --") > -1: if len(buf) > 0: # write out body, body = re.sub("\s+", " ", " ".join(buf)) fout.write("%s\t%s\t%s\n" % (id, category, body)) # process next header and init buf id, category, rest = map(lambda x: x.strip(), line.split(": ")) buf = [] else: # process body buf.append(line) fin.close() fout.close() def train(fnin): docs = [] cats = [] fin = open(fnin, 'rb') for line in fin: id, category, body = line.strip().split("\t") docs.append(body) cats.append(category) fin.close() pipeline = Pipeline([ ("vect", CountVectorizer(min_df=0, stop_words="english")), ("tfidf", TfidfTransformer(use_idf=False))]) tdMatrix = pipeline.fit_transform(docs, cats) return tdMatrix, cats def test(tdMatrix, cats): testIds = random.sample(range(0, len(cats)), int(0.1 * len(cats))) testIdSet = set(testIds) refIds = filter(lambda x: x not in testIdSet, range(0, len(cats))) sims = np.zeros((len(testIds), len(refIds))) for i in range(0, len(testIds)): for j in range(0, len(refIds)): doc1 = np.asarray(tdMatrix[testIds[i], :].todense()).reshape(-1) doc2 = np.asarray(tdMatrix[refIds[j], :].todense()).reshape(-1) sims[i, j] = cosine_distance(doc1, doc2) for i in range(0, sims.shape): xsim = list(enumerate(sims[i, :])) sortedSims = sorted(xsim, key=itemgetter(1), reverse=True)[0:5] sourceCat = cats[testIds[i]] numMatchedCats = 0 numTestedCats = 0 for j, score in sortedSims: targetCat = cats[j] if sourceCat == targetCat: numMatchedCats += 1 numTestedCats += 1 print("Test Doc: %d, Source Category: %s, Target Matched: %d/%d times" % (i, sourceCat, numMatchedCats, numTestedCats)) def main(): preprocess("sugar-coffee-cocoa-docs.txt", "sccpp.txt") tdMatrix, cats = train("sccpp.txt") test(tdMatrix, cats) if __name__ == "__main__": main() ```

The code then pulls out five random documents and tries to find the five most similar documents to it, and counts how many are in the same category as itself. As you can see, the results don't look too bad.

 ```1 2 3 4 5 6``` ```sujit@cyclone:docsim\$ python docsim.py Test Doc: 0, Source Category: coffee, Target Matched: 3/5 times Test Doc: 1, Source Category: cocoa, Target Matched: 2/5 times Test Doc: 2, Source Category: sugar, Target Matched: 3/5 times Test Doc: 3, Source Category: cocoa, Target Matched: 1/5 times Test Doc: 4, Source Category: sugar, Target Matched: 2/5 times ```

Few things I tried was to turn IDF on (its currently off), and to use Eucledian distance instead of Cosine distance. The first actually made the results slightly worse over a set of runs, and the second did not seem to have any effect.

One thing to note is that this approach is unlikely to scale. Computing document-to-document similarity across the entire corpus (as you would need to do for a group of students) is an O(n2) operation. I have recently been reading up about Locally Sensitive Hashing that can mitigate this problem by being able to detect document clumps. Another approach may be to cluster the documents first and only consider inter-document similarities for those within each cluster.

And I guess thats it for today. The approach I took with Scikit-Learn is heavily inspired by the code in the text processing tutorial linked to above. For a slightly more math and code heavy treatment of text processing with Scikit-Learn check out Christian Perone's awesome blog posts here and here.

Update (2013-05-08) - I've been reading the Building Search Applications: Lucene, LingPipe and GATE by Dr Manu Konchady (henceforth referred to as the LLG book in this blog), and I came upon the SCAM (Standard Copy Analysis Mechanism) Algorithm for Plagiarism Detection, and it seemed better suited for my cheating domain than Cosine Similarity. Here is some Python code that implements the SCAM measure between two documents. This code and the update to the original code to use this can be found in my GitHub page for this subproject.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29``` ```from __future__ import division import numpy as np import scipy.sparse as ss def _s_pos_or_zero(x): return x if x > 0 else 0 def _s_zero_mask(x, y): return 0 if y == 0 else x def _s_safe_divide(x, y): return 0 if x == 0 or y == 0 else x / y _v_pos_or_zero = np.vectorize(_s_pos_or_zero) _v_zero_mask = np.vectorize(_s_zero_mask) _v_safe_divide = np.vectorize(_s_safe_divide) def _assymetric_subset_measure(doc1, doc2): epsilon = np.ones(doc1.shape) * 2 filtered = _v_pos_or_zero(epsilon - (_v_safe_divide(doc1, doc2) + _v_safe_divide(doc2, doc1))) zdoc1 = _v_zero_mask(doc1, filtered) zdoc2 = _v_zero_mask(doc2, filtered) return np.sum(np.dot(zdoc1, zdoc2)) / np.sum(np.dot(doc1, doc2)) def scam_distance(doc1, doc2): asm12 = _assymetric_subset_measure(doc1, doc2) asm21 = _assymetric_subset_measure(doc2, doc1) return max(asm12, asm21) ```

I decided to check this out by calculating the SCAM distance (should really be called SCAM similarity, since higher values indicate closeness) between my previous post, this post without the update and this post with the update. My expectation is that the score for the first pair should be lower than that for the second pair, and indeed it is so, giving 0.655172413793 and 0.954821894005 respectively.