## Saturday, May 25, 2013

### Feature Selection with Scikit-Learn

I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person's ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.

One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.

The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered - Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz's passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.

In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123``` ```from __future__ import division import matplotlib.pyplot as plt import numpy as np from sklearn.cross_validation import KFold from sklearn.ensemble import RandomForestClassifier from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 from sklearn.feature_selection import f_classif from sklearn.linear_model import LogisticRegression from sklearn.linear_model import RidgeClassifier from sklearn.linear_model import SGDClassifier from sklearn.metrics import accuracy_score from sklearn.multiclass import OneVsRestClassifier from sklearn.naive_bayes import BernoulliNB from sklearn.naive_bayes import GaussianNB from sklearn.naive_bayes import MultinomialNB from sklearn.svm import LinearSVC from sklearn.tree import DecisionTreeClassifier N_TRAIN_ROWS = 139 N_FEATURES = 204355 N_FOLDS = 10 ethCode = { "CEU": 0, "GIH": 1, "JPT": 2, "ASW": 3, "YRI": 4 } def load(filename, numInstances, numFeatures): headerLines = 3 ftrain = open(filename, 'rb') X = np.zeros((numInstances, numFeatures)) y = np.zeros((numInstances,)) i = 0 for line in ftrain: i += 1 if i <= headerLines: continue line = line.strip() cols = line.split("\t") y[i - headerLines - 1] = ethCode[cols] for j in range(1, len(cols) - 1): if cols[j] == "1": X[i - headerLines - 1, j] = cols[j] ftrain.close() return X, y def evaluate(X, y, nfolds, clf, nfeats, clfname, scoreFunc): kfold = KFold(X.shape, n_folds=nfolds) acc = 0 i = 0 print("%s (#-features=%d)..." % (clfname, nfeats)) for train, test in kfold: i += 1 Xtrain, Xtest, ytrain, ytest = X[test], X[train], y[test], y[train] clf.fit(Xtrain, ytrain) ypred = clf.predict(Xtest) score = accuracy_score(ytest, ypred) print " Fold #%d, accuracy=%f" % (i, score) acc += score acc /= nfolds print "## %s (#-features=%d) accuracy=%f" % (clfname, nfeats, acc) return acc def plot(accuracies, xvals, legends): fig = plt.figure() ax = fig.add_subplot(111) cm = [color + marker for color in ["b", "g", "r", "c", "m", "y", "b"] for marker in ["o", "D"]] for i in range(0, accuracies.shape): ax.plot(xvals, accuracies[i, :], color=cm[i], marker=cm[i], label=legends[i]) plt.xlabel("#-Features") plt.ylabel("Accuracy") plt.title("Accuracy vs #-Features for different classifiers") ax.set_xscale("log") box = ax.get_position() ax.set_position([box.x0, box.y0 + box.height * 0.3, box.width, box.height * 0.7]) ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.15), fancybox=True, shadow=True, ncol=3) plt.show() def main(): X, y = load("genestrain.tab", N_TRAIN_ROWS, N_FEATURES) nFeatures = np.array([N_FEATURES, 50000, 5000, 500, 50, 10]) clfs = [ BernoulliNB(), MultinomialNB(), GaussianNB(), DecisionTreeClassifier(), RandomForestClassifier(n_estimators=10), OneVsRestClassifier(LinearSVC(random_state=0)), OneVsRestClassifier(LogisticRegression()), OneVsRestClassifier(SGDClassifier()), OneVsRestClassifier(RidgeClassifier()), ] clfnames = map(lambda x: type(x).__name__ if type(x).__name__ != 'OneVsRestClassifier' else type(x.estimator).__name__, clfs) scoreFuncs = [chi2, f_classif] accuracies = np.zeros((len(clfs), len(nFeatures), len(scoreFuncs))) for k in range(0, len(scoreFuncs)): Xtrunc = X.copy() for j in range(0, len(nFeatures)): if nFeatures[j] != N_FEATURES: featureSelector = SelectKBest(score_func=scoreFuncs[k], k=nFeatures[j]) Xtrunc = featureSelector.fit_transform(X, y) for i in range(0, len(clfs)): accuracies[i, j, k] = evaluate(Xtrunc, y, N_FOLDS, clfs[i], nFeatures[j], clfnames[i], scoreFuncs[k]) # print out accuracy matrix for k in range(0, len(scoreFuncs)): for i in range(0, len(clfs)): print "%22s " % clfnames[i], for j in range(0, accuracies.shape): print "%5.3f" % accuracies[i, j, k], print plot(accuracies[:, :, k], nFeatures, clfnames) if __name__ == "__main__": main() ```

The results and graph of accuracies for different feature sizes on different classifiers for the Chi-squared measure is shown below:

 ```1 2 3 4 5 6 7 8 9 10 11``` ```200k 500k 5k 500 50 10 ----------------------------------------------------------- BernoulliNB 0.332 0.386 0.454 0.525 0.585 0.233 MultinomialNB 0.383 0.488 0.505 0.519 0.489 0.158 GaussianNB 0.294 0.427 0.537 0.568 0.585 0.233 DecisionTreeClassifier 0.566 0.571 0.551 0.576 0.582 0.233 RandomForestClassifier 0.380 0.481 0.544 0.534 0.564 0.233 LinearSVC 0.406 0.554 0.570 0.585 0.585 0.233 LogisticRegression 0.391 0.512 0.582 0.585 0.585 0.233 SGDClassifier 0.401 0.557 0.559 0.585 0.585 0.239 RidgeClassifier 0.403 0.556 0.568 0.568 0.585 0.233 ``` And the same results and graph, but using the ANOVA F measure for feature selection, is shown below:

 ```1 2 3 4 5 6 7 8 9 10 11``` ```200k 500k 5k 500 50 10 ----------------------------------------------------------- BernoulliNB 0.332 0.406 0.556 0.544 0.544 0.501 MultinomialNB 0.383 0.452 0.585 0.585 0.544 0.455 GaussianNB 0.294 0.436 0.585 0.585 0.585 0.585 DecisionTreeClassifier 0.566 0.566 0.577 0.585 0.585 0.585 RandomForestClassifier 0.363 0.474 0.556 0.563 0.563 0.585 LinearSVC 0.406 0.559 0.585 0.585 0.585 0.585 LogisticRegression 0.391 0.527 0.585 0.585 0.585 0.479 SGDClassifier 0.401 0.475 0.583 0.585 0.563 0.585 RidgeClassifier 0.403 0.559 0.585 0.585 0.585 0.585 ``` As you can see (reading right to left on the graph), the accuracy seems to increase as the number of features are pruned, until a point beyond which there seems to be too few features for the classifier to make any reliable conclusions.

#### 2 comments (moderated to prevent spam):

Alex said...

Informative post. Do you have the genestrain.tab file so that we can replicate this locally?

Sujit Pal said...

Thanks Alex. Unfortunately, the dataset was part of the BigData course at Coursera, and according to Coursera rules, I can't share any artifacts. However, recently I came across something very similar being used for this Scikits-Learn talk, you may want to check it out.