I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person's ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.
One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.
The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered - Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz's passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.
In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 | from __future__ import division
import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier
N_TRAIN_ROWS = 139
N_FEATURES = 204355
N_FOLDS = 10
ethCode = {
"CEU": 0, "GIH": 1, "JPT": 2, "ASW": 3, "YRI": 4
}
def load(filename, numInstances, numFeatures):
headerLines = 3
ftrain = open(filename, 'rb')
X = np.zeros((numInstances, numFeatures))
y = np.zeros((numInstances,))
i = 0
for line in ftrain:
i += 1
if i <= headerLines:
continue
line = line.strip()
cols = line.split("\t")
y[i - headerLines - 1] = ethCode[cols[0]]
for j in range(1, len(cols) - 1):
if cols[j] == "1":
X[i - headerLines - 1, j] = cols[j]
ftrain.close()
return X, y
def evaluate(X, y, nfolds, clf, nfeats, clfname, scoreFunc):
kfold = KFold(X.shape[0], n_folds=nfolds)
acc = 0
i = 0
print("%s (#-features=%d)..." % (clfname, nfeats))
for train, test in kfold:
i += 1
Xtrain, Xtest, ytrain, ytest = X[test], X[train], y[test], y[train]
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
score = accuracy_score(ytest, ypred)
print " Fold #%d, accuracy=%f" % (i, score)
acc += score
acc /= nfolds
print "## %s (#-features=%d) accuracy=%f" % (clfname, nfeats, acc)
return acc
def plot(accuracies, xvals, legends):
fig = plt.figure()
ax = fig.add_subplot(111)
cm = [color + marker
for color in ["b", "g", "r", "c", "m", "y", "b"]
for marker in ["o", "D"]]
for i in range(0, accuracies.shape[0]):
ax.plot(xvals, accuracies[i, :], color=cm[i][0],
marker=cm[i][1], label=legends[i])
plt.xlabel("#-Features")
plt.ylabel("Accuracy")
plt.title("Accuracy vs #-Features for different classifiers")
ax.set_xscale("log")
box = ax.get_position()
ax.set_position([box.x0, box.y0 + box.height * 0.3,
box.width, box.height * 0.7])
ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.15),
fancybox=True, shadow=True, ncol=3)
plt.show()
def main():
X, y = load("genestrain.tab", N_TRAIN_ROWS, N_FEATURES)
nFeatures = np.array([N_FEATURES, 50000, 5000, 500, 50, 10])
clfs = [
BernoulliNB(),
MultinomialNB(),
GaussianNB(),
DecisionTreeClassifier(),
RandomForestClassifier(n_estimators=10),
OneVsRestClassifier(LinearSVC(random_state=0)),
OneVsRestClassifier(LogisticRegression()),
OneVsRestClassifier(SGDClassifier()),
OneVsRestClassifier(RidgeClassifier()),
]
clfnames = map(lambda x: type(x).__name__
if type(x).__name__ != 'OneVsRestClassifier'
else type(x.estimator).__name__, clfs)
scoreFuncs = [chi2, f_classif]
accuracies = np.zeros((len(clfs), len(nFeatures), len(scoreFuncs)))
for k in range(0, len(scoreFuncs)):
Xtrunc = X.copy()
for j in range(0, len(nFeatures)):
if nFeatures[j] != N_FEATURES:
featureSelector = SelectKBest(score_func=scoreFuncs[k], k=nFeatures[j])
Xtrunc = featureSelector.fit_transform(X, y)
for i in range(0, len(clfs)):
accuracies[i, j, k] = evaluate(Xtrunc, y, N_FOLDS, clfs[i],
nFeatures[j], clfnames[i], scoreFuncs[k])
# print out accuracy matrix
for k in range(0, len(scoreFuncs)):
for i in range(0, len(clfs)):
print "%22s " % clfnames[i],
for j in range(0, accuracies.shape[1]):
print "%5.3f" % accuracies[i, j, k],
print
plot(accuracies[:, :, k], nFeatures, clfnames)
if __name__ == "__main__":
main()
|
The results and graph of accuracies for different feature sizes on different classifiers for the Chi-squared measure is shown below:
1 2 3 4 5 6 7 8 9 10 11 | 200k 500k 5k 500 50 10
-----------------------------------------------------------
BernoulliNB 0.332 0.386 0.454 0.525 0.585 0.233
MultinomialNB 0.383 0.488 0.505 0.519 0.489 0.158
GaussianNB 0.294 0.427 0.537 0.568 0.585 0.233
DecisionTreeClassifier 0.566 0.571 0.551 0.576 0.582 0.233
RandomForestClassifier 0.380 0.481 0.544 0.534 0.564 0.233
LinearSVC 0.406 0.554 0.570 0.585 0.585 0.233
LogisticRegression 0.391 0.512 0.582 0.585 0.585 0.233
SGDClassifier 0.401 0.557 0.559 0.585 0.585 0.239
RidgeClassifier 0.403 0.556 0.568 0.568 0.585 0.233
|
And the same results and graph, but using the ANOVA F measure for feature selection, is shown below:
1 2 3 4 5 6 7 8 9 10 11 | 200k 500k 5k 500 50 10
-----------------------------------------------------------
BernoulliNB 0.332 0.406 0.556 0.544 0.544 0.501
MultinomialNB 0.383 0.452 0.585 0.585 0.544 0.455
GaussianNB 0.294 0.436 0.585 0.585 0.585 0.585
DecisionTreeClassifier 0.566 0.566 0.577 0.585 0.585 0.585
RandomForestClassifier 0.363 0.474 0.556 0.563 0.563 0.585
LinearSVC 0.406 0.559 0.585 0.585 0.585 0.585
LogisticRegression 0.391 0.527 0.585 0.585 0.585 0.479
SGDClassifier 0.401 0.475 0.583 0.585 0.563 0.585
RidgeClassifier 0.403 0.559 0.585 0.585 0.585 0.585
|
As you can see (reading right to left on the graph), the accuracy seems to increase as the number of features are pruned, until a point beyond which there seems to be too few features for the classifier to make any reliable conclusions.
Informative post. Do you have the genestrain.tab file so that we can replicate this locally?
ReplyDeleteThanks Alex. Unfortunately, the dataset was part of the BigData course at Coursera, and according to Coursera rules, I can't share any artifacts. However, recently I came across something very similar being used for this Scikits-Learn talk, you may want to check it out.
ReplyDelete