Salmon Run: Feature Selection with Scikit-Learn

I am currently doing the Web Intelligence and Big Data course from Coursera, and one of the assignments was to predict a person's ethnicity from a set of about 200,000 genetic markers (provided as boolean values). As you can see, a simple classification problem.

One of the optimization suggestions for the exercise was to prune the featureset. Prior to this, I had only a vague notion that one could do this by running correlations of each feature against the outcome, and choosing the most highly correlated ones. This seemed like a good opportunity to learn a bit about this, so I did some reading and digging within Scikit-Learn to find if they had something to do this (they did). I also decided to investigate how the accuracy of a classifier varies with the feature size. This post is a result of this effort.

The IR Book has a sub-chapter on Feature Selection. Three main approaches to Feature Selection are covered - Mutual Information based, Chi-square based and Frequency based. Scikit-Learn provides several methods to select features based on Chi-Squared and ANOVA F-values for classification. I learned about this from Matt Spitz's passing reference to Chi-squared feature selection in Scikit-Learn in his Slugger ML talk at Pycon USA 2012.

In the code below, I compute the accuracies with various feature sizes for 9 different classifiers, using both the Chi-squared measure and the ANOVA F measures.

from __future__ import division

import matplotlib.pyplot as plt
import numpy as np
from sklearn.cross_validation import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.feature_selection import f_classif
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.tree import DecisionTreeClassifier

N_TRAIN_ROWS = 139
N_FEATURES = 204355
N_FOLDS = 10

ethCode = {
  "CEU": 0, "GIH": 1, "JPT": 2, "ASW": 3, "YRI": 4
}

def load(filename, numInstances, numFeatures):
  headerLines = 3
  ftrain = open(filename, 'rb')
  X = np.zeros((numInstances, numFeatures))
  y = np.zeros((numInstances,))
  i = 0
  for line in ftrain:
    i += 1
    if i <= headerLines:
      continue
    line = line.strip()
    cols = line.split("\t")
    y[i - headerLines - 1] = ethCode[cols[0]]
    for j in range(1, len(cols) - 1):
      if cols[j] == "1":
        X[i - headerLines - 1, j] = cols[j]
  ftrain.close()
  return X, y

def evaluate(X, y, nfolds, clf, nfeats, clfname, scoreFunc):
  kfold = KFold(X.shape[0], n_folds=nfolds)
  acc = 0
  i = 0
  print("%s (#-features=%d)..." % (clfname, nfeats))
  for train, test in kfold:
    i += 1
    Xtrain, Xtest, ytrain, ytest = X[test], X[train], y[test], y[train]
    clf.fit(Xtrain, ytrain)
    ypred = clf.predict(Xtest)
    score = accuracy_score(ytest, ypred)
    print "  Fold #%d, accuracy=%f" % (i, score)
    acc += score
  acc /= nfolds
  print "## %s (#-features=%d) accuracy=%f" % (clfname, nfeats, acc)
  return acc

def plot(accuracies, xvals, legends):
  fig = plt.figure()
  ax = fig.add_subplot(111)
  cm = [color + marker
    for color in ["b", "g", "r", "c", "m", "y", "b"]
    for marker in ["o", "D"]]
  for i in range(0, accuracies.shape[0]):
    ax.plot(xvals, accuracies[i, :], color=cm[i][0], 
      marker=cm[i][1], label=legends[i])
  plt.xlabel("#-Features")
  plt.ylabel("Accuracy")
  plt.title("Accuracy vs #-Features for different classifiers")
  ax.set_xscale("log")
  box = ax.get_position()
  ax.set_position([box.x0, box.y0 + box.height * 0.3,
    box.width, box.height * 0.7])
  ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.15),
    fancybox=True, shadow=True, ncol=3)
  plt.show()
  
def main():
  X, y = load("genestrain.tab", N_TRAIN_ROWS, N_FEATURES)
  nFeatures = np.array([N_FEATURES, 50000, 5000, 500, 50, 10])
  clfs = [
    BernoulliNB(),
    MultinomialNB(),
    GaussianNB(),
    DecisionTreeClassifier(),
    RandomForestClassifier(n_estimators=10),
    OneVsRestClassifier(LinearSVC(random_state=0)),
    OneVsRestClassifier(LogisticRegression()),
    OneVsRestClassifier(SGDClassifier()),
    OneVsRestClassifier(RidgeClassifier()),
  ]
  clfnames = map(lambda x: type(x).__name__
    if type(x).__name__ != 'OneVsRestClassifier'
    else type(x.estimator).__name__, clfs)
  scoreFuncs = [chi2, f_classif]
  accuracies = np.zeros((len(clfs), len(nFeatures), len(scoreFuncs)))
  for k in range(0, len(scoreFuncs)):
    Xtrunc = X.copy()
    for j in range(0, len(nFeatures)):
      if nFeatures[j] != N_FEATURES:
        featureSelector = SelectKBest(score_func=scoreFuncs[k], k=nFeatures[j])
        Xtrunc = featureSelector.fit_transform(X, y)
      for i in range(0, len(clfs)):
        accuracies[i, j, k] = evaluate(Xtrunc, y, N_FOLDS, clfs[i],
          nFeatures[j], clfnames[i], scoreFuncs[k])
  # print out accuracy matrix
  for k in range(0, len(scoreFuncs)):
    for i in range(0, len(clfs)):
      print "%22s " % clfnames[i],
      for j in range(0, accuracies.shape[1]):
        print "%5.3f" % accuracies[i, j, k],
      print
    plot(accuracies[:, :, k], nFeatures, clfnames)

if __name__ == "__main__":
  main()

The results and graph of accuracies for different feature sizes on different classifiers for the Chi-squared measure is shown below:

200k  500k    5k   500    50    10
-----------------------------------------------------------
           BernoulliNB  0.332 0.386 0.454 0.525 0.585 0.233
         MultinomialNB  0.383 0.488 0.505 0.519 0.489 0.158
            GaussianNB  0.294 0.427 0.537 0.568 0.585 0.233
DecisionTreeClassifier  0.566 0.571 0.551 0.576 0.582 0.233
RandomForestClassifier  0.380 0.481 0.544 0.534 0.564 0.233
             LinearSVC  0.406 0.554 0.570 0.585 0.585 0.233
    LogisticRegression  0.391 0.512 0.582 0.585 0.585 0.233
         SGDClassifier  0.401 0.557 0.559 0.585 0.585 0.239
       RidgeClassifier  0.403 0.556 0.568 0.568 0.585 0.233

And the same results and graph, but using the ANOVA F measure for feature selection, is shown below:

200k  500k    5k   500    50    10
-----------------------------------------------------------
           BernoulliNB  0.332 0.406 0.556 0.544 0.544 0.501
         MultinomialNB  0.383 0.452 0.585 0.585 0.544 0.455
            GaussianNB  0.294 0.436 0.585 0.585 0.585 0.585
DecisionTreeClassifier  0.566 0.566 0.577 0.585 0.585 0.585
RandomForestClassifier  0.363 0.474 0.556 0.563 0.563 0.585
             LinearSVC  0.406 0.559 0.585 0.585 0.585 0.585
    LogisticRegression  0.391 0.527 0.585 0.585 0.585 0.479
         SGDClassifier  0.401 0.475 0.583 0.585 0.563 0.585
       RidgeClassifier  0.403 0.559 0.585 0.585 0.585 0.585

As you can see (reading right to left on the graph), the accuracy seems to increase as the number of features are pruned, until a point beyond which there seems to be too few features for the classifier to make any reliable conclusions.

2 comments:

Alex2/06/2014 5:55 AM
Informative post. Do you have the genestrain.tab file so that we can replicate this locally?
Sujit Pal2/06/2014 11:11 AM
Thanks Alex. Unfortunately, the dataset was part of the BigData course at Coursera, and according to Coursera rules, I can't share any artifacts. However, recently I came across something very similar being used for this Scikits-Learn talk, you may want to check it out.

Comments are moderated to prevent spam.

Saturday, May 25, 2013

Feature Selection with Scikit-Learn

2 comments: