Salmon Run: Using Random Forests to find Interesting Features on Comets

Monday, January 19, 2015

Using Random Forests to find Interesting Features on Comets

This is another exercise from the Caltech/JPL Summer School on Big Data Analytics, specifically the one following the lectures on Decision Trees and Random Forests by Thomas Fuchs of JPL. The object of the exercise is to predict interesting features on comets and asteroids as space probes fly by them, so that onboard cameras can take detailed pictures of those spots autonomously. Manual control from the ground is not an option because time lags are significant at these distances. The exercise asks to build and train a Random Forest classifier and investigate its performance.

The dataset provided is of surface features from the comet Hartley. There are 2,089 structured observations provided as a R dataset (.rdata). The structure of the dataset is as follows:

struct data {
  float[11][11] gray;      // grayscale values normalized 0..1
  float[11][11] median;    // grayscale for median filtered image (0..1)
  int Y;                   // 0 = boring; 1 = interesting
  int[4] rect;             // bounding box of the observation
  char* frameName;
  int frameId;
}

Since I prefer to work in Python, I decided to build my classifier using the Scikit-Learn toolkit. I initially tried the rpy2 to read this data from within Python, but it complained that the .rdata file was corrupt. So I switched around and used a combination of my rudimentary R skills and the starter code to write the data out from R into flat files, using the code below:

# Source: rdata_reader.R
setwd("/.../mlia-examples/data/hartley")
dat <- dget("patchesHartley2.rdat")

y <- sapply(dat, function(x) x$Y)
write.table(y, file="y.csv", row.names=FALSE, col.names=FALSE)

Xgray <- matrix(nrow=length(dat), ncol=121)
Xmedian <- matrix(nrow=length(dat), ncol=121)
for (i in 1:length(dat)) {
  Xgray[i, ] = as.vector(dat[[i]]$gray)
  Xmedian[i, ] = as.vector(dat[[i]]$median)
}
write.table(Xgray, file="Xgray.csv", row.names=FALSE, col.names=FALSE)
write.table(Xmedian, file="Xmedian.csv", row.names=FALSE, col.names=FALSE)

This creates 3 files - y.csv with 2,089 target variables one per line, and Xgray.csv and Xmedian.csv, each also with 2,089 lines and 121 (11x11) grayscale values per line.

Our Python code (shown below) builds various models with different combinations of Xgray and Xmedian, and measures their OOB (Out of Bag) error estimates. As mentioned in article, in case of Random Forests, cross validation is not required to get an estimate of the test set error. The OOB is estimated internally and serves as a good estimate.

# Source: hartley_classify.py
from operator import itemgetter
from sklearn.cross_validation import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, auc
import numpy as np
import matplotlib.pyplot as plt
import os

DATA_DIR = "../../data/hartley"

def top_n_features(fimps, n):
    return map(lambda x: x[0], sorted(enumerate(fimps), key=itemgetter(1), 
                                       reverse=True)[0:n])
    
plt.figure()

y = np.loadtxt(os.path.join(DATA_DIR, "y.csv"))
for model_id in range(5):    
    Xgray = np.loadtxt(os.path.join(DATA_DIR, "Xgray.csv"), delimiter=" ")
    Xmedian = np.loadtxt(os.path.join(DATA_DIR, "Xmedian.csv"), delimiter=" ")
    if model_id == 0:
        # gray values
        X = Xgray
    elif model_id == 1:
        # gray values minus median values
        X = Xgray - Xmedian
    elif model_id == 2:        
        # gray values concatenated with median values
        X = np.concatenate((Xgray, Xmedian), axis=1)
    elif model_id == 3:
        # combination 1 and 2
        X = np.concatenate((Xgray, Xmedian, Xgray - Xmedian), axis=1)
    elif model_id == 4:
        # mean shift values and add them in
        Xgray_mu = np.mean(Xgray)
        Xgray_ms = Xgray - Xgray_mu
        Xmedian_mu = np.mean(Xmedian)
        Xmedian_ms = Xmedian - Xmedian_mu
        X = np.concatenate((Xgray, Xmedian, Xgray_ms, Xmedian_ms, 
                            Xgray - Xmedian, Xgray_ms - Xmedian_ms), axis=1)
    clf = RandomForestClassifier(n_estimators=200, max_features="auto", 
                                 oob_score=True, n_jobs=-1)
    # Split data train/test = 75/25
    Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, 
                                                    random_state=42)
    clf.fit(Xtrain, ytrain)
    print "OOB Score:", clf.oob_score_
    print "Top 5 Features:", top_n_features(clf.feature_importances_, 5)
    # compute ROC curve data
    ypred = clf.predict_proba(Xtest)
    fpr, tpr, threshold = roc_curve(ytest, ypred[:, 1])
    roc_auc = auc(fpr, tpr)
    plt.plot(fpr, tpr, label="Model#%d (AUC=%.2f)" % (model_id + 1, roc_auc))

# baseline, axes, labels, etc
plt.plot([0, 1], [0, 1], "k--")
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend(loc="lower right")
plt.show()

The first model uses Xgray values only as input variables, the second uses the difference between Xgray and Xmedian (ie the highlights), the third uses a concatenation of Xgray and Xmedian, the fourth a concatenation of Xgray, Xmedian and the difference, and the fifth uses Xgray, Xmedian, mean shifted Xgray, mean shifted Xmedian and the difference between mean shifted Xgray and mean shifted Xmedian. OOB Estimates on the full dataset were in the 95-97% range, and 90-93% on a 25% held out test set. Here are the OOB estimates for each model.

Model 1: 0.924648786718
Model 2: 0.907407407407
Model 3: 0.92975734355
Model 4: 0.925287356322
Model 5: 0.932950191571

From the results, Model #5 performed the best. Random Forests can also provide an ordering of their input features by importance. This information can be used to retrain the classifier with only the top N features. I haven't tried this, but it may be worth trying out to see if results improve further. The top 5 features for each model was as follows.

Model 1 Top 5 Features: [60, 71, 61, 49, 50]
Model 2 Top 5 Features: [60, 57, 49, 46, 71]
Model 3 Top 5 Features: [181, 170, 60, 192, 61]
Model 4 Top 5 Features: [181, 192, 170, 60, 182]
Model 5 Top 5 Features: [181, 423, 412, 192, 60]

Finally, we plot the ROC (Reciever Operating Characteristic, basically a plot of False Positive Rate against True Positive Rate) curves for each of the 5 classifiers. The dotted line represents the baseline. Using the AUC (Area under the curve), Model 1 looks slightly better than Model 5. However, all the models look almost equally good, probably because it is a toy example.

Thats all I have for today. In the past, I hadn't looked at Random Forests too much because it is not that commonly used for text classification (although I suspect the reason for that may just be a historical preference for other methods).