Salmon Run: Predicting Citrus varieties from Reflectance Data

This is yet another exercise from the Caltech/JPL Summer School on Big Data Analytics. This one is for the lectures by David Thompson on Dimensionality Reduction. The problem is to differentiate citrus plants from non-citrus plants (and then between different varieties of citrus plants) from the light spectrum they reflect as observed by an airborne instrument. The exercise requires the student to do some basic data analysis, then perform Principal Component Analysis (PCA) to find the best features to differentiate between Citrus and Non-Citrus plants. In the next part of the exercise, the student is asked to use Linear Discriminant Analysis (LDA) to predict one of six varieties of Citrus plants.

The dataset consists of 4 comma-separated (CSV) text files. The first two are for the first (PCA) exercise, and the second two are for the second (LDA) exercise. Each pair consists of a data file with the class (citrus or non-citrus in the first pair and the citrus species in the second case), the plant number and a set of reflectance numbers, each corresponding to a different wavelength. There are 426 wavelengths in the first case and 388 in the second. Each plant has multiple records, possibly corresponding to observations at different times. The actual wavelength values are supplied in the second file (for each pair), one wavelength per line. The first dataset has 5,944 records and the second dataset has 2,109 records.

In this post I describe my attempt at solving the exercise. I used Python with Numpy and Scikit-Learn, and Matplotlib for the visualizations.

We first attempt to visualize the data and see if we can observe a difference in the reflectance properties between citrus and non-citrus plants. This is

from sklearn.cross_validation import KFold
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import random

DATA_DIR = "../../data/citrus_pca"

def plot_by_class(X, y, w):
    """Plot reflectance across wavelengths for citrus vs non-citrus"""
    viz_df = pd.DataFrame()
    viz_df["wavelength"] = w
    colnames = {1: "citrus", 2: "noncit"}
    for yval in range(1,3):
        Xc = X[np.where(y == yval)[0], :]
        xmin = np.min(Xc, axis=0)
        xavg = np.mean(Xc, axis=0)
        xmax = np.max(Xc, axis=0)
        viz_df[colnames[yval] + "_min"] = xmin.T
        viz_df[colnames[yval] + "_avg"] = xavg.T
        viz_df[colnames[yval] + "_max"] = xmax.T
    plt.subplot(211)
    viz_df.plot(x="wavelength", y=["citrus_min", "citrus_avg", "citrus_max"],
                title="Citrus Plants", style=["r--", "b", "g--"])
    plt.xticks([])
    plt.xlabel("")
    plt.ylabel("reflectance")
    plt.subplot(212)
    viz_df.plot(x="wavelength", y=["noncit_min", "noncit_avg", "noncit_max"],
                title="Non-Citrus Plants", style=["r--", "b", "g--"])
    plt.ylabel("reflectance")
    plt.show()
    
cid_df = pd.read_csv(os.path.join(DATA_DIR, "Citrus_Identification_Data.txt"))
y_class = np.asarray(cid_df["class"])
y_plant = np.asarray(cid_df["plant"])
X = np.matrix(cid_df[cid_df.columns[3:]])  # capture the bNNN cols (representing wavelengths)

wavelengths = np.loadtxt(os.path.join(DATA_DIR, "Citrus_Identification_Wavelengths.txt"))

# How many classes?
print "#-classes:", len(np.unique(y_class))
print "#-plants:", len(np.unique(y_plant))

# Plot reflectance vs wavelength for citrus vs non-citrus plants
plot_by_class(X, y_class, wavelengths)

The code above tells us that the data describes two classes (Citrus, Non-Citrus) and 160 unique plants. The plot shown below shows the variation in reflectance for each class of plant for different wavelengths. The dashed green and red lines represent the maximum and minimum reflectance for all plants in that class, and the blue line represents the mean. As you can see, the graphs look quite similar, although there are small differences in the reflectance.

Our next attempt is to see if there are differences across individual plants. The following code calculates the mean spectrum for each of 10 randomly chosen plants. Once more, differences appear to be in the amount of reflectance of each plant, the shape of the spectrum looks fairly uniform across plants.

def plot_random_plants(X, y, w, n):
    """Plot reflectance vs wavelengths for n random plants"""
    viz_df = pd.DataFrame()
    viz_df["wavelength"] = w
    # generate n random plant ids
    plant_ids = set()
    num_unique = len(np.unique(y))
    while len(plant_ids) < n:
        plant_ids.add(int(num_unique * random.random()))
    # for each plant plot mean value of wavelengths
    plant_ids_lst = list(plant_ids)
    pcols = []
    for i in range(n):
        Xp = X[np.where(y == plant_ids_lst[i])[0], :]
        xavg = np.mean(Xp, axis=0)
        pcols.append("p" + str(i))
        viz_df["p" + str(i)] = xavg.T
    viz_df.plot(x="wavelength", y=pcols, 
                title="Mean for %d random plants" % (n))
    plt.ylabel("reflectance")        
    plt.show()

## Plot Reflectace vs Wavelengths for 10 random plants on same chart
plot_random_plants(X, y_plant, wavelengths, 10)

We then attempt to calculate the correlation between features. As this blog post points out, high correlation between features may be an indication that there is one or more underlying factors that drives their similarity, and some less important factors drive their differences. When features are highly correlated, it may be possible to describe them adequately with a small number of principal components.

The code below computes an all-pairs correlations across all pairs of features, and plots the correlation matrix as a heatmap. I scaled the data first (mean of 0 and standard deviation of 1) first because PCA (next step) needs the input to be scaled. This does not have an effect on the correlation coefficients. As you can see from the heatmap, there is some correlation around the lower end of the spectrum and once again some (weaker) correlation at the higher end.

def plot_correlations_as_heatmap(X):
    """Plots heatmap from blue (low) to pink (high)"""
    corr = np.corrcoef(X)
    plt.imshow(corr, interpolation="nearest", cmap=plt.cm.cool)
    plt.xticks(())
    plt.yticks(())
    plt.title("Correlation Matrix")
    plt.colorbar()
    plt.show()
    
# plot heat map of correlations between features
scaler = StandardScaler()
Xscaled = scaler.fit_transform(X)
plot_correlations_as_heatmap(Xscaled)

We then perform PCA on the scaled dataset and see how the eigenvalues decay across the number of components. We restrict our plot to the first 50 principal components (to keep the chart readable), which explains 85.6% of the total variance. The first few components account for the bulk of the variance, so the behavior of the system can be approximated by considering only the first few components. Since the correlations are not very strong, the decay is not that dramatic.

def plot_fraction_of_variance_explained(s, n):
    """Plot the first n values of eigenvalue s as fraction of total"""
    fve = s / np.cumsum(s)[-1]
    plt.plot(np.arange(len(fve))[0:n], fve[0:n])
    plt.title("Eigenvalue Decay")
    plt.xlabel("Principal Eigenvalues")
    plt.ylabel("Fraction of Variance Explained")
    plt.show()
    return fve

# Perform a PCA on the (scaled) matrix
U, s, V = np.linalg.svd(Xscaled)

# Plot eigenvalue decay (as fraction of variance explained)
fve = plot_fraction_of_variance_explained(s, 50)

We then project our data to the first 10 components (accounting for 69.4% of the total variance) and plot the correlations between the first few components. We do this for both the original data and the projected data. As expected, the original data exhibit higher correlation than the projected data. However, even in the projected data, there is no clear separation between citrus (blue) and non-citrus (red).

def project_data(X, U, k):
    Ured = U[:, 0:k]
    Sred = np.diag(s[0:k])
    return np.dot(Ured, Sred)

def recover_data(Xred, V, k):
    Vred = V[0:k, :]
    return np.dot(Xred, Vred)
    
def plot_scatter_plots(X, y, title):
    pairs = [(0, 1), (1, 2), (0, 2)]
    Xparts = [X[np.where(y == 1)[0], :], X[np.where(y == 2)[0], :]]
    for i in range(len(pairs)):
        plt.subplot(1, 3, i + 1)
        for j in range(len(Xparts)):
            ecol = 'b' if j == 0 else 'r'
            plt.scatter(Xparts[j][:, pairs[i][0]], Xparts[j][:, pairs[i][1]], 
                        s=10, edgecolors=ecol, facecolors='none')
        if i == 1:
            plt.title(title)
        plt.xlabel("X" + str(pairs[i][0]))
        plt.ylabel("X" + str(pairs[i][1]))
        plt.xticks(())
        plt.yticks(())
        i = i + 1
    plt.show()

# Project dataset onto the first k components
k = 10
print "Variance explained with %d principal components = %0.3f" % (k, np.sum(fve[0:k]))

# Project Xscaled to first k dimensions
Xprojected = project_data(Xscaled, U, k)
plot_by_class(Xprojected, y_class, np.arange(k))

# Scatter plot between 1/2. 2/3, 1/3
plot_scatter_plots(Xscaled, y_class, "Original Data")
plot_scatter_plots(Xprojected, y_class, "Projected Data")

The final step in this part of the exercise was to build a classifier to predict between citrus and non-citrus plants. Because it appeared that the reflectance of citrus plants are spread out higher than non-citrus plants (also borne out by the original chart of citrus vs non-citrus reflectances), I decided to use a Support Vector Machine with a Radial (RBF) kernel. The results were a bit counter-intuitive - with 3-fold cross-validation over the original data, the accuracy was 85.1% but after PCA only 80.1%. I also tried using a RandomForest with 200 trees (code not shown here), and obtained an accuracy of 84.3% with the original data and 85.7% after PCA (data projected to the first 10 components).

def evaluate_classification(X, y):
    """Evaluate classifier using 10 fold cross validation"""
    kfold = KFold(X.shape[0], 3)
    scores = []
    for train, test in kfold:
        Xtrain, Xtest, ytrain, ytest = X[train], X[test], y_class[train], y_class[test]
        clf = SVC()  # rbf kernel default
        clf.fit(Xtrain, ytrain)
        ypred = clf.predict(Xtest)
        scores.append(accuracy_score(ypred, ytest))
    return 1.0 * sum(scores) / len(scores)
    
# Train classifier on original data with 10-fold cross validation
score_orig = evaluate_classification(Xscaled, y_class)
score_pca = evaluate_classification(Xprojected, y_class)
print "Classification accuracy: original: %.3f, after PCA: %.3f" % (score_orig, score_pca)

The next part of the exercise was to build an LDA classifier to differentiate between different species (varieties) of citrus plants, again based on similar reflectance data. The first step is to do PCA to project the data onto its two principal components. As you can see, there is some separability between the different varieties.

from sklearn.cross_validation import train_test_split, KFold
from sklearn.lda import LDA
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd

DATA_DIR = "../../data/citrus_pca"

cvdf = pd.read_csv(os.path.join(DATA_DIR, "Citrus_Varieties.txt"))
y_species = np.asarray(cvdf["class"])
y_plant = np.asarray(cvdf["plant"])
X = np.matrix(cvdf[cvdf.columns[3:]])

wavelengths = np.loadtxt(os.path.join(DATA_DIR, "Citrus_Varieties_Wavelengths.txt"))

# Reduce to 2 dimensions and visualize
scaler = StandardScaler()
Xscaled = scaler.fit_transform(X)

U, s, V = np.linalg.svd(Xscaled)
Ured = U[:, 0:2]
Sred = np.diag(s[0:2])
Xproj = np.dot(Ured, Sred)

species_ids = np.unique(y_species)
print "species_ids:", species_ids
colors = ['b', 'g', 'r', 'c', 'm', 'y']
i = 0
for species_id in species_ids:
    Xpart = Xproj[np.where(y_species == species_id)[0], :]
    plt.scatter(Xpart[:, 0], Xpart[:, 1], color=colors[i])
    i = i + 1
plt.title("Citrus Species (first 2 Principal Components)")
plt.xlabel("X0")
plt.ylabel("X1")
plt.show()

The next step is to train a multiclass LDA classifier and test its accuracy on a holdout test dataset. We split our dataset 75/25 train/test, train a Scikit-Learn LDA classifier on the training data, and obtain a pretty impressive score of 97.3% accuracy on the hold out test set. LDA provides the "mean" vector for each of its classes. In order to check which species are most similar, we build a correlation matrix and represent it as a heatmap. As can be easily seen from the heatmap, the most similar species (in terms of reflectance spectrum) are the Mandarin, Pummelo and the Hybrid.

# Perform multiclass LDA 
Xtrain, Xtest, ytrain, ytest = train_test_split(Xscaled, y_species, 
                                                test_size=0.25, 
                                                random_state=42)
clf = LDA(len(species_ids))
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)
print "LDA Accuracy Score: %.3f" % (accuracy_score(ypred, ytest))

# What varieties are most spectrally similar?
corr = np.corrcoef(clf.means_)
plt.imshow(corr, interpolation="nearest", cmap=plt.cm.cool)
plt.xticks(np.arange(len(species_ids)), species_ids, rotation=45)
plt.yticks(np.arange(len(species_ids)), species_ids)
plt.colorbar()
plt.show()

Finally, in an effort to test if we can predict species (and then plant) effectively from the reflectance data, we build small harnesses to do 10-fold cross validation with the LDA classifier. The cross-validation accuracy score for the species predictor is a slightly less impressive 78.6% and for the plant predictor an abysmal 3.2%.

# Find LDA classifier accuracy using cross validation
kfold = KFold(Xscaled.shape[0], 10)
scores = []
for train, test in kfold:
    Xtrain, Xtest, ytrain, ytest = Xscaled[train], Xscaled[test], \
        y_species[train], y_species[test]
    clf = LDA(len(species_ids))
    clf.fit(Xtrain, ytrain)
    ypred = clf.predict(Xtest)
    scores.append(accuracy_score(ypred, ytest))
print "LDA accuracy predicting species: %.3f" % (1.0 * sum(scores) / len(scores))

# Find LDA classifier accuracy for predicting specific plants
kfold = KFold(Xscaled.shape[0], 10)
scores = []
for train, test in kfold:
    Xtrain, Xtest, ytrain, ytest = Xscaled[train], Xscaled[test], \
        y_plant[train], y_plant[test]
    clf = LDA(len(species_ids))
    clf.fit(Xtrain, ytrain)
    ypred = clf.predict(Xtest)
    scores.append(accuracy_score(ypred, ytest))
print "LDA accuracy predicting plants: %.3f" % (1.0 * sum(scores) / len(scores))

This is all I have for today. As always (at least for most of my recent posts), this code is available on github here. I had a lot of fun doing this exercise even though the results are probably not that great. I did not know about the LDA classifier before this, got to learn about this. I also learned quite a few Matplotlib tricks. I hope you enjoyed it too.