## Saturday, January 31, 2015

### Predicting Citrus varieties from Reflectance Data

This is yet another exercise from the Caltech/JPL Summer School on Big Data Analytics. This one is for the lectures by David Thompson on Dimensionality Reduction. The problem is to differentiate citrus plants from non-citrus plants (and then between different varieties of citrus plants) from the light spectrum they reflect as observed by an airborne instrument. The exercise requires the student to do some basic data analysis, then perform Principal Component Analysis (PCA) to find the best features to differentiate between Citrus and Non-Citrus plants. In the next part of the exercise, the student is asked to use Linear Discriminant Analysis (LDA) to predict one of six varieties of Citrus plants.

The dataset consists of 4 comma-separated (CSV) text files. The first two are for the first (PCA) exercise, and the second two are for the second (LDA) exercise. Each pair consists of a data file with the class (citrus or non-citrus in the first pair and the citrus species in the second case), the plant number and a set of reflectance numbers, each corresponding to a different wavelength. There are 426 wavelengths in the first case and 388 in the second. Each plant has multiple records, possibly corresponding to observations at different times. The actual wavelength values are supplied in the second file (for each pair), one wavelength per line. The first dataset has 5,944 records and the second dataset has 2,109 records.

In this post I describe my attempt at solving the exercise. I used Python with Numpy and Scikit-Learn, and Matplotlib for the visualizations.

We first attempt to visualize the data and see if we can observe a difference in the reflectance properties between citrus and non-citrus plants. This is

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50``` ```from sklearn.cross_validation import KFold from sklearn.svm import SVC from sklearn.metrics import accuracy_score from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt import numpy as np import os import pandas as pd import random DATA_DIR = "../../data/citrus_pca" def plot_by_class(X, y, w): """Plot reflectance across wavelengths for citrus vs non-citrus""" viz_df = pd.DataFrame() viz_df["wavelength"] = w colnames = {1: "citrus", 2: "noncit"} for yval in range(1,3): Xc = X[np.where(y == yval), :] xmin = np.min(Xc, axis=0) xavg = np.mean(Xc, axis=0) xmax = np.max(Xc, axis=0) viz_df[colnames[yval] + "_min"] = xmin.T viz_df[colnames[yval] + "_avg"] = xavg.T viz_df[colnames[yval] + "_max"] = xmax.T plt.subplot(211) viz_df.plot(x="wavelength", y=["citrus_min", "citrus_avg", "citrus_max"], title="Citrus Plants", style=["r--", "b", "g--"]) plt.xticks([]) plt.xlabel("") plt.ylabel("reflectance") plt.subplot(212) viz_df.plot(x="wavelength", y=["noncit_min", "noncit_avg", "noncit_max"], title="Non-Citrus Plants", style=["r--", "b", "g--"]) plt.ylabel("reflectance") plt.show() cid_df = pd.read_csv(os.path.join(DATA_DIR, "Citrus_Identification_Data.txt")) y_class = np.asarray(cid_df["class"]) y_plant = np.asarray(cid_df["plant"]) X = np.matrix(cid_df[cid_df.columns[3:]]) # capture the bNNN cols (representing wavelengths) wavelengths = np.loadtxt(os.path.join(DATA_DIR, "Citrus_Identification_Wavelengths.txt")) # How many classes? print "#-classes:", len(np.unique(y_class)) print "#-plants:", len(np.unique(y_plant)) # Plot reflectance vs wavelength for citrus vs non-citrus plants plot_by_class(X, y_class, wavelengths) ```

The code above tells us that the data describes two classes (Citrus, Non-Citrus) and 160 unique plants. The plot shown below shows the variation in reflectance for each class of plant for different wavelengths. The dashed green and red lines represent the maximum and minimum reflectance for all plants in that class, and the blue line represents the mean. As you can see, the graphs look quite similar, although there are small differences in the reflectance.

Our next attempt is to see if there are differences across individual plants. The following code calculates the mean spectrum for each of 10 randomly chosen plants. Once more, differences appear to be in the amount of reflectance of each plant, the shape of the spectrum looks fairly uniform across plants.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24``` ```def plot_random_plants(X, y, w, n): """Plot reflectance vs wavelengths for n random plants""" viz_df = pd.DataFrame() viz_df["wavelength"] = w # generate n random plant ids plant_ids = set() num_unique = len(np.unique(y)) while len(plant_ids) < n: plant_ids.add(int(num_unique * random.random())) # for each plant plot mean value of wavelengths plant_ids_lst = list(plant_ids) pcols = [] for i in range(n): Xp = X[np.where(y == plant_ids_lst[i]), :] xavg = np.mean(Xp, axis=0) pcols.append("p" + str(i)) viz_df["p" + str(i)] = xavg.T viz_df.plot(x="wavelength", y=pcols, title="Mean for %d random plants" % (n)) plt.ylabel("reflectance") plt.show() ## Plot Reflectace vs Wavelengths for 10 random plants on same chart plot_random_plants(X, y_plant, wavelengths, 10) ```

We then attempt to calculate the correlation between features. As this blog post points out, high correlation between features may be an indication that there is one or more underlying factors that drives their similarity, and some less important factors drive their differences. When features are highly correlated, it may be possible to describe them adequately with a small number of principal components.

The code below computes an all-pairs correlations across all pairs of features, and plots the correlation matrix as a heatmap. I scaled the data first (mean of 0 and standard deviation of 1) first because PCA (next step) needs the input to be scaled. This does not have an effect on the correlation coefficients. As you can see from the heatmap, there is some correlation around the lower end of the spectrum and once again some (weaker) correlation at the higher end.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14``` ```def plot_correlations_as_heatmap(X): """Plots heatmap from blue (low) to pink (high)""" corr = np.corrcoef(X) plt.imshow(corr, interpolation="nearest", cmap=plt.cm.cool) plt.xticks(()) plt.yticks(()) plt.title("Correlation Matrix") plt.colorbar() plt.show() # plot heat map of correlations between features scaler = StandardScaler() Xscaled = scaler.fit_transform(X) plot_correlations_as_heatmap(Xscaled) ```

We then perform PCA on the scaled dataset and see how the eigenvalues decay across the number of components. We restrict our plot to the first 50 principal components (to keep the chart readable), which explains 85.6% of the total variance. The first few components account for the bulk of the variance, so the behavior of the system can be approximated by considering only the first few components. Since the correlations are not very strong, the decay is not that dramatic.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15``` ```def plot_fraction_of_variance_explained(s, n): """Plot the first n values of eigenvalue s as fraction of total""" fve = s / np.cumsum(s)[-1] plt.plot(np.arange(len(fve))[0:n], fve[0:n]) plt.title("Eigenvalue Decay") plt.xlabel("Principal Eigenvalues") plt.ylabel("Fraction of Variance Explained") plt.show() return fve # Perform a PCA on the (scaled) matrix U, s, V = np.linalg.svd(Xscaled) # Plot eigenvalue decay (as fraction of variance explained) fve = plot_fraction_of_variance_explained(s, 50) ```

We then project our data to the first 10 components (accounting for 69.4% of the total variance) and plot the correlations between the first few components. We do this for both the original data and the projected data. As expected, the original data exhibit higher correlation than the projected data. However, even in the projected data, there is no clear separation between citrus (blue) and non-citrus (red).

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38``` ```def project_data(X, U, k): Ured = U[:, 0:k] Sred = np.diag(s[0:k]) return np.dot(Ured, Sred) def recover_data(Xred, V, k): Vred = V[0:k, :] return np.dot(Xred, Vred) def plot_scatter_plots(X, y, title): pairs = [(0, 1), (1, 2), (0, 2)] Xparts = [X[np.where(y == 1), :], X[np.where(y == 2), :]] for i in range(len(pairs)): plt.subplot(1, 3, i + 1) for j in range(len(Xparts)): ecol = 'b' if j == 0 else 'r' plt.scatter(Xparts[j][:, pairs[i]], Xparts[j][:, pairs[i]], s=10, edgecolors=ecol, facecolors='none') if i == 1: plt.title(title) plt.xlabel("X" + str(pairs[i])) plt.ylabel("X" + str(pairs[i])) plt.xticks(()) plt.yticks(()) i = i + 1 plt.show() # Project dataset onto the first k components k = 10 print "Variance explained with %d principal components = %0.3f" % (k, np.sum(fve[0:k])) # Project Xscaled to first k dimensions Xprojected = project_data(Xscaled, U, k) plot_by_class(Xprojected, y_class, np.arange(k)) # Scatter plot between 1/2. 2/3, 1/3 plot_scatter_plots(Xscaled, y_class, "Original Data") plot_scatter_plots(Xprojected, y_class, "Projected Data") ```  The final step in this part of the exercise was to build a classifier to predict between citrus and non-citrus plants. Because it appeared that the reflectance of citrus plants are spread out higher than non-citrus plants (also borne out by the original chart of citrus vs non-citrus reflectances), I decided to use a Support Vector Machine with a Radial (RBF) kernel. The results were a bit counter-intuitive - with 3-fold cross-validation over the original data, the accuracy was 85.1% but after PCA only 80.1%. I also tried using a RandomForest with 200 trees (code not shown here), and obtained an accuracy of 84.3% with the original data and 85.7% after PCA (data projected to the first 10 components).

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16``` ```def evaluate_classification(X, y): """Evaluate classifier using 10 fold cross validation""" kfold = KFold(X.shape, 3) scores = [] for train, test in kfold: Xtrain, Xtest, ytrain, ytest = X[train], X[test], y_class[train], y_class[test] clf = SVC() # rbf kernel default clf.fit(Xtrain, ytrain) ypred = clf.predict(Xtest) scores.append(accuracy_score(ypred, ytest)) return 1.0 * sum(scores) / len(scores) # Train classifier on original data with 10-fold cross validation score_orig = evaluate_classification(Xscaled, y_class) score_pca = evaluate_classification(Xprojected, y_class) print "Classification accuracy: original: %.3f, after PCA: %.3f" % (score_orig, score_pca) ```

The next part of the exercise was to build an LDA classifier to differentiate between different species (varieties) of citrus plants, again based on similar reflectance data. The first step is to do PCA to project the data onto its two principal components. As you can see, there is some separability between the different varieties.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39``` ```from sklearn.cross_validation import train_test_split, KFold from sklearn.lda import LDA from sklearn.metrics import accuracy_score from sklearn.preprocessing import StandardScaler import matplotlib.pyplot as plt import numpy as np import os import pandas as pd DATA_DIR = "../../data/citrus_pca" cvdf = pd.read_csv(os.path.join(DATA_DIR, "Citrus_Varieties.txt")) y_species = np.asarray(cvdf["class"]) y_plant = np.asarray(cvdf["plant"]) X = np.matrix(cvdf[cvdf.columns[3:]]) wavelengths = np.loadtxt(os.path.join(DATA_DIR, "Citrus_Varieties_Wavelengths.txt")) # Reduce to 2 dimensions and visualize scaler = StandardScaler() Xscaled = scaler.fit_transform(X) U, s, V = np.linalg.svd(Xscaled) Ured = U[:, 0:2] Sred = np.diag(s[0:2]) Xproj = np.dot(Ured, Sred) species_ids = np.unique(y_species) print "species_ids:", species_ids colors = ['b', 'g', 'r', 'c', 'm', 'y'] i = 0 for species_id in species_ids: Xpart = Xproj[np.where(y_species == species_id), :] plt.scatter(Xpart[:, 0], Xpart[:, 1], color=colors[i]) i = i + 1 plt.title("Citrus Species (first 2 Principal Components)") plt.xlabel("X0") plt.ylabel("X1") plt.show() ```

The next step is to train a multiclass LDA classifier and test its accuracy on a holdout test dataset. We split our dataset 75/25 train/test, train a Scikit-Learn LDA classifier on the training data, and obtain a pretty impressive score of 97.3% accuracy on the hold out test set. LDA provides the "mean" vector for each of its classes. In order to check which species are most similar, we build a correlation matrix and represent it as a heatmap. As can be easily seen from the heatmap, the most similar species (in terms of reflectance spectrum) are the Mandarin, Pummelo and the Hybrid.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16``` ```# Perform multiclass LDA Xtrain, Xtest, ytrain, ytest = train_test_split(Xscaled, y_species, test_size=0.25, random_state=42) clf = LDA(len(species_ids)) clf.fit(Xtrain, ytrain) ypred = clf.predict(Xtest) print "LDA Accuracy Score: %.3f" % (accuracy_score(ypred, ytest)) # What varieties are most spectrally similar? corr = np.corrcoef(clf.means_) plt.imshow(corr, interpolation="nearest", cmap=plt.cm.cool) plt.xticks(np.arange(len(species_ids)), species_ids, rotation=45) plt.yticks(np.arange(len(species_ids)), species_ids) plt.colorbar() plt.show() ```

Finally, in an effort to test if we can predict species (and then plant) effectively from the reflectance data, we build small harnesses to do 10-fold cross validation with the LDA classifier. The cross-validation accuracy score for the species predictor is a slightly less impressive 78.6% and for the plant predictor an abysmal 3.2%.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23``` ```# Find LDA classifier accuracy using cross validation kfold = KFold(Xscaled.shape, 10) scores = [] for train, test in kfold: Xtrain, Xtest, ytrain, ytest = Xscaled[train], Xscaled[test], \ y_species[train], y_species[test] clf = LDA(len(species_ids)) clf.fit(Xtrain, ytrain) ypred = clf.predict(Xtest) scores.append(accuracy_score(ypred, ytest)) print "LDA accuracy predicting species: %.3f" % (1.0 * sum(scores) / len(scores)) # Find LDA classifier accuracy for predicting specific plants kfold = KFold(Xscaled.shape, 10) scores = [] for train, test in kfold: Xtrain, Xtest, ytrain, ytest = Xscaled[train], Xscaled[test], \ y_plant[train], y_plant[test] clf = LDA(len(species_ids)) clf.fit(Xtrain, ytrain) ypred = clf.predict(Xtest) scores.append(accuracy_score(ypred, ytest)) print "LDA accuracy predicting plants: %.3f" % (1.0 * sum(scores) / len(scores)) ```

This is all I have for today. As always (at least for most of my recent posts), this code is available on github here. I had a lot of fun doing this exercise even though the results are probably not that great. I did not know about the LDA classifier before this, got to learn about this. I also learned quite a few Matplotlib tricks. I hope you enjoyed it too.

## Monday, January 19, 2015

### Using Random Forests to find Interesting Features on Comets

This is another exercise from the Caltech/JPL Summer School on Big Data Analytics, specifically the one following the lectures on Decision Trees and Random Forests by Thomas Fuchs of JPL. The object of the exercise is to predict interesting features on comets and asteroids as space probes fly by them, so that onboard cameras can take detailed pictures of those spots autonomously. Manual control from the ground is not an option because time lags are significant at these distances. The exercise asks to build and train a Random Forest classifier and investigate its performance.

The dataset provided is of surface features from the comet Hartley. There are 2,089 structured observations provided as a R dataset (.rdata). The structure of the dataset is as follows:

 ```1 2 3 4 5 6 7 8``` ```struct data { float gray; // grayscale values normalized 0..1 float median; // grayscale for median filtered image (0..1) int Y; // 0 = boring; 1 = interesting int rect; // bounding box of the observation char* frameName; int frameId; } ```

Since I prefer to work in Python, I decided to build my classifier using the Scikit-Learn toolkit. I initially tried the rpy2 to read this data from within Python, but it complained that the .rdata file was corrupt. So I switched around and used a combination of my rudimentary R skills and the starter code to write the data out from R into flat files, using the code below:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15``` ```# Source: rdata_reader.R setwd("/.../mlia-examples/data/hartley") dat <- dget("patchesHartley2.rdat") y <- sapply(dat, function(x) x\$Y) write.table(y, file="y.csv", row.names=FALSE, col.names=FALSE) Xgray <- matrix(nrow=length(dat), ncol=121) Xmedian <- matrix(nrow=length(dat), ncol=121) for (i in 1:length(dat)) { Xgray[i, ] = as.vector(dat[[i]]\$gray) Xmedian[i, ] = as.vector(dat[[i]]\$median) } write.table(Xgray, file="Xgray.csv", row.names=FALSE, col.names=FALSE) write.table(Xmedian, file="Xmedian.csv", row.names=FALSE, col.names=FALSE) ```

This creates 3 files - y.csv with 2,089 target variables one per line, and Xgray.csv and Xmedian.csv, each also with 2,089 lines and 121 (11x11) grayscale values per line.

Our Python code (shown below) builds various models with different combinations of Xgray and Xmedian, and measures their OOB (Out of Bag) error estimates. As mentioned in article, in case of Random Forests, cross validation is not required to get an estimate of the test set error. The OOB is estimated internally and serves as a good estimate.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63``` ```# Source: hartley_classify.py from operator import itemgetter from sklearn.cross_validation import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import roc_curve, auc import numpy as np import matplotlib.pyplot as plt import os DATA_DIR = "../../data/hartley" def top_n_features(fimps, n): return map(lambda x: x, sorted(enumerate(fimps), key=itemgetter(1), reverse=True)[0:n]) plt.figure() y = np.loadtxt(os.path.join(DATA_DIR, "y.csv")) for model_id in range(5): Xgray = np.loadtxt(os.path.join(DATA_DIR, "Xgray.csv"), delimiter=" ") Xmedian = np.loadtxt(os.path.join(DATA_DIR, "Xmedian.csv"), delimiter=" ") if model_id == 0: # gray values X = Xgray elif model_id == 1: # gray values minus median values X = Xgray - Xmedian elif model_id == 2: # gray values concatenated with median values X = np.concatenate((Xgray, Xmedian), axis=1) elif model_id == 3: # combination 1 and 2 X = np.concatenate((Xgray, Xmedian, Xgray - Xmedian), axis=1) elif model_id == 4: # mean shift values and add them in Xgray_mu = np.mean(Xgray) Xgray_ms = Xgray - Xgray_mu Xmedian_mu = np.mean(Xmedian) Xmedian_ms = Xmedian - Xmedian_mu X = np.concatenate((Xgray, Xmedian, Xgray_ms, Xmedian_ms, Xgray - Xmedian, Xgray_ms - Xmedian_ms), axis=1) clf = RandomForestClassifier(n_estimators=200, max_features="auto", oob_score=True, n_jobs=-1) # Split data train/test = 75/25 Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, random_state=42) clf.fit(Xtrain, ytrain) print "OOB Score:", clf.oob_score_ print "Top 5 Features:", top_n_features(clf.feature_importances_, 5) # compute ROC curve data ypred = clf.predict_proba(Xtest) fpr, tpr, threshold = roc_curve(ytest, ypred[:, 1]) roc_auc = auc(fpr, tpr) plt.plot(fpr, tpr, label="Model#%d (AUC=%.2f)" % (model_id + 1, roc_auc)) # baseline, axes, labels, etc plt.plot([0, 1], [0, 1], "k--") plt.xlim([0.0, 1.0]) plt.ylim([0.0, 1.05]) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.legend(loc="lower right") plt.show() ```

The first model uses Xgray values only as input variables, the second uses the difference between Xgray and Xmedian (ie the highlights), the third uses a concatenation of Xgray and Xmedian, the fourth a concatenation of Xgray, Xmedian and the difference, and the fifth uses Xgray, Xmedian, mean shifted Xgray, mean shifted Xmedian and the difference between mean shifted Xgray and mean shifted Xmedian. OOB Estimates on the full dataset were in the 95-97% range, and 90-93% on a 25% held out test set. Here are the OOB estimates for each model.

 ```1 2 3 4 5``` ```Model 1: 0.924648786718 Model 2: 0.907407407407 Model 3: 0.92975734355 Model 4: 0.925287356322 Model 5: 0.932950191571 ```

From the results, Model #5 performed the best. Random Forests can also provide an ordering of their input features by importance. This information can be used to retrain the classifier with only the top N features. I haven't tried this, but it may be worth trying out to see if results improve further. The top 5 features for each model was as follows.

 ```1 2 3 4 5``` ```Model 1 Top 5 Features: [60, 71, 61, 49, 50] Model 2 Top 5 Features: [60, 57, 49, 46, 71] Model 3 Top 5 Features: [181, 170, 60, 192, 61] Model 4 Top 5 Features: [181, 192, 170, 60, 182] Model 5 Top 5 Features: [181, 423, 412, 192, 60] ```

Finally, we plot the ROC (Reciever Operating Characteristic, basically a plot of False Positive Rate against True Positive Rate) curves for each of the 5 classifiers. The dotted line represents the baseline. Using the AUC (Area under the curve), Model 1 looks slightly better than Model 5. However, all the models look almost equally good, probably because it is a toy example.

Thats all I have for today. In the past, I hadn't looked at Random Forests too much because it is not that commonly used for text classification (although I suspect the reason for that may just be a historical preference for other methods).

## Sunday, January 11, 2015

### Coding A Better Strategy for Hangman

Happy New Year! In this post, I will describe a fun little project that came about while going through the lectures in The Caltech-JPL Summer School on Big Data Analytics course on Coursera (really a dump of the lecture notes and videos from the actual week-long sessions rather than a formal course), I came across an interesting assignment - implementing the ideas outlined in this lifehacker article - A Better Strategy for Hangman. Hangman is a simple word-guessing game played by two players, one of whom (the Proposer) thinks of a word and the other (the Solver) tries to guess it. The guessing is done one letter at a time - the strategy described in the article uses letter frequencies to inform these guesses. The article goes into much more detail, and you should read it to get a full understanding of the strategy.

In real life, the game is played between two people, each of whom have their own (probably different sized) vocabularies. In our code, we use the words from our /usr/share/dict/words file (in Ubuntu, and likely available for other Unix based systems as well). So both Proposer and Solver use the same list of words as their "vocabulary". The Proposer selects a random word from the list and the Solver uses letter frequencies to make letter guesses until he guesses the word or the allowed number of guesses (11) run out. Here is the code for the game.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165``` ```# Source: src/hangman/game.py # -*- coding: utf-8 -*- import operator from random import random MAX_GUESSES_PER_GAME = 11 MAX_WORDS_IN_SUBSET = 25 ########################### Preprocessing ############################ non_ascii_mappings = list([ ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'), ('', 'a'), ('', 'c'), ('', 'c'), ('', 'e'), ('', 'e'), ('', 'e'), ('', 'e'), ('', 'i'), ('', 'n'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'o'), ('', 'u'), ('', 'u'), ('', '\''), ('', '"') ]) def ascii_fold(s): for x in non_ascii_mappings: s = s.replace(x, x) return s def preprocess(dictfile): fwords = open(dictfile, 'rb') wset = set() for line in fwords: word = line.strip().lower() word = ascii_fold(word) if word.endswith("'s"): word = word[:-2] word = word.replace(" ", "").replace("'", "").replace("\"", "") wset.add(word) fwords.close() return list(wset) ############################# Proposer Side ############################# def select_secret_word(words): widx = int(random() * len(words)) return words[widx] def find_all_match_positions(secret_word, guess_char): positions = [] curr_pos = 0 while curr_pos < len(secret_word): curr_pos = secret_word.find(guess_char, curr_pos) if curr_pos < 0: break positions.append(curr_pos) curr_pos += 1 return positions def update_guessed_word(guessed_word, matched_positions, guessed_char): for pos in matched_positions: guessed_word[pos] = guessed_char def is_solved(guessed_word): chars_remaining = len(filter(lambda x: x == '_', guessed_word)) return chars_remaining == 0 ############################# Solver Side ############################### letters = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"] def most_frequent_char(words, previously_guessed): cmap = {x:0 for x in letters} for word in words: cset = set() for c in word: if c not in previously_guessed: cset.add(c) for c in cset: cmap[c] += 1 return sorted(map(lambda x: (x, cmap[x]), cmap.keys()), key=operator.itemgetter(1), reverse=True) def best_guess(words, word_len, bad_guesses, good_guesses, guessed_word): temp_words = filter(lambda x: len(x) == word_len, words) if len(bad_guesses) > 0: for bad_guess in bad_guesses: temp_words = filter(lambda x: x.find(bad_guess) == -1, temp_words) if len(good_guesses) > 0: for good_guess in good_guesses: temp_words = filter(lambda x: x.find(good_guess) > -1, temp_words) previously_guessed = set(filter(lambda x: x != '_', guessed_word)) return temp_words, most_frequent_char(temp_words, previously_guessed) def init_guess(wordlen): initial_guess = [] for i in range(wordlen): initial_guess.append("_") return initial_guess def match_words_against_template(words, guessed_word): if len(words) > MAX_WORDS_IN_SUBSET: return words matched_words = [] for word in words: word_chars = [c for c in word] merged = zip(guessed_word, word_chars) diff = len(filter(lambda x: x != x, filter(lambda x: x != '_', merged))) if diff == 0: matched_words.append(word) if len(matched_words) > 1: break return matched_words def replace_guessed_word(guessed_word, matched_word): matched_chars = [c for c in matched_word] for i in range(len(matched_chars)): guessed_word[i] = matched_chars[i] ################################# Game ################################### def single_round(words, debug=False): solver_wins = False secret_word = select_secret_word(words) if debug: print "secret word:", secret_word word_len = len(secret_word) bad_guesses = set() good_guesses = set() guessed_word = init_guess(word_len) for num_guesses in range(MAX_GUESSES_PER_GAME): filtered_words, guess_char = best_guess(words, word_len, bad_guesses, good_guesses, guessed_word) if debug: print "guessed char:", guess_char matched_positions = find_all_match_positions(secret_word, guess_char) if len(matched_positions) == 0: bad_guesses.add(guess_char) else: good_guesses.add(guess_char) update_guessed_word(guessed_word, matched_positions, guess_char) matched_words = match_words_against_template(filtered_words, guessed_word) if len(matched_words) == 1: replace_guessed_word(guessed_word, matched_words) if debug: print "#", num_guesses, "guess:", " ".join(guessed_word) if is_solved(guessed_word): solver_wins = True break return len(secret_word), solver_wins, num_guesses def multiple_rounds(words, num_games, report_file): fdata = open(report_file, 'wb') for i in range(num_games): word_len, solver_wins, num_guesses = single_round(words, False) fdata.write("%d,%d,%d\n" % (word_len, 1 if solver_wins else 0, num_guesses)) fdata.close() ################################# Main ################################### words = preprocess("/usr/share/dict/words") #single_round(words, True) multiple_rounds(words, 10000, "hangman.csv") ```

The /usr/share/dict/words file has 99,171 English words, and contains several letters with diacritics for words that are adopted from other languages. We intially normalize these words so they are all in lowercase, and replace letters with diacritics with their corresponding ASCII letters from the range 'a' through 'z'. We also replace words ending with 's with their base forms, and remove duplicates. At the end of this, we are left with 72,167 words.

At the beginning of the game, the solver knows the number of letters in the word, so in order to maximize the chances of getting a letter, it makes sense to guess the most frequent letter for that size of word in the dictionary. If we get the letter, we are obviously one step closer to a win. However, even if we don't get a letter, it still gives us some information which we can use for subsequent guesses - ie, our subset of word candidates do not contain this letter.

This approach of consistently choosing the most frequent letter (minus any letters known to not exist in the word) works upto a point. Often, it is possible to use the word template generated from previous guesses to match against the word corpus. If the template matches to a single word, we can just guess that word and win. I have implemented this strategy in the match_words_against_template() function in the code above.

An optimization for cases where the template matches multiple words would be to select a letter that resolves the ambiguity (such as a guess of "s" for the pattern "_ e t t l e" where the word could be either "settle" or "kettle") rather than the most frequent character.

Here are traces from a few games. You can run a single game by calling the single_round() function with debug set to True. The word is selected randomly from the word list, but it is quite simple to make the change to take in a specific word instead if you want to do that.

 ```secret word: undergo guessed char: e # 0 guess: _ _ _ e _ _ _ guessed char: s # 1 guess: _ _ _ e _ _ _ guessed char: r # 2 guess: _ _ _ e r _ _ guessed char: a # 3 guess: _ _ _ e r _ _ guessed char: i # 4 guess: _ _ _ e r _ _ guessed char: o # 5 guess: _ _ _ e r _ o guessed char: d # 6 guess: _ _ d e r _ o guessed char: l # 7 guess: _ _ d e r _ o guessed char: n # 8 guess: _ n d e r _ o guessed char: u # 9 guess: u n d e r _ o guessed char: b # 10 guess: u n d e r g o ``` ```secret word: steakhouses guessed char: e # 0 guess: _ _ e _ _ _ _ _ _ e _ guessed char: i # 1 guess: _ _ e _ _ _ _ _ _ e _ guessed char: r # 2 guess: _ _ e _ _ _ _ _ _ e _ guessed char: s # 3 guess: s _ e _ _ _ _ _ s e s guessed char: a # 4 guess: s _ e a _ _ _ _ s e s guessed char: l # 5 guess: s _ e a _ _ _ _ s e s guessed char: t # 6 guess: s t e a _ _ _ _ s e s guessed char: n # 7 guess: s t e a _ _ _ _ s e s guessed char: o # 8 guess: s t e a k h o u s e s ``` ```secret word: coffers guessed char: e # 0 guess: _ _ _ _ e _ _ guessed char: s # 1 guess: _ _ _ _ e _ s guessed char: r # 2 guess: _ _ _ _ e r s guessed char: a # 3 guess: _ _ _ _ e r s guessed char: i # 4 guess: _ _ _ _ e r s guessed char: o # 5 guess: _ o _ _ e r s guessed char: t # 6 guess: _ o _ _ e r s guessed char: c # 7 guess: c o _ _ e r s guessed char: u # 8 guess: c o _ _ e r s guessed char: k # 9 guess: c o _ _ e r s guessed char: p # 10 guess: c o _ _ e r s ``` ```secret word: soho guessed char: a # 0 guess: _ _ _ _ guessed char: e # 1 guess: _ _ _ _ guessed char: o # 2 guess: _ o _ o guessed char: s # 3 guess: s o _ o guessed char: t # 4 guess: s o _ o guessed char: l # 5 guess: s o _ o guessed char: b # 6 guess: s o _ o guessed char: n # 7 guess: s o _ o guessed char: p # 8 guess: s o _ o guessed char: d # 9 guess: s o _ o guessed char: w # 10 guess: s o _ o ```

Now that the game had been reduced to closed-world (ie vocabulary drawn from the same list) interactions between two entities which had computers for minds, I wanted to see how successful the Solver was against the Proposer. For this, I ran a small experiment and let the code run for 10,000 games and recorded their results. The data is generated by calling the multiple_rounds() function with the number of runs desired and the output filename.

Here is some code that reads the resulting file and generates the necessary charts for visualization.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39``` ```# Source: src/hangman/gamestats.py # -*- coding: utf-8 -*- from __future__ import division import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv("hangman.csv", header=None, names=["WORD_LEN", "SOLVER_WINS", "NUM_GUESSES"]) # wins vs losses nloss = df[df["SOLVER_WINS"] == 0].count()["SOLVER_WINS"] nwins = df[df["SOLVER_WINS"] == 1].count()["SOLVER_WINS"] print "probability of winning=", nwins / (nwins + nloss) print "probability of losing=", nloss / (nwins + nloss) ## probability of winning and losing for different word lengths df2 = df.drop("NUM_GUESSES", 1) df2_wins = df2[df2["SOLVER_WINS"] == 1].groupby("WORD_LEN").count().reset_index() df2_losses = df2[df2["SOLVER_WINS"] == 0].groupby("WORD_LEN").count().reset_index() df2_losses.rename(columns={"SOLVER_WINS": "SOLVER_LOSES"}, inplace=True) df2_merged = df2_wins.merge(df2_losses, how="inner", on="WORD_LEN") df2_merged.plot(kind="bar", stacked=True, x="WORD_LEN", title="Win/Loss Counts by Word Length") plt.show() df2_merged["NUM_GAMES"] = df2_merged["SOLVER_WINS"] + df2_merged["SOLVER_LOSES"] df2_merged["SOLVER_WINS"] = df2_merged["SOLVER_WINS"] / df2_merged["NUM_GAMES"] df2_merged["SOLVER_LOSES"] = df2_merged["SOLVER_LOSES"] / df2_merged["NUM_GAMES"] df2_merged.drop("NUM_GAMES", axis=1, inplace=True) df2_merged.plot(kind="bar", stacked=True, x="WORD_LEN", title="Win/Loss Probabilities by Word Length") plt.show() # how number of guesses to win varies with word length (winning games only) df3 = df[df["SOLVER_WINS"] == 1] df3.drop("SOLVER_WINS", 1) df3_grouped = df3.drop("SOLVER_WINS", 1).groupby("WORD_LEN").mean().reset_index() df3_grouped.plot(kind="bar", x="WORD_LEN", title="Avg Guesses for Different Word Lengths") plt.show() ```

The empirical probability of winning the game, based on this data, is about 0.66. As expected, the distribution of word lengths thought up by the Proposer follows a roughly normal distribution. However, shorter words have a lower probability of getting solved than longer words. This is more easily seen in the chart showing win/loss probabilities. So treating win probability as a proxy for difficulty, winning at Hangman is actually harder for shorter words than for longer words.  I also plotted the average number of guesses the solver made to guess different sized words thought up by the Proposer. This time the distribution is more uniform, although it falls off to the right for long words.

This is all I have for today. I haven't been posting these past few weeks because I was out of town on vacation. Hopefully I should get back to a more regular schedule in the New Year.