Saturday, July 26, 2014

Handwritten Digit Recognition with PyBrain


I am taking Andrew Ng's ML class on Coursera again. The last time I took it was approximately 3 years ago, when I was just starting out learning about Machine Learning. This time round, I am not submitting any of the programming assignments because I am doing them in Python rather than in Octave.

Last week (and this week) was Neural Networks. Instead of building a Neural Network from first principles as required by the Programming Assignment, I decided to use this opportunity to explore the PyBrain, a Python machine learning library for building Neural Networks.

The task is to classify images of handwritten digits into the numbers 0-9. The data is a subset of the MNIST Database. It consists of 5,000 black and white images of a single handwritten digit, each 20x20 pixels flattened into a 1x400 array of grayscale values 0-127, and the actual value of the digit. The data is provided as an MATLAB .mat file for the assignment. Here is a sample of the data (visualization code included below).


To do the classification, I used PyBrain to build a 3 layer FeedForward Neural Network. The input layer has 400 units, each corresponding to a single feature. The output layer has 10 units, each corresponding to one of the possible numeric values. The hidden layer has 25 units based on the guidelines in the programming assignment. I then split the data into a 75/25 training/test set, used a BackPropagation trainer to train the network with the training set, and computed accuracy using the test set.

PyBrain has its own routines for splitting a dataset into training and test sets, computing accuracy, etc, but since I am more familiar with utility classes in Scikit-Learn, I used these instead where possible. Here is the code - its heavily documented, so a narrative is probably unnecessary.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Source: src/digit_recognition/neural_network.py
from __future__ import division

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score

from pybrain.datasets import ClassificationDataSet
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules import SoftmaxLayer

def load_dataset(dataset, X, y):
    enc = OneHotEncoder(n_values=10)
    yenc = enc.fit_transform(np.matrix(y)).todense()
    for i in range(y.shape[0]):
        dataset.addSample(X[i, :], yenc[i][0])

NUM_EPOCHS = 50
NUM_HIDDEN_UNITS = 25

print "Loading MATLAB data..."    
data = scipy.io.loadmat("../../data/digit_recognition/ex3data1.mat")
X = data["X"]
y = data["y"]
y[y == 10] = 0 # '0' is encoded as '10' in data, fix it
n_features = X.shape[1]
n_classes = len(np.unique(y))

# visualize data
# get 100 rows of the input at random
print "Visualize data..."
idxs = np.random.randint(X.shape[0], size=100)
fig, ax = plt.subplots(10, 10)
img_size = math.sqrt(n_features)
for i in range(10):
    for j in range(10):
        Xi = X[idxs[i * 10 + j], :].reshape(img_size, img_size).T
        ax[i, j].set_axis_off()
        ax[i, j].imshow(Xi, aspect="auto", cmap="gray")
plt.show()

# split up training data for cross validation
print "Split data into training and test sets..."
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, 
                                                random_state=42)
ds_train = ClassificationDataSet(X.shape[1], 10)
load_dataset(ds_train, Xtrain, ytrain)

# build a 400 x 25 x 10 Neural Network
print "Building %d x %d x %d neural network..." % (n_features, 
                                                   NUM_HIDDEN_UNITS, n_classes)
fnn = buildNetwork(n_features, NUM_HIDDEN_UNITS, n_classes, bias=True, 
                   outclass=SoftmaxLayer)
print fnn

# train network
print "Training network..."
trainer = BackpropTrainer(fnn, ds_train)
for i in range(NUM_EPOCHS):
    error = trainer.train()
    print "Epoch: %d, Error: %7.4f" % (i, error)
    
# predict using test data
print "Making predictions..."
ypreds = []
ytrues = []
for i in range(Xtest.shape[0]):
    pred = fnn.activate(Xtest[i, :])
    ypreds.append(pred.argmax())
    ytrues.append(ytest[i])
print "Accuracy on test set: %7.4f" % accuracy_score(ytrues, ypreds, 
                                                     normalize=True)

The highest test accuracy was achieved with a neural network trained for 50 epochs - approximately 91.5%. The corresponding accuracy on the training set was 99.8% (error of 0.0023). The learning curve for the neural network is shown below.


The corresponding (truncated for readability) output for the code above is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Loading MATLAB data...
Visualize data...
Split data into training and test sets...
Building 400 x 25 x 10 neural network...
FeedForwardNetwork-8
   Modules:
    [<BiasUnit 'bias'>, <LinearLayer 'in'>, <SigmoidLayer 'hidden0'>, 
     <SoftmaxLayer 'out'>]
   Connections:
    [<FullConnection 'FullConnection-4': 'in' -> 'hidden0'>, 
     <FullConnection 'FullConnection-5': 'bias' -> 'out'>, 
     <FullConnection 'FullConnection-6': 'bias' -> 'hidden0'>, 
     <FullConnection 'FullConnection-7': 'hidden0' -> 'out'>]

Training network...
Epoch: 0, Error:  0.0394
Epoch: 1, Error:  0.0241
Epoch: 2, Error:  0.0191
Epoch: 3, Error:  0.0163
Epoch: 4, Error:  0.0143
Epoch: 5, Error:  0.0129
...
Epoch: 45, Error:  0.0025
Epoch: 46, Error:  0.0025
Epoch: 47, Error:  0.0024
Epoch: 48, Error:  0.0024
Epoch: 49, Error:  0.0023
Making predictions...
Accuracy on test set:  0.9148

Kaggle has a Digit Recognizer competition (for knowledge) which has a larger dataset of 42,000 training rows and 28,000 unlabelled rows. The digits in this dataset are represented as 28x28 pixel images (flattened to 1x784 arrays of numbers in the range 0-128). I ran the code above (with some obvious changes to read CSV files instead of MATLAB files, additional code to predict values for the submission test, etc). With 250 epochs and 100 hidden units, the accuracy on the held out data was 69.86% and the accuracy on the submission set only slightly higher at 69.87%.