Saturday, July 26, 2014

Handwritten Digit Recognition with PyBrain


I am taking Andrew Ng's ML class on Coursera again. The last time I took it was approximately 3 years ago, when I was just starting out learning about Machine Learning. This time round, I am not submitting any of the programming assignments because I am doing them in Python rather than in Octave.

Last week (and this week) was Neural Networks. Instead of building a Neural Network from first principles as required by the Programming Assignment, I decided to use this opportunity to explore the PyBrain, a Python machine learning library for building Neural Networks.

The task is to classify images of handwritten digits into the numbers 0-9. The data is a subset of the MNIST Database. It consists of 5,000 black and white images of a single handwritten digit, each 20x20 pixels flattened into a 1x400 array of grayscale values 0-127, and the actual value of the digit. The data is provided as an MATLAB .mat file for the assignment. Here is a sample of the data (visualization code included below).


To do the classification, I used PyBrain to build a 3 layer FeedForward Neural Network. The input layer has 400 units, each corresponding to a single feature. The output layer has 10 units, each corresponding to one of the possible numeric values. The hidden layer has 25 units based on the guidelines in the programming assignment. I then split the data into a 75/25 training/test set, used a BackPropagation trainer to train the network with the training set, and computed accuracy using the test set.

PyBrain has its own routines for splitting a dataset into training and test sets, computing accuracy, etc, but since I am more familiar with utility classes in Scikit-Learn, I used these instead where possible. Here is the code - its heavily documented, so a narrative is probably unnecessary.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# Source: src/digit_recognition/neural_network.py
from __future__ import division

import numpy as np
import matplotlib.pyplot as plt
import scipy.io
import math

from sklearn.cross_validation import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.metrics import accuracy_score

from pybrain.datasets import ClassificationDataSet
from pybrain.tools.shortcuts import buildNetwork
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.structure.modules import SoftmaxLayer

def load_dataset(dataset, X, y):
    enc = OneHotEncoder(n_values=10)
    yenc = enc.fit_transform(np.matrix(y)).todense()
    for i in range(y.shape[0]):
        dataset.addSample(X[i, :], yenc[i][0])

NUM_EPOCHS = 50
NUM_HIDDEN_UNITS = 25

print "Loading MATLAB data..."    
data = scipy.io.loadmat("../../data/digit_recognition/ex3data1.mat")
X = data["X"]
y = data["y"]
y[y == 10] = 0 # '0' is encoded as '10' in data, fix it
n_features = X.shape[1]
n_classes = len(np.unique(y))

# visualize data
# get 100 rows of the input at random
print "Visualize data..."
idxs = np.random.randint(X.shape[0], size=100)
fig, ax = plt.subplots(10, 10)
img_size = math.sqrt(n_features)
for i in range(10):
    for j in range(10):
        Xi = X[idxs[i * 10 + j], :].reshape(img_size, img_size).T
        ax[i, j].set_axis_off()
        ax[i, j].imshow(Xi, aspect="auto", cmap="gray")
plt.show()

# split up training data for cross validation
print "Split data into training and test sets..."
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, test_size=0.25, 
                                                random_state=42)
ds_train = ClassificationDataSet(X.shape[1], 10)
load_dataset(ds_train, Xtrain, ytrain)

# build a 400 x 25 x 10 Neural Network
print "Building %d x %d x %d neural network..." % (n_features, 
                                                   NUM_HIDDEN_UNITS, n_classes)
fnn = buildNetwork(n_features, NUM_HIDDEN_UNITS, n_classes, bias=True, 
                   outclass=SoftmaxLayer)
print fnn

# train network
print "Training network..."
trainer = BackpropTrainer(fnn, ds_train)
for i in range(NUM_EPOCHS):
    error = trainer.train()
    print "Epoch: %d, Error: %7.4f" % (i, error)
    
# predict using test data
print "Making predictions..."
ypreds = []
ytrues = []
for i in range(Xtest.shape[0]):
    pred = fnn.activate(Xtest[i, :])
    ypreds.append(pred.argmax())
    ytrues.append(ytest[i])
print "Accuracy on test set: %7.4f" % accuracy_score(ytrues, ypreds, 
                                                     normalize=True)

The highest test accuracy was achieved with a neural network trained for 50 epochs - approximately 91.5%. The corresponding accuracy on the training set was 99.8% (error of 0.0023). The learning curve for the neural network is shown below.


The corresponding (truncated for readability) output for the code above is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
Loading MATLAB data...
Visualize data...
Split data into training and test sets...
Building 400 x 25 x 10 neural network...
FeedForwardNetwork-8
   Modules:
    [<BiasUnit 'bias'>, <LinearLayer 'in'>, <SigmoidLayer 'hidden0'>, 
     <SoftmaxLayer 'out'>]
   Connections:
    [<FullConnection 'FullConnection-4': 'in' -> 'hidden0'>, 
     <FullConnection 'FullConnection-5': 'bias' -> 'out'>, 
     <FullConnection 'FullConnection-6': 'bias' -> 'hidden0'>, 
     <FullConnection 'FullConnection-7': 'hidden0' -> 'out'>]

Training network...
Epoch: 0, Error:  0.0394
Epoch: 1, Error:  0.0241
Epoch: 2, Error:  0.0191
Epoch: 3, Error:  0.0163
Epoch: 4, Error:  0.0143
Epoch: 5, Error:  0.0129
...
Epoch: 45, Error:  0.0025
Epoch: 46, Error:  0.0025
Epoch: 47, Error:  0.0024
Epoch: 48, Error:  0.0024
Epoch: 49, Error:  0.0023
Making predictions...
Accuracy on test set:  0.9148

Kaggle has a Digit Recognizer competition (for knowledge) which has a larger dataset of 42,000 training rows and 28,000 unlabelled rows. The digits in this dataset are represented as 28x28 pixel images (flattened to 1x784 arrays of numbers in the range 0-128). I ran the code above (with some obvious changes to read CSV files instead of MATLAB files, additional code to predict values for the submission test, etc). With 250 epochs and 100 hidden units, the accuracy on the held out data was 69.86% and the accuracy on the submission set only slightly higher at 69.87%.

10 comments (moderated to prevent spam):

Diego said...

Could you compare the speed with what's obtained in Octave? Great work.

Sujit Pal said...

Thanks. Regarding the speed comparison, I didn't do any of the Octave exercises this time like I mentioned. Unfortunately my submissions from my previous iteration of the class perished in a disk crash about 1.5 years ago.

Unknown said...

Please Help Me Buy Posting Code For Breast Cancer Diagnosis Using Backpropagation Algorithm .. plzzzz..
thank you

Sujit Pal said...

Hi Rahul, if you mean the paper Breast cancer diagnosis using back-propagation algorithm by Mishra, et al, your best bet may be to contact the authors directly and ask, explaining why you need it. Also if you (or your institution) happens to have a subscription to ACM Digital Library, you can just get it from the site.

Anonymous said...

Hi Sujit

I'm working on a project of my own wherein I take diagnose a specific form of a disease after performing several image analysis on a image. I have thousands of spreadsheets worth of data and I cannot afford to manually feed in everything. Can you help me how to automate the process? I'm using Pybrain too

Sujit Pal said...

Pybrain takes inputs as a matrix of features (row per record, column per feature) and a vector of labels for classification, and a matrix of features for training. You can dump spreadsheets into CSV files (opening the spreadsheet and clicking file : save as part probably needs to be manual), then use Python to convert them into numpy matrices and vectors.

Unknown said...

In Line 6 of the code, I am getting the error ImportError: cannot import name train_test_split.
What could be the reason ?

Sujit Pal said...

Perhaps your sklearn is a bit old (or at least different from what I have)? My version of sklearn (using import sklearn; sklearn.__version__) is 0.16.1. I also see that the train_test_split function exists in the sklearn.cross_validation package from this page about cross-validation using sklearn.

Unknown said...

Hi Sujit ...I reach out to you based on your PyBrain expertise. I am using PyBrain for a NLP application. My dataset involves 0.6 M datapoints, each with 0.16 M dimensions.

To prepare feature vector, I built bag-of-words, and then adding 0/1 depending on if a word is present in tweet text or not. My feature vectors are very sparse. (12-15 1s in a vector of length 0.16 M). For efficiency purpose i am storing only the nonzero index values. Is there a way to directly feed this compact representation for training and testing a neural net in PyBrain ? (something like scipy.sparse)

Sujit Pal said...

Hi Unknown, I would not call my experience with PyBrain expertise :-). To answer your question about whether PyBrain accepts sparse matrices or not, I don't know for sure. But scipy.sparse matrices interoperate with numpy matrices and PyBrain takes numpy matrices, so hopefully it should work. If your code is all set up it may be worth just trying it out (or maybe just copy paste my code and make the Xtrain, ytrain, Xtest and ytest in line#50 of neural_network.py into scipy.sparse matrices instead then run it). If you do this, please share your findings.