Saturday, April 06, 2019

Matrix Factorization as Gradient Descent using Tensorflow 2.x


Last month, at the Tensorflow Dev Summit 2019, Google announced the release of Tensorflow 2.0-alpha, which I thought was something of a quantum jump in terms of its evolution. The biggest change in my opinion was the switch to using eager mode of execution as default. The next big thing is the adoption of Keras as the primary high level (tf.keras) API. The low level session based API still remains, and can be used to build components that can interoperate with components built using the tf.keras API.

Other big changes is the introduction of the new tf.data package for building input pipelines, new distribution strategies that allow your code to be run without change on your laptop as well as multi-GPU or TPU environments, better interop with tf.Estimator for your tf.keras models, and a single SavedModel format for saving models that works across the Tensorflow ecosystem (Tensorflow Hub, Tensorflow Serving, etc).

None of these features are brand new of course. But together, they make Tensorflow more attractive to me. As someone who started with Keras and gradually moved to Tensorflow 1.x because its part of the same ecosystem, I had a love-hate relationship with it. I couldn't afford not to use it because of its reach and power, but at the same time, I found the API complex and hard to debug, and I didn't particularly enjoy working with it. In comparison, I found Pytorch's API much more intuitive and fun to work with. TF 2.0 code looks a lot like Pytorch to me, and as a user, I like this a lot.

At work, we have formed an interest group to teach each other Deep Learning, sort of similar to the AI & Deep Learning Enthusiasts meetup group I was part of few years ago. At the meetup, we would come together one day a week to watch videos of various DL lectures and discuss afterwards. The interest group functions similarly so far, except that we are geographically distributed, so its not possible to watch the videos together, so we watch in advance and then get together weekly to discuss what we learned. Currently we are watching Practical Deep Learning for Coders, v3, taught by Jeremy Howard and Rachel Thomas of fast.ai.

Lecture 4 of this course was about Recommender Systems, and one of the examples was how to use Pytorch's optimizers to do Matrix Factorization using Gradient Descent. I had learned a similar technique in the Matrix Factorization and Advanced Techniques mini-course at Coursera, taught by Profs Michael Ekstrand and Joseph Konstan, and part of the Recommendation Systems specialization for which they are better known. At the time I had started to implement some of these techniques and even blogged about it, but ended up never implementing this particular technique, so I figured that this might be a good way to get my hands dirty with a little TF 2.x programming. So that's what I did, and this is what I am going to talk about today.

Matrix Factorization is the process of decomposing a matrix into (in this case) two matrices, which when multiplied back yields an approximation of the original matrix. In the context of a movie Recommendation Systems, the input X is the ratings matrix, a (num_users x num_movies) sparse matrix. Sparse because most users haven't rated most movies. Matrix Factorization would split them into a pair of matrices M and U of shapes (num_movies x k) and (num_users x k) respectively, representing movies and users respectively. Here k represents a latent variable that encodes a movie or user -- thus a movie or user can now be represented by a vector of size k.


As a practical matter, we also want to factor in that people are different and might rate a movie differently, even if they feel the same way about it. Similarly, movies are different, and the same rating for different movies doesn't imply that the rating is identical. So we factor out these biases and call them the user bias bU, movie (or item) bias bM and global bias bG. This is shown in the second equation above, and this is the formulation we will implement below.

In order to model this problem as a Gradient Descent problem, we can start with random matrices U and M, random vectors bU and bM, and a random scalar bG. We then attempt to compute an approximation X̂ of the ratings matrix X by composing these random tensors as shown in the second equation, and passing the result through a non-linear activation (sigmoid in our case). We then compute the loss as the mean square error between X and X̂ and then update the random tensors by the gradients of the loss with respect to each tensor. We continue this process until the loss is below some acceptable threshold.

Here is the partial code for doing the matrix factorization. I have created two classes - one for the MatrixFactorization layer and another for the MatrixFactorizer network consisting of the custom MatrixFactorization layer and a Sigmoid Activation layer. Notice the similarity between the MatrixFactorizer's call() method and Pytorch's forward(). The loss function is the MeanSquaredError, and the optimizer used is RMSprop. The training loop loops 5000 times, at each step, the network recomputes X̂ and the loss between this and the original X tensor. The GradientTape automatically computes the gradient of the loss w.r.t. each variable and the optimizer updates the variables.

One thing different between the implementation suggested in the Matrix Factorization course and the Fast.AI course is the Sigmoid activation layer. Since the Sigmoid squeezes its input into the range [0, 1], and our ratings are in the range [0, 5], we need to scale our input data accordingly. This has been done in the load_data() function which is not shown here in the interests of space. You can find the full code for the matrix factorization and prediction modules in github.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
class MatrixFactorization(tf.keras.layers.Layer):
    def __init__(self, emb_sz, **kwargs):
        super(MatrixFactorization, self).__init__(**kwargs)
        self.emb_sz = emb_sz

    def build(self, input_shape):
        num_users, num_movies = input_shape
        self.U = self.add_variable("U", 
            shape=[num_users, self.emb_sz], 
            dtype=tf.float32,
            initializer=tf.initializers.GlorotUniform)
        self.M = self.add_variable("M", 
            shape=[num_movies, self.emb_sz],
            dtype=tf.float32, 
            initializer=tf.initializers.GlorotUniform)
        self.bu = self.add_variable("bu",
            shape=[num_users],
            dtype=tf.float32, 
            initializer=tf.initializers.Zeros)
        self.bm = self.add_variable("bm",
            shape=[num_movies],
            dtype=tf.float32, 
            initializer=tf.initializers.Zeros)
        self.bg = self.add_variable("bg", 
            shape=[],
            dtype=tf.float32,
            initializer=tf.initializers.Zeros)

    def call(self, input):
        return (tf.add(
            tf.add(
                tf.matmul(self.U, tf.transpose(self.M)),
                tf.expand_dims(self.bu, axis=1)),
            tf.expand_dims(self.bm, axis=0)) +
            self.bg)


class MatrixFactorizer(tf.keras.Model):
    def __init__(self, embedding_size):
        super(MatrixFactorizer, self).__init__()
        self.matrixFactorization = MatrixFactorization(embedding_size)
        self.sigmoid = tf.keras.layers.Activation("sigmoid")

    def call(self, input):
        output = self.matrixFactorization(input)
        output = self.sigmoid(output)
        return output


def loss_fn(source, target):
    mse = tf.keras.losses.MeanSquaredError()
    loss = mse(source, target)
    return loss


DATA_DIR = Path("../../data")
MOVIES_FILE = DATA_DIR / "movies.csv"
RATINGS_FILE = DATA_DIR / "ratings.csv"
WEIGHTS_FILE = DATA_DIR / "mf-weights.h5"

EMBEDDING_SIZE = 15

X, user_idx2id, movie_idx2id = load_data()

model = MatrixFactorizer(EMBEDDING_SIZE)
model.build(input_shape=X.shape)
model.summary()

optimizer = tf.optimizers.RMSprop(learning_rate=1e-3, momentum=0.9)

losses, steps = [], []
for i in range(5000):
    with tf.GradientTape() as tape:
        Xhat = model(X)
        loss = loss_fn(X, Xhat)
        if i % 100 == 0:
            loss_value = loss.numpy()
            losses.append(loss_value)
            steps.append(i)
            print("step: {:d}, loss: {:.3f}".format(i, loss_value))
    variables = model.trainable_variables
    gradients = tape.gradient(loss, variables)
    optimizer.apply_gradients(zip(gradients, variables))

# plot training loss
plt.plot(steps, losses, marker="o")
plt.xlabel("steps")
plt.ylabel("loss")
plt.show()

The chart below shows the loss plotted across 5000 training steps. As can be seen, the loss falls quickly and then flattens out.


On the prediction side, we can now use the factorized matrices M and U as embeddings for movies and users respectively. We have chosen k=15, so effectively, we can now describe a movie or a user in terms of a vector of 15 latent features. So it is now possible to find movies similar to a given movie by simply doing a dot product of its vector with all the other vectors, and reporting the top N movies whose vectors yield the highest dot product.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# content based: find movies similar to given movie
# Batman & Robin (1997) -- movie_id = 1562
print("*** movies similar to given movie ***")
TOP_N = 10
movie_idx = np.argwhere(movie_idx2id == 1562)[0][0]
source_vec = np.expand_dims(M[movie_idx], axis=1)
movie_sims = np.matmul(M, source_vec)
similar_movie_ids = np.argsort(-movie_sims.reshape(-1,))[0:TOP_N]
baseline_movie_sim = None
for smid in similar_movie_ids:
    movie_id = movie_idx2id[smid]
    title, genres = movie_id2title[movie_id]
    genres = genres.replace('|', ', ')
    movie_sim = movie_sims[smid][0]
    if baseline_movie_sim is None:
        baseline_movie_sim = movie_sim
    movie_sim /= baseline_movie_sim
    print("{:.5f} {:s} ({:s})".format(movie_sim, title, genres))

This gives us the following results for the top 10 movies our model thinks is similar to Batman & Robin.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
*** movies similar to given movie ***
1.00000 Batman & Robin (1997) (Action, Adventure, Fantasy, Thriller)
0.83674 Apollo 13 (1995) (Adventure, Drama, IMAX)
0.81504 Island, The (2005) (Action, Sci-Fi, Thriller)
0.72535 Rescuers Down Under, The (1990) (Adventure, Animation, Children)
0.69084 Crow: City of Angels, The (1996) (Action, Thriller)
0.68452 Lord of the Rings: The Two Towers, The (2002) (Adventure, Fantasy)
0.65504 Saving Private Ryan (1998) (Action, Drama, War)
0.64059 Cast Away (2000) (Drama)
0.63650 Fugitive, The (1993) (Thriller)
0.61579 My Neighbor Totoro (Tonari no Totoro) (1988) (Animation, Children, Drama, Fantasy)

Similarly, we can recommend new movies or predict ratings for them for a given user by scanning across the row for the approximate matrix X̂ and finding the ones with the highest values. Note that we need to remove the biases from our computed X̂ matrix and rescale so the predicted ratings for our recommendations are comparable to the original ratings.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
# collaborative filtering based: find movies for user
# user: 121403 has rated 29 movies, we will identify movie
# recommendations for this user that they haven't rated
print("*** top movie recommendations for user ***")
USER_ID = 121403
user_idx = np.argwhere(user_idx2id == USER_ID)
Xhat = (
    np.add(
        np.add(
            np.matmul(U, M.T), 
            np.expand_dims(-bu, axis=1)
        ),
        np.expand_dims(-bm, axis=0)
    ) - bg)
scaler = MinMaxScaler()
Xhat = scaler.fit_transform(Xhat)
Xhat *= 5

user_preds = Xhat[user_idx].reshape(-1)
pred_movie_idxs = np.argsort(-user_preds)

print("**** already rated (top {:d}) ****".format(TOP_N))
mids_already_rated = set([mid for (mid, rating) in uid2mids[USER_ID]])
ordered_mrs = sorted(uid2mids[USER_ID], key=operator.itemgetter(1), reverse=True)
for mid, rating in ordered_mrs[0:TOP_N]:
    title, genres = movie_id2title[mid]
    genres = genres.replace('|', ', ')
    pred_rating = user_preds[np.argwhere(movie_idx2id == mid)[0][0]]
    print("{:.1f} ({:.1f}) {:s} ({:s})".format(rating, pred_rating, title, genres))
print("...")
print("**** movie recommendations ****")
top_recommendations = []
for movie_idx in pred_movie_idxs:
    movie_id = movie_idx2id[movie_idx]
    if movie_id in mids_already_rated:
        continue
    pred_rating = user_preds[movie_idx]
    top_recommendations.append((movie_id, pred_rating))
    if len(top_recommendations) > TOP_N:
        break
for rec_movie_id, pred_rating in top_recommendations:
    title, genres = movie_id2title[rec_movie_id]
    genres = genres.replace('|', ', ')
    print("{:.1f} {:s} ({:s})".format(pred_rating, title, genres))

This gives us the following top 10 recommendations for this user, along with the predicted ratings that the user might give each movie.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
*** top movie recommendations for user ***
4.8 Beverly Hills Cop (1984) (Action, Comedy, Crime, Drama)
4.8 Finding Neverland (2004) (Drama)
4.8 Nightmare Before Christmas, The (1993) (Animation, Children, Fantasy, Musical)
4.8 Conan the Barbarian (1982) (Action, Adventure, Fantasy)
4.8 Meet the Parents (2000) (Comedy)
4.8 Stranger than Fiction (2006) (Comedy, Drama, Fantasy, Romance)
4.8 Halloween H20: 20 Years Later (Halloween 7: The Revenge of Laurie Strode) (1998) (Horror, Thriller)
4.8 Forever Young (1992) (Drama, Romance, Sci-Fi)
4.8 Escape to Witch Mountain (1975) (Adventure, Children, Fantasy)
4.8 Blue Lagoon, The (1980) (Adventure, Drama, Romance)
4.8 Needful Things (1993) (Drama, Horror)

As a test, for a subset of movies the user has already rated, we compute the predicted ratings using our X̂ matrix. Here the first column is the actual rating, and the number in parenthesis is the predicted rating.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
**** already rated (top 10) ****
5.0 (4.8) Matrix, The (1999) (Action, Sci-Fi, Thriller)
5.0 (4.8) Kill Bill: Vol. 1 (2003) (Action, Crime, Thriller)
5.0 (4.8) V for Vendetta (2006) (Action, Sci-Fi, Thriller, IMAX)
5.0 (4.8) Planet Terror (2007) (Action, Horror, Sci-Fi)
4.5 (4.8) Pulp Fiction (1994) (Comedy, Crime, Drama, Thriller)
4.5 (4.8) Schindler's List (1993) (Drama, War)
4.5 (4.8) Silence of the Lambs, The (1991) (Crime, Horror, Thriller)
4.5 (4.8) Reservoir Dogs (1992) (Crime, Mystery, Thriller)
4.5 (4.8) Ghostbusters (a.k.a. Ghost Busters) (1984) (Action, Comedy, Sci-Fi)
4.5 (4.8) Snatch (2000) (Comedy, Crime, Thriller)
...

This concludes my example of using TF 2.0 to implement Matrix Factorization using Gradient Descent. Wanted to give a quick shout out to Tony Holdroyd for writing the Tensorflow 2.0 Quick Start Guide and for PackT for publishing it at the opportune time. I had the good fortune of reviewing the book at around the same time as I was looking at new features of TF 2.0, and that reduced my learning curve to a great extent.

Friday, March 29, 2019

The Amazing Composability of Convolutional Networks for Computer Vision Tasks


I last watched the videos for Stanford's CS231n: Convolutional Neural Networks for Visual Recognition almost two years ago, along with a lot of other videos, trying to scramble up the Deep Learning learning curve before I got too far behind. Recently, however, I was looking for some specific information around object detection, so I rewatched the Lecture 11: Detection and Segmentation video on Youtube. This lecture was from their 2017 class, and was taught by Justin Johnson. The lecture covers various popular approaches to object detection and segmentation, and can be summarized by the figure (taken from Darian Frajberg's presentation on Introduction to the Artificial Intelligence and Computer Vision Revolution at Politecnico di Milano) below.


What struck me again in the lecture was the incremental nature of the approaches, with each approach building on the one before it. But underneath it all, its the same convolutional and pooling layers with classification and regression heads, just reconfigured in different ways. That is what I will write about in this post today, and I hope by the end of the post, you will agree with me as to how super cool this is. Of necessity, I am going to describe the different networks at a somewhat high level, so if you are looking for more information, I would advise you to watch the video yourself. Alternatively, Leonardo Araujo dos Santos has an excellent tutorial on Object Localization and Detection that also goes into more detail than I am going to. But on to the post.

The first computer vision task is Image Classification. Here you train a network with images which can belong to one of a fixed number of classes, and use the trained model to predict the class. The Convolutional Neural Network (CNN) is the tool of choice when it comes to image classification. CNNs have consistently blown away classification benchmarks against the ImageNet dataset starting with AlexNet in 2012. Over the years, networks have grown deeper and some novel extensions to the basic CNN architecture have been developed, and today, image classification against the ImageNet dataset is considered a mostly solved problem. CNN models pre-trained against ImageNet are available for many frameworks, and people often use them to build their own classifiers using Transfer Learning.

Semantic Segmentation


The goal of semantic segmentation is to label each pixel of the image with its corresponding class label. Naively, this would be done by taking small crops by sliding a window across and down the input image, then predicting the central pixel in each crop. This can be expensive, since an image H pixels high and W pixels wide will have to do H x W operations for a single image.

A less expensive approach might be to run your image through a CNN which will increase the number of channels without changing the width and height. At the end, each pixel is represented by a vector of size equal to the number of target channels. Each of these vectors are then used to predict the class of the pixel. While better than the previous approach, this approach is also prohibitively expensive and is not used.

A third way is to use a CNN encoder-decoder pair, where the encoder will decrease the size of the image but increase its depth using Convolution and Pooling operations, and the decoder will use Transposed Convolutions to increase its size and decrease its depth. The input to this network is the image, and the output is the segmentation map. The U-Net, originally developed for biomedical image segmentation, contains additional skip-connections between corresponding layers of the encoder and decoder.

Classification + Localization


In the Classification + Localization problem, not only do you have to report the class of the object in the image, you have to report the location. The assumption in a localization problem is that you have a single object in the image.

The only addition to a standard CNN pipeline is that this network will have two heads, a classification head and a regression head that reports the bounding box coordinates (x, y, w, h). The convolutional part of the network will output a vector for the input image that is fed to a pair of dense networks jointly -- the classification head uses some sort of categorical loss function and the regression head uses a continuous loss function, combined with an additional scalar hyperparameter. You can train the networks separately and fine tune them jointly. Depending on where you place the regression head -- if placed before the fully connected layers, it is called the Overfeat or VGG style, and if placed after, it is called the R-CNN style.

Object Detection


The Object Detection task is the same as the Classification + Localization problem, except we now have multiple objects in the image, for each of which we have to find the class and bounding box. We don't know the number or the sizes of the objects in the image. This is a hard and almost intractable problem if we have to compute random crops.

The first solution is to use an external tool from computer vision, such as Selective Search, to compute "blobby" areas of the image as possible areas where objects might be found. These areas are called Region Proposals. Proposals are resized and used to fine-tune a Image classification network, which is then used to vectorize the images. Multiple binary SVMs (one per class) was used to classify the object between object and background, and a linear regression network was used to correct the bounding boxes proposed in the region proposals. This was the Region Proposal Network, or R-CNN.

The first improvement, called the Fast R-CNN, still gets the region proposals from an external network, but projects these proposals to the output of the CNN. An ROI Pooling layer resizes all the regions to a fixed size and pass the vector for each region to a classification and a regression head, which predicts the class and bounding box coordinates respectively.

The next improvement, called the Faster R-CNN, gets rid of the external Region Proposal mechanism and replaces it with a Region Proposal Network (RPN) between the Deep CNN and the ROI Pooling layer. The output of the RPN and the output of the CNN are fed into the ROI Pooling Layer from which it is fed into the fully connected part of the CNN with a classification and regression head to predict the class and bounding box for each proposal respectively.

The speedup of a Fast R-CNN over the R-CNN is about 25x, and speedup of the Faster R-CNN over a Fast-RCNN is 10x, thus a Faster R-CNN is about 250x faster than a R-CNN.

A slightly different approach to the Object Detection task was taken by the class of networks called Single Shot Detectors (SSD), one of which is the YOLO (You Only Look Once) network. The YOLO network breaks up the image into a 7x7 grid, then applies a set of pre-determined bounding boxes (B) with different aspect ratios to each grid. For each bounding box, it will compute the coordinates (x, y, w, h) and confidence for the object bounding box coordinates and the class probabilities for each of C classes. Thus the image can be represented by a vector of size 7 x 7 x (5B + C). The YOLO network is a CNN which converts the image to this vector.

Instance Segmentation


Instance Segmentation Tasks are similar to Semantic Segmentation, but the difference is that it distinguishes between individual objects of the same class, nor does it aim to label every pixel in the image. In some ways Instance Segmentation could also be considered similar to Object Detection, but instead of a bounding box, we want to find a binary mask the contains each object. The network that is used for this task is the Mask R-CNN.

Mask R-CNN adds an additional mask predicting branch (another CNN) to the regression head of a Faster R-CNN. This is applied to every region proposed by the RPN in the Faster R-CNN. The mask predicting branch converts the bounding box for each proposal into a binary mask.

This concludes my post on Object Detection and Segmentation networks. As you can see, while the networks have grown progressively more complex as the tasks got harder, they are composed of (mostly) the same basic building blocks. Some things to note is the gradually making networks learn end-to-end, by moving as much learning from the outside to inside the network.

I hope you enjoyed my high level overview on the composability of CNNs for different computer vision tasks. I have deliberately tried to keep the descriptions at a fairly high level. For more depth, consider watching the lecture on which this post is based on, or consult one of the many tutorials on the Internet that go into each network in greater detail, including the tutorial I mentioned earlier.


Monday, February 25, 2019

Autoencoder as a Feature Detector for Complex Classifiers


Using Autoencoders as anomaly detectors is not a new idea. For examples, see here, here and here (results of a google search "autoencoder anomaly detection"). However, one other possibility could be to use an autoencoder as a feature detector for specific features, for example, to look for specific kinds of objects in images. So assuming you wanted to detect domestic animals in images, you could train a set of autoencoders, each trained on a single animal, then run the image through the set. The intuition is that, assuming the image has a cat in it, an autoencoder trained on cat images should be able to reconstruct the image with lower reconstruction error than it would for other animals. In this post, I describe an experiment where I train an autoencoder on cat images from the Kaggle Cats and Dogs Dataset, and compare reconstruction errors on a held-out set of dog and cat images.

The Kaggle Cats and Dogs dataset contains images of 12,500 cats and 12,500 dogs, both color and monochrome. I picked a random sample of 2,500 images each of cats and dogs as my held-out test set, and trained my cat autoencoder on the remaining 10,000 cat images. The autoencoder was trained on monochrome images for 50 epochs. I then compared a few random images and the reconstructed image (input and output of the autoencoder) and it looks like the autoencoder has learned to reconstruct the images quite nicely, as you can see below.






As expected, the autoencoder seems somewhat better at reconstructing cat images compared to dog images (Root Mean Square Error (RMSE) between original and reconstructed image for the cat and the dog image above is 0.04304 and 0.07650 respectively). I then computed reconstruction errors (RMSE again) for the held out set of 2,500 cat and dog images and plotted the distribution of reconstruction errors, which is shown below.


As you can see, while there is quite a bit of overlap between the two distributions, the distribution of reconstruction errors for cat images peaks earlier and is skewed left a little more than the distribution of reconstruction errors for dog images. It also looks like the distributions are approximately normal, so we can use the t-test to detect if there is statistically significant difference between them, as explained on this page. Turns out that the t-value is 16.94 ≫ 3.291, so we can conclude that the distributions are different at 99% confidence level.

Since the distributions of reconstruction errors are approximately normal, and since we can represent a normal distribution by the mean and standard deviation, we can now compute the probability of an image being that of a cat based on its reconstruction error. For example, our cat and dog exampleis have a 80.3% and 0.15% chance of being a cat respectively (code for calculating these probabilities is provided later in the post). This gives us a way to use the autoencoder as a binary cat classifier.

So that was a brief overview of the idea and the results from my experiments. Now I will describe the experiment in more detail, along with the code, so you can replicate the results for yourself, should you find it interesting enough to do so.

Data cleaning


The images are provided in a directory structure that looks like this. The Cat and Dog directories each contain 12,500 images of varying sizes, both monochrome and color. We randomly sample 2500 images each from each collection as our held-out test set and write the filenames to flat files test-cats.txt and test-dogs.txt. The remaining 10,000 cat images are written into train-cats.txt. We use these files as inputs to a custom image generator.

1
2
3
4
5
6
7
8
PetImages/
|
+-- Cat/
|   |
|   +-- cat images
+-- Dog/
|   |
|   +-- dog images

We standardize our input images to all be monochrome and size 128 x 128 pixels. Most images have 3 channels (Red, Green, Blue) which we coalesce to a single channel using the Luminosity formula defined here. Some images had 4 channels (RGB + Alpha channel), for which we first discarded the Alpha channel and applied the same Luminosity formula to the remaining channels. A few files could not be read and were discarded. A few others had issues with its EXIF headers, which seems a spurious problem without a solution in the Pillow package according to this issue, which I ended up discarding as well. Because we are referencing the image files via text files, "discarding" an image is just commenting out its file name. In all, 24 filenames in all were discarded in this manner, 13 from the training set and 5 and 6 from the cat and dog test sets respectively.

The code to pre-process incoming images into a (128, 128) monochrome images is shown below. One thing to note is that the skimage.transform.resize() method not only resizes the image, but also silently rescales the pixel intensities from the [0, 256] range to the [0, 1] range.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from skimage.transform import resize

import os
import matplotlib.pyplot as plt
import numpy as np

IMAGE_HEIGHT, IMAGE_WIDTH, NUM_CHANNELS = 128, 128, 1

def get_processed_image(image):
    image_r = resize(image, (IMAGE_HEIGHT, IMAGE_WIDTH))
    if len(image_r.shape) == 2:  # 2D image
        image_r = np.expand_dims(image_r, axis=2)
    else:
        if image_r.shape[2] == 3:
            image_r = (image_r[:, :, 0] * 0.21 + 
                       image_r[:, :, 1] * 0.72 + 
                       image_r[:, :, 2] * 0.07)
        elif image_r.shape[2] == 4:
            image_r = image_r[:, :, 0:3]
            image_r = (image_r[:, :, 0] * 0.21 + 
                       image_r[:, :, 1] * 0.72 + 
                       image_r[:, :, 2] * 0.07)
        else:
            pass
        image_r = np.expand_dims(image_r, axis=2)
    assert(image_r.shape == (IMAGE_HEIGHT, IMAGE_WIDTH, NUM_CHANNELS))
    return image_r

Discarding the problem images was semi-manual. I would repeatedly loop through the three lists of files, looking for exceptions, using the following code. Once I found one, I would comment it out so it would get ignored in the next run. To save time, I did a little trick with global variables, so I could start with the file I left off, that bit is commented out. The code below should run without errors once all the offending files are commented out of the list.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import warnings

warnings.filterwarnings("ignore", ".*will be changed.*")

DATA_DIR = "../data"

CAT_IMAGE_DIR = os.path.join(DATA_DIR, "PetImages", "Cat")
DOG_IMAGE_DIR = os.path.join(DATA_DIR, "PetImages", "Dog")

CAT_TRAIN_FILE = os.path.join(DATA_DIR, "train-cats.txt")
CAT_TEST_FILE = os.path.join(DATA_DIR, "test-cats.txt")
DOG_TEST_FILE = os.path.join(DATA_DIR, "test-dogs.txt")

#LAST_FOUND_JPG = "573.jpg"
#start_processing = False

def get_source_image(imagedir, imagefile):
    image = plt.imread(os.path.join(imagedir, imagefile))
    return image

def check_images(imagedir, imagelist):
#    global start_processing
    num_processed = 0
    legend = os.path.basename(imagelist)
    flist = open(imagelist, "r")
    for line in flist:
        if line.startswith("#"):
#            if line.strip() == "#" + LAST_FOUND_JPG:
#                start_processing = True
            continue
        if num_processed % 100 == 0:
            print("{:s}: {:d} lines loaded...".format(legend, num_processed))
#        if not start_processing:
#            num_processed += 1
#            continue
        imagefile = line.strip()
        try:
            image = get_source_image(imagedir, imagefile)
            image_p = get_processed_image(image)
            if len(image_p.shape) != 3:
                print(imagefile, image.shape, image_p.shape)
        except IndexError as e1:
            print(imagefile, image.shape)
            raise e1 
        except OSError as e2:
            print(imagefile)
            raise e2
        except UserWarning as e3:
            print(imagefile)
            raise e3
        except AssertionError as e4:
            print(imagefile)
            raise e4
        num_processed += 1
    
    flist.close()
    print("{:s}: {:d} lines loaded, COMPLETE".format(legend, num_processed))

check_images(CAT_IMAGE_DIR, CAT_TRAIN_FILE)
check_images(CAT_IMAGE_DIR, CAT_TEST_FILE)
check_images(DOG_IMAGE_DIR, DOG_TEST_FILE)

Model Training


The autoencoder design is based on the one Aditya Sharma describes in his blog post on Reconstructing Brain MRI Images using Deep Learning (Convolutional Autoencoder). The network consists of 2 blocks of convolution and maxpooling layers for the encoder, followed by a final convolutional block, followed by 2 blocks of convolution and upsampling for the decoder. Input to the network is mini-batches (of 128 images each), each image having the shape (128, 128, 1). Here is the code to build the autoencoder. The loss function is Mean Square Error, since we want to minimize pixel-wise reconstruction error, and the optimizer we use is RMSProp with a learning rate of 0.0003.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
import tensorflow as tf

def build_model_2():
    input_img = tf.keras.Input(shape=(128, 128, 1))
    # encoder
    conv_1 = tf.keras.layers.Conv2D(32, (3, 3), activation="relu",
                                    padding="same")(input_img)
    conv_1 = tf.keras.layers.BatchNormalization()(conv_1)
    conv_1 = tf.keras.layers.Conv2D(32, (3, 3), activation="relu",
                                    padding="same")(conv_1)
    conv_1 = tf.keras.layers.BatchNormalization()(conv_1)
    pool_1 = tf.keras.layers.MaxPool2D((2, 2))(conv_1) # (64, 64, 32)
    
    conv_2 = tf.keras.layers.Conv2D(64, (3, 3), activation="relu",
                                    padding="same")(pool_1)
    conv_2 = tf.keras.layers.BatchNormalization()(conv_2)
    conv_2 = tf.keras.layers.Conv2D(64, (3, 3), activation="relu",
                                    padding="same")(conv_2)
    conv_2 = tf.keras.layers.BatchNormalization()(conv_2)
    pool_2 = tf.keras.layers.MaxPool2D((2, 2))(conv_2) # (32, 32, 64)
    
    conv_3 = tf.keras.layers.Conv2D(128, (3, 3), activation="relu",
                                    padding="same")(pool_2)
    conv_3 = tf.keras.layers.BatchNormalization()(conv_3)
    conv_3 = tf.keras.layers.Conv2D(128, (3, 3), activation="relu",
                                    padding="same")(conv_3)
    conv_3 = tf.keras.layers.BatchNormalization()(conv_3) # (32, 32, 128)
    
    # decoder
    conv_4 = tf.keras.layers.Conv2D(64, (3, 3), activation="relu",
                                    padding="same")(conv_3)
    conv_4 = tf.keras.layers.BatchNormalization()(conv_4)
    conv_4 = tf.keras.layers.Conv2D(64, (3, 3), activation="relu",
                                    padding="same")(conv_4)
    conv_4 = tf.keras.layers.BatchNormalization()(conv_4)
    up_1 = tf.keras.layers.UpSampling2D((2, 2))(conv_4) # (64, 64, 64)
    
    conv_5 = tf.keras.layers.Conv2D(32, (3, 3), activation="relu",
                                    padding="same")(up_1)
    conv_5 = tf.keras.layers.BatchNormalization()(conv_5)
    conv_5 = tf.keras.layers.Conv2D(32, (3, 3), activation="relu",
                                    padding="same")(conv_5)
    conv_5 = tf.keras.layers.BatchNormalization()(conv_5)
    up_2 = tf.keras.layers.UpSampling2D((2, 2))(conv_5) # (128, 128, 32)
    
    decoded = tf.keras.layers.Conv2D(1, (3, 3), activation="sigmoid",
                                     padding="same")(up_2) # (128, 128, 1)
    
    autoencoder = tf.keras.Model(inputs=[input_img], outputs=[decoded])
    rmsprop = tf.keras.optimizers.RMSprop(lr=3e-4)
    autoencoder.compile(optimizer=rmsprop, loss="mse")
    return autoencoder

Or in tabular form.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 128, 128, 1)       0         
_________________________________________________________________
conv2d (Conv2D)              (None, 128, 128, 32)      320       
_________________________________________________________________
batch_normalization (BatchNo (None, 128, 128, 32)      128       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 128, 128, 32)      9248      
_________________________________________________________________
batch_normalization_1 (Batch (None, 128, 128, 32)      128       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 64, 64, 32)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 64, 64, 64)        18496     
_________________________________________________________________
batch_normalization_2 (Batch (None, 64, 64, 64)        256       
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 64, 64, 64)        36928     
_________________________________________________________________
batch_normalization_3 (Batch (None, 64, 64, 64)        256       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 32, 32, 64)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 32, 32, 128)       73856     
_________________________________________________________________
batch_normalization_4 (Batch (None, 32, 32, 128)       512       
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 32, 32, 128)       147584    
_________________________________________________________________
batch_normalization_5 (Batch (None, 32, 32, 128)       512       
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 32, 32, 64)        73792     
_________________________________________________________________
batch_normalization_6 (Batch (None, 32, 32, 64)        256       
_________________________________________________________________
conv2d_7 (Conv2D)            (None, 32, 32, 64)        36928     
_________________________________________________________________
batch_normalization_7 (Batch (None, 32, 32, 64)        256       
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 64, 64, 64)        0         
_________________________________________________________________
conv2d_8 (Conv2D)            (None, 64, 64, 32)        18464     
_________________________________________________________________
batch_normalization_8 (Batch (None, 64, 64, 32)        128       
_________________________________________________________________
conv2d_9 (Conv2D)            (None, 64, 64, 32)        9248      
_________________________________________________________________
batch_normalization_9 (Batch (None, 64, 64, 32)        128       
_________________________________________________________________
up_sampling2d_1 (UpSampling2 (None, 128, 128, 32)      0         
_________________________________________________________________
conv2d_10 (Conv2D)           (None, 128, 128, 1)       289       
=================================================================
Total params: 427,713
Trainable params: 426,433
Non-trainable params: 1,280

As I had mentioned earlier, I use a custom image file generator based on my 3 text files, which point to the actual image files. To build it, I combined ideas from Arindam Baidya's blog post on How to increase your small image dataset using Keras ImageDataGenerator and Nilesh's post on Writing Custom Keras Generators. Essentially the idea is to build a generator function that reads the flat files to yield batches of images, which the Keras ImageDataGenerator calls using its flow() method. To be clear, I didn't end up using any image augmentation (which is probably the biggest reason to use the ImageDataGenerator). I did try augmenting initially, but my training was not converging, which I later found to be because I was mistakenly augmenting my validation set as well. In any case, I turned off augmentation for both, since I had 10,000 images, which I figured would be more than enough to train the autoencoder. Here is the code for the generator function images_generator() and the wrapper function custom_generator().

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
def images_generator(image_list):
    while True:
        indices = np.random.permutation(len(image_list))
        num_batches = len(image_list) // BATCH_SIZE
        for batch in range(num_batches):
            curr_batch = indices[batch*BATCH_SIZE:(batch + 1)*BATCH_SIZE]
            xtrain = np.empty([0, IMAGE_HEIGHT, IMAGE_WIDTH, NUM_CHANNELS],
                             dtype=np.float32)
            ytrain = np.empty([0], dtype=np.int32)
            for id_batch in curr_batch:
                image = get_source_image(IMAGE_DIR, image_list[id_batch])
                image = get_processed_image(image)
                xtrain = np.append(xtrain, [image], axis=0)
                ytrain = np.append(ytrain, [0])
            yield (xtrain, ytrain)

def custom_generator(gen, augment_images):
    if augment_images:
        keras_image_generator = tf.keras.preprocessing.image.ImageDataGenerator(
            rotation_range=15,
            width_shift_range=0.1,
            height_shift_range=0.1,
            shear_range=0.01,
            zoom_range=[0.9, 1.25],
            horizontal_flip=True,
            vertical_flip=False,
            fill_mode="reflect",
            brightness_range=[0.5, 1.5]
        )
    else:
        keras_image_generator = tf.keras.preprocessing.image.ImageDataGenerator()
    for images, labels in gen:
        kigen_batch = keras_image_generator.flow(images, labels,
                batch_size=images.shape[0])
        aug_images, labels = next(kigen_batch)
        yield (aug_images, aug_images)

In the code snippet below, the training set of 10,000 cat images is split 90/10 into a training set and validation set. We then declare custom generators on top of these image lists, and call the autoencoder's fit_generator method, training it for 50 epochs on a AWS p2.xlarge EC2 instance.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sklearn.model_selection import train_test_split

train_set_images = get_train_images(TRAIN_CATS)
train_images, val_images = train_test_split(train_set_images,
                                            train_size=0.9,
                                            test_size=0.1)

train_gen = images_generator(train_images)
cust_gen = custom_generator(train_gen, False)

val_gen = images_generator(val_images)
cust_val_gen = custom_generator(val_gen, False)

autoencoder = build_model_2()
print(autoencoder.summary())

checkpoint = tf.keras.callbacks.ModelCheckpoint(MODEL_FILE,
                                                save_best_only=True)
tensorboard = tf.keras.callbacks.TensorBoard(log_dir=LOG_DIR, 
                                             batch_size=BATCH_SIZE)
history = autoencoder.fit_generator(cust_gen, 
    steps_per_epoch=len(train_images) // BATCH_SIZE,
    epochs=NUM_EPOCHS,
    callbacks=[checkpoint, tensorboard],
    validation_data=cust_val_gen,
    validation_steps=len(val_images) // BATCH_SIZE)

The learning curve for the autoencoder training shown below shows that both the training and validation losses are decreasing and converging nicely.


Model Evaluation


The ModelCheckpoint callback ensures that the best model (based on the validation loss) is saved, so we can use this saved model to do our evaluations. The first evaluation I did was a few spot checks, where I rendered the input and reconstructed image (by running the input image through the trained autoencoder) side by side and computed the reconstruction error. Code for that is shown below. This code was used to produce the first two images of the cat and dog earlier in this post.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def compute_reconstruction_error(image, image_r):
    assert(image.shape == (IMAGE_WIDTH, IMAGE_HEIGHT))
    assert(image_r.shape == (IMAGE_WIDTH, IMAGE_HEIGHT))
    return np.sqrt(np.mean(np.power(image - image_r, 2)))

test_image = os.path.join(DOG_IMAGE_DIR, "4905.jpg")
    
image_p = get_processed_image(plt.imread(test_image)

model = tf.keras.models.load_model(MODEL_FILE)

image_r = model.predict(np.expand_dims(image_p, axis=0))

image_r = np.squeeze(image_r, axis=(0, 3))
image_p = np.squeeze(image_p, axis=2)
recon_error = compute_reconstruction_error(image_p, image_r)

plt.subplot(121)
image_p = (image_p * 255).astype(np.int32)
plt.xticks([])
plt.yticks([])
plt.title("original")
plt.imshow(image_p, cmap="gray")

plt.subplot(122)
image_r = (image_r * 255).astype(np.int32)
plt.xticks([])
plt.yticks([])
plt.title("reconstructed")
plt.imshow(image_r, cmap="gray")

plt.tight_layout()
plt.show()

print("reconstruction error: {:.7f}".format(recon_error))

I followed this up by computing the reconstruction error across 2,500 cats and 2,500 dogs in the held-out test set, and plotting the results in a pair of histograms. I then computed the t-value to determine if the two distributions were different with some reasonable confidence level. The code for this is shown below. This code was used to produce the overlapping histograms earlier in the post.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def compute_recon_errs(model, testfile, image_dir):
    num_images = 0
    images = []
    ftest = open(testfile, "r")
    for line in ftest:
        if line.startswith("#"):
            continue
        if num_images % 1000 == 0:
            print("{:d} images read".format(num_images))
        image_file = os.path.join(image_dir, line.strip())
        image = plt.imread(image_file)
        images.append(get_processed_image(image))
        num_images += 1
    
    print("{:d} images read, COMPLETE".format(num_images))
    ftest.close()
    
    print("Running model...")
    images = np.array(images)
    images_r = model.predict(images, batch_size=BATCH_SIZE)
    
    print("Computing errors...")
    errors = np.zeros((num_images,), dtype=np.float32)
    for i in range(num_images):
        image = np.squeeze(images[i], axis=2)
        image_r = np.squeeze(images_r[i], axis=2)
        errors[i] = _compute_recon_err(image, image_r)
    return errors

def compute_stats(errors):
    mu = np.mean(errors)
    sd = np.std(errors)
    return mu, sd, errors.shape[0]

def compute_t(cat_stats, dog_stats):
    cat_mu, cat_sig, cat_n = cat_stats
    dog_mu, dog_sig, dog_n = dog_stats
    mu_diff = dog_mu - cat_mu
    sd_sum = math.sqrt(((dog_sig ** 2) / dog_n) + ((cat_sig ** 2) / cat_n))
    return mu_diff / sd_sum


cat_recon_errs = compute_recon_errs(model, TEST_CATS_FILE, CAT_IMAGE_DIR)
dog_recon_errs = compute_recon_errs(model, TEST_DOGS_FILE, DOG_IMAGE_DIR)

plt.hist(cat_recon_errs, bins=50, color="b", alpha=0.3, label="cats")
plt.hist(dog_recon_errs, bins=50, color="r", alpha=0.3, label="dogs")
plt.legend(loc="best")
plt.xlabel("reconstruction error")
plt.ylabel("# of images")
plt.show()

cat_stats = compute_stats(cat_recon_errs)
dog_stats = compute_stats(dog_recon_errs)
print("cat stats:", cat_stats)
print("dog stats:", dog_stats)

t_value = compute_t(cat_stats, dog_stats)
print("t_value:", t_value)
print("lookup pvalue from: https://www.biologyforlife.com/t-test.html")

And finally, here is the snippet for computing the probability of an image being that of a cat, given the reconstruction error from the cat encoder. Essentially, I compute the value of the normal distribution at the reconstruction error point, normalized by the (maximum) value at the mean.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
import scipy.stats

cat_mean, cat_sd, cat_n = (0.035493888, 0.011421803, 2495)
dog_mean, dog_sd, dog_n = (0.042051364, 0.0155974245, 2494)

max_cat = scipy.stats.norm(cat_mean, cat_sd).pdf(cat_mean)

cat_recon_err = 0.04304386727035
dog_recon_err = 0.07650210412358

cat_score = scipy.stats.norm(cat_mean, cat_sd).pdf(cat_recon_error) / max_cat
dog_score = scipy.stats.norm(cat_mean, cat_sd).pdf(dog_recon_error) / max_cat

To conclude, in addition to being good at feature compression and anomaly detection, I believe autoencoders can also be useful to build complex classifiers (or feature detectors for more complex classifiers) by building on existing collections of pre-classified data. I did this experiment as a proof of concept to convince myself that this was indeed a viable approach, hopefully I have managed to convince you too.