I am participating in a weekly meetup with a TWIML (This Week in Machine Learning) group where we go through video lectures of the NYU (New York University) course Deep Learning (with Pytorch). Each week we cover one of the lectures in an "inverted classroom" manner -- we watch the video ourselves before attending, and one person leads the discussion, covering the main points of the lecture and moderating the discussion. Even though it starts from the basics, I have found the discussions to be very insightful so far. Next week's lecture is about Stochastic Gradient Descent and Backpropagation (Lecture 3), delivered by Yann LeCun. Towards the end of the lecture, he lists out some tricks for training neural networks efficiently using backpropagation.
To be fair, none of these tricks should be new information for folks who have been training neural networks. Indeed, in Keras, most if not all these tricks can be activated by setting a parameter somewhere in your pipeline. However, this was the first time I had seen them listed down in one place, and I figured that it would be interesting to put them to the test on a simple network. That way, one could compare the effects of each of these tricks, and more importantly for me, teach me how to do it using Pytorch.
The network I chose to do this with is a CIFAR-10 classifier, implemented as a 3 layer CNN (Convolutional Neural Network), identical in structure to the one described in Tensorflow CNN Tutorial. The CIFAR-10 dataset is a dataset of around a thousand low resolution (32, 32) RGB images. The nice thing about CIFAR-10 is that it is available as a canned dataset via the torchvision package. We explore the scenarios listed below. In all cases, we train the network using the training images, and validate at the end of each epoch using the test images. Finally, we evaluate the trained network in each case using the test images. We compare the trained network using micro F1-scores (same as accuracy) on the test set. All models were trained using the Adam optimizer, the first two used a fixed learning rate of 1e-3, while the rest used an initial learning rate of 2e-3 and an exponential decay of about 20% per epoch. All models were trained for 10 epochs, with a batch size of 64.
- baseline -- we incorporate some of the suggestions in the slide, such as using ReLU activation function over tanh and logistic, using the Cross Entropy loss function (coupled with Log Softmax as the final activation function), doing Stochastic Gradient Descent on minibatches, and shuffling the training examples, in the baseline already, since they are pretty basic and their usefulness is not really in question. We also use the Adam optimizer, based on a comment by LeCun during the lecture to prefer adaptive optimizers over the original SGD optimizer.
- norm_inputs -- here we find the mean and standard deviation of the training set images, then scale the images in both training and test set by subtracting the mean and dividing by the standard deviation.
- lr_schedule -- in the previous two cases, we used a fixed learning rate of 1e-3. While we are already using the Adam optimizer, which will give each weight its own learning rate based on the gradient, here we also create an Exponential Learning Rate scheduler that exponentially decays the learning rate at the end of each epoch. This is a built-in scheduler provided by Pytorch, among several other built-in schedulers.
- weight_decay -- weight decay is better known as L2 regularization. The idea is to add a fraction of the sum of the squared weights to the loss, and have the network minimize that. The net effect is to keep the weights small and avoid exploding the gradient. L2 regularization is available to set directly as the weight_decay parameter in the optimizer. Another related regularization strategy is the L1 regularization, which uses the absolute value of the weights instead of squared weights. It is possible to implement L1 regularization as well using code, but is not directly supported (i.e., in the form of an optimizer parameter) as L2 regularization is.
- init_weights -- this does not appear in the list in the slides, but is referenced in LeCun's Efficient Backprop paper (which is listed). While by default, module weights are initialized to random values, some random values are better than others for convergence. For ReLU activations, Kaimeng (or He) activtions are preferable, which is what we used (Kaimeng Uniform) in our experiment.
- dropout_dense -- dropouts can be placed after activation functions, and in our network, they can be placed after the activation function following a Linear (or Dense) module, or a convolutional module. Our first experiment places a Dropout module with dropout probability 0.2 after the first Linear module.
- dropout_conv -- dropout modules with dropout probability 0.2 are placed after each convolution module in this experiment.
- dropout_both -- dropout modules with dropout probability 0.2 are placed after both convolution and the first linear module in this experiment.
The code for this exercise is accessible at the link below. It was run on Colab (Google Colaboratory) on a (free) GPU instance. The Open in Colab button on the top of the notebook allows you to to run it yourself if you would like to explore the code.
The notebook evaluates and reports the accuracy, confusion matrix, and the classification report (with per class precision, recall, and F1-scores) for each model listed above. In addition, the bar chart below compares the micro F1-scores across the different models. As you can see, normalizing (scaling) the inputs does result in better performance, and the best results are achieved using the Learning Rate Schedule, Weight Initialization, and introducing Dropout for the Convolutional Layers.
That's basically all I had for today. The main benefit of the exercise for me was finding out how to implement these tricks in Pytorch. I hope you find this useful as well.