Some time back I wrote a post about Tricks to improve performance of CIFAR-10 classifier, based on things I learned from New York University's Deep Learning with Pytorch course taught by Yann Le Cun and Alfredo Canziani. The tricks I covered were conveniently located on a single slide in one of the lectures. Shortly thereafter, I learned of a few more tricks that were mentioned in passing, so I figured it might be interesting to try these out as well to see how well they worked. This is the subject of this blog post.
As before, the tricks themselves are not radically new or anything, my interest in implementing these techniques is as much to learn how to do it using Pytorch as driven by curiosity about their effectiveness at the classification task. The task is relatively simple -- the CIFAR-10 dataset contains about 1000 (800 training and 200 test) low resolution 32x32 RBG images, and the task is to classify them as one of 10 distinct classes. The network we use is adapted from the CNN described in the Tensorflow CNN tutorial.
We start with a baseline network that is identical to that described in the Tensorflow CNN tutorial. We train the network using the training set and evaluate the trained network using classification accuracy (micro-F1 score) on the test set. All models were trained for 10 epochs using the Adam optimizer. Here are the different scenarios I tried.
- Baseline -- This is a CNN with three layers of convolutions and max-pooling, followed by a two layer classification head. It uses the Coss Entropy loss function and the Adam optimizer with a fixed learning rate of 1e-3. The input filter size is 3 (RGB images), and the convolution layers create 32, 64, and 64 channels respectively. The resulting tensor is then flattened and passed through two linear layers to predict softmax probabilities for each of the 10 classes. The number of trainable parameters in this network is 122,570 and it achieves an accuracy score of 0.705.
- Wider Network -- The size of the penultimate layer in the feedforward or dense part of the network was widened from 64 to 512, increasing the number of trainable parameters to 586,250 and a score of 0.742.
- Deeper Network -- Similar to the previous approach, the number of layers in the dense part of the network was increased from a single layers of size 64 to two layers of size (512, 256). As with the previous approach, this increased the number of trainable parameters to 715,018 and a score of 0.732.
- Batch Normalization (before ReLU) -- This trick adds a Batch Normalization layer after each convolution layer. There is some confusion on whether to put the BatchNorm before the ReLU acivation or after, so I tried both ways. In this configuration, the BatchNorm layer is placed before the ReLU activation, i.e., each convolution block looks like (Conv2d → BatchNorm2d → ReLU → MaxPool2d). The BatchNorm layer functions as a regularizer and increases the number of trainable parameters slightly to 122,890 and gives a score of 0.752. Between the two setups (this and the one below), this seems to be the better setup to use based on my results.
- Batch Normalization (after ReLU) -- This setup is identical to the previous one, except that the BatchNorm layer is placed after the ReLU, i.e. each convolution block now looks like (Conv2d → ReLU → BatchNorm2d → MaxPool2d). This configuration gives a score of 0.745, which is less than the score from the previous setup.
- Residual Connection -- This approach involves switching each Convolution block (Conv2d → ReLU → MaxPool2d) with a basic ResNet block composed of two Convolution layers with a shortcut residual connection, followed by ReLU and MaxPool. This increases the number of trainable parameters to 212,714, a much more modest increase compared to the Wider and Deeper Network approaches, but with a much higher score boost (the highest among all the approaches tried) of 0.810.
- Gradient Clipping -- Gradient Clipping is more often used with Recurrent Networks, but serves a similar function as BatchNorm. It keeps the gradients from exploding. It is applied as an adjustment during the training loop and does not create new trinable paramters. It gave a much modest gain with a score of 0.728.
- Increase Batch Size -- Increasing the batch size from 64 to 128 did not result in significant change in score, it went up from 0.705 to 0.707.
The code for these experiments is available in the notebook at the link below. It was run on Colab (Google Colaboratory) on a (free) GPU instance. You can rerun the code yourself on Colab using the Open in Colab button at the top of the notebook.
The results of the evaluation for each of the different tricks are summarized in the barchart and table below. All the tricks outperformed the baseline, but the best performer was the one using residual connections, which outperformed the baseline by around 14 percentage points. Other notable performers were BatchNorm, and putting it before the ReLU activation worked better than putting it after. Making the dense head wider and deeper also worked well to increase performance.
One other thing I looked at was parameter efficiency. Widening and Deepening the Dense head layers caused the largest increase in the number of trainable parameters, but did not lead to a corresponding increase in performance. On the other hand, adding Batchnorm gave a performance boost with a small increase in the number of parameters. The residual connection approach did increase the number of parameters somewhat but gave a much larger boost in performance.
And thats all I had for today. It was fun to leverage the dynamic nature of Pytorch to build relatively complex models without too many more lines of code. I hope you found it useful.
Edit 2021-03-28: I had a bug in my notebook where I was creating an additional layer in the FCN head that I didn't intend to have, so I fixed that and re-ran the results, which gave different absolute numbers but largely retained the same rankings. The updated notebook is available on Github via the provided link, and the numbers have been updated in the blog post.