Salmon Run: March 2019

I last watched the videos for Stanford's CS231n: Convolutional Neural Networks for Visual Recognition almost two years ago, along with a lot of other videos, trying to scramble up the Deep Learning learning curve before I got too far behind. Recently, however, I was looking for some specific information around object detection, so I rewatched the Lecture 11: Detection and Segmentation video on Youtube. This lecture was from their 2017 class, and was taught by Justin Johnson. The lecture covers various popular approaches to object detection and segmentation, and can be summarized by the figure (taken from Darian Frajberg's presentation on Introduction to the Artificial Intelligence and Computer Vision Revolution at Politecnico di Milano) below.

What struck me again in the lecture was the incremental nature of the approaches, with each approach building on the one before it. But underneath it all, its the same convolutional and pooling layers with classification and regression heads, just reconfigured in different ways. That is what I will write about in this post today, and I hope by the end of the post, you will agree with me as to how super cool this is. Of necessity, I am going to describe the different networks at a somewhat high level, so if you are looking for more information, I would advise you to watch the video yourself. Alternatively, Leonardo Araujo dos Santos has an excellent tutorial on Object Localization and Detection that also goes into more detail than I am going to. But on to the post.

The first computer vision task is Image Classification. Here you train a network with images which can belong to one of a fixed number of classes, and use the trained model to predict the class. The Convolutional Neural Network (CNN) is the tool of choice when it comes to image classification. CNNs have consistently blown away classification benchmarks against the ImageNet dataset starting with AlexNet in 2012. Over the years, networks have grown deeper and some novel extensions to the basic CNN architecture have been developed, and today, image classification against the ImageNet dataset is considered a mostly solved problem. CNN models pre-trained against ImageNet are available for many frameworks, and people often use them to build their own classifiers using Transfer Learning.

Semantic Segmentation

The goal of semantic segmentation is to label each pixel of the image with its corresponding class label. Naively, this would be done by taking small crops by sliding a window across and down the input image, then predicting the central pixel in each crop. This can be expensive, since an image H pixels high and W pixels wide will have to do H x W operations for a single image.

A less expensive approach might be to run your image through a CNN which will increase the number of channels without changing the width and height. At the end, each pixel is represented by a vector of size equal to the number of target channels. Each of these vectors are then used to predict the class of the pixel. While better than the previous approach, this approach is also prohibitively expensive and is not used.

A third way is to use a CNN encoder-decoder pair, where the encoder will decrease the size of the image but increase its depth using Convolution and Pooling operations, and the decoder will use Transposed Convolutions to increase its size and decrease its depth. The input to this network is the image, and the output is the segmentation map. The U-Net, originally developed for biomedical image segmentation, contains additional skip-connections between corresponding layers of the encoder and decoder.

Classification + Localization

In the Classification + Localization problem, not only do you have to report the class of the object in the image, you have to report the location. The assumption in a localization problem is that you have a single object in the image.

The only addition to a standard CNN pipeline is that this network will have two heads, a classification head and a regression head that reports the bounding box coordinates (x, y, w, h). The convolutional part of the network will output a vector for the input image that is fed to a pair of dense networks jointly -- the classification head uses some sort of categorical loss function and the regression head uses a continuous loss function, combined with an additional scalar hyperparameter. You can train the networks separately and fine tune them jointly. Depending on where you place the regression head -- if placed before the fully connected layers, it is called the Overfeat or VGG style, and if placed after, it is called the R-CNN style.

Object Detection

The Object Detection task is the same as the Classification + Localization problem, except we now have multiple objects in the image, for each of which we have to find the class and bounding box. We don't know the number or the sizes of the objects in the image. This is a hard and almost intractable problem if we have to compute random crops.

The first solution is to use an external tool from computer vision, such as Selective Search, to compute "blobby" areas of the image as possible areas where objects might be found. These areas are called Region Proposals. Proposals are resized and used to fine-tune a Image classification network, which is then used to vectorize the images. Multiple binary SVMs (one per class) was used to classify the object between object and background, and a linear regression network was used to correct the bounding boxes proposed in the region proposals. This was the Region Proposal Network, or R-CNN.

The first improvement, called the Fast R-CNN, still gets the region proposals from an external network, but projects these proposals to the output of the CNN. An ROI Pooling layer resizes all the regions to a fixed size and pass the vector for each region to a classification and a regression head, which predicts the class and bounding box coordinates respectively.

The next improvement, called the Faster R-CNN, gets rid of the external Region Proposal mechanism and replaces it with a Region Proposal Network (RPN) between the Deep CNN and the ROI Pooling layer. The output of the RPN and the output of the CNN are fed into the ROI Pooling Layer from which it is fed into the fully connected part of the CNN with a classification and regression head to predict the class and bounding box for each proposal respectively.

The speedup of a Fast R-CNN over the R-CNN is about 25x, and speedup of the Faster R-CNN over a Fast-RCNN is 10x, thus a Faster R-CNN is about 250x faster than a R-CNN.

A slightly different approach to the Object Detection task was taken by the class of networks called Single Shot Detectors (SSD), one of which is the YOLO (You Only Look Once) network. The YOLO network breaks up the image into a 7x7 grid, then applies a set of pre-determined bounding boxes (B) with different aspect ratios to each grid. For each bounding box, it will compute the coordinates (x, y, w, h) and confidence for the object bounding box coordinates and the class probabilities for each of C classes. Thus the image can be represented by a vector of size 7 x 7 x (5B + C). The YOLO network is a CNN which converts the image to this vector.

Instance Segmentation

Instance Segmentation Tasks are similar to Semantic Segmentation, but the difference is that it distinguishes between individual objects of the same class, nor does it aim to label every pixel in the image. In some ways Instance Segmentation could also be considered similar to Object Detection, but instead of a bounding box, we want to find a binary mask the contains each object. The network that is used for this task is the Mask R-CNN.

Mask R-CNN adds an additional mask predicting branch (another CNN) to the regression head of a Faster R-CNN. This is applied to every region proposed by the RPN in the Faster R-CNN. The mask predicting branch converts the bounding box for each proposal into a binary mask.

This concludes my post on Object Detection and Segmentation networks. As you can see, while the networks have grown progressively more complex as the tasks got harder, they are composed of (mostly) the same basic building blocks. Some things to note is the gradually making networks learn end-to-end, by moving as much learning from the outside to inside the network.

I hope you enjoyed my high level overview on the composability of CNNs for different computer vision tasks. I have deliberately tried to keep the descriptions at a fairly high level. For more depth, consider watching the lecture on which this post is based on, or consult one of the many tutorials on the Internet that go into each network in greater detail, including the tutorial I mentioned earlier.