Salmon Run: Trying out various Deep Learning frameworks

The Deep Learning toolkit I am most familiar with is Keras, having used it to build some models around text classification, question answering and image similarity/classification in the past, as well as the examples for our book Deep Learning with Keras that I co-authored with Antonio Gulli. Before that, I have worked with Caffe to evaluate its pre-trained image classification models and to use one of them as a feature extractor for one of my own image classification pipelines. I have also worked with Tensorflow, learning it first from the awesome Udacity course taught by Vincent Vanhoucke of Google, and then using it to replicate the Grammar as a Foreign Language paper using our own data.

Lately, I have been curious about some of the other DL frameworks that are available, and whether it might make sense to explore them as well. So I decided to build a fully connected (FCN) and a convolutional (CNN) model to classify handwritten digits from the MNIST dataset, for each of Keras, Tensorflow, PyTorch, MXNet and Theano. Unlike the MNIST examples that are available for some of these frameworks, I read the data from CSV files and try to follow a similar coding style (the one I use for Keras) across all the different frameworks, so they are easy to compare. Both networks are also quite simple and training them is quick, so it is easy to run. All examples are provided as Jupyter notebooks, so you can just read them like you would one of my more code-heavy blog posts. The code is on my sujitpal/polydlot repository on GitHub.

My inspiration for the work was this chart posted on Twitter in May 2016 by Francois Chollet, creator of Keras. The first 3 charts show the top DL frameworks on GitHub ranked by number of forks, number of contributors and number of open issues. The fourth one weights these three features and produces an overall ranking that shows Keras at #3. I don't know the reasoning for the weights chosen in the fourth chart, although the rankings do line up with my own experience, and I would intuitively place similar importance on these three features as well. However, more importantly, even though it's somewhat dated, the chart gives an idea of the DL frameworks people are looking at, as well as a rough indication of their popularity.

In this post, I explain why I chose the DL frameworks that I did and share what I learned about each of these frameworks from the exercise. For those of you who know a subset of these frameworks, hopefully this will give you a glimpse of what it is like in the other framework. To those who are just starting out, I hope this comparison gives you some idea of where to start.

I chose Keras because I am comfortable with it. The very first DL framework I learned was Tensorflow. Soon after, I came across Keras when trying to read some Lasagne code (another library to build networks in Theano). While it didn't help with the Lasagne work, I got very excited about Keras, and set about building Keras implementations of the Tensorflow models I had built so far, and really got to appreciate how its object-oriented API made it easy to build useful models with very few lines of code. So anyway, I did the Keras examples mainly to figure out a base configuration and how many epochs to train each network to get reasonable results.

For those of you who are reading this to decide whether to learn Keras - learning Keras has one other advantage. In addition to the two backends (Theano and Tensorflow) it already supports, the Microsoft Cognitive Toolkit (CNTK) project and the MXNet project (supported by Amazon) are also considering Keras APIs. So once these APIs are in place, knowing Keras automatically gives you the skills to work with these frameworks as well.

My next candidate was Tensorflow. While not as fluent with Tensorflow as with Keras, I have written code using it in the past. I haven't kept up with the high level libraries that are tightly integrated with Tensorflow such as skflow and tensorflow-slim, since they looked like they were still evolving when I saw them.

Tensorflow (like Theano) programs require you to define your sequence of operations (i.e, the computation graph), "compile" it, and then run it with your variables. During the definition, the operands in the computation graph are represented using container objects called Tensors. At run-time, you pass in actual values to these container objects from your application. This is done mainly for performance, the network can optimize itself when it knows the sequence of operations up-front, and it is easier to distribute computations across different machines in a distributed environment. The process is called "Define and Run". Tensorflow is also a fairly low level library, its abstraction is at the operation level, compared to Keras, which is at the layer level. Consequently, Tensorflow code tends to be more verbose than comparable Keras code, and it often helps to modularize Tensorflow code for readability.

Keras, like the good high-level library that it is, tries to hide the separation implied by the "Define and Run" approach. However, there are times when it becomes necessary to extend Keras to do things it wasn't designed to do. Keras offers a backend API where it exposes operations on the backend, with which you can do some extensions such as a new loss function or new layer, and still remain within Keras. More complex extensions, such as adding an attention mechanism, can require setups where Keras and Theano or Tensorflow code must co-exist in the same code base, and figuring out how to make them interoperate can be a challenge. For this reason, I was quite excited to learn from Francois's talk on Integrating Keras and Tensorflow at Tensorflow Dev Summit 2017, that Keras will become the official API for Tensorflow starting with version 1.2. This will allow cleaner interoperability between Tensorflow and the Keras API, while at the same time allow you to write code that is less verbose than pure Tensorflow and more flexible than pure Keras.

For completeness, I also looked at Theano. Theano seems to be even more low level than Tensorflow, and lacks many of the convenience functions that Tensorflow provides. However, its computation graph definition is simpler and more intuitive (at least to me) compared to Tensorflow - you define Variables and functions, which you then populate with values from your application and run the function. I didn't do too much here as I don't expect to do too much work with Theano at this time.

One other framework I looked at was MXNet. Recently I attended a webinar organized by Amazon Web Service where they demonstrated the distributed training capabilities of MXNet on an AWS cluster, which I thought was quite cool, and which prompted me to look at MXNet further. Unlike Keras, MXNet is built on a C/C++ shared library and exposes a Python API. It also exposes APIs in various other languages, including Scala and R. In that respect, it is similar to Caffe. The Python API is similar to Keras, at least the level of abstraction, although there are some undocumented features that are set up by convention. I think this may be a good fit for shops that prefer Scala over Python, although Python seems to be quite ubuiquitous in the DL space.

Finally I looked at PyTorch, initially at the advice of a friend who works for Salesforce Research. PyTorch is the Python version of Torch, a DL framework written in Lua and used at Facebook Research. PyTorch seems to be the adopted as standard at Salesforce Research. The abstraction and code looks similar to Keras, but there is one important difference.

Unlike "Define and Run" frameworks such as Theano and Tensorflow (and by extension Keras), PyTorch (and Torch) is "Define by Run". So there is no additional code required to define the network and then run it. Because of that, the code is also more readable, and resembles Keras as well. The graph is built as you define it. This allows you to do certain things that cannot be done with "Define and Run" frameworks, especially with certain use cases in NLP. Like MXNet and Caffe, PyTorch is backed by a C/C++ shared library, and the Python and Lua front ends both use the same shared library. So in the long run, PyTorch seems to be worth learning as well.

Overall, I think the two advantages that this work has given me is an appreciation of how different DL frameworks work, and the ability to decide the next steps in my learning. Another advantage has been the advantage of polygot-ism, after which the project is named. Just like knowing a language of a country enables you to appreciate the culture better, knowing another DL framework allows you to understand the examples provided by each of these frameworks, some of which are quite interesting. It also allows you to read code written by others using these frameworks.

Well, that's all I have for today, hope you enjoyed it. I have tried to share what I learned from this brief exercise in comparing how to build fully connected and convolutional networks to classify MNIST digits. I found that reading the data from CSV files is more representative of real world situations and forces you to think about the input, something you wouldn't normally do if the data came from some built-in function. Also, while almost every DL framework comes with their own MNIST examples, their coding styles are very different and it is hard to compare implementations across frameworks. So I feel that the work I did might be helpful to you as well.