Saturday, May 20, 2017

Evaluating a Simple but Tough to Beat Embedding via Text Classification



Recently, a colleague and a reader of this blog independently sent me a link to the Simple but Tough-to-Beat Baseline for Sentence Embeddings (PDF) paper by Sanjeev Arora, Yingyu Liang, and Tengyu Ma. My reader also mentioned that the paper was selected for a mini-review in Lecture 2 of the Natural Language Processing and Deep Learning (CS 224N) course taught at Stanford University by Prof Chris Manning and Richard Socher. For those of you who have taken Stanford's earlier Deep Learning and NLP (CS 224d) course taught by Socher, or the very first course Coursera on Natural Language Processing by Profs Dan Jurafsky and Chris Manning, you will find elements from both in here. There are also some things I think are new or that I might have missed earlier.

The paper introduces an unsupervised scheme for generating sentence embeddings that has been shown to consistently outperform a simple Bag of Words (BoW) approach in a number of evaluation scenarios. The evaluation scenarios considered are both intrinsic (correlating computed similarities of sentence embeddings with human estimates of similarity) as well as extrinsic (using the embeddings for a downstream classification task). I thought the idea was very exciting, since all the techniques I have used to convert word embeddings to sentence embeddings have given results consistent with the complexity used to produce them. At the very low end is the BoW approach, which adds up the embedding vectors for the individual words and averages them over the sentence length. At the other end of the scale is to generate sentence vectors from a sequence of word vectors by training a LSTM and then using it, or by looking up sentence vectors using a trained skip-thoughts encoder.

The Smooth Inverse Frequency (SIF) embedding approach suggested by the paper is only slightly more complicated than the BoW approach, and promises consistently better results than BoW. So for those of us who used the BoW as a baseline, this suggests that we should now use SIF embedding instead. So instead of just averaging the component word vectors as suggested by this equation for BoW:



We generate the sentence vector vs by multiplying each component vector vw by the inverse of its probability of occurrence. Here α is a smoothing constant, its default value as suggested in the paper is 0.001. We then sum these normalized smoothed word vectors and divide by the number of words.



Since we do this for all the sentences in our corpus, we now have a matrix where the number of rows is the number of sentences and the number of columns is the embedding size (typically 300). Removing the first principal component from this matrix gives us our sentence embedding. There is also an implementation of this embedding scheme in the YingyuLiang/SIF GitHub repository.

For my experiment, I decided to compare BoW and SIF vectors by how effective they are when used for text classification. My task is to classify images as compound (i.e, composed of multiple sub-images) versus non-compound (single image, no sub-images) using only the captions. The data comes from the ImageCLEF 2016 (Medical) competition, where Compound Figure Detection is the very first task in the task pipeline. The provided dataset has 21,000 training captions, each about 92 words long on average, and split roughly equally between the two classes. The dataset also contains 3,456 test captions (labels provided for validation purposes).

The label and captions are provided as two separate files, for both training and test datasets. Here is an example of what the labels file looks like:

1
2
3
4
5
6
7
8
11373_2007_9226_Fig1_HTML,COMP
12178_2007_9002_Fig3_HTML,COMP
12178_2007_9003_Fig1_HTML,COMP
12178_2007_9003_Fig3_HTML,COMP
rr188-1,NOCOMP
rr199-1,NOCOMP
rr36-5,NOCOMP
scrt3-1,NOCOMP

and the captions files look like this:

1
2
3
12178_2007_9003_Fig1_HTML       An 64-year-old female with symptoms of bilateral lower limb neurogenic claudication with symptomatic improvement with a caudal epidural steroid injection. An interlaminar approach could have been considered appropriate, as well. ( a ) Sagittal view of a T2-weighted MRI of the lumbar spine. Note the grade I spondylolisthesis of L4 on L5 with severe central canal stenosis. ( b ) and ( c ) Axial views of a T2-weighted MRI through L4 รข<80><93> 5. Note the diffuse disc bulge in ( b ) and the marked ligamentum flavum hypertophy in ( c ), both contributing to the severe central stenosis. ( d ) The L5-S1 level showing no evidence of stenosis
12178_2007_9003_Fig3_HTML       Fluoroscopic images of an L3-4 interlaminar approach. ( a ) AP view, pre-contrast, ( b ) Lateral view, pre-contrast, and ( c ) Lateral view, post-contrast
12178_2007_9003_Fig5_HTML       Fluoroscopic images of a right L5-S1 transforaminal approach targeting the right L5 nerve root. ( a ) AP view, pre-contrast and ( b ) AP view, post-contrast

I built BoW and SIF vectors for the entire dataset, using GloVe word vectors. I then used these vectors as inputs to stock Scikit-Learn Naive Bayes and Support Vector Machine classifiers, and measured the test accuracy for various vocabulary sizes. For the word probabilities, I used both native probabilities (i.e, computed from the combined caption dataset) and outside probabilities (computed from Wikipedia, and available in the YingyuLiang/SIF GitHub repository). I then built vocabularies out of the most common N words, computed BoW sentence embeddings, SIF sentence embeddings with native word frequencies, and SIF sentence embeddings with external probabilities (SIF+EP), and recorded the accuracy reported for the two class classification task from the Naive Bayes and Support Vector Machine (SVM) classifiers. Below I provide a breakdown of the steps wtih code.

The first step is to parse the files and generate a list of training and test captions with their labels.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def parse_caption_and_label(caption_file, label_file, sep=" "):
    filename2label = {}
    flabel = open(label_file, "rb")
    for line in flabel:
        filename, label = line.strip().split(sep)
        filename2label[filename] = LABEL2ID[label]
    flabel.close()
    fcaption = open(caption_file, "rb")
    captions, labels = [], []
    for line in fcaption:
        filename, caption = line.strip().split("\t")
        captions.append(caption)
        labels.append(filename2label[filename])
    fcaption.close()
    return captions, labels

TRAIN_CAPTIONS = "/path/to/training-captions.tsv"
TRAIN_LABELS = "/path/to/training-labels.csv"
TEST_CAPTIONS = "/path/to/test-captions.tsv"
TEST_LABELS = "/path/to/test-labels.csv"
LABEL2ID = {"COMP": 0, "NOCOMP": 1}

captions_train, labels_train = parse_caption_and_label(
    TRAIN_CAPTIONS, TRAIN_LABELS, ",")
captions_test, labels_test = parse_caption_and_label(
    TEST_CAPTIONS, TEST_LABELS, " ")

Next I build the word count matrix using the captions. For this we use the Scikit-Learn CountVectorizer to do the heavy lifting. We have removed stopwords from the counting using the stopwords parameter. At this point Xc is a matrix of word counts of shape (number of training records + number of test records, VOCAB_SIZE). The VOCAB_SIZE is a hyperparameter which we will vary during our experiments.

1
2
3
4
5
6
7
8
from sklearn.feature_extraction.text import CountVectorizer

VOCAB_SIZE = 10000
counter = CountVectorizer(strip_accents=unicode, 
                          stop_words="english",
                          max_features=VOCAB_SIZE)
caption_texts = captions_train + captions_test
Xc = counter.fit_transform(caption_texts).todense().astype("float")

At this point, we can capture the sentence length vector S (see the formulae for vs as the sum across the columns of this matrix).

1
2
3
4
import numpy as np

sent_lens = np.sum(Xc, axis=1).astype("float")
sent_lens[sent_lens == 0] = 1e-14  # prevent divide by zero

Next we read the pretrained word vectors from the provided GloVe embedding file. We use the version built with Wikipedia 2014 + Gigaword 5 (6B tokens, 400K words and dimensionality 300). The following snippet extracts the vectors for the words in our vocabulary and collects them into a dictionary.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
E = np.zeros((VOCAB_SIZE, 300))
fglove = open(GLOVE_EMBEDDINGS, "rb")
for line in fglove:
    cols = line.strip().split(" ")
    word = cols[0]
    try:
        i = counter.vocabulary_[word]
        E[i] = np.array([float(x) for x in cols[1:]])
    except KeyError:
        pass
fglove.close()

We are now ready to build our BoW vectors. Replacing the term counts with the appropriate vector is just a matrix multiplication, and averaging by word length means an element-wise divide by the S vector. Finally we split our BoW sentence embeddings into training and test splits.

1
2
3
4
Xb = np.divide(np.dot(Xc, E), sent_lens)

Xtrain, Xtest = Xb[0:len(captions_train)], Xb[-len(captions_test):]
ytrain, ytest = np.array(labels_train), np.array(labels_test)

The regularity of the Scikit-Learn API means that we can build some functions that can be used to cross-validate our classifier during training and evaluate it with the test data.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

def cross_val(Xtrain, ytrain, clf):
    best_clf = None
    best_score = 0.0
    num_folds = 0
    cv_scores = []
    kfold = KFold(n_splits=10)
    for train, val in kfold.split(Xtrain):
        Xctrain, Xctest, yctrain, yctest = Xtrain[train], Xtrain[val], ytrain[train], ytrain[val]
        clf.fit(Xctrain, yctrain)
        score = clf.score(Xctest, yctest)
        if score > best_score:
            best_score = score
            best_clf = clf
        print("fold {:d}, score: {:.3f}".format(num_folds, score))
        cv_scores.append(score)
        num_folds += 1
    return best_clf, cv_scores

def test_eval(Xtest, ytest, clf):
    print("===")
    print("Test set results")
    ytest_ = clf.predict(Xtest)
    accuracy = accuracy_score(ytest, ytest_)
    print("Accuracy: {:.3f}".format(accuracy))

We now invoke these functions to instantiate a Naive Bayes and SVM classifier, train it with 10-fold cross validation on the training split, and evaluate it with the test data to produce, among other things, a test accuracy. The following code shows the call for doing this with a Naive Bayes classifier. The code for doing this with an SVM classifier is similar.

1
2
3
4
5
from sklearn.naive_bayes import GaussianNB

clf = GaussianNB()
best_clf, scores_nb = cross_val(Xtrain, ytrain, clf)
test_eval(Xtest, ytest, best_clf)

The SIF sentence embeddings also start with the count matrix generated by the CountVectorizer. In addition, we need to compute the word probabilities. If we want to use the word probabilities from the dataset, we can do so by computing the row sum of the count matrix as follows:

1
2
3
# compute word probabilities from corpus
freqs = np.sum(Xc, axis=0).astype("float")
probs = freqs / np.sum(freqs)

We could also get these word probabilities from some external source such as a file. So given the probs vector, we can create a vector representing the coefficient for each word. Something like this:

1
2
ALPHA = 1e-3
coeff = ALPHA / (ALPHA + probs)

We can then compute the raw sentence embedding matrix in a manner similar to the BoW matrix.

1
2
Xw = np.multiply(Xc, coeff)
Xs = np.divide(np.dot(Xw, E), sent_lens)

In order to remove the first principal component, we first compute it using the TruncatedSVD class from Scikit-Learn, and subtract it from the raw SIF embedding Xs.

1
2
3
4
5
6
from sklearn.decomposition import TruncatedSVD

svd = TruncatedSVD(n_components=1, n_iter=20, random_state=0)
svd.fit(Xs)
pc = svd.components_
Xr = Xs - Xs.dot(pc.T).dot(pc)

As with the BoW sentence embeddings, we split it back to a training and test set, and train the two classifiers and evaluate them.

The full code used for this post is available in this GitHub gist as a Jupyter notebook. The results of running the three sentence embeddings - BoW, SIF and SIF with External Word Probabilities (SIF+EP) - through the two stock Sciki-Learn classifiers for different vocabulary sizes are shown below.



As you can see, I get conflicting results for the two classifiers. For the Naive Bayes classifier, SIF sentence embeddings with native word probabilities narrowly beats out the BoW embeddings, whereas in case of SVM, the SIF embeddings with external word probabilities are slightly better than the BoW results for some vocabulary sizes. Also, accuracies from the other SIF embedding trails the ones from BoW in both cases. Finally, the differences are really minor - if you look at the y-axis on the charts, you will see that the difference is on the third decimal place. So at least based on my experiment, there does not seem to be a significant utility to use the SIF embeddings over the BoW.

My use case does differ from the ideal case in that my captions can be long (longer than a typical sentence) and/or multi-sentence. I don't believe that should make a difference, but I may be wrong. I have tried to follow the paper recommendations as closely as possible when replicating this experiment, but it is possible I may have made a mistake somewhere - in case you spot it please let me know. The code is included, both on this post as well as in the GitHub gist if you want to verify that it works like I described. As a user of word and sentence embeddings, my primary use case is to use them to encode text input to classifiers. If you have gotten results that indicate SIF sentence embeddings are significantly better than BoW sentence embeddings for this or a similar use case, please let me know.



Saturday, May 13, 2017

Trying out various Deep Learning frameworks


The Deep Learning toolkit I am most familiar with is Keras, having used it to build some models around text classification, question answering and image similarity/classification in the past, as well as the examples for our book Deep Learning with Keras that I co-authored with Antonio Gulli. Before that, I have worked with Caffe to evaluate its pre-trained image classification models and to use one of them as a feature extractor for one of my own image classification pipelines. I have also worked with Tensorflow, learning it first from the awesome Udacity course taught by Vincent Vanhoucke of Google, and then using it to replicate the Grammar as a Foreign Language paper using our own data.

Lately, I have been curious about some of the other DL frameworks that are available, and whether it might make sense to explore them as well. So I decided to build a fully connected (FCN) and a convolutional (CNN) model to classify handwritten digits from the MNIST dataset, for each of Keras, Tensorflow, PyTorch, MXNet and Theano. Unlike the MNIST examples that are available for some of these frameworks, I read the data from CSV files and try to follow a similar coding style (the one I use for Keras) across all the different frameworks, so they are easy to compare. Both networks are also quite simple and training them is quick, so it is easy to run. All examples are provided as Jupyter notebooks, so you can just read them like you would one of my more code-heavy blog posts. The code is on my sujitpal/polydlot repository on GitHub.

My inspiration for the work was this chart posted on Twitter in May 2016 by Francois Chollet, creator of Keras. The first 3 charts show the top DL frameworks on GitHub ranked by number of forks, number of contributors and number of open issues. The fourth one weights these three features and produces an overall ranking that shows Keras at #3. I don't know the reasoning for the weights chosen in the fourth chart, although the rankings do line up with my own experience, and I would intuitively place similar importance on these three features as well. However, more importantly, even though it's somewhat dated, the chart gives an idea of the DL frameworks people are looking at, as well as a rough indication of their popularity.



In this post, I explain why I chose the DL frameworks that I did and share what I learned about each of these frameworks from the exercise. For those of you who know a subset of these frameworks, hopefully this will give you a glimpse of what it is like in the other framework. To those who are just starting out, I hope this comparison gives you some idea of where to start.

I chose Keras because I am comfortable with it. The very first DL framework I learned was Tensorflow. Soon after, I came across Keras when trying to read some Lasagne code (another library to build networks in Theano). While it didn't help with the Lasagne work, I got very excited about Keras, and set about building Keras implementations of the Tensorflow models I had built so far, and really got to appreciate how its object-oriented API made it easy to build useful models with very few lines of code. So anyway, I did the Keras examples mainly to figure out a base configuration and how many epochs to train each network to get reasonable results.

For those of you who are reading this to decide whether to learn Keras - learning Keras has one other advantage. In addition to the two backends (Theano and Tensorflow) it already supports, the Microsoft Cognitive Toolkit (CNTK) project and the MXNet project (supported by Amazon) are also considering Keras APIs. So once these APIs are in place, knowing Keras automatically gives you the skills to work with these frameworks as well.

My next candidate was Tensorflow. While not as fluent with Tensorflow as with Keras, I have written code using it in the past. I haven't kept up with the high level libraries that are tightly integrated with Tensorflow such as skflow and tensorflow-slim, since they looked like they were still evolving when I saw them.

Tensorflow (like Theano) programs require you to define your sequence of operations (i.e, the computation graph), "compile" it, and then run it with your variables. During the definition, the operands in the computation graph are represented using container objects called Tensors. At run-time, you pass in actual values to these container objects from your application. This is done mainly for performance, the network can optimize itself when it knows the sequence of operations up-front, and it is easier to distribute computations across different machines in a distributed environment. The process is called "Define and Run". Tensorflow is also a fairly low level library, its abstraction is at the operation level, compared to Keras, which is at the layer level. Consequently, Tensorflow code tends to be more verbose than comparable Keras code, and it often helps to modularize Tensorflow code for readability.

Keras, like the good high-level library that it is, tries to hide the separation implied by the "Define and Run" approach. However, there are times when it becomes necessary to extend Keras to do things it wasn't designed to do. Keras offers a backend API where it exposes operations on the backend, with which you can do some extensions such as a new loss function or new layer, and still remain within Keras. More complex extensions, such as adding an attention mechanism, can require setups where Keras and Theano or Tensorflow code must co-exist in the same code base, and figuring out how to make them interoperate can be a challenge. For this reason, I was quite excited to learn from Francois's talk on Integrating Keras and Tensorflow at Tensorflow Dev Summit 2017, that Keras will become the official API for Tensorflow starting with version 1.2. This will allow cleaner interoperability between Tensorflow and the Keras API, while at the same time allow you to write code that is less verbose than pure Tensorflow and more flexible than pure Keras.

For completeness, I also looked at Theano. Theano seems to be even more low level than Tensorflow, and lacks many of the convenience functions that Tensorflow provides. However, its computation graph definition is simpler and more intuitive (at least to me) compared to Tensorflow - you define Variables and functions, which you then populate with values from your application and run the function. I didn't do too much here as I don't expect to do too much work with Theano at this time.

One other framework I looked at was MXNet. Recently I attended a webinar organized by Amazon Web Service where they demonstrated the distributed training capabilities of MXNet on an AWS cluster, which I thought was quite cool, and which prompted me to look at MXNet further. Unlike Keras, MXNet is built on a C/C++ shared library and exposes a Python API. It also exposes APIs in various other languages, including Scala and R. In that respect, it is similar to Caffe. The Python API is similar to Keras, at least the level of abstraction, although there are some undocumented features that are set up by convention. I think this may be a good fit for shops that prefer Scala over Python, although Python seems to be quite ubuiquitous in the DL space.

Finally I looked at PyTorch, initially at the advice of a friend who works for Salesforce Research. PyTorch is the Python version of Torch, a DL framework written in Lua and used at Facebook Research. PyTorch seems to be the adopted as standard at Salesforce Research. The abstraction and code looks similar to Keras, but there is one important difference.

Unlike "Define and Run" frameworks such as Theano and Tensorflow (and by extension Keras), PyTorch (and Torch) is "Define by Run". So there is no additional code required to define the network and then run it. Because of that, the code is also more readable, and resembles Keras as well. The graph is built as you define it. This allows you to do certain things that cannot be done with "Define and Run" frameworks, especially with certain use cases in NLP. Like MXNet and Caffe, PyTorch is backed by a C/C++ shared library, and the Python and Lua front ends both use the same shared library. So in the long run, PyTorch seems to be worth learning as well.

Overall, I think the two advantages that this work has given me is an appreciation of how different DL frameworks work, and the ability to decide the next steps in my learning. Another advantage has been the advantage of polygot-ism, after which the project is named. Just like knowing a language of a country enables you to appreciate the culture better, knowing another DL framework allows you to understand the examples provided by each of these frameworks, some of which are quite interesting. It also allows you to read code written by others using these frameworks.

Well, that's all I have for today, hope you enjoyed it. I have tried to share what I learned from this brief exercise in comparing how to build fully connected and convolutional networks to classify MNIST digits. I found that reading the data from CSV files is more representative of real world situations and forces you to think about the input, something you wouldn't normally do if the data came from some built-in function. Also, while almost every DL framework comes with their own MNIST examples, their coding styles are very different and it is hard to compare implementations across frameworks. So I feel that the work I did might be helpful to you as well.


Saturday, May 06, 2017

Deep Learning with Keras published!


Just wanted to let you all know that Deep Learning with Keras, a book I co-authored with Antonio Gulli, was published by PackT on April 26, 2017. For those of you who follow me on social media such as LinkedIn and Twitter, and for family and friends on Facebook, this is old news, but to others I apologize for the delay. Although if you're still reading my blog after all these years, I guess you accept (and forgive, thank you) that delays and apologies are somewhat par for the course here.



The book is targeted at the Data Scientist / Engineer staring out with Neural Networks. It contains a mix of theory and examples, but the focus is on the code, since we believe that the best way to learn something in this field is through looking at examples. All examples are in Keras, our favorite Deep Learning toolkit. By the time you are finished with the book, you should be comfortable building your own networks in Keras.

This book is also available on Amazon. If you end up reading it, do leave us a review and tell us what you liked and how we could have done better.

Yesterday, Antonio posted an image where it showed our book at #5 on Amazon. We thought initially that it was ranked by sales and were very thrilled that people like our book so much, until someone pointed out that the ranking is most likely by query relevance. Oh, well! Good feeling while it lasted though.



Today, I thought it might be interesting to share the story behind the book, and thank the people who made it possible. For those of you looking for technical content, fair warning - this post has none.

While I read a lot of books, I have never considered writing one. Like many other people in software engineering, I have switched fields multiple times, and books have been the way to gain (almost) instant expertise to help make the transition. But the authors I read were all quite accomplished, almost experts in their fields. I was neither, just a programmer who caught (and took advantage of) a few lucky breaks in his career, so end of story.

When Antonio asked me if I was interested in co-authoring a book on Deep Learning using Keras with him, I was undecided for a while. I felt that if I accepted, I was implicitly claiming expertise on subjects at which I wasn't one. On the flip side, I had been working with Deep Learning models with Caffe, Tensorflow and Keras for a while, so while I was definitely not an expert, I did have knowledge that could benefit people who were not as far in their journey as I was. That last bit convinced me that I did have some value to add to a book, so I accepted.

Once I overcame my initial hesitation about being an author, I began to see it as a new experience, one that I enjoyed thoroughly during the process of writing the chapters. Antonio wrote the first half of the book (Chapters 1-4) and I wrote the second half (Chapters 5-8) but we reviewed each others work as well before it went out for review by others. Since Antonio works for Google, he had Googlers internally review his chapters as part of their official process, and I was fortunate to have some of them review my work as well and provide valuable feedback. In addition, our technical reviewer from PackT, Nick McClure, also provided valuable suggestions. The book has benefited a great deal from the thoroughness of these reviews.

The speed at which our industry moves means that people in it have to adapt quickly as well, and I am no exception. Often, when I pick up a new technology, I spend just enough time on the theory so I can build something that works. If I don't fully understand something that isn't central to what I am building, I just "accept" it and move on. Unfortunately, this doesn't work when you are writing a book - while I have tried to limit the theory to be just enough to explain the model that I build in code, the explanation needed to be accurate and complete. For that I had to revisit some basic concepts in order to clarify them for myself, things I had neglected to do while learning about it the first time. So in a sense, writing this book actually forced me to fill gaps in my own knowledge, so I am really grateful I did it.

From an engineering standpoint, I thought PackT's publication pipeline was quite cool. I had imagined that we would provide the manuscripts electronically over email and it would go back and forth, using the built in comment mechanism supported by Microsoft Word or similar. At least that has been my experience with PackT as reviewer in the past. Instead, they now have a Content Development Platform (CDP), a CMS system (similar to Joomla or Drupal) customized to the publishing task. Authors enter their chapters into an online editor that supports code blocks, quotations, images, info boxes, etc, as well as version control. Reviewers make comments using the same interface, and the EBook and print copies are generated automatically off the updated content.

Our own process was somewhat hybrid, since we started writing before we learned about the CDP, so we started off using Google Docs, which turned out to be a good choice since it could be shared easily with Google reviewers. We ended up building all our chapters on Google docs, and then copying them over to the CDP after the Google reviews, at which point all comments and changes happened only on the CDP.

The editors from PackT were awesome to work with as well - many thanks to Divya Poojari (Acquisition editor), Cheryl Dsa (Content editor) and Dinesh Thakur (Publishing editor) for all their help guiding us through the various steps of the publishing process.

One thing that hit us towards the end, I think about a week before our originally scheduled release date, was the Keras2 upgrade. Because it was so late in the process, we debated a bit about launching as-is and providing an upgrade guide to help readers upgrade the provided code to Keras2, but in the end we decided that the right thing to do was to upgrade our code before release. This did push back the schedule a bit, but the upgrade process went relatively smoothly, thanks in large part to the very informative deprecation warnings that Keras provides.

Looking back, I am really grateful to Antonio for having confidence in my skills and offering me the opportunity to co-author the book with him. Writing the book was an extremely valuable experience for me. Quick shout-out also to two of my colleagues here at Elsevier, Ron Daniel and Bradley P Allen, both of whom have been working on Deep Learning for longer than I have, and whose experiences led me to investigate the subject further in the first place. Also, the last four months were pretty hectic, trying to balance work, the book and home, and I am grateful to my family for their patience.

Antonio and I have put in a lot of thought and effort into this book. For the explanations, we have tried to strike a balance, trying to present just enough detail to be complete yet not inundating you with math. For the code, we have tried to keep it simple enough to understand but not so simple that it ends up implementing something trivial. But all things considered, the true litmus test for the book is whether you the reader find it useful. We look forward to hearing back from you.