Saturday, July 22, 2017

The Benefits of Attention for Document Classification

A couple of weeks ago, I presented Embed, Encode, Attend, Predict - applying the 4 step NLP recipe for text classification and similarity at PyData Seattle 2017. The talk itself was inspired by the Embed, encode, attend, predict: The new deep learning formula for state-of-the-art NLP models blog post by Matthew Honnibal, creator of the spaCy Natural Language Processing (NLP) Python toolkit. In it, he posits that any NLP pipeline can be constructed from these 4 basic operations and provides examples from two of his use cases. In my presentation, I use his recipe to construct deep learning pipelines for two other processes - document classification and text similarity.

Now I realize that it might seem a bit pathetic to write a blog post about a presentation about someone else's blog post. But the reason I even came up with the idea for the presentation was because Honnibal's idea of using these higher level building blocks struck me as being so insightful and generalizable that I figured that it would be interesting to use it on my own use cases. And I decided to do the blog post because I thought that the general idea of abstracting a pipeline using these 4 steps would be useful to people beyond those who attended my talk. I also hope to provide a more in-depth look at the Attend step here than I could during the talk due to time constraints.

Today, I cover only my first use case of document classification. As those of you who attended my talk would recall, I did not get very good results for the second and third use cases around document and text similarity. I have a few ideas that I am exploring at the moment. If they are successful, I will talk about them in a future post.

The 4 step recipe

For those of you who are not aware of the 4-step recipe, I refer you to Honnibal's original blog post for the details. But if you would rather just get a quick refresher, the 4 steps are as follows:

  • Embed - converts an integer into a vector. For example, a sequence of words can be transformed through vocabulary lookup to a sequence of integers, each of which could be transformed into a fixed size vector represented by the word embedding looked up from third party embeddings such as word2vec or GloVe.
  • Encode - converts a sequence of vectors into a matrix. For example, a sequence of vectors representing some sequence of words such as a sentence, could be sent through a bi-directional LSTM to produce a sentence matrix.
  • Attend - reduces the matrix to a vector. This can be done by passing the matrix into an Attention mechanism that captures the most salient features of the matrix, thus minimizing the information loss during reduction.
  • Predict - reduces a vector to a integer label. This would correspond to a fully connected prediction layer that takes a vector as input and returns a single classification label.

Of these steps, all but the Attend step is adequately implemented by most Deep Learning toolkits. My examples use Keras, a Python deep learning library. In Keras, the Embed step is represented by the Embedding layer where you initialize the weights from an external embedding; the Encode step can be implemented using a LSTM layer wrapped in a Bidirectional wrapper; and the Predict step is implemented with a Dense layer.

Experiment: Document Classification

These steps can be thought of as large logical building blocks for our NLP pipeline. A pipeline can be composed of zero or more of these steps. It is also important to realize that each of these steps has a naive, non deep learning equivalent. For example, the Embed step can be done using one-hot vectors instead of third party word embeddings; the Encode step can be done by just concatenating the vectors along their short axis; the Attend step can be done by averaging the component word vectors; and the Predict step can use an algorithm other than deep learning. Since I wanted to see the effect of each of these steps separately, I conducted the following set of experiments - the links lead out to Jupyter notebooks on Github.

The data for this experiment comes from the Reuters 20 newsgroups dataset. It comes as part of scikit-learn's datasets package. It is a collection of 180000 newsgroup postings pre-categorized into one of 20 newsgroups. Our objective is to build a classifier (or classifiers) that can predict the document's newsgroup category from its text.

  • Embed and Predict (EP) - Here I treat a sentence as a bag of words and a document as a bag of sentences. So a word vector is created by looking it up against a GloVe embedding, a sentence vector is created by averaging its word vectors, and a document vector is created by averaging its sentence vectors. The resulting document vector is fed into a 2 layer Dense network to produce a prediction of one of 20 class.
  • Embed, Encode and Predict (EEP) - We use a document classification hierarchy as described in this paper by Yang, et al.[1]. Specifically, a sentence encoder is created that transforms integer sequences (from words in sentences) into a sequence of word vectors by looking up GloVe embeddings, then converts the sequence of word vectors to a sentence vector by passing it through a Bidirectional LSTM and capturing the context vector. This sentence encoder is embedded into the document network, which takes in a sequence of sequence of integers (representing a sequence of sentences or a document). The sentence vectors are passed into a Bidirectional LSTM encoder that outputs a document vector, again by returning only the context vector. This document vector is fed into a 2 layer Dense network to produce a category prediction.
  • Embed, Encode, Attend and Predict #1 (EEAP#1) - In this network, we add an Attention layer in the sentence encoder as well as in the Document classification network. Unlike the previous network, the Bidirectional LSTM in either network returns the full sequences, which are then reduced by the Attention layer. This layer is of the first type as described below. Output of the document encoding is a document vector as before, so as before it is fed into a 2 layer Dense network to produce a category prediction.
  • Embed, Encode, Attend and Predict #2 (EEAP#2) - The only difference between this network and the previous one is the use of the second type of Attention mechanism as described in more detail below.
  • Embed, Encode, Attend and Predict #3 (EEAP#3) - The only difference between this network and the previous one is the use of the third type of Attention mechanism. Here the Attention layer is fed with the output of the Bidirectional LSTM as well as the output of a max pool operation on the sequence to capture the most important parts of the encoding output.

The results of the experiment are as follows. The interesting values are the blue bars, that represent the accuracy reported by each trained model on the 30% held out test set. As you would expect, the Bag of Words (EP) approach yields the worst results, around 71.4%, which goes up to 77% once we replace the naive encoding with a Bidirectional LSTM (EEP). All the models with Attention outperform these two models, and the best result is around 82.4% accuracy with the first Attend layer (EEAP#1).

Attention Mechanisms

I think one reason Keras doesn't provide an implementation of Attention is because different researchers have proposed slightly different variations. For example, the only toolkit I know that offers Attention implementations is Tensorflow (LuongAttention and BahdanauAttention), but both are in the narrower context of seq2seq models. Perhaps a generalized Attention layer is just not worth the trouble given all the variations and maybe it is preferable to build custom one-offs yourself. In any case, I ended up spending quite a bit of time understanding how Attention worked and how to implement it myself, which I hope to also share with you in this post.

Honnibal's blog post also offers a taxonomy of different kinds of attention. Recall that the Attend step is a reduce operation, converting a matrix to a vector, so the following configurations are possible.

  • Matrix to Vector - proposed by Raffel, et al.[2]
  • Matrix to Vector (with implicit context) - proposed by Lin, et al.[3]
  • Matrix + Vector to Vector - proposed by Cho, et al.[4]
  • Matrix + Matrix to Vector - proposed by Parikh, et al.[5]

Of these, I will cover the first three here since they were used for the document classification example. References to the papers where these were propsed are provided at the end of the post. I have tried to normalize the notation across these papers so it is easier to talk about them in relation with each other.

I ended up implementing them as custom layers, although in hindsight, I could probably have used Keras layers to compose them as well. However, that approach can be complex if your attention mechanism is complicated. If you want an example of how to do that, take a look at Spacy's implementation of decomposable attention used for sentence entailment.

There are many blog posts and articles that talk about how Attention works. By far the best one I have seen is this one from Heuritech. Essentially, the Attention process involves combining the input signal (a matrix) with some other signal (a vector) to find an alignment that tells us which parts of the input signal we should pay attention to. The alignment is then combined with the input signal to produce the attended output. Personally, I have found that it helps to look at a flow diagram to see how the signals are combined, and the equations to figure out how to implement the layer.

Matrix to Vector (Raffel)

This mechanism is a pure reduction operation. The input signal is passed through a tanh and a softmax to produce an alignment matrix. The dot product of the alignment and the input signal is the attended output.

Two things to note here is the presence of the learnable weights W and b. The idea is that the component will learn these values so as to align the input based on the task it is being trained for.

The code for this layer can be found in class AttentionM in the custom layer code.

Matrix to Vector (Lin)

This mechanism is also a pure reduction operation, since the input to the layer is a matrix and the output is a vector. However, unlike the previous mechanism, it learns an implicit context vector u, in addition to W and b, as part of the training process. You can see this by the presence of a u vector entering the softmax and in the formula for αt.

Code for this Attention class can be found in the AttentionMC class in the custom layer code.

Matrix + Vector to Vector (Cho)

Unlike the previous two mechanisms, this takes an additional context vector that is explicitly provided along with the input signal matrix from the Encode step. This can be a vector that is generated by some external means that is somehow representative of the input. In our case, I just took the max pool of the input matrix along the time dimension. The process of creating the alignment vector is the same as the first mechanism. However, there is now an additional weight that learns how much weight to give to the provided context vector, in addition to the weights W and b.

Code for this Attention class can be found in the AttentionMV class in the code for the custom layers.

As you may have noticed, the code for the various custom layers is fairly repetitive. We declare the weights in the build() method and the computations with the weights and signals in the call() method. In addition, we support input masking via the presence of the compute_mask() method. The get_config() method is needed when trying to save and load the model. Keras provides some guidance on building custom layers, but a lot of the information is scattered around in Keras issues and various blog posts. The Keras website is notable, among other things, for the quality of its documentation, but somehow custom layers haven't received the same kind of love and attention. I am guessing that perhaps it is because this is closer to the internals and hence more changeable, so harder to maintain, and also once you are doing custom layers, you are expected to be able to read the code yourself.

So there you have it. This is Honnibal's 4-step recipe for deep learning NLP pipelines, and how I used it for one of the use cases I talked about at PyData. I hope you found the information about Attention and how to create your own Attention implementations useful.


  1. Yang, Z, et al (2016). Hierarchical attention networks for document classification. In Proceedings of NAACL-HLT (pp. 1480-1489).
  2. Raffel, C, & Ellis, D. P (2015). Feed-forward networks with attention can solve some long term memory problems. arXiv preprint arXiv:1512.08756.
  3. Lin, Z., et al. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
  4. Cho, K, et al. (2015). Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia, 17(11), 1875-1886.
  5. Parikh, A. P., et al (2016). A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.

22 comments (moderated to prevent spam):

Abebawu Eshetu said...

Dear Sujit, how are you? Thank you for giving insight on EEAP model. I am student and it helps me on my work. Inspired by your previous post on SIF sentence embedding, I want to use SIF based attention at attention layer. My aim is to pass two tensor then return matrix with shape (n_sample,embedding_size). Could you help me how to do that?

With Regards,

Sujit Pal said...

Hi Abebawu, I am fine, hope you are too, and you are welcome, glad my post helped you. My understanding is that SIF is a weighted embedding scheme for word collections (sentences/documents) where you weight 3rd party word embeddings for each word by the inverse of its probability of occurrence. I am not sure what you mean when you say you want SIF based attention - do you want to generate a custom reference vector using SIF embeddings. Or are you looking for a custom layer that will use a supplied word frequency table to generate the initial embeddings?

Abebawu Eshetu said...

Thank you for help sujit. What I need is I want to provide SIF vector of input as custom reference vector for attention that use matrix with provided sentence. I tried it, but it returns the error "AttributeError:'Tensor' object has no attribute '_keras_history'" and I removed it. It works. Is their any possible way to incorporate such information as weight?

Thank you in advance for you favorable reply.

Kind regards,

Sujit Pal said...

The error message is happening (I think) because you are trying to insert a numpy vector where it needs a Theano/Tensorflow (depending on backend) tensor. You may want to pregenerate SIF vectors for the sentences, and declare it as an input to the network, then reference it by variable name later when you want it as input to the attention layer.

Abebawu Eshetu said...

Let I have pregenerated SIF_vector sif, how can I declare this to keras input with place holder. The vector I generated has shape (batch_size, emb_size). Could you suggest sample code please. I am not experienced with deep learning. Thank you for any time you gave for me.

Sujit Pal said...

Since SIF vector is meant to codify sentences, you would probably only use it in the sentence network, for example the 04c-clf-eeap notebook, as follows. For each of sentences represented by Xtrain and Xtest, generate a corresponding set of SIF vectors, called sif_train and sif_test say. Then in cell 10, you could declare a sif_input = Input(shape={MAX_WORDS,]), and use that instead of sent_vec below.

Abebawu Eshetu said...

Hi sujit how are you? I believe that you are doing good. Today also I am going to ask you some help. Inspired by your hierarchical document similarity task I was trying to implement hierarchical approach for short text based problem that is sentence to paragraph level. In my case word embedding is concatenation of word embedding and char level CNN embedding. then sequence encoder encodes embedding output at sentence level and I want to apply paragraph level model based on output of sentence level model output. Every thing works good till sentence model, but when I apply the following fragment it raises the following error.
model_sen_model=Model(inputs=inputs_model, outputs=sen_seq_rep)
model_sen_model.summary() # here it works well

model_paragraph=Input(shape=(max_sentences, seq_length), dtype='int32',name='model_paragraph_input')

now raises
assert str(id(x)) in tensor_map, 'Could not compute output ' + str(x)

AssertionError: Could not compute output Tensor("time_distributed_379/Reshape_1:0", shape=(?, 65, 200), dtype=float32)

#shape of last output is (None,seq_length,emb_dim)
I tried with return_sequences=False for BiLSTM encoder and in this case in same line it raise
Assertion problem i.e may be TimeDistributed layer expectes 3D and return_sequences=False result is 2D.

I guess the problem is related to previous multiple outputs (CNN char and RNN combined).
Is their any possible suggestion?

Thank you heartedly for all thing your help. I know you are busy, but I appreciate any time you gave to me. I am running in last week of my work submission.

Sujit Pal said...

Hi Abebawu, sorry about the delay in responding, hope your problem got fixed and you were not waiting on me. You would typically apply time distributed to each step of the model, so input to it should be (?, embedding size). Maybe check to see if the model_sen_model is sending out a single context vector with return_sequences=False?

Xyore said...

Very amazing and interesting post

Sujit Pal said...

Thank you Xyore.

Abebawu Eshetu said...

Hi sujit how are you? I believe that you are doing good. As usual I am going to ask again. First Really I want to appreciate for all constructive comments and suggestions. Today my question is how can I save and load hierarchical model with keras? I have tried and goggled many times, but I did not get any answer. I believe that you suggest me possible way to go. When I tried it it raises 'ValueError: Missing layer: input_12'. Is there any possible solution to recover from such error. I know sujit you are busy, but I appreciate any time you gave for me.

Kind regards,

Sujit Pal said...

Hi Abebawu, usually if I do a and models.load_model things just work. The only thing I can think of is that if you are using custom layers, then you have to pass that information to load_model as a dictionary, see Keras Issue 4871 for a discussion. For example, if your network includes the attention layer given by class AttentionMM, then you would need to specify the following additional argument in load_model, custom_objects={"AttentionMM": custom_attn.AttentionMM}). You can see an example in this notebook.

Abiodun Modupe said...

Hi Sujit,

Thank you so much for your contribution and self-motivated tutor. I want to create a representation for an informal sentence such as Tweet from a structured document such as the newswire. I have embedded layer for the two documents using GloVec. I decided to find the average of the two embedding then fed into CNN with flatting and batch normalization. The output of the convolution operation is then fed into Bidirectional LSTM to learn long-dependencies in the input sequent before softmax function to obtain the predicted label. The idea is to use structured data to find the representation of informal text, so the question is how can I find the average of two embedding layer?

I also want to use the model to find the overlapping between structured and unstructured document. Will this be possible? Looking forward to your suggestion and advice.


Sujit Pal said...

Hi Abbey, I don't have concrete advice unfortunately, but I do have a few questions that may or may not help me provide advice. Tweets are usually 140 (now sometimes 280) chars long, and newswire articles are much larger. Is the CNN an approach to reducing them to a "standard" size? I feel that an LSTM might be better at this stage, at least for tweets, since LSTMs are better able to capture long term dependencies, and CNNs work kind of like a bag-of-words on local dependencies. Also, not quite sure what your prediction label is - you said you are trying to find a representation of twitter (or more generally informal) text using newswire (or more generally structured) text, but I am not understanding how that would be done here. Maybe you could provide an example? If you are just trying to compute a similarity/difference metric, you could try element-wise multiplication, absolute difference or even concatenation.

Abiodun Modupe said...

Hi Sujit,

Thank you so much for your response. For example:

doc1: 'i will see you tomorrow at 10pm'
doc2: 'c u 2mor'

The doc1 is a structured text, while doc2 is an informal text. The idea is to learn features(n-gram) representation which generalizes well across structure texts and different informal text. The goal is to learn features (embeddings) for each word similar words in each structure text to learn generalized embedding representation for informal text.

My idea of a solution is to use pre-trained word embedding (e.g., Glove), then find the average of the two embedding (rather than concatenate or multiplication). Thereafter, I can be fed the output unto CNN then LSTM which can be trained on a linear classifier to predict class label in structured text, where we have labelled data and then transfer the idea to informal text for which we do not have the label.

1. How did you see this idea?
2. How can one optimize the two embedding layer e.g., its to sum, absolutely different, concatenate or average? I will prefer average but I have no concept of how to do it.

Code sample:
model = Sequential()
e = Embedding(len(word_index)+1, 100, weights=[embedding_matrix], input_length=4, trainable=True)
e1 = Embedding(len(word_index)+1, 100, weights=[embedding_matrix1], input_length=4, trainable=True)
The average code here!!!!
model.add(Dense(1, activation='sigmoid'))

Sujit Pal said...

#2: Let me answer this first. Since your model is not really sequential, you should look at using the functional API. You will declare Inputs for each of the two data streams (one for formal sentences and one for tweets). The Embeddings will take these as inputs and will convert them to a 3D tensor (batch_size, sentence_length_in_words, embedding_size). You can compute the average using keras.layers.average([e, e1]). Look at the Keras documentation on Merge Layers for the operations available.

#1: I am not sure, I don't quite understand the intuition behind the sequence of operations, you may want to try and see if it works. I would think just building word embeddings on twitter data may give you what you are looking for, since it will contain a combination of formal and informal ways of saying the same thing.

In the structure you described, the output of the Merge layer will go to a 1D CNN. Output of the CNN will be (batch_size, smaller_sequence_length, longer_embedding_size). You now feed this unchanged into an LSTM and get the context vector out the other end (return_sequences=False) of size (batch_size, some_other_embedding_size), this will then go into a FCN (Dense layers) to get some sort of prediction, not sure what it is from the details provided. I am sure you have reasons for doing these steps, but I guess I am not seeing it.

Cynthia Wu said...

Thanks for the informative blog post! In the spirit of the paper giving rise to the Hierarchical Attention Network, I would like to know the importance of every word in a sentence and every sentence in a document

How could we go about grabbing the attention weights ( I am assuming the atx's in for every test sample on both the word level and sentence level?

Sujit Pal said...

Not 100% sure if this will work, but one possibility could be to train the network on the classification task and then get the output of the attention layer. For words in sentence you could use the sentence model and for sentences in document you could use the document model.

Cynthia Wu said...

Thank you for responding. What about something like this? I am only looking at the sentence level weights for now:

Assuming the model is trained, here is a function to get the output of the 2nd layer of the network:

from keras import backend as bk
get_2nd_layer_output = bk.function([model.layers[0].input, bk.learning_phase()], [model.layers[2].output])

then for every document in the TEST set:

reshaped_doc = document.reshape(1, max_sentences, max_tokens)
enc = get_2nd_layer_output([reshaped_doc, 0])[0]
et = np.tanh(, att_weights["W"]) + att_weights["b"])
at = softmax(et)

Then at[0][j] will give me the weight of the jth sentence of that document.

I am basing a lot of this off of your code on:

Your code defines softmax and att_weights. max_sentences is the maximum number of sentences in a document I allow. max_tokens is the maximum number of words in a sentence I allow.

Sujit Pal said...

Yes, I think this should work. Are you trying to do this for an extractive summarization task?

Cynthia Wu said...

Nope, this is actually for escalation determination between a user and a chatbot. As in: when should we escalate the user to a real human customer representative because the robot is not helping? It is important to know why the NN is thinking a conversation should escalate or not, so I am hoping to make figures similar to Figures 5 and 6 in the Hierarchical attention networks paper.

Assuming I got the weights correctly (and I am thankful you confirmed this as I am no Keras and Tensorflow expert!), the next issue is that conversations vary widely in length which results in lots of padding. Interestingly, PAD does not get a weight of 0. It is close, but not 0. I imagine it is because, in a sense, padding is somewhat informative in itself so it gets a small weight. But then you might end up with a document with only two sentences and those two sentences only having 12 percent weight each given that immense amount of padding. I am covering 100% of my data now, but I might try what you have done and get 95% of data or the upper fence.

Sujit Pal said...

Very cool, good luck with the project.