Sunday, October 09, 2016

Deep Learning Models for Question Answering with Keras


Last week, I was at a (company internal) workshop on Question Answering (Q+A), organized by our Search Guild, of which I am a member. The word "guild" sounds vaguely medieval, but its basically a group of employees who share a common interest in Search technologies. As so often happens in large companies, groups tend to be somewhat silo-ized, and one group might not know much about what another one is doing, so the objective of the Search Guild is to bring groups together and promote knowledge sharing. To that end, the Search Guild organizes monthly presentations (with internal speakers as well as industry experts from outside the company) delivered via Webex (we are a distributed company with offices in at least 5 continents). It also provides forums for members to share information via blog posts, mailing lists, etc. As part of this effort, and given the importance of Q+A to Search, this year we organized our very first workshop on Q+A at Philadelphia on October 5 and 6.

What was unique about this workshop for me was that I was an organizer, speaker and attendee here. As speaker, there is obviously significant additional work involved with building your presentation and delivering it. As organizer, however, you truly get an appreciation of how much work goes into making an event successful. Many thanks to my fellow organizers for all the work they did, and apologies to the participants (if any of them are reading this) for any mistakes we made (we made quite a few, next time we should definitely use more checklists. Also remote two-way participation is very hard).

The talks at the Workshop were organized into 4 primary themes. The first group of 3 talks (one of which was mine) dealt with approaches designed against external benchmarks, and were a bit more "researchy" than others. The second group of 3 talks dealt with Question Complexity and how people are tackling them in their various projects. The third group of 4 talks looked at strategies used by engines that were already in production or QA, and the fourth group had 3 talks around different approaches to introducing Q+A into our Clinical search engine. In addition, there were several short talks and demos, mostly around Clinical. The most interest and activity in Q+A is around our Legal and Clinical search engines, followed by search engine products built around Life Sciences, Material Science and Chemistry. Attendance wise, we had around 25 in-person participants and 15 remote. 3 of the 13 talks were delivered remotely from our Amsterdam and Frankfurt offices.

My own experience with Question Answering is fairly minimal, mainly attempts to build functionality over search without trying too hard to understand the question implicit in the query. So it was definitely a great learning experience for me, to hear from people who had thought about their respective domains at length and come up with some pretty innovative solutions. As expected, some of the approaches described were similar to what I had used before, but they were used as part of a broader array of techniques, so there was something to learn for me there as well.

In this post, I will briefly describe my presentation and point you to the slides and code. My talk was about a hobby project that my co-presenter Abhishek Sharma and I started couple of months ago, hoping to deepen our understanding of how Deep Learning could be applied to Question Answering. We are both part of the Deep Learning Enthusiasts Meetup (he is the organizer), and he came up with the idea while we were watching Richard Socher's Deep Learning for Natural Language Processing (CS224d) lectures. The project involves implementing a bunch of Deep Learning models to predict the correct choice for multiple choice 8th grade Science questions. The data came from the Allen AI Science Challenge on Kaggle.

You can find the slides for the talk here. All the code can be found in this github repository. The code is written in Python using the awesome Keras library. I also used gensim to generate and load external embeddings, and NLTK and SpaCy for some simple NLP functionality. The README,md is fairly detailed (with many illustrations originally built for the slides), so I am not going to repeat the stuff here.

I looked at the "question with four candidate answers one of which is correct" as a classification problem with 1 positive and 3 negative examples per question. All my models produce a binary (correct/incorrect) response given a question and answer pair. Once the best model (in terms of accuracy of correct/incorrect predictions) is identified, I then run it on all four (question, answer) pairs and select the one with the best score. To do this, I needed to be able to serialize each model after training and deserialize it in the final prediction script. This is where I ran into problems I described in Keras Issue 3927.

To make a long story short, if you re-use an input with the Sequential model, the weights get mis-aligned somehow and cannot be loaded back into the model. I noticed it after I upgraded to the latest version of Keras from a much older version because of some extra layer types I wanted to use. The workaround for the newer version seems to be to use the Functional API. Unfortunately I wasn't able to do the code rewrite and rerun by my presentation deadline, although luckily for me, I did have a usable model for one of my earlier (weaker) classifiers that I saved using the earlier version.

So in the rest of this post, I will describe the architecture and code for my strongest model, an LSTM-QA model with Attention (inspired by the paper LSTM-based Deep Learning Models for Non-factoid Answer Selection by Tan, dos Santos, Xiang and Zhou), and using a custom embedding generated from approximately 500k Studystack Flashcards, followed by the code for finding the best answer. In other words, the last mile of my solution.

This is what the network looks like:


And here is the code for the network.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
# Source: qa-lstm-fem-attn.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from keras.callbacks import ModelCheckpoint
from keras.layers import Input, Dense, Dropout, Reshape, Flatten, merge
from keras.layers.embeddings import Embedding
from keras.layers.recurrent import LSTM
from keras.models import Model
from sklearn.cross_validation import train_test_split
import os
import sys

import kaggle

DATA_DIR = "../data/comp_data"
MODEL_DIR = "../data/models"
WORD2VEC_BIN = "studystack.bin"
WORD2VEC_EMBED_SIZE = 300

QA_TRAIN_FILE = "8thGr-NDMC-Train.csv"
QA_TEST_FILE = "8thGr-NDMC-Test.csv"

QA_EMBED_SIZE = 64
BATCH_SIZE = 128
NBR_EPOCHS = 20

## extract data
print("Loading and formatting data...")
qapairs = kaggle.get_question_answer_pairs(
    os.path.join(DATA_DIR, QA_TRAIN_FILE))
question_maxlen = max([len(qapair[0]) for qapair in qapairs])
answer_maxlen = max([len(qapair[1]) for qapair in qapairs])

# Even though we don't use the test set for classification, we still need
# to consider any additional vocabulary words from it for when we use the
# model for prediction (against the test set).
tqapairs = kaggle.get_question_answer_pairs(
    os.path.join(DATA_DIR, QA_TEST_FILE), is_test=True)    
tq_maxlen = max([len(qapair[0]) for qapair in tqapairs])
ta_maxlen = max([len(qapair[1]) for qapair in tqapairs])

seq_maxlen = max([question_maxlen, answer_maxlen, tq_maxlen, ta_maxlen])

word2idx = kaggle.build_vocab([], qapairs, tqapairs)
vocab_size = len(word2idx) + 1 # include mask character 0

Xq, Xa, Y = kaggle.vectorize_qapairs(qapairs, word2idx, seq_maxlen)
Xqtrain, Xqtest, Xatrain, Xatest, Ytrain, Ytest = \
    train_test_split(Xq, Xa, Y, test_size=0.3, random_state=42)
print(Xqtrain.shape, Xqtest.shape, Xatrain.shape, Xatest.shape, 
      Ytrain.shape, Ytest.shape)

# get embeddings from word2vec
print("Loading Word2Vec model and generating embedding matrix...")
embedding_weights = kaggle.get_weights_word2vec(word2idx,
    os.path.join(DATA_DIR, WORD2VEC_BIN), is_custom=True)
        
print("Building model...")

# output: (None, QA_EMBED_SIZE, seq_maxlen)
qin = Input(shape=(seq_maxlen,), dtype="int32")
qenc = Embedding(input_dim=vocab_size,
                 output_dim=WORD2VEC_EMBED_SIZE,
                 input_length=seq_maxlen,
                 weights=[embedding_weights])(qin)
qenc = LSTM(QA_EMBED_SIZE, return_sequences=True)(qenc)
qenc = Dropout(0.3)(qenc)

# output: (None, QA_EMBED_SIZE, seq_maxlen)
ain = Input(shape=(seq_maxlen,), dtype="int32")
aenc = Embedding(input_dim=vocab_size,
                 output_dim=WORD2VEC_EMBED_SIZE,
                 input_length=seq_maxlen,
                 weights=[embedding_weights])(ain)
aenc = LSTM(QA_EMBED_SIZE, return_sequences=True)(aenc)
aenc = Dropout(0.3)(aenc)

# attention model
attn = merge([qenc, aenc], mode="dot", dot_axes=[1, 1])
attn = Flatten()(attn)
attn = Dense(seq_maxlen * QA_EMBED_SIZE)(attn)
attn = Reshape((seq_maxlen, QA_EMBED_SIZE))(attn)

qenc_attn = merge([qenc, attn], mode="sum")
qenc_attn = Flatten()(qenc_attn)

output = Dense(2, activation="softmax")(qenc_attn)

model = Model(input=[qin, ain], output=[output])

print("Compiling model...")
model.compile(optimizer="adam", loss="categorical_crossentropy",
              metrics=["accuracy"])

print("Training...")
best_model_filename = os.path.join(MODEL_DIR, 
    kaggle.get_model_filename(sys.argv[0], "best"))
checkpoint = ModelCheckpoint(filepath=best_model_filename,
                             verbose=1, save_best_only=True)
model.fit([Xqtrain, Xatrain], [Ytrain], batch_size=BATCH_SIZE,
          nb_epoch=NBR_EPOCHS, validation_split=0.1,
          callbacks=[checkpoint])

print("Evaluation...")
loss, acc = model.evaluate([Xqtest, Xatest], [Ytest], batch_size=BATCH_SIZE)
print("Test loss/accuracy final model = %.4f, %.4f" % (loss, acc))

final_model_filename = os.path.join(MODEL_DIR, 
    kaggle.get_model_filename(sys.argv[0], "final"))
json_model_filename = os.path.join(MODEL_DIR,
    kaggle.get_model_filename(sys.argv[0], "json"))
kaggle.save_model(model, json_model_filename, final_model_filename)

best_model = kaggle.load_model(json_model_filename, best_model_filename)
best_model.compile(optimizer="adam", loss="categorical_crossentropy",
              metrics=["accuracy"])
loss, acc = best_model.evaluate([Xqtest, Xatest], [Ytest], batch_size=BATCH_SIZE)
print("Test loss/accuracy best model = %.4f, %.4f" % (loss, acc))

The code above builds up questions and answers as an array of indexes into the word dictionary created off the words in the questions and answers. The weights for our embeddings are initialized from running word2vec on our corpus of StudyStack flashcards. Attention is modeled as a dot product of the output of the question and answer vectors that come out of the LSTMs. Finally, the attention vector and question vectors are concatenated and sent into a Dense network, which outputs one of two values.

The next step takes the saved model (final one) and runs each question in the test set and its four choices as a single batch, and predicts the correct answer as the one which has the highest score. The output is written to a CSV file in the format required for submission to Kaggle.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
# src/predict_testfile.py
# -*- coding: utf-8 -*-
from __future__ import division, print_function
from keras.preprocessing.sequence import pad_sequences
import nltk
import numpy as np
import os

import kaggle

DATA_DIR = "../data/comp_data"
TRAIN_FILE = "8thGr-NDMC-Train.csv"
TEST_FILE = "8thGr-NDMC-Test.csv"
SUBMIT_FILE = "submission.csv"

MODEL_DIR = "../data/models"
MODEL_JSON = "qa-lstm-fem-attn.json"
MODEL_WEIGHTS = "qa-lstm-fem-attn-final.h5"
LSTM_SEQLEN = 196 # seq_maxlen from original model

print("Loading model..")
model = kaggle.load_model(os.path.join(MODEL_DIR, MODEL_JSON),
                          os.path.join(MODEL_DIR, MODEL_WEIGHTS))
model.compile(optimizer="adam", loss="categorical_crossentropy",
              metrics=["accuracy"])

print("Loading vocabulary...")
qapairs = kaggle.get_question_answer_pairs(os.path.join(DATA_DIR, TRAIN_FILE))
tqapairs = kaggle.get_question_answer_pairs(os.path.join(DATA_DIR, TEST_FILE), 
                                            is_test=True)
word2idx = kaggle.build_vocab([], qapairs, tqapairs)
vocab_size = len(word2idx) + 1 # include mask character 0

ftest = open(os.path.join(DATA_DIR, TEST_FILE), "rb")
fsub = open(os.path.join(DATA_DIR, SUBMIT_FILE), "wb")
fsub.write("id,correctAnswer\n")
line_nbr = 0
for line in ftest:
    line = line.strip().decode("utf8").encode("ascii", "ignore")
    if line.startswith("#"):
        continue
    if line_nbr % 10 == 0:
        print("Processed %d questions..." % (line_nbr))
    cols = line.split("\t")
    qid = cols[0]
    question = cols[1]
    answers = cols[2:]
    # create batch of question
    qword_ids = [word2idx[qword] for qword in nltk.word_tokenize(question)]
    Xq, Xa = [], []
    for answer in answers:
        Xq.append(qword_ids)
        Xa.append([word2idx[aword] for aword in nltk.word_tokenize(answer)])
    Xq = pad_sequences(Xq, maxlen=LSTM_SEQLEN)
    Xa = pad_sequences(Xa, maxlen=LSTM_SEQLEN)
    Y = model.predict([Xq, Xa])
    probs = np.exp(1.0 - (Y[:, 1] - Y[:, 0]))
    correct_answer = chr(ord('A') + np.argmax(probs))
    fsub.write("%s,%s\n" % (qid, correct_answer))
    line_nbr += 1
print("Processed %d questions..." % (line_nbr))
fsub.close()
ftest.close()

Here is the output for one single question which I had referenced in the presentation slides. The model shows shows the distribution of scores between the answers (normalized to add up to 1).


I did try to run my classifier on the entire test set and produce a submission file for Kaggle, just to see where I stand. Since the classification accuracy for the winner was approximately 59%, it is unlikely that my 70%+ accuracy numbers for my classifiers will carry over into the final task. I had signed up for the competition with the intention of participating but got sidetracked, so I had the original datasets of approximately 8000 training and 8000 test questions, but unfortunately, the final rankings were computed off another test set of approximately 200k questions that were supplied later in the competition, so I didn't have them.

Thats all I have for today. As someone mentioned to me after the workshop, these sort of things are very energizing. Certainly I learned a lot from it. The deadline also pushed me to complete my hobby project, so I got to learn quite a bit about more complex Keras models. Hopefully, this will enable me to build more complex models going forward.

8 comments (moderated to prevent spam):

Anonymous said...

Interesting post! Thank you for sharing your work.

Sujit Pal said...

Thank you for the kind words.

Elias Abou Haydar said...

I've reading your blog a couple of years now and I wanted to thank you for all the time and quality that you try to put into each post. I really enjoy reading them. Can't wait for the next post !

Sujit Pal said...

Thanks for the kind words, Elias.

Abebawu Eshetu said...

Dear Sujit Pal,

I have reading your blog and I am one your folowers. I always enjoy reading your posts. I always appreciate you. Realy I am proud of you.

Next, I want to ask is please help me on MaxEnt(Maximum Entropy) java implementation. I am working on evaluating open questions answer using ontology (domain knowledge) and MaxEnt as evaluation techniques. I am challenged to understand how extratcted concept is evaluated with MaxEnt. Please suggest me some thing to do. I now you are busy, but I appreciate any time you give for me.

Malaikannan Sankarasubbu said...

Pretty cool blog post on Q&A. Actively working on this field, i understand the complexity involved in solving this problem.

Sujit Pal said...

Thanks Malaikannan, glad you liked it. Was good meeting at the Demystifying AI and DL event.

Sujit Pal said...

Hi Abewabu, thanks for the kind words. OpenNLP has a MaxEnt classifier (http://maxent.sourceforge.net/howto.html). I am not sure about this, but I am guessing you will extract concept features and match the concepts found in the question and the answer. The best answer will be the one with the highest probability.