Recently, a colleague and a reader of this blog independently sent me a link to the Simple but Tough-to-Beat Baseline for Sentence Embeddings (PDF) paper by Sanjeev Arora, Yingyu Liang, and Tengyu Ma. My reader also mentioned that the paper was selected for a mini-review in Lecture 2 of the Natural Language Processing and Deep Learning (CS 224N) course taught at Stanford University by Prof Chris Manning and Richard Socher. For those of you who have taken Stanford's earlier Deep Learning and NLP (CS 224d) course taught by Socher, or the very first course Coursera on Natural Language Processing by Profs Dan Jurafsky and Chris Manning, you will find elements from both in here. There are also some things I think are new or that I might have missed earlier.
The paper introduces an unsupervised scheme for generating sentence embeddings that has been shown to consistently outperform a simple Bag of Words (BoW) approach in a number of evaluation scenarios. The evaluation scenarios considered are both intrinsic (correlating computed similarities of sentence embeddings with human estimates of similarity) as well as extrinsic (using the embeddings for a downstream classification task). I thought the idea was very exciting, since all the techniques I have used to convert word embeddings to sentence embeddings have given results consistent with the complexity used to produce them. At the very low end is the BoW approach, which adds up the embedding vectors for the individual words and averages them over the sentence length. At the other end of the scale is to generate sentence vectors from a sequence of word vectors by training a LSTM and then using it, or by looking up sentence vectors using a trained skip-thoughts encoder.
The Smooth Inverse Frequency (SIF) embedding approach suggested by the paper is only slightly more complicated than the BoW approach, and promises consistently better results than BoW. So for those of us who used the BoW as a baseline, this suggests that we should now use SIF embedding instead. So instead of just averaging the component word vectors as suggested by this equation for BoW:
We generate the sentence vector vs by multiplying each component vector vw by the inverse of its probability of occurrence. Here α is a smoothing constant, its default value as suggested in the paper is 0.001. We then sum these normalized smoothed word vectors and divide by the number of words.
Since we do this for all the sentences in our corpus, we now have a matrix where the number of rows is the number of sentences and the number of columns is the embedding size (typically 300). Removing the first principal component from this matrix gives us our sentence embedding. There is also an implementation of this embedding scheme in the YingyuLiang/SIF GitHub repository.
For my experiment, I decided to compare BoW and SIF vectors by how effective they are when used for text classification. My task is to classify images as compound (i.e, composed of multiple sub-images) versus non-compound (single image, no sub-images) using only the captions. The data comes from the ImageCLEF 2016 (Medical) competition, where Compound Figure Detection is the very first task in the task pipeline. The provided dataset has 21,000 training captions, each about 92 words long on average, and split roughly equally between the two classes. The dataset also contains 3,456 test captions (labels provided for validation purposes).
The label and captions are provided as two separate files, for both training and test datasets. Here is an example of what the labels file looks like:
1 2 3 4 5 6 7 8 | 11373_2007_9226_Fig1_HTML,COMP
12178_2007_9002_Fig3_HTML,COMP
12178_2007_9003_Fig1_HTML,COMP
12178_2007_9003_Fig3_HTML,COMP
rr188-1,NOCOMP
rr199-1,NOCOMP
rr36-5,NOCOMP
scrt3-1,NOCOMP
|
and the captions files look like this:
1 2 3 | 12178_2007_9003_Fig1_HTML An 64-year-old female with symptoms of bilateral lower limb neurogenic claudication with symptomatic improvement with a caudal epidural steroid injection. An interlaminar approach could have been considered appropriate, as well. ( a ) Sagittal view of a T2-weighted MRI of the lumbar spine. Note the grade I spondylolisthesis of L4 on L5 with severe central canal stenosis. ( b ) and ( c ) Axial views of a T2-weighted MRI through L4 รข<80><93> 5. Note the diffuse disc bulge in ( b ) and the marked ligamentum flavum hypertophy in ( c ), both contributing to the severe central stenosis. ( d ) The L5-S1 level showing no evidence of stenosis
12178_2007_9003_Fig3_HTML Fluoroscopic images of an L3-4 interlaminar approach. ( a ) AP view, pre-contrast, ( b ) Lateral view, pre-contrast, and ( c ) Lateral view, post-contrast
12178_2007_9003_Fig5_HTML Fluoroscopic images of a right L5-S1 transforaminal approach targeting the right L5 nerve root. ( a ) AP view, pre-contrast and ( b ) AP view, post-contrast
|
I built BoW and SIF vectors for the entire dataset, using GloVe word vectors. I then used these vectors as inputs to stock Scikit-Learn Naive Bayes and Support Vector Machine classifiers, and measured the test accuracy for various vocabulary sizes. For the word probabilities, I used both native probabilities (i.e, computed from the combined caption dataset) and outside probabilities (computed from Wikipedia, and available in the YingyuLiang/SIF GitHub repository). I then built vocabularies out of the most common N words, computed BoW sentence embeddings, SIF sentence embeddings with native word frequencies, and SIF sentence embeddings with external probabilities (SIF+EP), and recorded the accuracy reported for the two class classification task from the Naive Bayes and Support Vector Machine (SVM) classifiers. Below I provide a breakdown of the steps wtih code.
The first step is to parse the files and generate a list of training and test captions with their labels.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | def parse_caption_and_label(caption_file, label_file, sep=" "):
filename2label = {}
flabel = open(label_file, "rb")
for line in flabel:
filename, label = line.strip().split(sep)
filename2label[filename] = LABEL2ID[label]
flabel.close()
fcaption = open(caption_file, "rb")
captions, labels = [], []
for line in fcaption:
filename, caption = line.strip().split("\t")
captions.append(caption)
labels.append(filename2label[filename])
fcaption.close()
return captions, labels
TRAIN_CAPTIONS = "/path/to/training-captions.tsv"
TRAIN_LABELS = "/path/to/training-labels.csv"
TEST_CAPTIONS = "/path/to/test-captions.tsv"
TEST_LABELS = "/path/to/test-labels.csv"
LABEL2ID = {"COMP": 0, "NOCOMP": 1}
captions_train, labels_train = parse_caption_and_label(
TRAIN_CAPTIONS, TRAIN_LABELS, ",")
captions_test, labels_test = parse_caption_and_label(
TEST_CAPTIONS, TEST_LABELS, " ")
|
Next I build the word count matrix using the captions. For this we use the Scikit-Learn CountVectorizer to do the heavy lifting. We have removed stopwords from the counting using the stopwords parameter. At this point Xc is a matrix of word counts of shape (number of training records + number of test records, VOCAB_SIZE). The VOCAB_SIZE is a hyperparameter which we will vary during our experiments.
1 2 3 4 5 6 7 8 | from sklearn.feature_extraction.text import CountVectorizer
VOCAB_SIZE = 10000
counter = CountVectorizer(strip_accents=unicode,
stop_words="english",
max_features=VOCAB_SIZE)
caption_texts = captions_train + captions_test
Xc = counter.fit_transform(caption_texts).todense().astype("float")
|
At this point, we can capture the sentence length vector S (see the formulae for vs as the sum across the columns of this matrix).
1 2 3 4 | import numpy as np
sent_lens = np.sum(Xc, axis=1).astype("float")
sent_lens[sent_lens == 0] = 1e-14 # prevent divide by zero
|
Next we read the pretrained word vectors from the provided GloVe embedding file. We use the version built with Wikipedia 2014 + Gigaword 5 (6B tokens, 400K words and dimensionality 300). The following snippet extracts the vectors for the words in our vocabulary and collects them into a dictionary.
1 2 3 4 5 6 7 8 9 10 11 | E = np.zeros((VOCAB_SIZE, 300))
fglove = open(GLOVE_EMBEDDINGS, "rb")
for line in fglove:
cols = line.strip().split(" ")
word = cols[0]
try:
i = counter.vocabulary_[word]
E[i] = np.array([float(x) for x in cols[1:]])
except KeyError:
pass
fglove.close()
|
We are now ready to build our BoW vectors. Replacing the term counts with the appropriate vector is just a matrix multiplication, and averaging by word length means an element-wise divide by the S vector. Finally we split our BoW sentence embeddings into training and test splits.
1 2 3 4 | Xb = np.divide(np.dot(Xc, E), sent_lens)
Xtrain, Xtest = Xb[0:len(captions_train)], Xb[-len(captions_test):]
ytrain, ytest = np.array(labels_train), np.array(labels_test)
|
The regularity of the Scikit-Learn API means that we can build some functions that can be used to cross-validate our classifier during training and evaluate it with the test data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
def cross_val(Xtrain, ytrain, clf):
best_clf = None
best_score = 0.0
num_folds = 0
cv_scores = []
kfold = KFold(n_splits=10)
for train, val in kfold.split(Xtrain):
Xctrain, Xctest, yctrain, yctest = Xtrain[train], Xtrain[val], ytrain[train], ytrain[val]
clf.fit(Xctrain, yctrain)
score = clf.score(Xctest, yctest)
if score > best_score:
best_score = score
best_clf = clf
print("fold {:d}, score: {:.3f}".format(num_folds, score))
cv_scores.append(score)
num_folds += 1
return best_clf, cv_scores
def test_eval(Xtest, ytest, clf):
print("===")
print("Test set results")
ytest_ = clf.predict(Xtest)
accuracy = accuracy_score(ytest, ytest_)
print("Accuracy: {:.3f}".format(accuracy))
|
We now invoke these functions to instantiate a Naive Bayes and SVM classifier, train it with 10-fold cross validation on the training split, and evaluate it with the test data to produce, among other things, a test accuracy. The following code shows the call for doing this with a Naive Bayes classifier. The code for doing this with an SVM classifier is similar.
1 2 3 4 5 | from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
best_clf, scores_nb = cross_val(Xtrain, ytrain, clf)
test_eval(Xtest, ytest, best_clf)
|
The SIF sentence embeddings also start with the count matrix generated by the CountVectorizer. In addition, we need to compute the word probabilities. If we want to use the word probabilities from the dataset, we can do so by computing the row sum of the count matrix as follows:
1 2 3 | # compute word probabilities from corpus
freqs = np.sum(Xc, axis=0).astype("float")
probs = freqs / np.sum(freqs)
|
We could also get these word probabilities from some external source such as a file. So given the probs vector, we can create a vector representing the coefficient for each word. Something like this:
1 2 | ALPHA = 1e-3
coeff = ALPHA / (ALPHA + probs)
|
We can then compute the raw sentence embedding matrix in a manner similar to the BoW matrix.
1 2 | Xw = np.multiply(Xc, coeff)
Xs = np.divide(np.dot(Xw, E), sent_lens)
|
In order to remove the first principal component, we first compute it using the TruncatedSVD class from Scikit-Learn, and subtract it from the raw SIF embedding Xs.
1 2 3 4 5 6 | from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=1, n_iter=20, random_state=0)
svd.fit(Xs)
pc = svd.components_
Xr = Xs - Xs.dot(pc.T).dot(pc)
|
As with the BoW sentence embeddings, we split it back to a training and test set, and train the two classifiers and evaluate them.
The full code used for this post is available in this GitHub gist as a Jupyter notebook. The results of running the three sentence embeddings - BoW, SIF and SIF with External Word Probabilities (SIF+EP) - through the two stock Sciki-Learn classifiers for different vocabulary sizes are shown below.
As you can see, I get conflicting results for the two classifiers. For the Naive Bayes classifier, SIF sentence embeddings with native word probabilities narrowly beats out the BoW embeddings, whereas in case of SVM, the SIF embeddings with external word probabilities are slightly better than the BoW results for some vocabulary sizes. Also, accuracies from the other SIF embedding trails the ones from BoW in both cases. Finally, the differences are really minor - if you look at the y-axis on the charts, you will see that the difference is on the third decimal place. So at least based on my experiment, there does not seem to be a significant utility to use the SIF embeddings over the BoW.
My use case does differ from the ideal case in that my captions can be long (longer than a typical sentence) and/or multi-sentence. Also, for the embedding I used the GloVe vectors computed against the 6B corpus, the YingyuLiang/SIF implementation used vectors generated from the 84B corpus. I don't believe these should make too much difference, but I may be wrong. I have tried to follow the paper recommendations as closely as possible when replicating this experiment, but it is possible I may have made a mistake somewhere - in case you spot it please let me know. The code is included, both on this post as well as in the GitHub gist if you want to verify that it works like I described. As a user of word and sentence embeddings, my primary use case is to use them to encode text input to classifiers. If you have gotten results that indicate SIF sentence embeddings are significantly better than BoW sentence embeddings for this or a similar use case, please let me know.