I was at NAACL 2015 earlier (first week of) this month. This was my very first time attending an academic conference, and I found it incredibly interesting. New ideas were unveiled and old benchmarks crushed on a regular basis over the course of the week. Some of the ideas were familiar, others completely new. It was a chance to listen in person to people whose blog posts and textbooks I have read in the past. It was a chance to talk to incredibly smart people who are doing extremely cool things with text (as well as speech and video). I came away with several new ideas, techniques and technologies that I plan to explore in the coming months. As an ex-Denverite, it was also a chance to meet and catch up with old friends.
As part of the conference, I attended a tutorial on Learning to Search (L2S) for Structured Prediction, where there was a tutorial on Vowpal Wabbit (VW) basics (since VW is used in the implementation). I had read many good things about VW on the Kaggle message boards, but hadn't used it so far, so for me, this was also a chance to get familiar with VW.
The keynote for the conference was delivered by Lillian Lee who talked about investigating the effect of language on people. Among the examples she provided, she mentioned her work on memorable movie quotes (in itself an excellent primer on experiment design BTW). The gist of this work is finding memorable movie quotes and finding similar length utterances by the same character at around the same point in the movie that are not memorable, and using them as positive and negative examples to train a classifier that predicts if an utterance is memorable. More details can be found in the paper You had me at hello: How phrasing affects memorability.
The dataset seemed fairly well-defined and simple (deceptively simple, as I would find out later), so I decided to try to build a classifier with VW (and get familiar with VW in the process). This post describes that effort.
VW provides a rich command-line interface (CLI). Recently it also provides a Python based programmatic API but there were some problems getting it to work on Mac OSX (my current platform) during the tutorial. The Python API can be useful when you are using VW as a platform rather than a tool (for example when doing L2S), but I just wanted to do simple classification, so I just focused on the CLI.
The dataset of memorable movie quotes can be found here. It consists of 2,197 memorable movie quotes paired with 2,197 non-memorable ones. Each pair of quotes is organized into 4 lines as shown below (in the file moviequotes.memorable_nonmemorable_pairs.txt):
1 2 3 4 5 | wild things
My mother would kill me if she knew I took the Rover!
875814 My mom would kill me if she knew I took the Rover.
875877 I think it was something else... Suzie says the bust was bullshit...
...
|
The first line is the name of the movie, the second line is the memorable quote, the third line is the memorable quote as it is in the script (not necessarily exactly the same as the quote that became memorable), and the fourth line is a non-memorable utterance by the same character at around the same time in the movie. We first transform this to a format that is easier to work with downstream.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | # Source: src/moviequotes/preprocess_data.py
# -*- coding: utf-8 -*-
def main():
fin = open("../../data/moviequotes/moviequotes.memorable_nonmemorable_pairs.txt", 'rb')
fout = open("../../data/moviequotes/moviequotes.txt", 'wb')
line_num = 0
mem_line = None
nonmem_line = None
for line in fin:
line = line.strip()
if line_num == 1:
mem_line = line
if line_num == 3:
nonmem_line = line[line.index(" ")+1:]
if len(line) == 0:
fout.write("%d\t%s\n" % (1, mem_line))
fout.write("%d\t%s\n" % (0, nonmem_line))
line_num = 0
mem_line = None
nonmem_line = None
else:
line_num += 1
fin.close()
fout.flush()
fout.close()
if __name__ == "__main__":
""" Convert Cornell data into easy to read tabular format """
main()
|
This transforms the data into the following flat format.
1 2 3 4 5 6 7 8 9 10 11 | 1 My mother would kill me if she knew I took the Rover!
0 I think it was something else... Suzie says the bust was bullshit...
1 You've been a bad girl, Grace!
0 What the hell you doin', Grace?
1 And a waitress named Flo.
0 You son of a bitch!
1 I bet right now, you don't know if you want to kill me... or fuck me.
0 Women say they don't want to be taken like, really taken -- that's bullshit -- they do.
1 Is everyone in this town on drugs?
0 Well, he offered, and I just couldn't refuse.
...
|
The next part is to split this up into a training and test set into the format VW wants. We use 70% of the data as our training set and the remaining 30% as the validation set. Our first attempt is to just treat all the text as features. So our data looks like this:
1 2 3 4 5 6 7 8 9 10 11 | 1 | My mother would kill me if she knew I took the Rover!
-1 | I think it was something else... Suzie says the bust was bullshit...
1 | You've been a bad girl, Grace!
-1 | What the hell you doin', Grace?
1 | And a waitress named Flo.
-1 | Well, he offered, and I just couldn't refuse.
1 | And here I made you all hot and sweaty...
-1 | And I don't know if you're really giving.
1 | Look, defenseless babies!
-1 | I hate this.
...
|
We train a VW model (model.vw) using Logistic Regression. VW offers other models such as SVM, but this was the one that gave me the best results.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | $ vw -d train_moviequotes.txt -f model.vw \
--loss_function logistic --passes 10000 --cache_file cache
final_regressor = model.vw
Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
decay_learning_rate = 1
using cache_file = cache
ignoring text input in favor of cache input
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.693147 0.693147 1 1.0 1.0000 0.0000 13
0.768432 0.843716 2 2.0 -1.0000 0.2814 13
0.707602 0.646772 4 4.0 1.0000 0.0773 7
0.709794 0.711985 8 8.0 1.0000 0.1979 6
0.727054 0.744315 16 16.0 -1.0000 -0.0520 5
0.711858 0.696662 32 32.0 -1.0000 -0.0961 16
0.705705 0.699551 64 64.0 -1.0000 -0.2107 6
0.727612 0.749519 128 128.0 -1.0000 0.3081 17
0.705219 0.682826 256 256.0 1.0000 -0.2075 9
0.689915 0.674610 512 512.0 1.0000 0.3932 7
0.682021 0.674128 1024 1024.0 -1.0000 -0.4391 14
0.675095 0.668168 2048 2048.0 -1.0000 -0.4202 7
0.668805 0.668805 4096 4096.0 -1.0000 0.0315 4 h
0.664234 0.659672 8192 8192.0 1.0000 -0.1119 7 h
0.657264 0.650294 16384 16384.0 -1.0000 -2.8443 12 h
finished run
number of examples per pass = 2784
passes used = 7
weighted example sum = 19488.000000
weighted label sum = -126.000000
average loss = 0.651617 h
best constant = -0.012931
best constant's loss = 0.693126
total feature number = 207620
$ vw test_moviequotes.txt -t -i model.vw -p test_pred.txt
only testing
predictions = test_pred.txt
Num weight bits = 18
learning rate = 10
initial_t = 1
power_t = 0.5
using no cache
Reading datafile = test_moviequotes.txt
num sources = 1
average since example example current current current
loss last counter weight label predict features
0.175001 0.175001 1 1.0 -1.0000 -0.5817 9
11.97682 23.77864 2 2.0 -1.0000 -5.8763 18
6.771072 1.565320 4 4.0 1.0000 -0.4749 13
5.300591 3.830111 8 8.0 1.0000 -1.7914 28
3.082561 0.864532 16 16.0 1.0000 0.8826 12
2.398779 1.714996 32 32.0 -1.0000 -1.4483 18
1.842740 1.286701 64 64.0 -1.0000 0.6553 7
1.602232 1.361725 128 128.0 -1.0000 -0.0691 28
1.632950 1.663667 256 256.0 1.0000 0.6150 5
1.548076 1.463202 512 512.0 1.0000 -0.3059 9
1.552964 1.557851 1024 1024.0 -1.0000 1.4476 11
finished run
number of examples per pass = 1301
passes used = 1
weighted example sum = 1301.000000
weighted label sum = 11.000000
average loss = 1.522137
best constant = 0.008455
best constant's loss = 0.999928
total feature number = 14037
|
The second call creates a train_pred.txt file containing the predictions that look something like this:
1 2 3 4 5 6 7 8 9 10 11 | 1.773244
-2.195071
1.417488
1.006819
-0.361311
0.838351
-1.418754
0.207618
0.272800
1.483417
...
|
To calculate the accuracy, we compare the values predicted (passing them through a step function so any positive prediction is a 1 and any negative prediction is a -1) with the actual values in our test_moviequotes.txt file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 | # Source: src/moviequotes/accuracy_calc.py
# -*- coding: utf-8 -*-
from __future__ import division
DATA_FILE = "../../data/moviequotes/%s_moviequotes.txt"
PRED_FILE = "../../data/moviequotes/%s_pred.txt"
def step(x):
return -1 if x < 0 else 1
def main():
for m in ["train", "test"]:
fdata = open(DATA_FILE % m, 'rb')
label_acts = []
for line in fdata:
label_acts.append(int(line.split("|")[0].strip()))
fdata.close()
label_preds = []
fpred = open(PRED_FILE % m, 'rb')
for line in fpred:
label_preds.append(step(float(line.strip())))
fpred.close()
num_recs = 0
num_correct = 0
for act, pred in zip(label_acts, label_preds):
if act == pred:
num_correct += 1
num_recs += 1
print("Accuracy for %s: %.3f %%" % (m, 100.0 * num_correct / num_recs))
if __name__ == "__main__":
main()
|
This gives us a test set accuracy of 58.647%, slightly lower than the result the authors of the paper had with their bag of words model (59.67%). The paper goes on to describe some additional features that they bring their accuracy numbers up to 64.46%, so I tried to replicate similar features for my model to see if I could also get these higher numbers.
The three things the authors noticed about memorable quotes were (a) the use of language in memorable quotes was to some extent unusual (Distinctiveness), (b) they were easier to use outside the specific context in which they were originally used (Generality) and (c) the use of specific language that is "memorable". The first is modeled as the likelihood of a quote against (unigram, bigram and trigram) language models built using the Brown Corpus News sentences. The second is modeled as the use of fewer personal pronouns, more use of indefinite articles and past tense verbs in memorable than in non-memorable quotes. The third was done by constructing separate language models out of the memorable and non-memorable quotes and comparing the likelihood against them.
I implemented the first two extensions in the distinctiveness() and generality() functions in the code below. The code splits the data into a 70/30 split and computes the extra features and writes them out into the data file along with the text.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | # Source: src/moviequotes/vwformat_split.py
# -*- coding: utf-8 -*-
from __future__ import division
from nltk.corpus import brown
import math
import nltk
import random
INPUT_FILE = "../../data/moviequotes/moviequotes.txt"
VW_TRAIN_FILE = "../../data/moviequotes/train_moviequotes.txt"
VW_TEST_FILE = "../../data/moviequotes/test_moviequotes.txt"
TRAIN_TEST_SPLIT = 0.7
def generality(s):
"""
Counts the number of personal pronouns (PRP), Indefinite articles (a, an)
and Past Tense (VBD) per word in sentence.
"""
words = nltk.word_tokenize(s)
num_words = len(words)
postags = [x[1] for x in nltk.pos_tag(words)]
num_prps = len([x for x in postags if x == "PRP"]) / num_words
num_indef_articles = len([x for x in words
if x.lower() == "a" or x.lower() == "an"]) / num_words
num_vbds = len([x for x in postags if x == "VBD"]) / num_words
return num_prps, num_indef_articles, num_vbds
def pad_words(words):
words.insert(0, "_START_")
words.append("_END_")
return words
def browncorpus_gram_freqs(gram_size):
"""
Construct a frequency distribution out of news sentences in Brown
Corpus available in NLTK
"""
gram_freqs = nltk.FreqDist()
num_elems = 0
for sent in brown.sents(categories=["news"]):
sent = [x.lower() for x in sent]
sent = pad_words(sent)
# construct n-grams
for ngram in nltk.ngrams(sent, 2):
num_elems += 1
gram_freqs[ngram] += 1
return gram_freqs, num_elems
def distinctiveness(s, gram_size, gram_freqs, num_grams):
"""
Returns probability of the sentence being from Brown Corpus (news)
based on the probabilities of the n-gram in the sentence being found
in the Brown corpus.
"""
words = nltk.word_tokenize(s)
words = pad_words(words)
log_prob = 0.0
for wgram in nltk.ngrams(words, 2):
p_wgram = (gram_freqs[wgram] + 1) / (num_grams + len(gram_freqs))
log_prob += math.log(p_wgram)
return log_prob
def vw_format(label, text, gram1_freqs, num_grams1, gram2_freqs,
num_grams2, gram3_freqs, num_grams3):
vw_label = "1" if int(label) == 1 else "-1"
prp, ia, pt = generality(text)
d1 = distinctiveness(text, 1, gram1_freqs, num_grams1)
d2 = distinctiveness(text, 2, gram2_freqs, num_grams2)
d3 = distinctiveness(text, 3, gram3_freqs, num_grams3)
return "%s |s %s|g prp:%.3f ia:%.3f pt:%.3f|d d1:%.3f d2:%.3f d3:%.3f\n" % \
(vw_label, text, prp, ia, pt, d1, d2, d3)
def main():
# load up frequency distributions for calculating distinctiveness
gram1_freqs, num_grams1 = browncorpus_gram_freqs(1)
gram2_freqs, num_grams2 = browncorpus_gram_freqs(2)
gram3_freqs, num_grams3 = browncorpus_gram_freqs(3)
fin = open(INPUT_FILE, 'rb')
ftrain = open(VW_TRAIN_FILE, 'wb')
ftest = open(VW_TEST_FILE, 'wb')
for line in fin:
label, text = line.strip().split("\t")
# VW can't handle ":" since its a special character
text = text.replace(":", " ")
dice = random.random()
if dice < TRAIN_TEST_SPLIT:
ftrain.write(vw_format(label, text, gram1_freqs, num_grams1,
gram2_freqs, num_grams2, gram3_freqs,
num_grams3))
else:
ftest.write(vw_format(label, text, gram1_freqs, num_grams1,
gram2_freqs, num_grams2, gram3_freqs,
num_grams3))
ftest.flush()
ftrain.flush()
ftest.close()
ftrain.close()
fin.close()
if __name__ == "__main__":
main()
|
This produces a VW datafile that looks like this:
1 2 3 4 5 6 7 8 9 10 | 1 |s My mother would kill me if she knew I took the Rover!|g prp:0.231 ia:0.000 pt:0.154|d d1:-162.664 d2:-162.664 d3:-162.664
-1 |s I think it was something else... Suzie says the bust was bullshit...|g prp:0.143 ia:0.000 pt:0.143|d d1:-172.825 d2:-172.825 d3:-172.825
1 |s You've been a bad girl, Grace!|g prp:0.111 ia:0.111 pt:0.000|d d1:-114.040 d2:-114.040 d3:-114.040
1 |s Your lies are old but you tell them pretty good.|g prp:0.182 ia:0.000 pt:0.000|d d1:-132.371 d2:-132.371 d3:-132.371
1 |s And a waitress named Flo.|g prp:0.000 ia:0.167 pt:0.000|d d1:-75.848 d2:-75.848 d3:-75.848
1 |s Is everyone in this town on drugs?|g prp:0.000 ia:0.000 pt:0.000|d d1:-100.962 d2:-100.962 d3:-100.962
-1 |s Well, he offered, and I just couldn't refuse.|g prp:0.167 ia:0.000 pt:0.083|d d1:-137.264 d2:-137.264 d3:-137.264
1 |s Don't admire people too much, they might disappoint you.|g prp:0.167 ia:0.000 pt:0.000|d d1:-140.809 d2:-140.809 d3:-140.809
-1 |s I mean there's someone besides your mother you've got to forgive.|g prp:0.143 ia:0.000 pt:0.000|d d1:-169.716 d2:-169.716 d3:-169.716
...
|
With the addition of the generality features (namespace g above), the accuracy on the test set jumped to 61.864% and the addition of the distinctiveness features (namespace d above) it jumped to 62.296%. In comparison, the result reported in the paper after the addition of all 3 feature sets was 64.46%. I decided to stop at this point, since my objective was to learn how to do classification with VW and I had already explored that.
VW offers some interesting new features and algorithms, some of which I don't fully understand yet, but it looks like it may be a useful tool to have in one's Machine Learning toolbox. All the code in this post can be found on GitHub here.