The work in this blog post is prompted by a problem I am facing at work, so this is my attempt to figure out if Doc2Vec might be a feasible solution. For background, Doc2Vec allows you to represent a block of text by a fixed length vector as a point in a latent topic space (regardless of the size of the text) as described in the paper Distributed Representations of Sentences and Documents by Quoc Le and Tomas Mikolov. As Radim Rehurek (creator of Gensim) explains on his blog, Doc2Vec (also known as paragraph2vec or sentence embeddings) extend the word2vec algorithm to unsupervised learning of continuous representations for larger blocks of text, such as sentences, paragraphs or entire documents.
The task was to predict new tags for movies, given a synopsis of its plotline and human-assigned tags (short phrases encapsulating the viewer's impression of the movie, such as "dark humor" or "great performance"). My data consists of 6,044 human-assigned tags for 1,085 movies from the ml-latest-small dataset from the GroupLens Repository. The associated movie plotlines comes from the The Open Movie Database (OMDB) API.
The idea is to train a Doc2Vec model using the text from the plotlines and the human assigned tags, then infer new tags for existing plotlines as well as for unseen plotlines. Gensim provides functionality to build Doc2Vec models, so I used that here.
The first step is to set up the data so it can be consumed by Doc2Vec. Doc2Vec expects its input as an iterable of LabeledPoint objects, which are basically a list of words from the text and a list of labels. The code below downloads the movie plotlines from the OMDB API and ties them together with the assigned tags and writes it out to a file. This file will be used to train the Doc2Vec model later.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | # Source: src/build_dataset.py
# -*- coding: utf-8 -*-
import json
import requests
OMDB_URL = "http://www.omdbapi.com/?i=tt%s&plot=full&r=json"
movie_tags = {}
ftag = open("../data/tags.csv", 'rb')
for line in ftag:
if line.startswith("userId"):
continue
_, mid, tag, _ = line.strip().split(",")
if movie_tags.has_key(mid):
movie_tags[mid].add(tag)
else:
movie_tags[mid] = set([tag])
ftag.close()
fdata = open("../data/tagged_plots.csv", 'wb')
flink = open("../data/links.csv", 'rb')
for line in flink:
if line.startswith("movieId"):
continue
mid, imdb_id, _ = line.strip().split(",")
if not movie_tags.has_key(mid):
continue
resp = requests.get(OMDB_URL % (imdb_id))
resp_json = json.loads(resp.text)
plot = resp_json["Plot"].encode("ascii", "ignore")
fdata.write("%s\t%s\t%s\n" % (mid, plot, "::".join(list(movie_tags[mid]))))
flink.close()
fdata.close()
|
Using this combined dataset, we can now train a Doc2Vec model. Similar to word2vec, Doc2Vec comes in different flavors - the PV-DM learns to predict a word given its container paragraph matrix and its context words and the PV-DBOW learns to predict the context words given the paragraph matrix. The PV-DM has two sub-flavors depending on how the vectors from its components are combined - averaging (PV-DM/M) or concatenation (PV-DM/C). In my experiment, I build all 3 flavors (it's a one-line call with Gensim). You can also build stacked Doc2Vec models as described in this notebook, but you can't infer vectors from them, so I haven't used them here.
The code below constructs a list of LabeledSentence objects from the generated data, constructs a 90/10 training/test split, trains each of the 3 different Doc2Vec models described above for 20 epochs, and evaluates them on the test split using Jaccard similarity between the actual and predicted tags. We also write out the plotline, already assigned tags, and predicted tags along with their probabilities for 5 random movies for each model.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | # Source: src/doc2vec.py
# -*- coding: utf-8 -*-
from __future__ import division
from gensim.models.doc2vec import LabeledSentence
from gensim.models import Doc2Vec
from random import shuffle
from sklearn.cross_validation import train_test_split
import nltk
import numpy as np
def tokenize_text(text):
tokens = []
for sent in nltk.sent_tokenize(text):
for word in nltk.word_tokenize(sent):
if len(word) < 2:
continue
tokens.append(word.lower())
return tokens
def tokenize_tags(label):
tags = label.split("::")
tags = map(lambda tok: mark_tag(tok), tags)
return tags
def jaccard_similarity(labels, preds):
lset = set(labels)
pset = set(preds)
return len(lset.intersection(pset)) / len(lset.union(pset))
def mark_tag(s):
return "_" + s.replace(" ", "_")
def unmark_tag(s):
return s[1:].replace("_", " ")
# read input data
orig_sents = []
sentences = []
fdata = open("../data/tagged_plots.csv", 'rb')
for line in fdata:
mid, text, label = line.strip().split("\t")
orig_sents.append(text)
tokens = tokenize_text(text)
tags = tokenize_tags(label)
sentences.append(LabeledSentence(words=tokens, tags=tags))
fdata.close()
# Split model into 90/10 training and test
train_sents, test_sents = train_test_split(sentences, test_size=0.1,
random_state=42)
## Build and train model
## PV-DM w/concatenation
#model = Doc2Vec(dm=1, dm_concat=1, size=100, window=5, negative=5,
# hs=0, min_count=2)
## PV-DM w/averaging
#model = Doc2Vec(dm=1, dm_mean=1, size=100, window=5, negative=5,
# hs=0, min_count=2)
# PV-DBOW
model = Doc2Vec(dm=0, size=100, negative=5, hs=0, min_count=2)
model.build_vocab(sentences)
alpha = 0.025
min_alpha = 0.001
num_epochs = 20
alpha_delta = (alpha - min_alpha) / num_epochs
for epoch in range(num_epochs):
shuffle(sentences)
model.alpha = alpha
model.min_alpha = alpha
model.train(sentences)
alpha -= alpha_delta
# evaluate the model
tot_sim = 0.0
for test_sent in test_sents:
pred_vec = model.infer_vector(test_sent.words)
actual_tags = map(lambda x: unmark_tag(x), test_sent.tags)
pred_tags = model.docvecs.most_similar([pred_vec], topn=5)
pred_tags = filter(lambda x: x[0].find("_") > -1, pred_tags)
pred_tags = map(lambda x: (unmark_tag(x[0]), x[1]), pred_tags)
sim = jaccard_similarity(actual_tags, [x[0] for x in pred_tags])
tot_sim += sim
print "Average Similarity on Test Set: %.3f" % (tot_sim / len(test_sents))
# print out random test result
for i in range(5):
docid = np.random.randint(len(sentences))
pred_vec = model.infer_vector(sentences[docid].words)
actual_tags = map(lambda x: unmark_tag(x), sentences[docid].tags)
pred_tags = model.docvecs.most_similar([pred_vec], topn=5)
print "Text: %s" % (orig_sents[docid])
print "... Actual tags: %s" % (", ".join(actual_tags))
print "... Predicted tags:", map(lambda x: (unmark_tag(
x[0]), x[1]), pred_tags)
|
The average Jaccard similarity between the actual and predicted tags for the 3 models are shown below. PV-DBOW seems to work best for this task.
- PV-DM/C: 0.033
- PV-DM/M: 0.038
- PV-DBOW: 0.465
Here is a sample of random entries from the test set, with actual and predicted tags for each of the 3 Doc2Vec models.
Model-Type | Plot | Actual Tags | Predicted Tags (and probabilities) |
PV-DM/C | Rose Hathaway is a dhampir, half-vampire and half-human, who is training to be a guardian at St Vladimir's Academy along with many others like her. There are good and bad vampires in their world: Moroi, who co-exist peacefully among the humans and only take blood from donors, and also possess the ability to control one of the four elements - water, earth, fire or air; and Strigoi, blood-sucking, evil vampires who drink to kill. Rose and other dhampir guardians are trained to protect Moroi and kill Strigoi throughout their education. Along with her best friend, Princess Vasilisa Dragomir, a Moroi and the last of her line, with whom she has a nigh unbreakable bond, Rose must run away from St Vladimir's, in order to protect Lissa from those who wish to harm the princess and use her for their own means. | patriotism | (based on true story, 0.668), (patriotism, 0.631), (unnecessary, 0.542), (best ending ever, 0.461), (predictable, 0.450) |
PV-DM/C | Charles is the owner of a photo-shop. He is not too friendly and spends his evenings alone, and one day he finally decides to get a social life. He meets elderly Florence, who is tormented by her gambling husband Lester and longs for the son Willie she hasn't seen or heard from in 20 years. | homicide, regret, elevator, religion, plot twist, guilt, supernatural | (short, 0.729), (bank robbery, 0.701), (Exceptional Acting, 0.564), (Vulgar, 0.528), (violent, 0.493) |
PV-DM/M | Mankind discover the existence of the Vampire and Lycan species and they begin a war to annihilate the races. When Selene meets with Michael in the harbor, they are hit by a grenade and Selene passes out. Twelve years later, Selene awakes from a cryogenic sleep in the Antigen laboratory and meets the Vampire David. She learns that she had been the subject of the scientist Dr. Jacob Lane and the Vampire and Lycan species have been practically eradicated from Earth. But Selene is still connected to Michael and has visions that she believes that belongs to Michael's sight. However she has a surprise and finds that she has a powerful daughter named Eve that has been raised in the laboratory. Now Selene and David have to protect Eve against the Lycans that intend to use her to inoculate their species against silver. | die hard 4.0 | (slow build, 0.737), (riveting, 0.735), (Oliver Stone, 0.715), (unmemorable, 0.706), (die hard 4.0, 0.702) |
PV-DBOW | Lawrence Talbot's childhood ended the night his mother died. His father sent him from the sleepy Victorian hamlet of Blackmoor to an insane asylum, then he goes to America. When his brother's fiance, Gwen Conliffe, tracks him down to help find her missing love, Talbot returns to his father's estate to learn that his brother's mauled body has been found. Reunited with his estranged father, Lawrence sets out to find his brother's killer... and discovers a horrifying destiny for himself. Someone or something with brute strength and insatiable blood lust has been killing the villagers, and a suspicious Scotland Yard inspector named Aberline comes to investigate. | torture porn | (torture porn, 0.940), (no payoff, 0.486), (female heroine, 0.469), (cliched plot, 0.469), (masterpiece, 0.464) |
PV-DBOW | Lester and Carolyn Burnham are, on the outside, a perfect husband and wife in a perfect house in a perfect neighborhood. But inside, Lester is slipping deeper and deeper into a hopeless depression. He finally snaps when he becomes infatuated with one of his daughter's friends. Meanwhile, his daughter Jane is developing a happy friendship with a shy boy-next-door named Ricky, who lives with an abusive father. | likeable lead, atheist | (atheist, 0.897), (likeable lead, 0.896), (double frame rate, 0.591), (sweden, 0.554), (scout, 0.549) |
As you can see, while there is scope for improvement, the models do get some of the tags right, and some of the newly predicted tags are also quite insightful. This is quite encouraging given the small size of my training set. I had deliberately kept my dataset small, because I wanted to focus on figuring out how to use Doc2Vec in my situation rather than mostly waiting for training to complete. But it looks like this might be a good avenue to explore further.
Anyway, thats all I have for today, hope you found it interesting.
6 comments (moderated to prevent spam):
Hi
Interesting. I followed a differente approach.
I label each document individually, then run a multilabel (OneVSRest) classifier using the document vectors.
Results are somewhat similar to what you have here.
I'm going to try your approach instead.
Thanks. So you are labeling the documents manually rather than depending on the user-generated tags? I guess you could do that, but one of my objectives was to use available data (which was only the text and the tags) - that was my real life constraint which I tried to mimic here.
Apologies, I didn't explain myself properly. Gensim uses the word label for you to tag the documents, irrespective of whether that tag really just a unique doc_id.
So I create one vector per document. No labels involved at that stage.
Then I run an SVM classifier training on my doc_id_vector -> (labels) corpus.
Results are better compared to the approach you run here. But I need to run more tests.
Do you know of a framework that allows you to optimise the hyperparameters "holistically" i.e. including the params that are used when preprocessing, vectorising, and training the classifier?
Sorry about the delay in replying, meant to answer but got pulled off elsewhere and forgot about it. Also thanks for the clarification, that looks like a good approach. Regarding hyperparameter optimization, the most common one is grid search or random, but you could also use bayesian approaches where you focus on areas of high yield in the hyperparameter space -- check out hyperopt. Also I built a homegrown hyperparameter optimizer along these lines -- not a generic one, but to understand the principle behind it.
Hi Sujit,
Would you mind explaining why you reduce alpha in each consecutive epoch? I've looked at a couple doc2vec tutorials and this appears to be unique to your approach :).
Thanks a million,
Moritz
Hi Moritz, alpha is the learning rate and I am decaying it linearly as the number of epochs progresses. That way as I (hopefully) the network converges, it oscillates less and less around the (again hopefully) correct solution. There are other examples of this - see, for example, gensim/docs/notebooks/doc2vec-IMDB.ipynb.
Post a Comment