I was watching Richard Socher's lectures on CS224d: Deep Learning for Natural Language Processing at the Deep Learning Enthusiasts meetup at San Francisco couple of weeks ago. In one of the lectures, he describes this really cool idea for paraphrase detection using Dynamic Pooling and Unfolding Recursive Autoencoders. His use case was detecting mutual entailment in sentence pairs, but I figured that it might also be useful for computing sentence similarity, so I decided to try that out.
My data is a tiny corpus of 16 pairs of short sentences. The dataset came from a paper that described a rule-based algorithm that uses Wordnet and the Brown Corpus for its semantic smarts. Here is a link to my implementation of this algorithm. I later used the same dataset to re-implement another algorithm that uses word2vec and Word Mover's Distance. Neither algorithm gives importance to the structure of the sentence, although the second one does bring in some information about the context via word2vec. Socher's algorithm uses the structure of the sentence via its parse tree, and uses word embeddings, so it considers both the sentence structure and word context. So I thought it might be a good idea to see how it stacks up against the previous two.
Socher's algorithm goes something like this. Given a sentence pair, we first construct parse trees for them. The figure below shows the parse trees for a pair of sentences. Embeddings are constructed for each node in each tree using an Unfolding Recursive Autoencoder. The resulting embeddings are compared pairwise to generate a similarity matrix. Since the sentence pairs can be of arbitary length, the similarity matrix is passed through a Dynamic Pooling filter to reduce it to a fixed size square matrix. This matrix is then used as input to a classifier network to predict whether the sentences mutually entail one another.
Since my sentence collection was too small to generate a decent embedding out of, I decided to use the GoogleNews model (word2vec embeddings trained on about 100B words of Google News) to look up the words instead. For non-leaf nodes, I compute the vector as the sum of the vectors for the words in the phrase. I then compute the similarity matrix between the nodes for the two sentences. The similarity matrix for the sentences above looks like this.
I then apply a Dynamic Pooling filter on this matrix to reduce it to a 4x4 pooled similarity matrix. This involves splitting the input matrix into 16 blocks to which the pooling function can be applied. The paper deals with (the very common) situation where the input matrix is not perfecly divisible into 4x4 blocks by assigning the extra blocks to the last block. My approach is slightly different, I pad out the input matrix with the edge values in order to make it perfectly divisible - this allows me to use library functions that assume perfect divisibility. The corresponding pooled similarity matrix looks like this:
Instead of computing a binary label (entailed vs not entailed), I compute the similarity between the sentences as the sum of the elements of the pooled matrix divided by the number of elements (i.e, average pool). The similarity for the example sentence pair works out to 0.5079262884003104.
I now apply this modified algorithm to my small corpus of short sentence pairs. My input data has only 16 sentence pairs, with similarity scores that were provided as part of the original paper.
lsent | rsent | sim | |
---|---|---|---|
6 | A glass of cider. | A full cup of apple juice. | 0.678 |
11 | Canis familiaris are animals. | Dogs are common pets. | 0.362 |
10 | Dogs are animals. | They are common pets. | 0.738 |
14 | I have a hammer. | Take some nails. | 0.508 |
15 | I have a hammer. | Take some apples. | 0.121 |
13 | I have a pen. | Where is ink? | 0.129 |
12 | I have a pen. | Where do you live? | 0.000 |
0 | I like that bachelor. | I like that unmarried man. | 0.561 |
9 | It is a dog. | It is a pig. | 0.790 |
7 | It is a dog. | That must be your dog. | 0.739 |
8 | It is a dog. | It is a log. | 0.623 |
1 | John is very nice. | Is John very nice? | 0.977 |
3 | Red alcoholic drink. | Fresh orange juice. | 0.611 |
2 | Red alcoholic drink. | A bottle of wine. | 0.585 |
5 | Red alcoholic drink. | Fresh apple juice. | 0.420 |
4 | Red alcoholic drink. | An English dictionary. | 0.000 |
I used the Stanford CoreNLP to generate the nodes in the trees using the following Scala code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | package parsing
import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.trees.LabeledScoredTreeNode
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation
import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.LinkedHashSet
import scala.io.Source
import java.io.FileWriter
import java.io.PrintWriter
import java.util.Properties
object SentenceNodes extends App {
val props = new Properties
props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
val pipeline = new StanfordCoreNLP(props)
val idx2sent = scala.collection.mutable.Map[String,String]()
val texts = Source.fromFile("/path/to/orig_phrases.tsv")
.getLines
.zipWithIndex
.map(indexedLine => {
val Array(lsent, rsent, score) = indexedLine._1.split("\\|")
idx2sent(indexedLine._2 + "L") = lsent
idx2sent(indexedLine._2 + "R") = rsent
List(lsent, rsent).mkString(" ")
})
val sentPhrasesOut = new PrintWriter(new FileWriter(
"/path/to/sent_phrases.txt"))
texts.zipWithIndex.foreach(textIndexed => {
val doc = new Annotation(textIndexed._1)
pipeline.annotate(doc)
val sentences = doc.get(classOf[SentencesAnnotation])
sentences.zipWithIndex.foreach(sentenceIndexed => {
val tree = sentenceIndexed._1.get(classOf[TreeAnnotation])
// leaves first, then build up nodes
val nodes = tree.postOrderNodeList()
val phrases = nodes.map(node => {
node.getLeaves(
new java.util.ArrayList[LabeledScoredTreeNode])
.map(leaf => leaf.label().value()).mkString(" ")
})
// remove terminating period
.filter(node => !node.equals("."))
.map(node => node.replace(".", ""))
.toList
val dedupedPhrases = new LinkedHashSet[String]()
dedupedPhrases.addAll(phrases)
val sentId = textIndexed._2.toString +
(if (sentenceIndexed._2 == 0) "L" else "R")
sentPhrasesOut.println("%s\t%s".format(
sentId, dedupedPhrases.mkString("|")))
})
})
sentPhrasesOut.flush
sentPhrasesOut.close
}
|
This writes out the phrases in the following format. Each sentence is identified by the sentence number and L or R depending on whether it is the first or second sentence in the pair. The second field is a pipe separated list of nodes in the tree. I do a post-order walk of the tree and discard duplicate nodes.
1 2 3 4 5 | 0L I|like|that|bachelor|that bachelor|like that bachelor|I like that bachelor
0R I|like|that|unmarried|man|unmarried man|that unmarried man|like that unmarried man|I like that unmarried man
1L John|is|very|nice|very nice|is very nice|John is very nice
1R Is|John|very|nice|very nice|?|Is John very nice ?
...
|
I then use gensim to load up its word2vec model using the GoogleNews data, and generate the word2vec embeddings for each of the nodes. Vectors for non-leaf nodes (containing multiple words) are the sum of the vectors for the constituent words.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | from gensim.models import Word2Vec
model = Word2Vec.load_word2vec_format(
"/path/to/GoogleNews-vectors-negative300.bin.gz", binary=True)
sent_phrases = open("/path/to/sent_phrases.txt", "rb")
phrase_vecs = open("/path/to/phrase_vecs.txt", "wb")
for line in sent_phrases:
sent_id, phrase_str = line.strip().split("\t")
phrases = phrase_str.split("|")
for phrase in phrases:
words = phrase.lower().split(" ")
phrase_vec = np.zeros((1, 300))
for word in words:
try:
phrase_vec += model[word]
except KeyError:
continue
phrase_vec_str = ",".join(["%.5f" % (float(x)) for x in phrase_vec[0]])
phrase_vecs.write("%s\t%s\t%s\n" % (sent_id, phrase, phrase_vec_str))
sent_phrases.close()
phrase_vecs.close()
|
This produces another file that looks like this. The fields are the sentence ID, the phrase string and the corresponding 300 dimensional word2vec vector as a string.
1 2 3 4 5 6 | 0L I -0.08574,-0.00742,0.03452,0.09019,...
0L like 0.05687,0.07565,-0.00163,0.09980,...
0L that -0.01236,-0.02223,0.06554,0.03948,...
0L bachelor 0.02234,-0.05067,-0.11133,0.02895,...
0L that bachelor 0.00998,-0.07290,-0.04579,0.06843,...
...
|
Finally, I compute the similarity matrix for each sentence pair and run it through the Dynamic Pooling filter to produce the pooled similarity matrix, from which I compute the similarity score. I also generated heatmaps of the pooled similarity matrix for visualization.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | from skimage.measure import block_reduce
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# Dynamic pooling
def dynamic_pool(X, out_dim, func):
output_dim = np.ones(2) * out_dim
# make sure input dimensions are splittable into output_dim pools
xpad = X.shape[0] % out_dim
ypad = X.shape[1] % out_dim
# if not, pad with edge weights to make it splittable
Xpadded = np.pad(X, ((0, out_dim - xpad), (0, out_dim - ypad)), "edge")
# recalculate pooling block size
xpool = Xpadded.shape[0] // out_dim
ypool = Xpadded.shape[1] // out_dim
# compute pooled matrix
Xpooled = block_reduce(Xpadded, block_size=(xpool, ypool), func=func)
return Xpooled
# Create a Pandas dataframe for the phrase vectors
# This is so we can filter on it for individual sentence pairs
phrase_vecs_df = pd.read_csv("/path/to/phrase_vecs.txt",
delimiter="\t", index_col=False,
header=None,
names=["sent_id", "phrase", "phrase_vec"])
results = []
for i in range(len(orig_df)):
phrases_left = (phrase_vecs_df[phrase_vecs_df["sent_id"] == str(i) + "L"]
["phrase_vec"].values)
Xvecs = np.zeros((phrases_left.shape[0], 300))
for j in range(phrases_left.shape[0]):
Xvecs[j] = np.fromstring(phrases_left[j], dtype=float, sep=",")
phrases_right = (phrase_vecs_df[phrase_vecs_df["sent_id"] == str(i) + "R"]
["phrase_vec"].values)
Yvecs = np.zeros((phrases_right.shape[0], 300))
for j in range(phrases_right.shape[0]):
Yvecs[j] = np.fromstring(phrases_right[j], dtype=float, sep=",")
# compute similarity between elements from each sentence
sims = np.zeros((Xvecs.shape[0], Yvecs.shape[0]))
for j in range(Xvecs.shape[0]):
for k in range(Yvecs.shape[0]):
denom = np.linalg.norm(Xvecs[j], 2) * np.linalg.norm(Yvecs[k], 2)
if denom == 0:
sims[j, k] = 0
else:
sims[j, k] = np.dot(Xvecs[j], Yvecs[k]) / denom
pooled_sims = dynamic_pool(sims, 4, np.max)
# retrieve original sentence
lsent = id2sent[str(i) + "L"]
rsent = id2sent[str(i) + "R"]
similarity = np.sum(pooled_sims) / 16
results.append({"sent_id": i, "lsent": lsent, "rsent": rsent,
"sim": similarity})
print(i, "//", lsent, "//", rsent, "//", "%.5f" % (similarity))
plt.figure(figsize=(1, 1))
plt.xticks([])
plt.yticks([])
plt.pcolor(pooled_sims)
plt.show()
# populate a Pandas dataframe from the results and display
# :NOTE: we manually modified the HTML to add in the heatmaps
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(["lsent", "sim"], ascending=[1, 0])
results_df.to_html()
|
The results are shown below. As with the input, I have sorted the resulting sentence pairs by left sentence and score in order to visually verify that the scores match up with intuition.
From the results, it looks like the similarity scores do capture the semantics of the sentences. For example, a dog seems semantically closer to a pig than a log, a hammer closer to nails than apples, and red alcoholic drink closer to wine than apple juice.
That's all I have for today. Hope you enjoyed reading this as much as I enjoyed building it.
13 comments (moderated to prevent spam):
This is really cool. Thanks for the precise description. Let me try it and then I will get back to you on this.
This past week attended a workshop the Evolution and variation of classification systems organized by the Knowescape EU project.
Just want I need. Amazing explanation Sujit. Although understood most of it, reading the heat map went over my head. On a side note there is an interesting project funded by European Union on textual entailment, very similar to what you did. check it out http://hltfbk.github.io/Excitement-Open-Platform/
Ravi Kiran Bhaskar
Thanks Ravi, glad you liked it, the credit for the idea goes to Richard Socher, I just figured that using it to test textual similarity between short sentences might be an interesting application. And thanks for the link, EOP looks very interesting.
Hi Cam, sorry for the late reply and thanks for the pointer to Knowescape project, can you say more about it and how it might relate to this post?
Is there a possibility to rewrite the java scalascript into python code? I'm using the following library https://github.com/smilli/py-corenlp but i dont manage to translate the java scala code. Thank you!
Nice post. Also, using Word Movers Distance on vectors formed by embeddings also result in descent results.
Nice post. But a quick question, You have written that Vectors for non-leaf nodes (containing multiple words) are the sum of the vectors for the constituent words.
Is there any other way of representing multi-word phrases, like avg. , etc ?
@Valeriu: I doubt it, biggest problem for me is time. Although looks like (from a quick glance at the github page) py-corenlp appears to be a HTTP client where you pass in the tokenizer list and get back a giant JSON containing everything, which you then parse. Check out the json module. From the github example, I think it might look a bit like this:
from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
fphrases = open("/path/to/phrases.txt", "rb")
for phrase in fphrases:
....output = nlp.annotate(text, properties={
........'annotators': 'tokenize,ssplit,pos,depparse,parse',
........'outputFormat': 'json'
....})
....# parse relevant data out of the output (print it first to see what you want)
fphrases.close()
@Prakhar: Thanks, good suggestion with the WMD. To answer your question, sum and avg is similar, avg might be better since it is normalized by word length. You could also use an autoencoder to try to generate some hidden representation and use that, but I feel that might be overkill for short phrases.
Thank you for your nice post. I have a problem of compiling the scala code and linking it to the CoreNLP. Could you please give a brief introduction? Thank you!
Thanks 钻地的虫, its been a while but I here is another of my projects from which I think this code was adapted -- sujitpal/scalcium. The code from which this was adapted is StanfordNameFinder.scala. This project is (or was) setup to work with Stanford CoreNLP (and other libraries) in Scala, so maybe if you clone this project, you should be able to compile and link.
Hello Sujit
I am trying to implement this can you provide me the required links to install packages which has been used here.
Hi Bhagwat, the data is copied from the paper (link in the first paragraph), the other two software is gensim and Stanford CoreNLP, both have links, if you follow them you will find installation instructions.
Post a Comment