Saturday, September 10, 2016

Sentence Similarity using Recursive Embeddings and Dynamic Pooling


I was watching Richard Socher's lectures on CS224d: Deep Learning for Natural Language Processing at the Deep Learning Enthusiasts meetup at San Francisco couple of weeks ago. In one of the lectures, he describes this really cool idea for paraphrase detection using Dynamic Pooling and Unfolding Recursive Autoencoders. His use case was detecting mutual entailment in sentence pairs, but I figured that it might also be useful for computing sentence similarity, so I decided to try that out.

My data is a tiny corpus of 16 pairs of short sentences. The dataset came from a paper that described a rule-based algorithm that uses Wordnet and the Brown Corpus for its semantic smarts. Here is a link to my implementation of this algorithm. I later used the same dataset to re-implement another algorithm that uses word2vec and Word Mover's Distance. Neither algorithm gives importance to the structure of the sentence, although the second one does bring in some information about the context via word2vec. Socher's algorithm uses the structure of the sentence via its parse tree, and uses word embeddings, so it considers both the sentence structure and word context. So I thought it might be a good idea to see how it stacks up against the previous two.

Socher's algorithm goes something like this. Given a sentence pair, we first construct parse trees for them. The figure below shows the parse trees for a pair of sentences. Embeddings are constructed for each node in each tree using an Unfolding Recursive Autoencoder. The resulting embeddings are compared pairwise to generate a similarity matrix. Since the sentence pairs can be of arbitary length, the similarity matrix is passed through a Dynamic Pooling filter to reduce it to a fixed size square matrix. This matrix is then used as input to a classifier network to predict whether the sentences mutually entail one another.


Since my sentence collection was too small to generate a decent embedding out of, I decided to use the GoogleNews model (word2vec embeddings trained on about 100B words of Google News) to look up the words instead. For non-leaf nodes, I compute the vector as the sum of the vectors for the words in the phrase. I then compute the similarity matrix between the nodes for the two sentences. The similarity matrix for the sentences above looks like this.


I then apply a Dynamic Pooling filter on this matrix to reduce it to a 4x4 pooled similarity matrix. This involves splitting the input matrix into 16 blocks to which the pooling function can be applied. The paper deals with (the very common) situation where the input matrix is not perfecly divisible into 4x4 blocks by assigning the extra blocks to the last block. My approach is slightly different, I pad out the input matrix with the edge values in order to make it perfectly divisible - this allows me to use library functions that assume perfect divisibility. The corresponding pooled similarity matrix looks like this:


Instead of computing a binary label (entailed vs not entailed), I compute the similarity between the sentences as the sum of the elements of the pooled matrix divided by the number of elements (i.e, average pool). The similarity for the example sentence pair works out to 0.5079262884003104.

I now apply this modified algorithm to my small corpus of short sentence pairs. My input data has only 16 sentence pairs, with similarity scores that were provided as part of the original paper.

lsent rsent sim
6 A glass of cider. A full cup of apple juice. 0.678
11 Canis familiaris are animals. Dogs are common pets. 0.362
10 Dogs are animals. They are common pets. 0.738
14 I have a hammer. Take some nails. 0.508
15 I have a hammer. Take some apples. 0.121
13 I have a pen. Where is ink? 0.129
12 I have a pen. Where do you live? 0.000
0 I like that bachelor. I like that unmarried man. 0.561
9 It is a dog. It is a pig. 0.790
7 It is a dog. That must be your dog. 0.739
8 It is a dog. It is a log. 0.623
1 John is very nice. Is John very nice? 0.977
3 Red alcoholic drink. Fresh orange juice. 0.611
2 Red alcoholic drink. A bottle of wine. 0.585
5 Red alcoholic drink. Fresh apple juice. 0.420
4 Red alcoholic drink. An English dictionary. 0.000

I used the Stanford CoreNLP to generate the nodes in the trees using the following Scala code.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
package parsing

import edu.stanford.nlp.ling.CoreAnnotations.SentencesAnnotation
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import edu.stanford.nlp.pipeline.Annotation
import edu.stanford.nlp.trees.LabeledScoredTreeNode
import edu.stanford.nlp.trees.TreeCoreAnnotations.TreeAnnotation

import scala.collection.JavaConversions._
import scala.collection.mutable.ArrayBuffer
import scala.collection.mutable.LinkedHashSet
import scala.io.Source

import java.io.FileWriter
import java.io.PrintWriter
import java.util.Properties

object SentenceNodes extends App {

    val props = new Properties
    props.put("annotators", "tokenize, ssplit, pos, lemma, ner, parse")
    val pipeline = new StanfordCoreNLP(props)

    val idx2sent = scala.collection.mutable.Map[String,String]()
    val texts = Source.fromFile("/path/to/orig_phrases.tsv")
        .getLines
        .zipWithIndex
        .map(indexedLine => {
            val Array(lsent, rsent, score) = indexedLine._1.split("\\|")
            idx2sent(indexedLine._2 + "L") = lsent
            idx2sent(indexedLine._2 + "R") = rsent
            List(lsent, rsent).mkString(" ")
        })

    val sentPhrasesOut = new PrintWriter(new FileWriter(
        "/path/to/sent_phrases.txt"))
    texts.zipWithIndex.foreach(textIndexed => {
        val doc = new Annotation(textIndexed._1)
        pipeline.annotate(doc)
        val sentences = doc.get(classOf[SentencesAnnotation])
        sentences.zipWithIndex.foreach(sentenceIndexed => {
            val tree = sentenceIndexed._1.get(classOf[TreeAnnotation])
            // leaves first, then build up nodes
            val nodes = tree.postOrderNodeList()
            val phrases = nodes.map(node => {
                node.getLeaves(
                        new java.util.ArrayList[LabeledScoredTreeNode])
                    .map(leaf => leaf.label().value()).mkString(" ")
                })
                // remove terminating period
                .filter(node => !node.equals("."))
                .map(node => node.replace(".", ""))
                .toList
            val dedupedPhrases = new LinkedHashSet[String]()
            dedupedPhrases.addAll(phrases)
            val sentId = textIndexed._2.toString + 
                (if (sentenceIndexed._2 == 0) "L" else "R")
            sentPhrasesOut.println("%s\t%s".format(
                sentId, dedupedPhrases.mkString("|")))
        })
    })
    sentPhrasesOut.flush
    sentPhrasesOut.close
}

This writes out the phrases in the following format. Each sentence is identified by the sentence number and L or R depending on whether it is the first or second sentence in the pair. The second field is a pipe separated list of nodes in the tree. I do a post-order walk of the tree and discard duplicate nodes.

1
2
3
4
5
0L    I|like|that|bachelor|that bachelor|like that bachelor|I like that bachelor 
0R    I|like|that|unmarried|man|unmarried man|that unmarried man|like that unmarried man|I like that unmarried man 
1L    John|is|very|nice|very nice|is very nice|John is very nice 
1R    Is|John|very|nice|very nice|?|Is John very nice ?
...

I then use gensim to load up its word2vec model using the GoogleNews data, and generate the word2vec embeddings for each of the nodes. Vectors for non-leaf nodes (containing multiple words) are the sum of the vectors for the constituent words.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
from gensim.models import Word2Vec

model = Word2Vec.load_word2vec_format(
    "/path/to/GoogleNews-vectors-negative300.bin.gz", binary=True)

sent_phrases = open("/path/to/sent_phrases.txt", "rb")
phrase_vecs = open("/path/to/phrase_vecs.txt", "wb")
for line in sent_phrases:
    sent_id, phrase_str = line.strip().split("\t")
    phrases = phrase_str.split("|")
    for phrase in phrases:
        words = phrase.lower().split(" ")
        phrase_vec = np.zeros((1, 300))
        for word in words:
            try:
                phrase_vec += model[word]
            except KeyError:
                continue
        phrase_vec_str = ",".join(["%.5f" % (float(x)) for x in phrase_vec[0]])
        phrase_vecs.write("%s\t%s\t%s\n" % (sent_id, phrase, phrase_vec_str))
sent_phrases.close()
phrase_vecs.close()

This produces another file that looks like this. The fields are the sentence ID, the phrase string and the corresponding 300 dimensional word2vec vector as a string.

1
2
3
4
5
6
0L    I               -0.08574,-0.00742,0.03452,0.09019,...
0L    like             0.05687,0.07565,-0.00163,0.09980,...
0L    that             -0.01236,-0.02223,0.06554,0.03948,...
0L    bachelor         0.02234,-0.05067,-0.11133,0.02895,...
0L    that bachelor    0.00998,-0.07290,-0.04579,0.06843,...
...

Finally, I compute the similarity matrix for each sentence pair and run it through the Dynamic Pooling filter to produce the pooled similarity matrix, from which I compute the similarity score. I also generated heatmaps of the pooled similarity matrix for visualization.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
from skimage.measure import block_reduce
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Dynamic pooling
def dynamic_pool(X, out_dim, func):
    output_dim = np.ones(2) * out_dim
    # make sure input dimensions are splittable into output_dim pools
    xpad = X.shape[0] % out_dim
    ypad = X.shape[1] % out_dim
    # if not, pad with edge weights to make it splittable
    Xpadded = np.pad(X, ((0, out_dim - xpad), (0, out_dim - ypad)), "edge")
    # recalculate pooling block size
    xpool = Xpadded.shape[0] // out_dim
    ypool = Xpadded.shape[1] // out_dim
    # compute pooled matrix
    Xpooled = block_reduce(Xpadded, block_size=(xpool, ypool), func=func)
    return Xpooled

# Create a Pandas dataframe for the phrase vectors
# This is so we can filter on it for individual sentence pairs
phrase_vecs_df = pd.read_csv("/path/to/phrase_vecs.txt",
                            delimiter="\t", index_col=False,
                            header=None, 
                            names=["sent_id", "phrase", "phrase_vec"])

results = []
for i in range(len(orig_df)):
    phrases_left = (phrase_vecs_df[phrase_vecs_df["sent_id"] == str(i) + "L"]
                                  ["phrase_vec"].values)
    Xvecs = np.zeros((phrases_left.shape[0], 300))
    for j in range(phrases_left.shape[0]):
        Xvecs[j] = np.fromstring(phrases_left[j], dtype=float, sep=",")
    phrases_right = (phrase_vecs_df[phrase_vecs_df["sent_id"] == str(i) + "R"]
                                   ["phrase_vec"].values)
    Yvecs = np.zeros((phrases_right.shape[0], 300))
    for j in range(phrases_right.shape[0]):
        Yvecs[j] = np.fromstring(phrases_right[j], dtype=float, sep=",")
    # compute similarity between elements from each sentence
    sims = np.zeros((Xvecs.shape[0], Yvecs.shape[0]))
    for j in range(Xvecs.shape[0]):
        for k in range(Yvecs.shape[0]):
            denom = np.linalg.norm(Xvecs[j], 2) * np.linalg.norm(Yvecs[k], 2)
            if denom == 0:
                sims[j, k] = 0
            else:
                sims[j, k] = np.dot(Xvecs[j], Yvecs[k]) / denom
    pooled_sims = dynamic_pool(sims, 4, np.max)
    # retrieve original sentence
    lsent = id2sent[str(i) + "L"]
    rsent = id2sent[str(i) + "R"]
    similarity = np.sum(pooled_sims) / 16
    results.append({"sent_id": i, "lsent": lsent, "rsent": rsent, 
                    "sim": similarity})
    print(i, "//", lsent, "//", rsent, "//", "%.5f" % (similarity))
    plt.figure(figsize=(1, 1))
    plt.xticks([])
    plt.yticks([])
    plt.pcolor(pooled_sims)
    plt.show()

# populate a Pandas dataframe from the results and display
# :NOTE: we manually modified the HTML to add in the heatmaps
results_df = pd.DataFrame(results)
results_df = results_df.sort_values(["lsent", "sim"], ascending=[1, 0])
results_df.to_html()

The results are shown below. As with the input, I have sorted the resulting sentence pairs by left sentence and score in order to visually verify that the scores match up with intuition.

lsent rsent sent_id sim heatmap
6 A glass of cider. A full cup of apple juice. 6 0.392548

11 Canis familiaris are animals. Dogs are common pets. 11 0.606134

10 Dogs are animals. They are common pets. 10 0.734140

14 I have a hammer. Take some nails. 14 0.409951

15 I have a hammer. Take some apples. 15 0.347448

13 I have a pen. Where is ink? 13 0.329194

12 I have a pen. Where do you live? 12 0.296247

0 I like that bachelor. I like that unmarried man. 0 0.613622

7 It is a dog. That must be your dog. 7 0.611811

9 It is a dog. It is a pig. 9 0.581342

8 It is a dog. It is a log. 8 0.432356

1 John is very nice. Is John very nice? 1 0.723129

2 Red alcoholic drink. A bottle of wine. 2 0.454120

3 Red alcoholic drink. Fresh orange juice. 3 0.451136

5 Red alcoholic drink. Fresh apple juice. 5 0.346925

4 Red alcoholic drink. An English dictionary. 4 0.095906


From the results, it looks like the similarity scores do capture the semantics of the sentences. For example, a dog seems semantically closer to a pig than a log, a hammer closer to nails than apples, and red alcoholic drink closer to wine than apple juice.

That's all I have for today. Hope you enjoyed reading this as much as I enjoyed building it.

13 comments (moderated to prevent spam):

amit said...

This is really cool. Thanks for the precise description. Let me try it and then I will get back to you on this.

Can Crusher said...

This past week attended a workshop the Evolution and variation of classification systems organized by the Knowescape EU project.

Anonymous said...

Just want I need. Amazing explanation Sujit. Although understood most of it, reading the heat map went over my head. On a side note there is an interesting project funded by European Union on textual entailment, very similar to what you did. check it out http://hltfbk.github.io/Excitement-Open-Platform/

Ravi Kiran Bhaskar

Sujit Pal said...

Thanks Ravi, glad you liked it, the credit for the idea goes to Richard Socher, I just figured that using it to test textual similarity between short sentences might be an interesting application. And thanks for the link, EOP looks very interesting.

Sujit Pal said...

Hi Cam, sorry for the late reply and thanks for the pointer to Knowescape project, can you say more about it and how it might relate to this post?

Unknown said...

Is there a possibility to rewrite the java scalascript into python code? I'm using the following library https://github.com/smilli/py-corenlp but i dont manage to translate the java scala code. Thank you!

Anonymous said...

Nice post. Also, using Word Movers Distance on vectors formed by embeddings also result in descent results.

Prakhar Mishra said...

Nice post. But a quick question, You have written that Vectors for non-leaf nodes (containing multiple words) are the sum of the vectors for the constituent words.

Is there any other way of representing multi-word phrases, like avg. , etc ?

Sujit Pal said...

@Valeriu: I doubt it, biggest problem for me is time. Although looks like (from a quick glance at the github page) py-corenlp appears to be a HTTP client where you pass in the tokenizer list and get back a giant JSON containing everything, which you then parse. Check out the json module. From the github example, I think it might look a bit like this:

from pycorenlp import StanfordCoreNLP
nlp = StanfordCoreNLP('http://localhost:9000')
fphrases = open("/path/to/phrases.txt", "rb")
for phrase in fphrases:
....output = nlp.annotate(text, properties={
........'annotators': 'tokenize,ssplit,pos,depparse,parse',
........'outputFormat': 'json'
....})
....# parse relevant data out of the output (print it first to see what you want)
fphrases.close()

@Prakhar: Thanks, good suggestion with the WMD. To answer your question, sum and avg is similar, avg might be better since it is normalized by word length. You could also use an autoencoder to try to generate some hidden representation and use that, but I feel that might be overkill for short phrases.

钻地的虫 said...

Thank you for your nice post. I have a problem of compiling the scala code and linking it to the CoreNLP. Could you please give a brief introduction? Thank you!

Sujit Pal said...

Thanks 钻地的虫, its been a while but I here is another of my projects from which I think this code was adapted -- sujitpal/scalcium. The code from which this was adapted is StanfordNameFinder.scala. This project is (or was) setup to work with Stanford CoreNLP (and other libraries) in Scala, so maybe if you clone this project, you should be able to compile and link.

Unknown said...

Hello Sujit
I am trying to implement this can you provide me the required links to install packages which has been used here.

Sujit Pal said...

Hi Bhagwat, the data is copied from the paper (link in the first paragraph), the other two software is gensim and Stanford CoreNLP, both have links, if you follow them you will find installation instructions.