Thursday, December 04, 2014

Computing Semantic Similarity for Short Sentences


A reader recently recommended a paper for me to read - Sentence Similarity Based on Semantic Nets and Corpus Statistics. I found the algorithm quite interesting and I ended up implementing it. While I was not able to replicate the results exactly, my results did agree with results you would intuitively expect. I describe the algorithm and my implementation in this post.

My implementation is built with Python and Natural Language Tool Kit (NLTK). The Semantic Net referred to in the paper is Wordnet and the Corpus Statistics are from the Brown Corpus, both of which are available using NLTK's corpus API. Here is the complete code, which I explain below.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
from __future__ import division
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
import math
import numpy as np
import sys

# Parameters to the algorithm. Currently set to values that was reported
# in the paper to produce "best" results.
ALPHA = 0.2
BETA = 0.45
ETA = 0.4
PHI = 0.2
DELTA = 0.85

brown_freqs = dict()
N = 0

######################### word similarity ##########################

def get_best_synset_pair(word_1, word_2):
    """ 
    Choose the pair with highest path similarity among all pairs. 
    Mimics pattern-seeking behavior of humans.
    """
    max_sim = -1.0
    synsets_1 = wn.synsets(word_1)
    synsets_2 = wn.synsets(word_2)
    if len(synsets_1) == 0 or len(synsets_2) == 0:
        return None, None
    else:
        max_sim = -1.0
        best_pair = None, None
        for synset_1 in synsets_1:
            for synset_2 in synsets_2:
               sim = wn.path_similarity(synset_1, synset_2)
               if sim > max_sim:
                   max_sim = sim
                   best_pair = synset_1, synset_2
        return best_pair

def length_dist(synset_1, synset_2):
    """
    Return a measure of the length of the shortest path in the semantic 
    ontology (Wordnet in our case as well as the paper's) between two 
    synsets.
    """
    l_dist = sys.maxint
    if synset_1 is None or synset_2 is None: 
        return 0.0
    if synset_1 == synset_2:
        # if synset_1 and synset_2 are the same synset return 0
        l_dist = 0.0
    else:
        wset_1 = set([str(x.name()) for x in synset_1.lemmas()])        
        wset_2 = set([str(x.name()) for x in synset_2.lemmas()])
        if len(wset_1.intersection(wset_2)) > 0:
            # if synset_1 != synset_2 but there is word overlap, return 1.0
            l_dist = 1.0
        else:
            # just compute the shortest path between the two
            l_dist = synset_1.shortest_path_distance(synset_2)
            if l_dist is None:
                l_dist = 0.0
    # normalize path length to the range [0,1]
    return math.exp(-ALPHA * l_dist)

def hierarchy_dist(synset_1, synset_2):
    """
    Return a measure of depth in the ontology to model the fact that 
    nodes closer to the root are broader and have less semantic similarity
    than nodes further away from the root.
    """
    h_dist = sys.maxint
    if synset_1 is None or synset_2 is None: 
        return h_dist
    if synset_1 == synset_2:
        # return the depth of one of synset_1 or synset_2
        h_dist = max([x[1] for x in synset_1.hypernym_distances()])
    else:
        # find the max depth of least common subsumer
        hypernyms_1 = {x[0]:x[1] for x in synset_1.hypernym_distances()}
        hypernyms_2 = {x[0]:x[1] for x in synset_2.hypernym_distances()}
        lcs_candidates = set(hypernyms_1.keys()).intersection(
            set(hypernyms_2.keys()))
        if len(lcs_candidates) > 0:
            lcs_dists = []
            for lcs_candidate in lcs_candidates:
                lcs_d1 = 0
                if hypernyms_1.has_key(lcs_candidate):
                    lcs_d1 = hypernyms_1[lcs_candidate]
                lcs_d2 = 0
                if hypernyms_2.has_key(lcs_candidate):
                    lcs_d2 = hypernyms_2[lcs_candidate]
                lcs_dists.append(max([lcs_d1, lcs_d2]))
            h_dist = max(lcs_dists)
        else:
            h_dist = 0
    return ((math.exp(BETA * h_dist) - math.exp(-BETA * h_dist)) / 
        (math.exp(BETA * h_dist) + math.exp(-BETA * h_dist)))
    
def word_similarity(word_1, word_2):
    synset_pair = get_best_synset_pair(word_1, word_2)
    return (length_dist(synset_pair[0], synset_pair[1]) * 
        hierarchy_dist(synset_pair[0], synset_pair[1]))

######################### sentence similarity ##########################

def most_similar_word(word, word_set):
    """
    Find the word in the joint word set that is most similar to the word
    passed in. We use the algorithm above to compute word similarity between
    the word and each word in the joint word set, and return the most similar
    word and the actual similarity value.
    """
    max_sim = -1.0
    sim_word = ""
    for ref_word in word_set:
      sim = word_similarity(word, ref_word)
      if sim > max_sim:
          max_sim = sim
          sim_word = ref_word
    return sim_word, max_sim
    
def info_content(lookup_word):
    """
    Uses the Brown corpus available in NLTK to calculate a Laplace
    smoothed frequency distribution of words, then uses this information
    to compute the information content of the lookup_word.
    """
    global N
    if N == 0:
        # poor man's lazy evaluation
        for sent in brown.sents():
            for word in sent:
                word = word.lower()
                if not brown_freqs.has_key(word):
                    brown_freqs[word] = 0
                brown_freqs[word] = brown_freqs[word] + 1
                N = N + 1
    lookup_word = lookup_word.lower()
    n = 0 if not brown_freqs.has_key(lookup_word) else brown_freqs[lookup_word]
    return 1.0 - (math.log(n + 1) / math.log(N + 1))
    
def semantic_vector(words, joint_words, info_content_norm):
    """
    Computes the semantic vector of a sentence. The sentence is passed in as
    a collection of words. The size of the semantic vector is the same as the
    size of the joint word set. The elements are 1 if a word in the sentence
    already exists in the joint word set, or the similarity of the word to the
    most similar word in the joint word set if it doesn't. Both values are 
    further normalized by the word's (and similar word's) information content
    if info_content_norm is True.
    """
    sent_set = set(words)
    semvec = np.zeros(len(joint_words))
    i = 0
    for joint_word in joint_words:
        if joint_word in sent_set:
            # if word in union exists in the sentence, s(i) = 1 (unnormalized)
            semvec[i] = 1.0
            if info_content_norm:
                semvec[i] = semvec[i] * math.pow(info_content(joint_word), 2)
        else:
            # find the most similar word in the joint set and set the sim value
            sim_word, max_sim = most_similar_word(joint_word, sent_set)
            semvec[i] = max_sim if max_sim > PHI else 0.0
            if info_content_norm:
                semvec[i] = semvec[i] * info_content(joint_word) * info_content(sim_word)
        i = i + 1
    return semvec                
            
def semantic_similarity(sentence_1, sentence_2, info_content_norm):
    """
    Computes the semantic similarity between two sentences as the cosine
    similarity between the semantic vectors computed for each sentence.
    """
    words_1 = nltk.word_tokenize(sentence_1)
    words_2 = nltk.word_tokenize(sentence_2)
    joint_words = set(words_1).union(set(words_2))
    vec_1 = semantic_vector(words_1, joint_words, info_content_norm)
    vec_2 = semantic_vector(words_2, joint_words, info_content_norm)
    return np.dot(vec_1, vec_2.T) / (np.linalg.norm(vec_1) * np.linalg.norm(vec_2))

######################### word order similarity ##########################

def word_order_vector(words, joint_words, windex):
    """
    Computes the word order vector for a sentence. The sentence is passed
    in as a collection of words. The size of the word order vector is the
    same as the size of the joint word set. The elements of the word order
    vector are the position mapping (from the windex dictionary) of the 
    word in the joint set if the word exists in the sentence. If the word
    does not exist in the sentence, then the value of the element is the 
    position of the most similar word in the sentence as long as the similarity
    is above the threshold ETA.
    """
    wovec = np.zeros(len(joint_words))
    i = 0
    wordset = set(words)
    for joint_word in joint_words:
        if joint_word in wordset:
            # word in joint_words found in sentence, just populate the index
            wovec[i] = windex[joint_word]
        else:
            # word not in joint_words, find most similar word and populate
            # word_vector with the thresholded similarity
            sim_word, max_sim = most_similar_word(joint_word, wordset)
            if max_sim > ETA:
                wovec[i] = windex[sim_word]
            else:
                wovec[i] = 0
        i = i + 1
    return wovec

def word_order_similarity(sentence_1, sentence_2):
    """
    Computes the word-order similarity between two sentences as the normalized
    difference of word order between the two sentences.
    """
    words_1 = nltk.word_tokenize(sentence_1)
    words_2 = nltk.word_tokenize(sentence_2)
    joint_words = list(set(words_1).union(set(words_2)))
    windex = {x[1]: x[0] for x in enumerate(joint_words)}
    r1 = word_order_vector(words_1, joint_words, windex)
    r2 = word_order_vector(words_2, joint_words, windex)
    return 1.0 - (np.linalg.norm(r1 - r2) / np.linalg.norm(r1 + r2))

######################### overall similarity ##########################

def similarity(sentence_1, sentence_2, info_content_norm):
    """
    Calculate the semantic similarity between two sentences. The last 
    parameter is True or False depending on whether information content
    normalization is desired or not.
    """
    return DELTA * semantic_similarity(sentence_1, sentence_2, info_content_norm) + \
        (1.0 - DELTA) * word_order_similarity(sentence_1, sentence_2)
        
######################### main / test ##########################

# the results of the algorithm are largely dependent on the results of 
# the word similarities, so we should test this first...
word_pairs = [
  ["asylum", "fruit", 0.21],
  ["autograph", "shore", 0.29],
  ["autograph", "signature", 0.55],
  ["automobile", "car", 0.64],
  ["bird", "woodland", 0.33],
  ["boy", "rooster", 0.53],
  ["boy", "lad", 0.66],
  ["boy", "sage", 0.51],
  ["cemetery", "graveyard", 0.73],
  ["coast", "forest", 0.36],
  ["coast", "shore", 0.76],
  ["cock", "rooster", 1.00],
  ["cord", "smile", 0.33],
  ["cord", "string", 0.68],
  ["cushion", "pillow", 0.66],
  ["forest", "graveyard", 0.55],
  ["forest", "woodland", 0.70],
  ["furnace", "stove", 0.72],
  ["glass", "tumbler", 0.65],
  ["grin", "smile", 0.49],
  ["gem", "jewel", 0.83],
  ["hill", "woodland", 0.59],
  ["hill", "mound", 0.74],
  ["implement", "tool", 0.75],
  ["journey", "voyage", 0.52],
  ["magician", "oracle", 0.44],
  ["magician", "wizard", 0.65],
  ["midday", "noon", 1.0],
  ["oracle", "sage", 0.43],
  ["serf", "slave", 0.39]
]
for word_pair in word_pairs:
    print "%s\t%s\t%.2f\t%.2f" % (word_pair[0], word_pair[1], word_pair[2], 
                                  word_similarity(word_pair[0], word_pair[1]))

sentence_pairs = [
    ["I like that bachelor.", "I like that unmarried man.", 0.561],
    ["John is very nice.", "Is John very nice?", 0.977],
    ["Red alcoholic drink.", "A bottle of wine.", 0.585],
    ["Red alcoholic drink.", "Fresh orange juice.", 0.611],
    ["Red alcoholic drink.", "An English dictionary.", 0.0],
    ["Red alcoholic drink.", "Fresh apple juice.", 0.420],
    ["A glass of cider.", "A full cup of apple juice.", 0.678],
    ["It is a dog.", "That must be your dog.", 0.739],
    ["It is a dog.", "It is a log.", 0.623],
    ["It is a dog.", "It is a pig.", 0.790],
    ["Dogs are animals.", "They are common pets.", 0.738],
    ["Canis familiaris are animals.", "Dogs are common pets.", 0.362],
    ["I have a pen.", "Where do you live?", 0.0],
    ["I have a pen.", "Where is ink?", 0.129],
    ["I have a hammer.", "Take some nails.", 0.508],
    ["I have a hammer.", "Take some apples.", 0.121]
]
for sent_pair in sentence_pairs:
    print "%s\t%s\t%.3f\t%.3f\t%.3f" % (sent_pair[0], sent_pair[1], sent_pair[2], 
        similarity(sent_pair[0], sent_pair[1], False),
        similarity(sent_pair[0], sent_pair[1], True))

Proceeding from the top-down (wrt the code, or bottom-up wrt the algorithm), the lowest unit of the algorithm is the semantic similarity between a pair of words. The word similarity is a combination of two functions f(l) and f(h), where l is the shortest path between the two words in Wordnet (our Semantic Network) and h the height of their Lowest Common Subsumer (LCS) from the root of the Semantic Network. The intuition behind these is that l is a proxy for how similar the words are, and d is a proxy for the specificity of the LCS, ie, LCS nodes closer to the root indicate broader/more abstract concepts and less similarity. The functions f(l) and f(h) serve to normalize these values to the range [0,1]. In formulas, then:

sim(w1, w2) = f(l).f(h)

    where:

    f(l) = e-αl
            eβh - e-βh
    f(h) = ------------
            eβh + e-βh

Word similarities between a set of word pairs were reported in the paper. As a test, I computed the similarities between the same word pairs with my code above. The similarity values reported in the paper are shown under Exp.Sim and the ones returned by my code are shown under Act.Sim. As you can see, they are close but not identical - however, note that the computed similarites seem to line up with intuition. For example, sim(autograph, signature) is higher than sim(autograph, shore), sim(magician, wizard) is higher than sim(magician, oracle), etc.

Word #1Word #2Exp. SimAct. Sim
asylumfruit0.210.30
autographshore0.290.16
autographsignature0.550.82
automobilecar0.641.00
birdwoodland0.330.20
boyrooster0.530.11
boylad0.660.82
boysage0.510.37
cemeterygraveyard0.731.00
coastforest0.360.36
coastshore0.760.80
cockrooster1.001.00
cordsmile0.330.13
cordstring0.680.82
cushionpillow0.660.82
forestgraveyard0.550.20
forestwoodland0.700.98
furnacestove0.720.17
glasstumbler0.650.82
grinsmile0.490.99
gemjewel0.831.00
hillwoodland0.590.36
hillmound0.740.99
implementtool0.750.82
journeyvoyage0.520.82
magicianoracle0.440.30
magicianwizard0.651.00
middaynoon1.001.00
oraclesage0.430.37
serfslave0.390.55

One thing I did differently from the paper is to select the most similar pair of synsets instead of just picking the first noun synset for each word (see the function get_best_synset_pair). This is because a word can map to multiple synsets, and finding the most similar pair mimics the human tendency to maximize pattern seeking (ie see patterns where there are none).

Sentence similarity is computed as a linear combination of semantic similarity and word order similarity. Semantic Similarity is computed as the Cosine Similarity between the semantic vectors for the two sentences. To build the semantic vector, the union of words in the two sentences is treated as the vocabulary. If the word occurs in the sentence, its value for that position is 1. If it doesn't, the similarity for the word is computed against all the other words in the sentence. If it happens to be above a threshold φ, then the value of the element is φ, else it is 0. This value is further attenuated by the information content for the word as found in the Brown corpus. In equations:

s1 • s2
    Ss = -----------------
          ||s1|| * ||s2||

    where:

    si = s * I(wi) * I(wj)

                log(n + 1)
    I(w) = 1 - ------------
                log(N + 1)

    where:
      n = number of times word w occurs in corpus
      N = number of words in the corpus

The word order similarity attempts to correct for the fact that sentences with the same words can have radically different meanings. This is done by computing the word order vector for each sentence and computing a normalized similarity measure between them. The word order vector, like the semantic vector is based on the joint word set. If the word occurs in the sentence, its position in the joint word set is recorded. If not, the similarity to the most similar word in the sentence is recorded if it crosses a threshold η else it is 0. In equations:

||r1 - r2||
    Sr = 1 - --------------
              ||r1 + r2||

    where:
      r1 = word position vector for sentence 1
      r2 = word position vector for sentence 2

The similarity between two sentences are modeled as a linear combination of their semantic similarity and word order similarity, ie:

S = δSs + (1 - δ)Sr

Similar to word similarities, the paper also lists some sentence similarities computed with their algorithm. I tried these same sentences through my code, and as expected, got slightly different results (since the sentence similarity is dependent on word similarities). Here they are. Exp.Sim are the values reported in the paper, Act.Sim (w/o IC) are computed similarities without Information Content normalization and Act.Sim (w/IC) are computed similarities with Information Content normalization. As you can see the Exp.Sim and Act.Sim (w/IC) values are quite consistent.

Sentence #1Sentence #2Exp.SimAct.Sim (w/o IC)Act.Sim (w/IC)
I like that bachelor.I like that unmarried man.0.5610.8010.345
John is very nice.Is John very nice?0.9770.5920.881
Red alcoholic drink.A bottle of wine.0.5850.4770.307
Red alcoholic drink.Fresh orange juice.0.6110.4670.274
Red alcoholic drink.An English dictionary.0.0000.2370.028
Red alcoholic drink.Fresh apple juice.0.4200.3890.215
A glass of cider.A full cup of apple juice.0.6780.6590.347
It is a dog.That must be your dog.0.7390.4520.709
It is a dog.It is a log.0.6230.8580.497
It is a dog.It is a pig.0.7900.8630.500
Dogs are animals.They are common pets.0.7380.5500.377
Canis familiaris are animals.Dogs are common pets.0.3620.4580.151
I have a pen.Where do you live?0.0000.1340.158
I have a pen.Where is ink?0.1290.1120.077
I have a hammer.Take some nails.0.5080.4310.288
I have a hammer.Take some apples.0.1210.3440.147

Once more, while the numbers don't match exactly, the results seem intuitively correct. For example, "Red alcoholic drink" is more similar to "A bottle of wine" than "Fresh apple juice", which is more similar than "An English dictionary", etc.

Thats all I have for today. Hope you found this paper (and my implementation) interesting. The (latest) code for this post is available on GitHub here.

Update 2017-03-13: Many thanks to Mathieu Chrétien for updating the code to use Python3 and contributing it back, you can find it on this github gist.

105 comments (moderated to prevent spam):

Anonymous said...

Interesting...I was trying to deal with it from the other end using text entailment. In all fairness was trying to compare 2 or more entire documents together to see if they are similar. Although entailment was promising at a sentence level, I felt it wasn't great for larger texts. I was also trying to view it from Natural Language Understanding perspective, but in vain. Kindly let me know if you find alternatives :-)

Thanks for the great write up

Ravi Kiran Bhaskar

Sujit Pal said...

Thanks Ravi, every time we speak I end up learning something new :-). I didn't know about text entailment; from the little I understand now (from this paper), it seems to me that entailment at a reasonably high probability may be a stronger guarantee than what this is computing.

One possibility to compute semantic similarity between documents could be something similar to what we do at work - we reduce both documents to a bag of concepts by annotating phrases in it to an ontology, then computing the similarity between their concept vectors. There is some upfront effort to create an ontology but once you have it, the process is reasonably accurate and has good performance.

Anonymous said...

Sujit,
I love your blog man, I can only dream to understand math as clearly as you do and apply them as algorithms :-)

I do understand and know that we can calculate document similarity based on bag of words/term vectors/features. However, I am still not convinced its a true representation of similarity. For example, consider the overly simplistic view of 2 docs

1. "Ravi went to USA. He loves NLP and is working in a company using the awesome technologies."

2. "Ravi is a probably a good boy. The jobless rate depicted in USA news is misleading, it depends from comapany to company and tech involved".

Both have the same set terms "Ravi", "USA", "Company", although the context/meaning of both these docs are totally different.

From the limited knowledge I have, BOW/Term Vectors can only decipher "relatedness" NOT "similarity" ...there seems to be a fine distinction which eludes most NLPers...that's the real head scratcher I am after :-)

Thanks,

Ravi Kiran Bhaskar

Anonymous said...

BTW if you want to look at Entailment look at European Union Funded Project http://hltfbk.github.io/Excitement-Open-Platform/

I felt this was better than the rest and others felt rudimentary

Ravi Kiran Bhaskar

Sujit Pal said...

Thanks Ravi, both for the kind words and the link. Actually I mean Bag of Concepts rather than Bag of Words, it gives you a little more in terms of synonymy and relationships. Perhaps if in the example, we recognized more entities such as NLP and jobless, we could not only use the matched terms/concepts but also the mismatched ones to form a more well rounded version of similarity that would be closer to what you are looking for?

Mahesh said...

Hi Sujit Pal,

Really, you did a great job. From last few days I am searching for sentence matching. The code which you have written is very very useful to me. Simply you did an awesome job.

But I have one problem, when I integrate your code and checking with 1000 sentences its getting too slower and taking at most 8 mins time.

Is there any way to reduce the processing time. Could you please help me to out of this issue.

Thanks in advance.

Sujit Pal said...

Thanks for the kind words Mahesh, glad you found it useful. Regarding your question about processing time, the algorithm is doing quite a lot of work so I am not surprised that you found it slow. I didn't investigate the slowness, but you may want to do some profiling to see which part of the algorithm is slow, then focus on that. So for example, Wordnet lookups can perhaps be cached so lookup of the same word across different sentences can be speeded up.

Mahesh said...

Thank you very much Sujit Pal. As you said I try with caching the lookups and the performance was improved and execution time reduced half of the time of previous.

Once again thank you very much Sujit.

Sujit Pal said...

Cool! Glad my suggestion helped.

Mahesh said...

I am very new to NLTK and a little bit of experience in python. The code which you have posted make me more confident on NLTK and python.

Really, I am so happy with your suggestions, especially with your code.

I need your valuable suggestions or code snippets for some critical scenario's.

Is it possible to add own synonyms to nltk - wordnet? If yes, please suggest me how?
Is it possible to use dictionary in nltk? If yes, please suggest me how?


Your suggestions are valuable and great guidance to me.

Thanks in advance.

Sujit Pal said...

Hi Mahesh, thanks for the kind words, happy to help. With regard to your question, no, it is not possible to add your own synonyms to Wordnet as far as I know. However, you can have a dictionary (Map in Java) for {word => synonym, synonym => word} lookups that you consult first and only go to Wordnet if lookup fails. Regarding dictionaries, I am guessing you mean a data structure that supports O(1) lookups by key, right? In that case, yes, and the data structure is called a dictionary in Python :-).

Mahesh said...

Thanks for giving your valuable time and suggestions.

I have two problems,
1.) How to handle digits (See Ex.1)
2.) Can we check the negative scenarios with this code. (See Ex.2)
Please check the expected and actual ratios

Ex1:-
Sent1 = "score 4/10"
Sent2 = "score 5/10"

Output put from console:
==============================================================
Sent1 Sent2 Expected Act True Act False
--------------------------------------------------------------
score 4/10 score 5/10 0.650 0.425 0.164


Ex2:-
Sent1 = "Headache gradual in onset"
Sent2 = "Headache sudden onset"

Output put from console:
=====================================================================================
Sent1 Sent2 Expected Act True Act False
-------------------------------------------------------------------------------------
Headache gradual in onset Headache sudden onset 0.335 0.577 0.692

Sujit Pal said...

Hi Mahesh, good call on the numbers. If the numbers are equal, then obviously they are similar, so you could have a check in both the word_similarity and hierarchical_similarity methods to check if the input words are numbers and return 1 if they are equal. Also you could modify the tokenization so it splits numbers on punctuation so 4/10 becomes ["4", "/", "10"]. For the "gradual" vs "sudden" case, they are both found in Wordnet and path_similarity will/may give some indication that they have opposite meanings (coverage is better for nouns than other parts of speech and these are adjectives). BTW I wasn't sure what the numbers are in your output - I am guessing that the "Expected" is a number that human judges have come up with, but not sure what "Act True" and "Act False" are.

Mahesh said...

Thank you for giving reply with max clarity and most patience.

Sujit Pal said...

You are welcome Mahesh, glad it helped.

Ahmed said...

Dear sir,
I want to implement Computing Semantic Similarity between two documents.Can you give me some brief detail on it.Any tutorial to help or any help will be appreciated.Thanks sir for such a good tutorial.

Sujit Pal said...

You're welcome, Ahmed. Regarding semantic similarity between two documents, this approach probably won't scale very well. One way to compute semantic similarity between two documents may be to use word2vec word vectors to produce document vectors by summing up the word vectors and comparing their similarity using standard measures like cosine similarity. This is on my to-do list, not sure how effective this would be though. Other approaches could use an ontology, and group related words into a coarser entity and use that for similarity calculations. With word2vec you could probably simulate this by clustering where you choose a number of clusters so there are approximately N (you control N) elements in each cluster, and then replace all the words in the cluster with the synthetic word represented by the cluster center, then use that for your similarity calculations. Anyway, the idea is to come up with some way to replace groups of words into a single representative word, then do similarity calculations against the representative words.

Ahmed said...

Dear sir,
I am using your code to get better understanding.but "wn.synsets(word_1)" is not working.Is there any change of API or what.I don't that .try to reply as soons as possible.

Sujit Pal said...

Hi Ahmed, not sure why its not working. For reference, my nltk version (using nltk.__version__) is 3.0.2 running against Python 2.7.10. It is possible it may have changed, although I doubt it if you are on a 3.0 version as well. in any case it returns the synsets associated with the word, so you might want to google for something equivalent. The output of wn.synsets("cat") on my machine looks like this:

>>> wn.synsets("cat")
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'),
Synset('caterpillar.n.02'), Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'),
Synset('vomit.v.01')]

The other option is that perhaps you have loaded wordnet? To verify check to see if some of the other wordnet commands work. If they don't then you should do nltk.download as explained here.

Ahmed said...

Dear Sir
Can you tell me for example Synset('kat.n.01')......what is n and 01 ?

Sujit Pal said...

Hi Ahmed, "n" is noun and 01 is a index into the sequence of noun synsets found for "kat" - run wn.synsets("kat", "n") or better wn.synsets("cat", "n") and you will see what I mean.

Anonymous said...

Hi, I have a question about negation. Consider: "This is good" and "This is not good".
Semantically, these two are opposite. Will the above algorithm handle such a pair? In general, what is the most effective way to handle negation? For example, "It is not the case that it is OK" vs. "It is acceptable and it is not quite unreasonable". In the first case, the whole clause is modified in meaning; in the second case, the negation is contained only in one of the clauses of the compound sentence. I have an impression that this issue is taken for granted, but I am looking for an effective way to address negation in deriving the meaning in a universally applicable way in all situations.

Sujit Pal said...

This algorithm will not handle this case, although (at least in theory, depending on Wordnet coverage) it will handle antonyms such as "gradual" vs "sudden" (from an example in an earlier comment). In the past I have handled negation using a rule based algorithm called negex (I have a Scala implementation here if you are interested). However, a nuance of negex is that it computes negation status of specific phrases in a sentence rather than the sentence itself, so you will need some way to find the phrases (maybe using an NER to find noun phrases and compute negation status for each). A simpler (but cruder) way is to only consider words in the sentence before a negator word (from a list) is seen. Recently, I also read (in the context of keyword based sentiment analysis) about changing the sign of the sentence vector for the sentence upon encountering negator words, so that could be another option.

Anonymous said...

Hi, Thank you very much for your quick and detailed reply. Also, I truly appreciate your sharing of the above code with excellent comments; it is indeed very nice of you. I am learning about Python as well as WordNet. One doubt: The cited paper seems to look for the distance from the root of the LCS, where as the code seems to calculate the distance between the LCS and the given synsets. I could be wrong, but could it be reason for the differences in the two sets of answers?

Sujit Pal said...

I believe so. Its been a while since I looked at the paper, but from what I remember I couldn't figure out a way to calculate the hierarchical distance the way it was computed in the paper so I decided to do it in an equivalent way.

Unknown said...

Thanks so much for this. Very useful!

I believe that the word_order_vector function should be as follows:

def word_order_vector(words, joint_words):
"""
Computes the word order vector for a sentence. The sentence is passed
in as a collection of words. The size of the word order vector is the
same as the size of the joint word set. The elements of the word order
vector are the position mapping (from the windex dictionary) of the
word in the joint set if the word exists in the sentence. If the word
does not exist in the sentence, then the value of the element is the
position of the most similar word in the sentence as long as the similarity
is above the threshold ETA.
"""
wovec = np.zeros(len(joint_words))
i = 0
# wordset = set(words) in original but set changes element order
wordDict = {x[1]: x[0] for x in enumerate(words)}
for joint_word in joint_words:
if joint_word in wordDict:
# word in joint_words found in sentence, just populate the index
wovec[i] = wordDict[joint_word]
else:
# word not in joint_words, find most similar word and populate
# word_vector with the thresholded similarity
wordSet = set(words)
sim_word, max_sim = most_similar_word(joint_word, wordSet)
if max_sim > ETA:
wovec[i] = wordDict[sim_word]
else:
wovec[i] = 0
i = i + 1
return wovec

Sujit Pal said...

You are welcome and thanks for the code change. Regarding the code change, I think my original code was passing in the word order via the windex argument, whereas your modification figures it out by enumerating words and creating a dict. Did the original code not work or is this an improvement? In any case, since Blogger comments tend to destroy formatting and formatting is so important for Python code, I am going to repost it with ".." replacing one indent, that way someone can copy-paste your version into the code easily.

def word_order_vector(words, joint_words):
.."""
..Computes the word order vector for a sentence. The sentence is passed
..in as a collection of words. The size of the word order vector is the
..same as the size of the joint word set. The elements of the word order
..vector are the position mapping (from the windex dictionary) of the
..word in the joint set if the word exists in the sentence. If the word
..does not exist in the sentence, then the value of the element is the
..position of the most similar word in the sentence as long as the similarity
..is above the threshold ETA.
.."""
..wovec = np.zeros(len(joint_words))
..i = 0
..# wordset = set(words) in original but set changes element order
..wordDict = {x[1]: x[0] for x in enumerate(words)}
..for joint_word in joint_words:
....if joint_word in wordDict:
......# word in joint_words found in sentence, just populate the index
......wovec[i] = wordDict[joint_word]
....else:
......# word not in joint_words, find most similar word and populate
......# word_vector with the thresholded similarity
......wordSet = set(words)
......sim_word, max_sim = most_similar_word(joint_word, wordSet)
......if max_sim > ETA:
........wovec[i] = wordDict[sim_word]
......else:
........wovec[i] = 0
....i = i + 1
..return wovec

Anonymous said...

Hi Sujit,

Excellent paper.

Just wanted to ask why aren't we using wup-similarity for calculating sentence similarity. How is it different from the function that you have written.

Sujit Pal said...

Thanks Ekta, although I didn't write the paper, just implemented it at best as I could. You are right, Wu-Palmer Similarity could be used also to find the difference between individual words since it is so similar (and uses similar metrics to derive the similarity as the one described). In my case I was trying to follow the paper as closely as possible. Once you have the inter-word similarities you could use the algorithm in the paper to find the distance between the sentences.

colla said...

Dear Sujit, thanks for your codes. Really helpful. Can I please have your email address to discuss off here?
thanks

Sujit Pal said...

Hi Colla, I prefer to not share my email on a public page. If you send me your email address as a comment, I can contact you on it and delete your comment so your email address is not public either.

Anonymous said...

Hello, I like your job in implement of algorithm, I can use it in my dissertation?

Sujit Pal said...

The implementation is based on an algorithm put forward in an existing paper (referenced in the first sentence of this blog post). I am not sure what you mean by "using", so I guess the answer would depend on that.

Unknown said...

Hi Sujit,

Great post. My question regarding this implementation is that this code will never assign a score of 0 (zero) to completely dissimilar sentences. Do you have any suggestion that how to modify this code that it gives a score of zero to completely dissimilar sentences.
As an example from your sentence pairs:
Red alcoholic drink. An English dictionary. 0.000 0.237
Here two sentences are totally dissimilar and actual score is 0 but the code results in a score of 0.237
Really appreciate your guidance to address this shortcoming.
Thanks

Sujit Pal said...

Thanks Naveed. I guess one simple way could be to consider everything below a certain cutoff as dissimilar, as long as it was universally applicable, which it doesn't seem to be (at least w/o IC). I think its a good way to compare two pairs of strings rather than an absolute indicator of similarity/dissimilarity.

zahra said...

Hi Sujit,thanks for your codes. do you have java implemention of this code?!
Thanks in advance.

Sujit Pal said...

Hi Zahra, no I don't have a Java implementation. But you should be able to build it fairly easily from the Scala version.

Anonymous said...

hi sujit,
thanks alot sujit for great article , it helped alot though, could you please give reference of how to extract corpus reuters.
i am trying to cluster Reuters, NEWS20 and OSHUMD based on content semantic using Wordnet ontology, could you please help me out with this problem, i will be agglomatitve hierachal based clustring.

Sujit Pal said...

You are welcome, glad it helped you. For Reuters, NLTK has a reader as described here. If you want to extract the text from the source, there are various parsers available such as this one and this one.

Anonymous said...

Thanks alot Sujit, again, but could u please tell me how to apply clustring decision based on semantics of corpus using wordNet or anything u want to suggest where i will be making clustere decision based on semantics

Anonymous said...

Hi Sujit,

This is really a great post. Enjoying it totally. Wanted to discuss with you on importance of word order vector.

Taking word order into account is surely a better approach than bag of words. But, a lot of times, two sentences may have same meaning but completely different word orders. Is word order actually a measure of similarity between sentences?

Also, how easy/difficult it would be to take care of n-grams with this algorithm?

Thanks a lot again !

Anonymous said...

Hi Sujit,

What are your observations about coverage of WordNet.

Lot of researchers seem to mention that the coverage of WordNet is low and can be an issue but still its one of the most used resource. Have you come across any better alternatives?

Thanks

Sujit Pal said...

Thank you anonymous, and you bring up good points. Regarding word order, it seems to work well to predict similarity for the dataset. I did not actually think too much about the negative cases, but I am guessing the author of the paper on which this is based (referenced at the top of the post) must have, and concluded that the word order helps more than it harms, ie, maybe there are fewer cases where word order is different for two sentences with similar meaning, so overall its a win. I think it should be possible to use n-grams for the length similarity but probably less so for the others (hierarchical and semantic). Regarding wordnet coverage, it /is/ quite low, but if you have enough cases, you can sort of "smooth" over that, so it works out. I guess you could consider things like Yago similar, but it is not manually curated like Wordnet is and its coverage is broader.

Sujit Pal said...

Replying to this comment dated Jan 18 2016, must have missed it, sorry about that, found it in my queue when I looked today.

>> Thanks alot Sujit, again, but could u please tell me how to apply clustring decision based on semantics of corpus using wordNet or anything u want to suggest where i will be making clustere decision based on semantics

You are welcome, but wouldn't the choice of the ontology (in this case I am thinking of Wordnet as an ontology of words by grammar sense) to lookup be dependent on the type of content you are looking to cluster? Also I think we might have already discussed this question in more detail later (if you are the same Anonymous).

Anonymous said...

Hi Sujit,

Thanks for the excellent blog !

I want to understand the intuition behind using "path_similarity" to find the best matching synsets and then using a different measure "shortest_path_distance" to find the actual distance. Why are we using two different algorithm to find similarity between two synsets ?

Unknown said...

i need the algorithm of wu plamer similarity algorithm

Sujit Pal said...

@Vikram: you are welcome, glad you found it useful. I am using the two because there can be multiple synset pairs to compare because the word can correspond to multiple synsets. So I am using the pair that is the closest. I could also have done the shortest path distance across all the synset pairs and chosen the minimum distance.

@Nadhiya: From the Wordnet Similarity page, here is the definition: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

Ramanujan said...

Hi Sujit,

I am working in a part of project for sentence similarity, we are using unsupervised learning. I need your help in adding extra feature(or papers) to add as part of our model. It would be very helpful for me, Can you please provide if you have it.

Thanks & Regards

Sujit Pal said...

Hi Ramanujan, this is the only paper on sentence similarity for short sentences that I know about. But I looked up some results on Scopus, maybe these are helpful?

Anonymous said...

hello sujit,
i m using reuters-21578 for clustering based on semantics, which i will take from wordNET, approach i m trying to follow is to extract topics using LDA then i want to map those topics to wordNET hypernym, as GRAIN topic will be concept food,instead of making term frequency vector i will make concept vector , which will reduce high demensionality of large data set because it will be based on concept vector instead of term vector. Now the problem is i am unable to map wordNET hypernym with Topics i extracted using LDA. i m using Pythoin for that
.
regards

Sujit Pal said...

Are you unable to map because the correct hypernym does not exist? I tried with your example, I see that the 2nd synset maps approximately to what you want.

>>> from nltk.corpus import wordnet as wn
>>> wn.synsets("grain")
[Synset('grain.n.01'), Synset('grain.n.02'), Synset('grain.n.03'), Synset('grain.n.04'), Synset('grain.n.05'), Synset('grain.n.06'), Synset('grain.n.07'), Synset('grain.n.08'), Synset('grain.n.09'), Synset('grain.n.10'), Synset('texture.n.05'), Synset('ingrain.v.01'), Synset('grain.v.02'), Synset('granulate.v.01'), Synset('granulate.v.02')]
>>> wn.synset("grain.n.01").hypernyms()
[Synset('atom.n.02')]
>>> wn.synset("grain.n.02").hypernyms()
[Synset('foodstuff.n.02')]

Cant think of a simple way to do this automatically though... there is no indication in the word about its hypernym. One possibility could be to look at different words in the topic and see if they map to locations that are closer to one hypernym than another, and choose the closest hypernym to all words in the topic, but then thats kind of circular...

Anonymous said...

hello Sujit,thanks for reply,
i would simply do it prototype level, whatever will be availble i will only use those concepts if any topics concept is not avaible thn wouldnot take it ..
so could you refer in this scenario now.

Sujit Pal said...

Then it might be a workable idea, although there is still the problem of choosing the right synset to collect hypernyms from. Since LDA yields a distribution of words for each topic, you could choose maybe the top N or some threshold to consider a subset of most probable words for each topic, and use the synset which is closest to the subset of these words.

Anonymous said...

Hello Sujit, I would like to ask permission to use their implements as a reference for testing of Sentence Similarity Based on Semantic Nets and Corpus Statistics, this was the only implementation I found that article.

Sujit Pal said...

Hi, if you are looking for my permission to use the code in this blog post, you have my permission to use it. If you are looking for permission to use the code in the paper (I don't think there was any, but from your question it appears that there might be), then you have to ask the authors of the paper.

Anonymous said...

Hi Sujit,
Great blog post. I was wondering if you were aware of any open source projects which do Sentence Semantic Similarity Analysis. I am making a QnA system and I need to match User Input with the most similar question of my Database. I have tried your above algorithm plus a few of my own, but didnt reach too much accuracy since the whole idea is built on matching words rather than matching sentences.

Also, should we not give higher weightages e.g when nouns match instead of adjectives? Are there any libraries that do that?

Sujit Pal said...

Thank you, and sorry, I don't know of a project that does sentence semantic similarity analysis like you described. Perhaps you could do something with word2vec? You could look up words in your query and get their word vectors, then sum them up to form your query vector. On the database side, you could do the same thing for words in the candidate questions, then sum them up to form question vectors, then find the one closest by some metric such as cosine or euclidean distance. You could also include weights to prefer nouns over adjectives in such a pipeline - there are libraries such as NLTK (Python) and OpenNLP (Java/Scala) to get POS tags for words in sentences.

Ganesh said...

Hi,

Your post is really useful who wants to learn NLP things and implement them.
I want to achieve similar thing i.e. I want to check whether two statements are similar or opposite to each other.
Can your implementation achieve this? I know your implementation is for similarity but is h there any way to check opposite-ness?

Thanks,
Ganesh

Sujit Pal said...

Thanks for the kind words Ganesh. Unfortunately I can't think of a way to measure out oppositeness like you are looking for. One obvious way, especially if the range of the metric value is 0-1 would be to think of it as a probability and compute oppositeness as 1 - similarity, but I don't remember if the similarity metric has that range or not.

sherlockatszx said...

I tested "Apples eat me" and "I eat apples". Sadly, this sentence pair socres 0.59,which is not reasonable. It seems that need to add semantci dependency parsing into.Do you got any improved method to deal with this kind semantic tricks

Sujit Pal said...

Hi sherlockatszx, Here is something more recent that uses intermediate nodes in the parse tree for the sentences, although you might not find much difference with that particular sentence.

Anonymous said...

Hi Sujit,

First of all, excellent code.

However, I wanted to reproduce the original values obtained in the paper by modifying your code, which I am unable to. Can you tell me how to modify the get_best_pair function so that the code produces the same values as in the paper?

Thanks.

Shaown S.

Sujit Pal said...

Hi Shaown, thanks for the kind words. As I mention at the beginning of the post, I wasn't able to reproduce the values reported in the paper either, however, the rankings seem to be correct and largely match the rankings reported in the paper. Its been a while since I wrote the post, but IIRC there were some places where I improvised on the original paper because (a) the paper was underspecified or (b) I thought I could do better. One such case is mentioned (search for "did differently") - you can try computing the difference of synsets_1[0] and synsets_2[0] instead.

Unknown said...

Dear Salmon Run

I am one of your followers. I always think that you are lucky guy, because I believe that the most interesting task in this world is sharing what you have to others. Thank you really sir.
I was looking for your early post on "Computing Semantic Similarity for Short Sentences" to compare two essay text semantically. While I was looking for it I had some confusion
1. Does it work for two essay text?
2. When computing order similarity of two sentence what if they are semantically same while their word order in active and passive sentence form.
3. What is necessity of large text corpus used "brown"

Hopefully you will help me because this time the answer is compulsory for me. I am graduate student and this task is part of my work. I now you are too busy, but I appreciate any time give for me.

Sujit Pal said...

You are welcome and hank you for the kind words, Abebawu. To answer your questions:
1) This is for short sentences, it is likely to become computationally too expensive for long texts. For long text, it might be better to reduce the document to a single meaning vector. Easiest is adding up the sparse one-hot vectors for the words for a bag-of-words model, more complex if you want to use word embeddings and combine them to form meaning vectors.
2) The semantic similarity component of the score should pick that up.
3) Its for computing the information content of a word in context of some "standard" text. This is used to discount the similarity component.

Unknown said...

hi sir , thanks for the implementation which is very useful for my work but i have a query regarding the word_pairs u have mentioned the values along with word-pairs , may i know on what basis we are getting those values.
sir plz clarify my doubt i stopped implementing for work.

Sujit Pal said...

Hi Sravanthi, glad you found the post helpful to your work. The expected similarities came from the paper referenced towards the beginning of the post, and the actual similarities come from my code shown in the post. The code attempts to follow the algorithm laid down in the paper, but is not identical, I made some changes based on what was convenient and available to me. I have done another post with a different algorithm on the same data, maybe you might find that interesting also.

frolickbbc said...

Hi Sujit,
I found this post really helpful and thanks for making out this post. I have close to 1 million support ticket data and its a short text. I need to cluster the sentence based on semantic and word order similarity. How I can go about this..

Thanks
Bharath

Sujit Pal said...

Hi Bharath, one possibility might be to package up the code in the blog into a custom distance function and then call it in a K nearest neighbors classifier. Look at the accepted answer in this thread on Stack Overflow, it has links to the relevant documentation for scikit-learn components.

Neha Kaushik said...

Dear Sir,

We are trying to match terms in an ontology. Many of the concepts in source ontologies are composite terms. The problem is that WordNet does not work for composite terms.How to find similarity between two composite terms?
For example,
Cereal crop and cereals
Legumionous crop and grain legumes

Sujit Pal said...

Hi Neha, if I understand correctly, you are trying to use this algorithm to compute similarity between ontology terms, and ontology terms can be multi-word. So in this scenario, the ontology terms are the "short sentences". If so, this should not be a problem since by the time you look up Wordnet, you are looking at individual words. Or maybe I am not understanding the problem?

Unknown said...

Dear Sir
   I have a project of end of theme study on the clasification of tweets, I wonder if I can use it in my project, and if I can change cosine by jaccard by ex, and cant you help me .


thanks.

Sujit Pal said...

Hi Faycal, I just implemented (with some changes) the paper referenced at the beginning of the post. The author of the paper is more likely to have a better theoretical understanding of the algorithm, so may be better to ask there. I think it might be fine to use it for tweet classification since both are short sentences. I don't have a preference about cosine vs jaccard, if Jaccard works better for your data then I think you should absolutely use it.

Unknown said...

Dear Sujit,

Thanks a lot for your implementation of the algorithm. It worked like a charm after some minor adaptation for python 3 ;)

Cheers,

Mathieu

Sujit Pal said...

Cool, this is good news for people (most people nowadays I think) who use Python3. If you have the updated version as a gist or on a public repository somewhere, please let me know and I can post it as an update to the post.

Unknown said...

Dear Sujit,

here is a gist of the updated version ;)
https://gist.github.com/chretm/fdcefce520ddfa1b66af3c730d4928c0

Cheers,

Mathieu

Sujit Pal said...

Thanks Mathieu, I have updated the post with the link to your gist.

Unknown said...

Hello sir , myself Sriharsha I didn't understand, why you were using two if statements in the lines 91 and 94 as the in the line 85 you are doing an intersection of the hypernyms that are there from the two synsets , isn't it kind of redundant?

Sujit Pal said...

Good catch, Sriharsha! Yes it does appear that way. Since the words in lcs_candidates are part of the intersection, the if conditions are redundant.

Unknown said...

Another doubt sir , in the function hierarchy_dist why cant we do like this :

Get the lowest common hypernym for both synset by using( an example -> wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('cat.n.01')), and then apply the function hypernym_distances to the result of the function (lowest_common_hypernyms) and extract the value for the synset-> 'entity.n.01'??which will give the distance(as the index 1 of the function hypernym_distances will give the distance). If this approach is not possible could you please explain why and could you explain how your algo works?

thank you.

Unknown said...

Hi sir, your program really helps me, and I need a dataset of tweets to test my KNN algorithm as I find on NET just feeling tweets.
thank you very much.

Sujit Pal said...

@Sriharsha - I think that should work, thanks for suggesting. It's possible I didn't know of the lowest common hypernym function so implemented something similar. But using the function is definitely more elegant.

@faycal - glad to know it helped. Tweets would be a good use case I think, good luck with the project.

Anonymous said...

Hi Sujit, it is really a great script and in most of the industries in today's world it can be a use case as they want to explore the customer data.

I just have a quiet basic question :) (pls don't laugh out :)).

what can be the input sentence meaning as the string? and where to define the input?

Sujit Pal said...

Hi Anonymous, I don't understand your first question. For the second question, there is a sentence_pairs data structure at line 281, thats where you would put in the short sentence pairs you want to match.

Anonymous said...

how i can use this code two measure similarity between two file
and i can use it to measure similarity between two Arabic file

hanan said...

i want know how write code two measure similarity between two English file TXT by this code

Sujit Pal said...

@hanan: does the two files contain parallel sentences? In that case you can read them both into arrays, then zip them together to form a list of sentence pairs like line #281 in the code (the final entry in the triple is the expected score from the paper, you can replace that with 0.0 if you want to reuse the code as-is, or tweak the code to not read that value in the block below it.

@anonymous: this is meant to do similarity between short sentences, so I believe using it on two files will be very resource intensive and not very practical. For Arabic, make sure you have some kind of wordnet like resource available, since one part of the similarity measure involves a wordnet lookup.

Unknown said...

Hai sujit
Thanks for the code.
I am trying to use this code for the sentences of 200-350 words. It is taking long time to execute , can you please provide me the system upgrade required (either gpu or cpu)RAM needed to make execution faster
Or can you please provide code for the semantic similarity of long sentences

Thank you.

Sujit Pal said...

Hi Naga, I don't think the algorithm is suitable for long sentences, its O(n^2) complexity because you are doing an all-pairs between words in the two sentences. For long sentences, aggressive lemmatization (and possibly word pruning) followed by a high level similarity metric like Jaccard may be more suitable. I also talked about a Deep Learning based approach for text classification and similarity at PyData Seattle last week, one of the examples I covered was sentence similarity. Unfortunately the results for sentence similarity was not that great although its quite general and can be applied to sentences of any length. I am working on it and if i get better results I will blog about it.

Unknown said...

Hi, I use python3 to execute the code, but i got the following error:

[Synset('refuge.n.03'), Synset('mental_hospital.n.01')]
Traceback (most recent call last):
File "sentence_distance.py", line 280, in
word_similarity(word_pair[0], word_pair[1])))
File "sentence_distance.py", line 105, in word_similarity
synset_pair = get_best_synset_pair(word_1, word_2)
File "sentence_distance.py", line 39, in get_best_synset_pair
if sim > max_sim:
TypeError: unorderable types: NoneType() > float()

Is there the problem of version?

Thanks for your works!

Unknown said...

Hi Sujit, I executed the code successfully!

Now,I have a question,the text word or sentence is English,so it can use nltk, if I want to text Chinese ,how I should do?

Thank you !

Sujit Pal said...

Hi Chen, good to know you fixed it, I am guessing max_sim was not initialized maybe? This code presupposes English, not sure if it can be transformed for Chinese. I see at least 3 problems maybe more since I don't know much about Chinese. First, word tokenization is probably much simpler in English than Chinese. Second, not sure if similarity using things like edit distance make much sense for Chinese ideograms. Third, it uses Wordnet which is a ontology of English words, not sure if something similar exists for Chinese?

hanan said...

can you help me in implementation this paper
PDLK: Plagiarism detection using linguistic knowledge

Sujit Pal said...

Hi Hanan, paper looks interesting, thanks for the pointer. Happy to help in case you are stuck. Also given recent advances in DL methods in NLP, once you are done implementing the paper, you might also want to look at using prebuilt word vectors instead of wordnet, and bi-LSTMs to generate sentence vectors and compare them, and see if your results are better than reported by PDLK.

Anonymous said...

Hi Sujit, I found it very useful to see how a Semantic Similarity algorithm is actually implemented in code, thank you.
One thing: on line 168, should it be
semvec[i] = max_sim if max_sim > PHI else 0.0
rather than
semvec[i] = PHI if max_sim > PHI else 0.0
so that the value is not limited to PHI?
- Carl

Sujit Pal said...

I think you are right, with the current setting anything greater than PHI gets truncated to PHI, but the change you suggested just gets rid of the effect of very dissimilar words. I am going to make the change in the code above, thanks for catching this.

Unknown said...

What I find interesting is that if you run just the semantic_similarity function as opposed to similarity (taking into account word order), the similarity score is higher. Though the differences are pretty minute. It seems including the calculation of word order does not alter the scores significantly.

Thanks for this article by the way! Writing my thesis on detecting malicious email campaigns via semantic similarity.

Sujit Pal said...

You are welcome and good luck on your thesis. That is an interesting observation, and I think perhaps expected to some extent, since semantic similarity is a higher bar than structural similarity. Although the minuteness of the difference probably indicates that there is not much semantic content to short sentences in general (or at least the ones we are testing with).

Unknown said...

Hey Sujit its wonderful to look at you code and I was wondering that how cool stuffs can be done with writing scripts, I have started out recently new with python I have a question for you can you pls tell mehow to find similar sentences from a excel file of 15000 sentences which are in hindi text ?

Unknown said...

Chen Yuan, how did you fix you Nonetype > issue?? It would be helpful to post the fixes to your bugs for others :)

Anonymous said...

So there's a huge discrepancy when I capitalise the a word and then match it with a sentence. Is there any way to remove the discrepancy?

Sujit Pal said...

This is quite interesting. Do you mean if you capitalize the word "a" to "A" then results change? I don't have the code handy anymore but if you can comment with a test case demonstrating this, that would probably be very useful for others here.

Also sorry about the delay in responding, I wasn't being notified about the comments.

Sujit Pal said...

Reply to comment by Unknown dated 9/12/2018: I am not familiar with NLP tools for Hindi, but I expect that in the absence of everything else, you would still be able to use things like Jaccard and Levenshtein's distances. Also check out multi-language embeddings or embeddings specifically built with Hindi. The general idea would be to somehow generate features for your sentences and then use either brute force or approximate nearest neighbor between all pairs of sentences.

Anonymous said...

Is there any way I can get the overall accuracy of the model? Maybe a pearson's coefficient value? If so, Sir can you guide me on how I could alter the above code to do so?

Sujit Pal said...

The model itself is unsupervised, so there is no golden scores to compare with. The numbers I show are the ones reported in the paper vs the ones I calculated, based on my implementation of the approach described in the paper. Ideally, our numbers should have matched, but it is possible that I may have misunderstood some aspects of the algorithm described or made some other mistakes. I did note, however, that my implementation reported differences between words and sentences that were in line with the results in the paper.

I think a better measure of the model's skill might be to quantify the ordering of word and sentence similarities for pairs of inputs, and count how many predictions from the model match up to what you would expect as a human.