Friday, May 03, 2013

Inter-Document Similarity with Scikit-Learn and NLTK


Someone recently asked me about using Python to calculate document similarity across text documents. The application had to do with cheating detection, ie, compare student transcripts and flag documents with (abnormally) high similarity for further investigation. For security reasons, I could not get access to actual student transcripts. But the basic idea was to convince ourselves that this approach is valid, and come up with a code template for doing this.

I have been playing quite a bit with NLTK lately, but for this work, I decided to use the Python ML Toolkit Scikit-Learn, which has pretty powerful text processing facilities. I did end up using NLTK for its cosine similarity function, but that was about it.

I decided to use the coffee-sugar-cocoa mini-corpus of 53 documents to test out the code - I first found this in Dr Manu Konchady's TextMine project, and I have used it off and on. For convenience I have made it available at the github location for the sub-project.

Heres the code. It first parses this data into a temporary one line per file version, then vectorizes each line and creates a term document matrix for the corpus.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
from __future__ import division

from operator import itemgetter

from nltk.cluster.util import cosine_distance
import numpy as np
import random
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline

def preprocess(fnin, fnout):
  fin = open(fnin, 'rb')
  fout = open(fnout, 'wb')
  buf = []
  id = ""
  category = ""
  for line in fin:
    line = line.strip()
    if line.find("-- Document Separator --") > -1:
      if len(buf) > 0:
        # write out body,
        body = re.sub("\s+", " ", " ".join(buf))
        fout.write("%s\t%s\t%s\n" % (id, category, body))
      # process next header and init buf
      id, category, rest = map(lambda x: x.strip(), line.split(": "))
      buf = []
    else:
      # process body
      buf.append(line)
  fin.close()
  fout.close()

def train(fnin):
  docs = []
  cats = []
  fin = open(fnin, 'rb')
  for line in fin:
    id, category, body = line.strip().split("\t")
    docs.append(body)
    cats.append(category)
  fin.close()
  pipeline = Pipeline([
    ("vect", CountVectorizer(min_df=0, stop_words="english")),
    ("tfidf", TfidfTransformer(use_idf=False))])
  tdMatrix = pipeline.fit_transform(docs, cats)
  return tdMatrix, cats

def test(tdMatrix, cats):
  testIds = random.sample(range(0, len(cats)), int(0.1 * len(cats)))
  testIdSet = set(testIds)
  refIds = filter(lambda x: x not in testIdSet, range(0, len(cats)))
  sims = np.zeros((len(testIds), len(refIds)))
  for i in range(0, len(testIds)):
    for j in range(0, len(refIds)):
      doc1 = np.asarray(tdMatrix[testIds[i], :].todense()).reshape(-1)
      doc2 = np.asarray(tdMatrix[refIds[j], :].todense()).reshape(-1)
      sims[i, j] = cosine_distance(doc1, doc2)
  for i in range(0, sims.shape[0]):
    xsim = list(enumerate(sims[i, :]))
    sortedSims = sorted(xsim, key=itemgetter(1), reverse=True)[0:5]
    sourceCat = cats[testIds[i]]
    numMatchedCats = 0
    numTestedCats = 0
    for j, score in sortedSims:
      targetCat = cats[j]
      if sourceCat == targetCat:
        numMatchedCats += 1
      numTestedCats += 1
    print("Test Doc: %d, Source Category: %s, Target Matched: %d/%d times" %
      (i, sourceCat, numMatchedCats, numTestedCats))
      
def main():
  preprocess("sugar-coffee-cocoa-docs.txt", "sccpp.txt")
  tdMatrix, cats = train("sccpp.txt")
  test(tdMatrix, cats)
  
if __name__ == "__main__":
  main()

The code then pulls out five random documents and tries to find the five most similar documents to it, and counts how many are in the same category as itself. As you can see, the results don't look too bad.

1
2
3
4
5
6
sujit@cyclone:docsim$ python docsim.py 
Test Doc: 0, Source Category: coffee, Target Matched: 3/5 times
Test Doc: 1, Source Category: cocoa, Target Matched: 2/5 times
Test Doc: 2, Source Category: sugar, Target Matched: 3/5 times
Test Doc: 3, Source Category: cocoa, Target Matched: 1/5 times
Test Doc: 4, Source Category: sugar, Target Matched: 2/5 times

Few things I tried was to turn IDF on (its currently off), and to use Eucledian distance instead of Cosine distance. The first actually made the results slightly worse over a set of runs, and the second did not seem to have any effect.

One thing to note is that this approach is unlikely to scale. Computing document-to-document similarity across the entire corpus (as you would need to do for a group of students) is an O(n2) operation. I have recently been reading up about Locally Sensitive Hashing that can mitigate this problem by being able to detect document clumps. Another approach may be to cluster the documents first and only consider inter-document similarities for those within each cluster.

And I guess thats it for today. The approach I took with Scikit-Learn is heavily inspired by the code in the text processing tutorial linked to above. For a slightly more math and code heavy treatment of text processing with Scikit-Learn check out Christian Perone's awesome blog posts here and here.

Update (2013-05-08) - I've been reading the Building Search Applications: Lucene, LingPipe and GATE by Dr Manu Konchady (henceforth referred to as the LLG book in this blog), and I came upon the SCAM (Standard Copy Analysis Mechanism) Algorithm for Plagiarism Detection, and it seemed better suited for my cheating domain than Cosine Similarity. Here is some Python code that implements the SCAM measure between two documents. This code and the update to the original code to use this can be found in my GitHub page for this subproject.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from __future__ import division
import numpy as np
import scipy.sparse as ss

def _s_pos_or_zero(x):
  return x if x > 0 else 0

def _s_zero_mask(x, y):
  return 0 if y == 0 else x

def _s_safe_divide(x, y):
  return 0 if x == 0 or y == 0 else x / y

_v_pos_or_zero = np.vectorize(_s_pos_or_zero)
_v_zero_mask = np.vectorize(_s_zero_mask)
_v_safe_divide = np.vectorize(_s_safe_divide)

def _assymetric_subset_measure(doc1, doc2):
  epsilon = np.ones(doc1.shape) * 2
  filtered = _v_pos_or_zero(epsilon - (_v_safe_divide(doc1, doc2) +
    _v_safe_divide(doc2, doc1)))
  zdoc1 = _v_zero_mask(doc1, filtered)
  zdoc2 = _v_zero_mask(doc2, filtered)
  return np.sum(np.dot(zdoc1, zdoc2)) / np.sum(np.dot(doc1, doc2))

def scam_distance(doc1, doc2):
  asm12 = _assymetric_subset_measure(doc1, doc2)
  asm21 = _assymetric_subset_measure(doc2, doc1)
  return max(asm12, asm21)

I decided to check this out by calculating the SCAM distance (should really be called SCAM similarity, since higher values indicate closeness) between my previous post, this post without the update and this post with the update. My expectation is that the score for the first pair should be lower than that for the second pair, and indeed it is so, giving 0.655172413793 and 0.954821894005 respectively.

17 comments (moderated to prevent spam):

Saugata Bose said...

Dear Sir,
Thank you for such a resourceful writing. I am following your code and I am obtaining an error:

CountVectorizer(min_df=1,stop_words="english")
TypeError: __init__() got an unexpected keyword argument 'min_df'

Can you please advice me to resolve this problem? thank you in advance.

Sujit Pal said...

Hi Saugata, the only thing I can think of is that perhaps the sklearn version you are using differs from mine and the api is a little different. In my case:

>>> import sklearn
>>> sklearn.__version__

returns 0.14.1

I ran this set of commands in my python shell for what min_df means, maybe you can do the same on your version and figure out the analogous parameter (or see if the code works for you acceptably without min_df):

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> help(CountVectorizer)

gives me this:
min_df : float in range [0.0, 1.0] or int, optional, 1 by default
When building the vocabulary ignore terms that have a term frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

Saugata Bose said...

Dear Sir,

Thank you for your kind reply. You were right. I installed updated version of sklearn and it was solved.

At present, I am trying to find the cosine distance similarity between 2 files following this tutorial. You have used single file of documents. And I try to modify the code using 2 files without cosidering 'category' and 'id'.

But it has shown an error that "list object is not callable".

I am bit confused on my modified test method

My code is following:

#! /usr/bin/python -tt

>>from __future__ import division
>>from operator import itemgetter
>>from sklearn.feature_extraction.text import CountVectorizer
>>from sklearn.feature_extraction.text import TfidfTransformer
>>import nltk.cluster.util as nltkutil
>>import numpy as np
>>import re

>>def preprocess(fnin, fnout):
>> fin = open(fnin, 'rb')
>> print fin
>> fout = open(fnout, 'wb')
>> buf = []

>> for line in fin:

>> line = line.strip()
>> if line.find("-- Document Separator --") > -1:
>> if len(buf) > 0:

>>body = re.sub("\s+", " ", " ".join(buf))
>>fout.write("%s\n" % (body))

>> rest = map(lambda x: x.strip(), line.split(": "))

>> buf = []
>> else:
>>buf.append(line)

>> fin.close()
>> fout.close()

>>def test(tdMatrix,count,fsim):

>> sims=[]
>> sims = np.zeros((len(tdMatrix), count))

>> for i in range(len(tdMatrix)):
>> for j in range(count):
>> doc1 = np.asarray(tdMatrix[tdMatrix[i], :].todense()).reshape(-1)
>> doc2 = np.asarray(tdMatrix[tdMatrix[j], :].todense()).reshape(-1)

>> sims[i, j] = fsim(doc1, doc2)

>> print sims
>>def main():

>> file_set=["corpusA.txt","corpusB.txt"]
>> train=[]
>> test=[]

>> for file1 in file_set:
>> s="x"+file1
>> preprocess(file1,s)

>> count_vectorizer = CountVectorizer()
>> m=open("xcorpusA.txt",'r')
>> for i in m:
>>train.append(i.strip())
>> #print doc
>>count_vectorizer.fit_transform(train)
>> m1=open("xcorpusB.txt",'r')
>> for i in m1:
>>test.append(i.strip())

>> freq_term_matrix = count_vectorizer.transform(test)

>> tfidf = TfidfTransformer(norm="l2")
>> tfidf.fit(freq_term_matrix)


>> tf_idf_matrix = tfidf.transform(freq_term_matrix)

>> count=0
>> s=""
>> for i in tf_idf_matrix.toarray():
>> for j in i:
>>count+=1
>> break

>> print type(tf_idf_matrix)
>> print "Results with Cosine Distance Similarity Measure"
>>test(tf_idf_matrix,count,nltkutil.cosine_distance)

>>if __name__ == "__main__":
main()

I am looking for your kind advice.
Thank You,
Saugata

Saugata Bose said...
This comment has been removed by the author.
Sujit Pal said...

Hi Saugata, its a bit painful reading python code in comments since Blogger so gratuitously removes whitespace, can you please replace leading spaces with "." when posting (real easy with an editor) makes it easier to read. Something like this:

for f in ["foo", "bar"]:
..print f

Also the stack trace does not match up with the source code provided (doc2 has j-1 in the trace, j in the code).

One thing to look out for is that your dictionary must contain words from both documents - I see you run both files through preprocess but without indents I can't tell if they are being fed into CountVectorizer. But IndexError is likely in such a case.

Saugata Bose said...

Dear Sir,
It's all my pleasure to have your reply.

I am running preprocess() in 2 files to remove extra space,category, id, new line and Document Separator.

courpusA is the corpus provided by you and courpusB is bit similar to courpusA. I want to find the cosine similarity without considering id and category field.

For finding tf-idf matrix I followed Christian S. Perone' s blog Pyevolve what you recommended in your post.
I am confused on whether I construct the test() properly or not.

My code is following:

#! /usr/bin/python -tt

from __future__ import division
from operator import itemgetter

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import nltk.cluster.util as nltkutil
import numpy as np
import re

def preprocess(fnin, fnout):
....... fin = open(fnin, 'rb')
....... print fin
....... fout = open(fnout, 'wb')
....... buf = []

....... for line in fin:
....line = line.strip()

....if line.find("-- Document Separator --") > -1:
....if len(buf) > 0:
....... body = re.sub("\s+", " ", " ".join(buf))
....... fout.write("%s\n" % (body))
....... rest = map(lambda x: x.strip(), line.split(": "))
....... buf = []
....else:
....buf.append(line)

........fin.close()
........fout.close()

def test(tdMatrix,count,fsim):

........sims=[]
........sims = np.zeros((len(tdMatrix.todense()), count))
........l=len(tdMatrix.todense())

........for i in range(0, l):
........for j in range(0, count):
........doc1 = np.asarray(tdMatrix[i].todense()).reshape(-1)
........doc2 = np.asarray(tdMatrix[j].todense()).reshape(-1)
........sims[i, j] = fsim(doc1, doc2)

........print sims

def main():

........file_set=["corpusA.txt","corpusB.txt"]
........train=[]
........test1=[]
........count_vectorizer = CountVectorizer()

........for file1 in file_set: #both the files are renamed as xcorpusA.txt and xcorpusB.txt
....s="x"+file1
....preprocess(file1,s)

........m=open("xcorpusA.txt",'r')
........for i in m:
....train.append(i.strip())

........count_vectorizer.fit_transform(train)

........m1=open("xcorpusB.txt",'r')
........for i in m1:
....test1.append(i.strip())

........freq_term_matrix = count_vectorizer.transform(test1)

........tfidf = TfidfTransformer(norm="l2")
........tfidf.fit(freq_term_matrix)

........tf_idf_matrix = tfidf.transform(freq_term_matrix)
........print (tf_idf_matrix.toarray())

........count=0 # count will give the length of the tf-idf sublist within list

........for i in tf_idf_matrix.toarray():
........for j in i:
........count+=1
........break

........print "Results with Cosine Distance Similarity Measure"
........test(tf_idf_matrix,count,nltkutil.cosine_distance)

And the error shows here is:

Traceback (most recent call last):
File "3.py", line 103, in
main()
File "3.py", line 99, in main
test(tf_idf_matrix,count,nltkutil.cosine_distance)
File "3.py", line 46, in test
doc2 = np.asarray(tdMatrix[j].todense()).reshape(-1)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 281, in __getitem__
return self[key,:] #[i] or [1:2]
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 233, in __getitem__
return self._get_row_slice(row, col) #[i,1:2]
File "/usr/lib/python2.7/dist-packages/scipy/sparse/csr.py", line 320, in _get_row_slice
raise IndexError('index (%d) out of range' % i )
IndexError: index (4) out of range

Sujit Pal said...

Its still a bit difficult to read because indents are incorrect, but I got a bit further this time :-). My understanding is you are pulling in two files of data with similar structure - running them through preprocess creates a list, one doc per line, something like this:

docs = preprocess(corpusA)
docs = docs.extend(preprocess(corpusB))

You can then pass this through CountVectorizer to create a matrix.

cv = CountVectorizer()
X = cv.fit_transform(docs)

Here each row of X represents a document. So if you want to compute similarity between a document from corpusA and another document from corpusB, find their row numbers (say 0 and 50), and similarity would be calculated as follows:

doc0 = np.asarray(X[0, :])
doc50 = np.asarray(X[50, :])
sim = cosine_distance(doc0, doc50)

Since you are getting an error at the np.asarray() call, I would suggest looking at the shape of tdMatrix, perhaps it is not what you imagine it, ie its not (l,count). You can verify with:

print np.shape(tdMatrix)

Also, if the indentation you show is accurate, then it is logically incorrect, it should be:

for i in range(0, l):
..for j in range(0, count):
....doc1 = np.asarray(tdMatrix[i].todense()).reshape(-1)
....doc2 = np.asarray(tdMatrix[j].todense()).reshape(-1)
....sims[i, j] = fsim(doc1, doc2)

Hope this helps.

Saugata Bose said...

Dear Sir,

Thank you for your kind response.

Can I ask you that, could I combine n gram approach and cosine similarity?

Sujit Pal said...

You are welcome Saugata, glad I could help. And yes, you can combine ngram and cosine similarity - instead of considering a document as a vector of individual terms, you can consider it to be a vector of ngrams.

Anonymous said...

Dear sir, i have a problem with cosine similarity. I would calculate cosine similaty between a tweet and a document formed by many tweet. How can i do? thank you in advance.

Sujit Pal said...

You have to vectorize both documents. There are various vectorizers available in Scikit-Learn you could use for this. Then you could compute cosine distance with the built in function in NLTK. So assuming docA and docB are the two documents available to you as strings, you could use something like the following code snippet to calculate cosine distance between these two documents.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from nltk.cluster.util import cosine_distance
>>> docA = "..."
>>> docB = "..."
>>> vectorizer = CountVectorizer(min_df=1.0)
>>> X = vectorizer.fit_transform([docA, docB])
>>> vecA = np.array(X[0].todense())[0]
>>> vecB = np.array(X[1].todense())[0]
>>> cosine_distance(vecA, vecB)

hanan said...

we can use wordnet in this code

Sujit Pal said...

Yes, absolutely, wordnet could be used to bring in hypernyms or other kinds of related words on both sides to increase recall.

hanan said...

thanks sir but i want khnow the code write to use word net in this code

Sujit Pal said...

I don't have this code, but I think its a good idea to maybe preprocess the data to expand synonyms and possibly hypernyms in the source and target documents. So assuming you had a sentence like "The boy plays with the car", maybe wordnet can find synonyms for boy and car as "male human child" and "automobile" respectively. So idea would be to simply add these words so the document now becomes "The boy male human child plays with the car automobile". To figure out how to get synonyms and hypernyms from Wordnet using NLTK check out the NLTK Wordnet Interface.

teddy said...

Can i Apply this technique to identify similar document in a large corpus? can you please give me some hint?

Sujit Pal said...

This was originally built as a proof of concept to answer a question, but it should be possible to use for large corpora. Another possibly more practical but less tunable approach that may be more suitable to detect document similarity is to use a search engine -- search for "more like this" and you will find the theory behind the idea, as well as implementations in Lucene, Solr and ElasticSearch.