Someone recently asked me about using Python to calculate document similarity across text documents. The application had to do with cheating detection, ie, compare student transcripts and flag documents with (abnormally) high similarity for further investigation. For security reasons, I could not get access to actual student transcripts. But the basic idea was to convince ourselves that this approach is valid, and come up with a code template for doing this.
I have been playing quite a bit with NLTK lately, but for this work, I decided to use the Python ML Toolkit Scikit-Learn, which has pretty powerful text processing facilities. I did end up using NLTK for its cosine similarity function, but that was about it.
I decided to use the coffee-sugar-cocoa mini-corpus of 53 documents to test out the code - I first found this in Dr Manu Konchady's TextMine project, and I have used it off and on. For convenience I have made it available at the github location for the sub-project.
Heres the code. It first parses this data into a temporary one line per file version, then vectorizes each line and creates a term document matrix for the corpus.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
from __future__ import division from operator import itemgetter from nltk.cluster.util import cosine_distance import numpy as np import random import re from sklearn.feature_extraction.text import CountVectorizer from sklearn.feature_extraction.text import TfidfTransformer from sklearn.pipeline import Pipeline def preprocess(fnin, fnout): fin = open(fnin, 'rb') fout = open(fnout, 'wb') buf =  id = "" category = "" for line in fin: line = line.strip() if line.find("-- Document Separator --") > -1: if len(buf) > 0: # write out body, body = re.sub("\s+", " ", " ".join(buf)) fout.write("%s\t%s\t%s\n" % (id, category, body)) # process next header and init buf id, category, rest = map(lambda x: x.strip(), line.split(": ")) buf =  else: # process body buf.append(line) fin.close() fout.close() def train(fnin): docs =  cats =  fin = open(fnin, 'rb') for line in fin: id, category, body = line.strip().split("\t") docs.append(body) cats.append(category) fin.close() pipeline = Pipeline([ ("vect", CountVectorizer(min_df=0, stop_words="english")), ("tfidf", TfidfTransformer(use_idf=False))]) tdMatrix = pipeline.fit_transform(docs, cats) return tdMatrix, cats def test(tdMatrix, cats): testIds = random.sample(range(0, len(cats)), int(0.1 * len(cats))) testIdSet = set(testIds) refIds = filter(lambda x: x not in testIdSet, range(0, len(cats))) sims = np.zeros((len(testIds), len(refIds))) for i in range(0, len(testIds)): for j in range(0, len(refIds)): doc1 = np.asarray(tdMatrix[testIds[i], :].todense()).reshape(-1) doc2 = np.asarray(tdMatrix[refIds[j], :].todense()).reshape(-1) sims[i, j] = cosine_distance(doc1, doc2) for i in range(0, sims.shape): xsim = list(enumerate(sims[i, :])) sortedSims = sorted(xsim, key=itemgetter(1), reverse=True)[0:5] sourceCat = cats[testIds[i]] numMatchedCats = 0 numTestedCats = 0 for j, score in sortedSims: targetCat = cats[j] if sourceCat == targetCat: numMatchedCats += 1 numTestedCats += 1 print("Test Doc: %d, Source Category: %s, Target Matched: %d/%d times" % (i, sourceCat, numMatchedCats, numTestedCats)) def main(): preprocess("sugar-coffee-cocoa-docs.txt", "sccpp.txt") tdMatrix, cats = train("sccpp.txt") test(tdMatrix, cats) if __name__ == "__main__": main()
The code then pulls out five random documents and tries to find the five most similar documents to it, and counts how many are in the same category as itself. As you can see, the results don't look too bad.
1 2 3 4 5 6
sujit@cyclone:docsim$ python docsim.py Test Doc: 0, Source Category: coffee, Target Matched: 3/5 times Test Doc: 1, Source Category: cocoa, Target Matched: 2/5 times Test Doc: 2, Source Category: sugar, Target Matched: 3/5 times Test Doc: 3, Source Category: cocoa, Target Matched: 1/5 times Test Doc: 4, Source Category: sugar, Target Matched: 2/5 times
Few things I tried was to turn IDF on (its currently off), and to use Eucledian distance instead of Cosine distance. The first actually made the results slightly worse over a set of runs, and the second did not seem to have any effect.
One thing to note is that this approach is unlikely to scale. Computing document-to-document similarity across the entire corpus (as you would need to do for a group of students) is an O(n2) operation. I have recently been reading up about Locally Sensitive Hashing that can mitigate this problem by being able to detect document clumps. Another approach may be to cluster the documents first and only consider inter-document similarities for those within each cluster.
And I guess thats it for today. The approach I took with Scikit-Learn is heavily inspired by the code in the text processing tutorial linked to above. For a slightly more math and code heavy treatment of text processing with Scikit-Learn check out Christian Perone's awesome blog posts here and here.
Update (2013-05-08) - I've been reading the Building Search Applications: Lucene, LingPipe and GATE by Dr Manu Konchady (henceforth referred to as the LLG book in this blog), and I came upon the SCAM (Standard Copy Analysis Mechanism) Algorithm for Plagiarism Detection, and it seemed better suited for my cheating domain than Cosine Similarity. Here is some Python code that implements the SCAM measure between two documents. This code and the update to the original code to use this can be found in my GitHub page for this subproject.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
from __future__ import division import numpy as np import scipy.sparse as ss def _s_pos_or_zero(x): return x if x > 0 else 0 def _s_zero_mask(x, y): return 0 if y == 0 else x def _s_safe_divide(x, y): return 0 if x == 0 or y == 0 else x / y _v_pos_or_zero = np.vectorize(_s_pos_or_zero) _v_zero_mask = np.vectorize(_s_zero_mask) _v_safe_divide = np.vectorize(_s_safe_divide) def _assymetric_subset_measure(doc1, doc2): epsilon = np.ones(doc1.shape) * 2 filtered = _v_pos_or_zero(epsilon - (_v_safe_divide(doc1, doc2) + _v_safe_divide(doc2, doc1))) zdoc1 = _v_zero_mask(doc1, filtered) zdoc2 = _v_zero_mask(doc2, filtered) return np.sum(np.dot(zdoc1, zdoc2)) / np.sum(np.dot(doc1, doc2)) def scam_distance(doc1, doc2): asm12 = _assymetric_subset_measure(doc1, doc2) asm21 = _assymetric_subset_measure(doc2, doc1) return max(asm12, asm21)
I decided to check this out by calculating the SCAM distance (should really be called SCAM similarity, since higher values indicate closeness) between my previous post, this post without the update and this post with the update. My expectation is that the score for the first pair should be lower than that for the second pair, and indeed it is so, giving 0.655172413793 and 0.954821894005 respectively.