Friday, October 21, 2011

Computing Document Similarity using Lucene Term Vectors

Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn't used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

1
2
   <field name="content" type="text" indexed="true" stored="true" 
       termVectors="true" termPositions="true" termOffsets="true"/>

or enable term vectors in your Lucene indexing code like so:

1
2
  doc.addField(new Field("content", content, Store.YES, Index.ANALYZED,
    TermVector.WITH_POSITIONS_OFFSETS));

Note that you don't need positions and offsets enabled for term vectors for doing what I am doing, but I had it on in my schema.xml file already, so decided (to avoid confusion) to show the equivalent Lucene call in the code snippet above.

Once you have the index, find the list of all the terms in the "content" field across the entire index. These terms would represent the entries in the document vector for a document. Then for the documents which you want to consider for your similarity computation, extract its term vector. The term vector gives you two arrays, an array of terms within this document, and the corresponding frequency of that term in this document. Using these three data structures, it is easy to construct a (sparse) document vector representing the document(s).

Once you have the two document vectors, the similarity between the two can be computed using Cosine Similarity (or other measure - most of them work with document vectors).

For my test, I chose 4 documents out of my corpus, the titles of which (and their associated Lucene document IDs) are shown below:

1
2
3
4
Document#0: Mitral valve surgery - minimally invasive (31825)
Document#1: Mitral valve surgery - open (31835)
Document#2: Laryngectomy (31706)
Document#3: Shaken baby syndrome (30)

Notice that the first two are both about surgery and about the same body part, so intuitively I should expect a very high similarity between the documents 0 and 1. Document 2 is still about surgery, so I should expect a good, but smaller similarity value between documents 0 and 2. Document 3 is completely different, so I should expect an even smaller similarity value between documents 0 and 3.

The code for this is shown below (as a JUnit test).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
package com.mycompany.termpos;

import java.io.File;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.math.linear.OpenMapRealVector;
import org.apache.commons.math.linear.RealVectorFormat;
import org.apache.commons.math.linear.SparseRealVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.store.FSDirectory;
import org.junit.Assert;
import org.junit.Test;

public class TermVectorBasedSimilarityTest {

  @Test
  public void testSimilarity() throws Exception {
    IndexReader reader = IndexReader.open(
      FSDirectory.open(new File("/path/to/my/index")));
    // first find all terms in the index
    Map<String,Integer> terms = new HashMap<String,Integer>();
    TermEnum termEnum = reader.terms(new Term("content"));
    int pos = 0;
    while (termEnum.next()) {
      Term term = termEnum.term();
      if (! "content".equals(term.field())) 
        break;
      terms.put(term.text(), pos++);
    }
    int[] docIds = new int[] {31825, 31835, 31706, 30};
    DocVector[] docs = new DocVector[docIds.length];
    int i = 0;
    for (int docId : docIds) {
      TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
      Assert.assertTrue(tfvs.length == 1);
      docs[i] = new DocVector(terms); 
      for (TermFreqVector tfv : tfvs) {
        String[] termTexts = tfv.getTerms();
        int[] termFreqs = tfv.getTermFrequencies();
        Assert.assertEquals(termTexts.length, termFreqs.length);
        for (int j = 0; j < termTexts.length; j++) {
          docs[i].setEntry(termTexts[j], termFreqs[j]);
        }
      }
      docs[i].normalize();
      i++;
    }
    // now get similarity between doc[0] and doc[1]
    double cosim01 = getCosineSimilarity(docs[0], docs[1]);
    System.out.println("cosim(0,1)=" + cosim01);
    // between doc[0] and doc[2]
    double cosim02 = getCosineSimilarity(docs[0], docs[2]);
    System.out.println("cosim(0,2)=" + cosim02);
    // between doc[0] and doc[3]
    double cosim03 = getCosineSimilarity(docs[0], docs[3]);
    System.out.println("cosim(0,3)=" + cosim03);
    reader.close();
  }
  
  private double getCosineSimilarity(DocVector d1, DocVector d2) {
    return (d1.vector.dotProduct(d2.vector)) /
      (d1.vector.getNorm() * d2.vector.getNorm());
  }

  private class DocVector {
    public Map<String,Integer> terms;
    public SparseRealVector vector;
    
    public DocVector(Map<String,Integer> terms) {
      this.terms = terms;
      this.vector = new OpenMapRealVector(terms.size());
    }
    
    public void setEntry(String term, int freq) {
      if (terms.containsKey(term)) {
        int pos = terms.get(term);
        vector.setEntry(pos, (double) freq);
      }
    }
    
    public void normalize() {
      double sum = vector.getL1Norm();
      vector = (SparseRealVector) vector.mapDivide(sum);
    }
    
    public String toString() {
      RealVectorFormat formatter = new RealVectorFormat();
      return formatter.format(vector);
    }
  }
}

The results of running this test are shown below, and match up with my expectations, as shown below:

1
2
3
  cosim(0,1)=0.9672563304581603
  cosim(0,2)=0.7496880467405691
  cosim(0,3)=0.23968406019250035

Using Lucene's term vectors to generate document vectors can be useful not only for similarity calculations, but for other tasks where you need document vectors, such as clustering and classification. Computing document vectors from raw data is typically more involved (requires development time and can have scalability issues) than loading data into a Lucene index.

The other scalability issue is the size of the document vectors. My index is relatively tiny (about 3.000 documents), but we are still talking 10,000 unique terms after stopwording and stemming. However, document vectors are typically quite sparse (my test document vectors have an average density of 0.01), so an OpenMapRealVector is a good choice of data structure, and should scale to much larger sizes as well.

38 comments (moderated to prevent spam):

Anonymous said...

In this example there are only 4 documents being tested.But when i tried to compute cosine similarity of about 100 document from my lucene index i got an error


"Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.lucene.index.TermVectorsReader.readTermVector(TermVectorsReader.java:499)
at org.apache.lucene.index.TermVectorsReader.readTermVectors(TermVectorsReader.java:383)
at org.apache.lucene.index.TermVectorsReader.get(TermVectorsReader.java:345)
at org.apache.lucene.index.SegmentReader.getTermFreqVectors(SegmentReader.java:813)
at org.apache.lucene.index.DirectoryReader.getTermFreqVectors(DirectoryReader.java:502)"

i dont know what is happening.is there any solution to this?
i wanted to find out all the cosine similarity between all the document in my index.

Sujit Pal said...

The DocVector class in the example is inefficient for large number of documents. Each instance creates its own terms map, try pulling this out into a static or operating on the SparseRealVector directly.

Pawan Kumar said...

Thanks for sharing this great piece of information. It is very helpful for me.

Sujit Pal said...

You are welcome, Pawan.

Nesli said...

Hello,

I have to get the similarities between documents. Particularly between movie-descriptions, to generate recommendations.

I have created an index with Solr. (with Lucene, it was to difficult for me, I'm a novice).

I have tried to get the cosine similarity with this code, but there are some failures.

For the cosinus test I have indexed 3 example-documents (movie-contents) with only two fields: id and content.

At first, what is the difference between the types "text" and "text_general". If I tried to set the type on "text", solr didn't startet. So I changed it to text_general.

In FSDirectory.open(new File("/path/to/my/index")) I've set the path to the Solr-Index-Folder "apache-solr-3.5.0\\example\\solr\\data\\index"

Is there the mistake?

Have you any tips?
please help.

Sujit Pal said...

Hi Nesli, from the SOLR schema example, text_general field type is missing the termVectors setting, which is required by the code (since its reading term vectors which must be present).

But since you are using SOLR, and all you care about is showing "similar" or "more-like-this" docs, you may want to consider just using SOLR's built in More Like This feature. That may be easier to do (also requires term vectors, but no coding, only configuration).

Anonymous said...

Hi Sujit,
many thanks for your help. I have tried it with MLT. It was helpful, I received similar movies.
But for my thesis I have to select the tf-idf-values of movie-descripton to compute the cosineSimilarity myself.
I must write a cosine similarity calculator anyway in java.
My current status is, that I got the tf-idf values from the solr-xml-response and put them in hash-maps.
Now my problem is: any movie has a genre (often more than one).
Its like the "cat"-field (category) in the exampledocs with ipod/monitor etc. and its an important point.
How can I integrate this factor?
If I not involve the genre of a movie, I will compute the similarity only based on its description-terms.
Have you an approach? Or should I only enable the "termVector"-attribute like by the film-description,
and add the "genre"-term-values to my computation?
Regards

Sujit Pal said...

Hi, thanks and you are welcome. Regarding the genre, it really depends on the application, but one approach could be to compute an additional measure of similarity based on genre overlap (something like Jaccard's coefficient), and then make your similarity = a * cosine_similarity + genre_similarity where a is a weighting factor (maybe 1).

Unknown said...

Hi Sujit Pal,

Right now I am working on Nutch 1.4. How can I achieve something similar in nutch. I want access to the term frequency vector corresponding to each document while crawling (not indexing).

I believe the nutch indexer must be using term vector for finding the scores.. As indexing is now moved to SOLR, where can i find the code for the same - nutch or solr..
plz help...

Sujit Pal said...

Hi Unknown, you can do the same thing with Nutch's Solr. You need to enable termVectors on the "content" field it as noted towards the beginning of the post. Underlying Solr is a Lucene index, so you can run the code against the underlying index to find inter-document similarity. But this is /after/ indexing rather than during crawling or indexing as you are looking for.

I don't think there is a way to do this while crawling in Nutch (or any crawler), as far as I know. As for scoring, its all governed by Lucene, and you can modify it to some extent by overriding Lucene's similarity, but the default implementation is based on cosine similarity and does a pretty good job for most use cases.

vijith said...

Thanks for the reply. Pardon me if this is out of context.

Since I am doing focused crawling with nutch, I need to find the similarity during the crawl to update the score (set by OPIC plugin).

So I thought of implementing the similarity calculation in a scoring plugin. So I want to see the code in solr before I do that.

Sujit Pal said...

Hi Vijith, already replied on the discussion thread on the gora mailing list.

Aida said...

Oh man, when ever I needed something get done with lucene, sooner or later i ended up on your blog. From now on, ill not google any more, but come straight here :)
Thanx for all the articles, I learned a lot from them!

Sujit Pal said...

Thanks Aida, glad it helped.

Anonymous said...

Hi Sujit,

Thank you so much for posting this article, just what I was looking for...I love your blog post - they are easy to understand the concepts.

I do have a question - I have been trying MoreLikeThis (MLT) within Solr. In MLT, how can similarity scores used for calculation be retrieved?

Say, if the 4 articles were added to Solr, how can the same end-result you have achieved be achieved within Solr, i.e. how can I use MLT to get
Article1 is related to Article2 by <<>
Article1 is related to Article3 by <<>
etc.

Thanks again for your advanced help.

Sujit Pal said...

Thanks Anonymous. For the MLT score shouldn't this be possible by explicitly declaring the required fields from the similar docs (using fl=score,...)?

Just found a very very interesting writeup on how MLT works, it has some more configurable parameters than my approach such as the ability to drop words below a certain length, etc. So MLT is probably preferable to my approach unless you are just trying to figure out whats going on.

astro said...

Is the normalization necessary?
What if we do not
docs[i].normalize();

Sujit Pal said...

I think it is, here's why. The docs[i].normalize() converts down all documents to a unit size, so the effect of differences in document length is removed. Usually the similarity is used to answer the question "how different thematically is this document from this other document" and document size is not important in the answer.

Raje said...

Thanks for your article!

I'm doing project based on XML!

I want to calculate the weight of the XML document!

Is it possible by means of this cosine similarity?

Sujit Pal said...

Hi Raje, you are welcome. Cosine similarity measures the similarity between two documents or between a query and a document. To calculate the absolute weight of a document you could probably just use the square root of the sum of the squares of its individual term dimensions (think Pythagoras theorem in n-dimensions). Since you mention XML, I am guessing there is something specific to XML, perhaps tag boosts (ie weight words occurring with <title> tags higher, etc) that you will also want to consider.

Unknown said...

Hi Sujit,

Can you suggest how I would do the same thing but use terms from two fields in each document?

Peter

Sujit Pal said...

Already answered via email, but putting in here also because the question was interesting and for others who may have similar questions. One way could be to to union the {term, freq} maps for "content" and "otherfield", and then use the resulting map for the DocVector constructor. Then change the DocVector.setEntry() method to be additive, ie, if a term already exists then its frequency is added to rather than reset.

Yolanda Sjafrizal said...

Thank you very much for your post. It very helpful for my final project now.

I am very new in using lucene, so my question maybe a little bit silly.
I am using lucene 3.6.1 and when I use this program, it said that cannot find class SpareRealVector.

is it because i'am using lucene with this version?

Praveen said...

Hi,



I have query somewhat similar to this post.

I would like to interact Lucene api's using term vectors for both indexing and querying.



Let me clarify my problem. Suppose I have term vector representation for each document in my corpus.

doc_1 = {(vocab_1, floatVal_1);(vocab_2, floatVal_2);...(vocab_k, floatVal_k) }.

.

.

doc_n = {(vocab_1, floatVal_1);(vocab_2, floatVal_2);...(vocab_k, floatVal_k) }.



Here "vocab_1" is a term from the vocabulary, "float_1" is the custom normalized score calculated for the corresponding term. "k" is the size of the vocabulary.



Once I have created the Index, I would like to query it in similar fashion where I produce a term vector representation for the query and retrieve the results.



I did search in internet for help regarding this topic and got to know about use of payloads in lucene especially for indexing thinking that payloads could be the substitute for the float_vals that I computed. But it seems that payloads are not supported for querying purposes. All together I am not really convinced whether I should use payloads as it is quite memory intensive.



Thanks in advance.

Sujit Pal said...

Hi Praveen, if I understand correctly, the functionality you are looking for is natively supported by Lucene - it will allow you to search for /any/ word in your documents and retrieve documents by the score. If you would like to restrict to only those in your controlled vocabulary, you could build a custom TokenFilter that only passes these words (or its stems) through and include it in your analyzer chain. The scoring in this case would be Lucene's default. If you want to have custom scores for your CV terms, take a look at Function queries, you can modify the ranking based on field contents. Payloads would only be needed if you wanted to score metadata (such as annotations) differently from the main text.

eli said...

hello all,
i have built an index in Lucene.net. and i want to get cosine similarity between two documents in lucene.net. unfortunately i can not find any example code in intenet. and i should do it untill next week. pleaseee help me. just a simple example can help me.

Sujit Pal said...

Hi Eli, take a look at StackOverflow page. It has references to this page, as well as a version written against the Lucene 4.x API, one that just uses the Lucene stemmer to create tokens for vectorization, and a few other similar implementations. All the code is in Java, you would have to translate to the .NET language of your choice.

Unknown said...

Hello, thanks for your example.Your code is calculating the similarity score. If i want to extract similar phrases from a document how i can do that. Please help me in this I am making use of lucene in my java code

Sujit Pal said...

Hi Veena, you are welcome. In order to extract similar phrases, you will first have to find phrases. The easiest way I can think of is to run it through some phrase chunker such as OpenNLP's or LingPipe's. Once you have these, you could compare the two sets of phrases (this is O(m*n) where there are m phrases in the first document and n phrases in the second). If you want to do this against the full corpus, then you can find all the (unique) phrases across all the documents, then find the ones that are most similar (O(n**2) for n phrases). If you want a Lucene approach, then maybe try n-grams (for n=2..5 or so) and find the most statistically significant phrases as your input instead.

Unknown said...

Thanku for your response sir. I dont have much knowledge in language
processing. So i am feeling it very tricky. I am doing autocompletion
using lucene. I search for queries from a document and whatever i
searched stored it in db. Now based on whatever searched results are
there in db i want to extract similar terms from document or rank according to the cosine similarity score

Sujit Pal said...

Can you elaborate a bit on your setup? I don't think I understand it enough to provide an answer. From what I understand, you have a list of queries that you are passing to Lucene's autocomplete - you are capturing results (doc_ids?) to a database. What fields are being captured? Then you want to find similar terms among these doc_ids? By query perhaps? To do that you would need to find the terms in each of these documents - perhaps consider using Lucene's IndexReader.getTermFreqVector(docId, fieldName)?

Ritesh said...

Hello Sir,

I have experience in Data Analysis but new to the concept of NLP.

I am doing one project in NLP, below is the detail.

I have set of wikipedia database and in database articles are present in Different languages. Task is to test how neutral is wikipedia?

For example Article about debt crisis in europe in english might have different views from article of same topic in german wikipedia.

My task is to how similar are the two documents.. First i have to do for two articles and then replicate same over whole wikipedia.

Can you please guide how should i approach the problem?

Thanks,
Ritesh

Sujit Pal said...

At a very high level, it looks like the objective is to test sentiment of similar (or parallel) articles across different communities (modeled by native language). Your individual circumstances will probably be better guidance (for example, if this is for a student project, your guide may be a better resource to consult). However, from my POV, there are two things you will need to do here - first, find groups of articles (pairs in your example, but maybe larger sets if you are planning to consider multiple languages) across language which are similar and second, calculate sentiment across each of these articles. Here are some really simple (and probably naive) approaches to both. For the first, if you have interlanguage dictionaries (English-to-German or vice versa in your example), you could consider the articles as bags of words and "reduce" the german one to an equivalent english one by dictionary lookups (drop words not found and don't care about alignments), then do a cosine similarity and consider them the same article if the similarity is very high. For sentiment, the simplest approach is using sentiment word lists in your given language and compute the positivity/negativity as a rollup of the sentiments of the words found in the article.

Unknown said...

hi sujit, i have a hive table with a column containing a large strings. each string is a crawled web page. I have to compare all the strings in the column and find the strings which match (like comparing webpages and finding webpages having same data written differently).
I am thinking of using solr with mlt or lucene with cosines, as i am naive to nlp could you suggest me a way.

Thanks in advance.

Sujit Pal said...

Hi Subash, MLT should work. You will have to import the strings into a Solr/Lucene index and run the MLT query against one of them, you should get a ranked list of "similar" docs. You will have to tune how far you want to go down the list depending on your application. You could use a more expensive (and accurate) operation to do this since you are working with a smaller set of docs.

Anonymous said...

how can I get term-term similairty

Sujit Pal said...

Its been a while since I wrote this, and looking today I see that the API seems to have changed quite a bit. However, AFAIK I don't think there is any specific API that you could call for this. If you are looking at term-term similarity as perhaps the cosine similarity of the docId vector in which they are found, you could do two searches and get back top N, then union the docIDs. The union is your vector space and you could build vectors for the two terms in this space and then run some kind of similarity metric like cosine or Jaccard.

shanshan said...

hi,Praveen,How did you finally solve this problem?
Thanks in advance.