Saturday, September 27, 2008

IR Math with Java : Similarity Measures

Last week, I wrote about building term document matrices based on Dr. Manu Konchady's Text Mining Application Programming book. This week, I continue working on computing some similarity metrics from the same set of data. Most of the material is covered in Chapter 8 of the book.

As before, our document collection consists of 7 document titles as shown below:

1
2
3
4
5
6
7
D1      Human machine interface for computer applications
D2      A survey of user opinion of computer system response time
D3      The EPS user interface management system
D4      System and human system engineering testing of EPS
D5      The generation of random, binary and ordered trees
D6      The intersection graph of paths in trees
D7      Graph minors: A survey

The document collection is first converted to a raw term frequency matrix. This is further normalized using either TF/IDF or LSI. The matrix thus normalized is the input to our similarity computations.

A document is represented by a column of data in our (normalized) term document matrix. To compute the similarity between two documents, we perform computations between two columns of the matrix. Conceptually, each document as a point in an n-dimensional term space, where each term corresponds to a dimension. So in our example, we have 23 terms, therefore our term space is a 23-dimensional space. In case this is hard to visualize (it was for me), it helps to think of documents with 2 terms (and hence a two dimensional term space), and extrapolate from there.

Jaccard Similarity

Jaccard Similarity is a simple and very intuitive measure of document to document similarity. It is defined as follows:

  similarity(A,B) = n(A ∩ B) / n(A ∪ B)

Using the Inclusion-Exclusion Principle, this reduces to:

  similarity(A,B) = n(A ∩ B) / n(A) + n(B) - n(A ∩ B)
  where:
    n(A ∩ B) = Σ min(Ai, Bi)
    n(A) = Σ Ai
    n(B) = Σ Bi
  for i = [0..n-1], where n = number of terms in our term-document matrix.

AbstractSimilarity.java

The AbstractSimilarity class shown below contains the code to compare all document pairs from our collection. This has a complexity of O(n2), which we can reduce to about half if we code in the knowledge that similarity(A,B) == similarity(B,A) and similarity(A,A) == 1.0, but which is still too high for large number of documents. I haven't added these knowledge here, though, since this is just toy data and the performance is tolerable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Source: src/main/java/com/mycompany/myapp/similarity/AbstractSimilarity.java
package com.mycompany.myapp.similarity;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;

public abstract class AbstractSimilarity 
    implements Transformer<Matrix, Matrix> {

  public Matrix transform(Matrix termDocumentMatrix) {
    int numDocs = termDocumentMatrix.getColumnDimension();
    Matrix similarityMatrix = new Matrix(numDocs, numDocs);
    for (int i = 0; i < numDocs; i++) {
      Matrix sourceDocMatrix = termDocumentMatrix.getMatrix(
        0, termDocumentMatrix.getRowDimension() - 1, i, i); 
      for (int j = 0; j < numDocs; j++) {
        Matrix targetDocMatrix = termDocumentMatrix.getMatrix(
          0, termDocumentMatrix.getRowDimension() - 1, j, j);
        similarityMatrix.set(i, j, 
          computeSimilarity(sourceDocMatrix, targetDocMatrix));
      }
    }
    return similarityMatrix;
  }

  protected abstract double computeSimilarity(
      Matrix sourceDoc, Matrix targetDoc);
}

JaccardSimilarity.java

For the JaccardSimilarity class, we define the abstract method computeSimilarity() as shown below (based on our formula above). I have cheated here slightly - in Jama, norm1() is defined as the maximum column sum, but since our document matrices are column-matrices (ie 1 column), norm1() returns the same value as the sum of all elements in the document matrix. It does make the code less readable, in the sense that the reader/maintainer would have to go through an extra mental hoop, but it makes the code concise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Source: src/main/java/com/mycompany/myapp/similarity/JaccardSimilarity.java
package com.mycompany.myapp.similarity;

import Jama.Matrix;

public class JaccardSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix source, Matrix target) {
    double intersection = 0.0D;
    for (int i = 0; i < source.getRowDimension();i++) {
      intersection += Math.min(source.get(i, 0), target.get(i, 0));
    }
    if (intersection > 0.0D) {
      double union = source.norm1() + target.norm1() - intersection;
      return intersection / union;
    } else {
      return 0.0D;
    }
  }
}

The result of our Jaccard similarity computation is shown below. The JUnit test to return this is SimilarityTest.java which is shown later in this post. The two matrices are derived from our raw term document matrix normalized using TF/IDF and LSI respectively. As you can see, the similarity matrix created off the LSI normalized vector shows more inter-document similarity than the one created off the TF/IDF normalized vector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
=== Jaccard Similarity (TF/IDF) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  0.0000  0.0818  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0710
    D3  0.0000  0.0000  1.0000  0.2254  0.0000  0.0000  0.0000
    D4  0.0818  0.0000  0.2254  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
    D6  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.1781
    D7  0.0000  0.0710  0.0000  0.0000  0.0000  0.1781  1.0000

=== Jaccard Similarity (LSI) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D3  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D4  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D6  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D7  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000

Cosine Similarity

Cosine Similarity is the more popular but also a slightly more complex measure of similarity. It is defined as:

  similarity(A,B) = cos θ = (A ⋅ B) / (|A| * |B|)
  where:
    A ⋅ B = Σ Ai * Bi
    |A| = sqrt(Σ Ai2)
    |B| = sqrt(Σ Bi2)
  for i = [0..n-1], where n = number of terms in our term-document matrix.

Here, θ is the angle between the gradients of the documents A and B in the n-dimensional term space. A and B are very similar as θ approaches 0° (cos(0) = 1), and very dissimilar as θ approaches 90° (cos(90) = 0).

Dr. E. Garcia has a very informative tutorial on Cosine Similarity that explains the formula above in much more detail (and with more humor).

CosineSimilarity.java

The code below computes the cosine similarity based on the formula above. Here, normF() is the Frobenius norm, the square root of the sum of all elements, and norm1() is the maximum column sum, which works for me because we are computing it off a one-dimensional matrix.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Source: src/main/java/src/com/mycompany/myapp/similarity/CosineSimilarity.jav
a
package com.healthline.jrocker.similarity;

import Jama.Matrix;

public class CosineSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
    double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
    double eucledianDist = sourceDoc.normF() * targetDoc.normF();
    return dotProduct / eucledianDist;
  }
}

The results of this computation for a term document vector normalized with TF/IDF and LSI respectively are shown below. As before, the inter-document similarity is higher for the LSI normalized vector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
=== Cosine Similarity (TF/IDF) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  0.0000  0.1316  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.1680
    D3  0.0000  0.0000  1.0000  0.4198  0.0000  0.0000  0.0000
    D4  0.1316  0.0000  0.4198  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
    D6  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.3154
    D7  0.0000  0.1680  0.0000  0.0000  0.0000  0.3154  1.0000

=== Cosine Similarity (LSI) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D3  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D4  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D6  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D7  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000

Searching - Similarity against a Query Vector

Using the similarity vectors above, it is now possible to build a Searcher. The objective is to find the most similar documents that match our query. For this, we will need to build a query vector consisting of the terms in the query, and then compute the similarity of the documents against the query vector. The computation is similar to what we have already done above if we treat the query vector as a specialized instance of a document vector. We use Cosine Similarity for our computations here.

Searcher.java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
// Source: src/main/java/com/mycompany/myapp/similarity/Searcher.java
package com.healthline.jrocker.similarity;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import Jama.Matrix;

public class Searcher {

  public class SearchResult {
    public String title;
    public double score;
    public SearchResult(String title, double score) {
      this.title = title;
      this.score = score;
    }
  };
  
  private Matrix termDocumentMatrix;
  private List<String> documents;
  private List<String> terms;
  private AbstractSimilarity similarity;
  
  public void setTermDocumentMatrix(Matrix termDocumentMatrix) {
    this.termDocumentMatrix = termDocumentMatrix;
  }
  
  public void setDocuments(String[] documents) {
    this.documents = Arrays.asList(documents);
  }
  
  public void setTerms(String[] terms) {
    this.terms = Arrays.asList(terms);
  }
  
  public void setSimilarity(AbstractSimilarity similarity) {
    this.similarity = similarity;
  }
  
  public List<SearchResult> search(String query) {
    // build up query matrix
    Matrix queryMatrix = getQueryMatrix(query);
    final Map<String,Double> similarityMap = 
      new HashMap<String,Double>();
    for (int i = 0; i < termDocumentMatrix.getColumnDimension(); i++) {
      double sim = similarity.computeSimilarity(queryMatrix, 
        termDocumentMatrix.getMatrix(
          0, termDocumentMatrix.getRowDimension() - 1, i, i));
      if (sim > 0.0D) {
        similarityMap.put(documents.get(i), sim);
      }
    }
    return sortByScore(similarityMap);
  }
  
  private Matrix getQueryMatrix(String query) {
    Matrix queryMatrix = new Matrix(terms.size(), 1, 0.0D);
    String[] queryTerms = query.split("\\s+");
    for (String queryTerm : queryTerms) {
      int termIdx = 0;
      for (String term : terms) {
        if (queryTerm.equalsIgnoreCase(term)) {
          queryMatrix.set(termIdx, 0, 1.0D);
        }
        termIdx++;
      }
    }
    queryMatrix = queryMatrix.times(1 / queryMatrix.norm1());
    return queryMatrix;
  }
  
  private List<SearchResult> sortByScore(
      final Map<String,Double> similarityMap) {
    List<SearchResult> results = new ArrayList<SearchResult>();
    List<String> docNames = new ArrayList<String>();
    docNames.addAll(similarityMap.keySet());
    Collections.sort(docNames, new Comparator<String>() {
      public int compare(String s1, String s2) {
        return similarityMap.get(s2).compareTo(similarityMap.get(s1));
      }
    });
    for (String docName : docNames) {
      double score = similarityMap.get(docName);
      if (score < 0.00001D) {
        continue;
      }
      results.add(new SearchResult(docName, score));
    }
    return results;
  }
}

The results of searching with the query "human computer interface" against a similarity matrix built off a TF/IDF and an LSI normalization are shown below. Notice that we get a few more results from a LSI normalized vector. I have manually copied the titles over to make the result more readable. I suppose I could make the code do it, and I probably would have, if the code was going to be used by anybody other than myself.

1
2
3
4
5
6
7
8
Results for query: [human computer interface] using TF/IDF
D1 (score =   0.8431) Human machine interface for computer applications
D4 (score =   0.1881) System and human system engineering testing of EPS

Results for query: [human computer interface] using LSI
D4 (score =   0.2467) System and human system engineering testing of EPS
D3 (score =   0.2467) The EPS user interface management system
D1 (score =   0.2467) Human machine interface for computer applications

Calling the code - the JUnit test

For completeness, as well as to show you how the components above should be called, I show here the JUnit test that was used to run these components and derive results for this post.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
// Source: src/test/java/com/mycompany/myapp/similarity/SimilarityTest.java
package com.mycompany.myapp.similarity;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;

import org.apache.commons.lang.StringUtils;
import org.junit.Before;
import org.junit.Test;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import Jama.Matrix;

import com.mycompany.myapp.indexers.IdfIndexer;
import com.mycompany.myapp.indexers.LsiIndexer;
import com.mycompany.myapp.indexers.VectorGenerator;
import com.mycompany.myapp.similarity.Searcher.SearchResult;

public class SimilarityTest {

  private VectorGenerator vectorGenerator;
  
  @Before
  public void setUp() throws Exception {
    vectorGenerator = new VectorGenerator();
    vectorGenerator.setDataSource(new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", 
      "tmdb", "irstuff"));
    Map<String,Reader> documents = 
      new LinkedHashMap<String,Reader>();
    BufferedReader reader = new BufferedReader(
      new FileReader("src/test/resources/data/indexing_sample_data.txt"));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] docTitleParts = StringUtils.split(line, ";");
      documents.put(docTitleParts[0], new StringReader(docTitleParts[1]));
    }
    vectorGenerator.generateVector(documents);
  }

  @Test
  public void testJaccardSimilarityWithTfIdfVector() throws Exception {
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    JaccardSimilarity jaccardSimilarity = new JaccardSimilarity();
    Matrix similarity = jaccardSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Jaccard Similarity (TF/IDF)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testJaccardSimilarityWithLsiVector() throws Exception {
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    JaccardSimilarity jaccardSimilarity = new JaccardSimilarity();
    Matrix similarity = jaccardSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Jaccard Similarity (LSI)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }

  @Test
  public void testCosineSimilarityWithTfIdfVector() throws Exception {
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    CosineSimilarity cosineSimilarity = new CosineSimilarity();
    Matrix similarity = cosineSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Cosine Similarity (TF/IDF)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testCosineSimilarityWithLsiVector() throws Exception {
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    CosineSimilarity cosineSimilarity = new CosineSimilarity();
    Matrix similarity = cosineSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Cosine Similarity (LSI)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testSearchWithTfIdfVector() throws Exception {
    // generate the term document matrix via the appropriate indexer
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    // set up the query
    Searcher searcher = new Searcher();
    searcher.setDocuments(vectorGenerator.getDocumentNames());
    searcher.setTerms(vectorGenerator.getWords());
    searcher.setSimilarity(new CosineSimilarity());
    searcher.setTermDocumentMatrix(termDocMatrix);
    // run the query
    List<SearchResult> results = 
      searcher.search("human computer interface");
    prettyPrintResults("human computer interface", results);
  }

  @Test
  public void testSearchWithLsiVector() throws Exception {
    // generate the term document matrix via the appropriate indexer
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    // set up the query
    Searcher searcher = new Searcher();
    searcher.setDocuments(vectorGenerator.getDocumentNames());
    searcher.setTerms(vectorGenerator.getWords());
    searcher.setSimilarity(new CosineSimilarity());
    searcher.setTermDocumentMatrix(termDocMatrix);
    // run the query
    List<SearchResult> results = 
      searcher.search("human computer interface");
    prettyPrintResults("human computer interface", results);
  }

  private void prettyPrintMatrix(String legend, Matrix matrix, 
      String[] documentNames, PrintWriter writer) {
    writer.printf("=== %s ===%n", legend);
    writer.printf("%6s", " ");
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%8s", documentNames[i]);
    }
    writer.println();
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%6s", documentNames[i]);
      for (int j = 0; j < documentNames.length; j++) {
        writer.printf("%8.4f", matrix.get(i, j));
      }
      writer.println();
    }
    writer.flush();
  }
  
  private void prettyPrintResults(String query, 
      List<SearchResult> results) {
    System.out.printf("Results for query: [%s]%n", query);
    for (SearchResult result : results) {
      System.out.printf("%s (score = %8.4f)%n", result.title, result.score);
    }
  }
}

Conclusion

I hope this post was helpful. I had looked at Term Vectors in Lucene in the past, but the idea of an n-dimensional term space never quite sank in until I started working through the stuff I describe above. I know that the data I am using is toy data, and that the computations described in the blog are computationally too complex and impractical in real life, but I believe that knowing this stuff helps in building models that would be useful and practical.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

84 comments (moderated to prevent spam):

Anonymous said...

for a better collection of similarity metrics see SimMetrics on sourceforge or http://www.dcs.shef.ac.uk/~sam/simmetrics.html

Sujit Pal said...

Thank you for the comment, Rev. Chapman, and thank you for compiling the list. It was very informative.

KaTat said...

Thanks, that's very useful for my current experimental project.

Sujit Pal said...

Hi KaTat, glad the post helped. We should both really be thanking Dr Konchady :-), on whose book this stuff is based.

Amit said...

Post is very useful for me working on sentimental analysis.
but i am not able to understand how to create indexing_sample_data.txt file from sample data given.
Help me to find this text file from data set.
Thanks

Sujit Pal said...

Hi Bablu, thanks for the feedback and glad it helped you. The file is in src/test/resources/data in the SVN repository for the JTMT project on sourceforge.

Amit said...

Thanks for reply. it really worked.
but i am getting java.lang.NoClassDefFoundError
in the "this.jdbcTemplate = new JdbcTemplate(dataSource);" in phrase and abbreviation recognizer. I have included all header files. May be this error can because of incorrect path of datasource. If that so then please specify way of path of datasource. help me to remove this error.
Thanks

Sujit Pal said...

Hi Amit, NoClassDef means that the class could not be found, since this occurs in this line, it could either be JdbcTemplate or DataSource. The first is possible if you don't have Spring jars in your classpath, and the second should not happen, since DataSource is part of the javax.sql package built into the JDK (unless you are using an ancient version). In any case, your error message should tell you more - I suspect that you are missing spring jars.

Amit said...

Hi sujit, thanks for reply!!!
I got that error was because of datasource. I tried to make connection to mysql through jdbc but i got exception
"com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:

** BEGIN NESTED EXCEPTION **

java.net.SocketException
MESSAGE: Permission denied: connect"
help me to remove this error.
Thanks...

Sujit Pal said...

Hi Amit, this is probably because either (a) you need to expose a socket for JDBC to talk to MySql and/or (b) your username/password is incorrect or (c) some other reason I can't predict remotely. You may want to RTFM about Java/JDBC/MySQL connectivity, and if you think that is not necessary, try posting the problem on a MySQL forum - this looks like this is related to your setup and not with the code.

Amit said...

Hello sujit,
Your suggestions are really working.
Now i am not able to create database using textmine project.I have downloaded textmine project. Can you please tell me process to create "tmdb" database using textmine project. it will be very helpful.
Thanks for help......

Sujit Pal said...

Hi Amit, good to know. As you can see, tmdb is simply a MySQL database loaded with data from the textmine project. You will find the data in the utils/data subdirectory of the textmine project. I had to massage it a bit, not sure if the code for that exists, if not, it was probably just a bunch of sed commands.

Anonymous said...

hi this is tehseen .i need code in java for finding semantic similarity between sentences or similarity between sentences.This will help in my project a lot

Sujit Pal said...

Hi Tehseen, calculating similarity between sentences is the same as calculating between documents, at least using the vector approach I describe here. Basically create a term/sentence matrix instead of a term/document matrix.

Ambika said...

Hi Sujit, am using LsiIndexer.java for my project. but while executing am getting error such as " can not access Jama.Matrix; bad class file:\java.Matrix ; please remove or make sure it appears in the correct subdirectory of the class path ". Please help me on this. Thanks in advance.

Sujit Pal said...

Hi Ambika, you should download the entire project (contains updated LsiIndexer.java) from the SVN repository at jtmt.sf.net. I switched from Jama to commons-math-2.0 once they put the LSI Indexing stuff in there.

Sujit Fan said...

I see your LSI similarity values. I saw the documents too. It is saying 1.0 while, I can see documents are not even same in meanings. For example doc2 and doc 6 are not similar. Same for other records. How come your program calculation are saying this? If I missed anything, please let me know

Sujit Pal said...

Hi Sujit Fan, not sure why LSI says its similar but they are not actually. I believe it investigated it once but can't be sure. If you dump out the components of the decomposition, it would probably give you a better idea - you would see the terms that are the principal components - perhaps these words co-occur in the documents, even though they don't contribute much meaning? Then they may be good candidates for stopwords.

Sujit Fan said...

Hi Sujit,
Thanks for your quick reply. You wrote that JAMA is not good for sparse matrix. But I think SVD is used to handle sparse matrix. Isn't it? If I am wrong can you please explain what problem we can face if we use a big sparse matrix in jama? speed? accuracy? etc. Also can you please tell me if there is any way to make JAMA work where rows are less than columns??? If NO! then can you point out any other LSI java package that I can use? I am looking for a small package like jama, not a BIG ONE :)

Sujit Pal said...

Hi, AFAIK, Jama has no support for sparse matrices - internally matrices are treated as a double[][] - when I say sparse matrix support, I mean that the data structure will only store the non-zero elements of the matrix. Obviously for large sparse matrices such as those for TD matrices, this can result in significant memory savings. Also AFAIK, SVD does not care about whether the matrix is sparse or dense. Also Jama is no longer under active development, so I would recommend commons-math (which is a bit larger than Jama) - they have recently been looking at SVD improvements lately. Actually, I would suggest that you get on the apache commons list and ask your questions there, you will probably get better answers there than from me.

Anjali said...

Hi Sujit,
Thank you so much for such a brilliant post. Am using your LSI and cosine similarity concept in my project. I downloaded your project, but While executing SimilarityTest.java , am getting error “ Exception in thread “main” java.lang.NoSuchMethoderror: main “ . I couldn’t find “main” method in your code. Please help me to solve this.
One more doubt is, I dont quite understand it..why do we use a database here.
vectorGenerator.setDataSource(new DriverManagerDataSource( "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", "root", "orange"));
Is any other settings I should do for this like jdbcodbcconnection,,??
Can I use oracle10g for this?
is it compulsory to use database??
Sorry, if my questions are too lame..
Please help me on this. I will be grateful to you.

Thanks in advance

Regards,
Anjali

Sujit Pal said...

Hi Anjali, thanks for the kind words.

To answer your first question, I am guessing that you may be running the SimilarityTest as a Java application (from your IDE perhaps?), in which case it looks for the main() method - you will need to run this as a JUnit application - you can do this from your IDE or using Ant or Maven from the command line - see the documentation for whatever tool you are using.

For the second question, the database is used for recognizing abbreviations and multi-word phrases. The first is a list of known abbreviations, which you can also put into a text file. The second is a bit more complex, since phrases are stored as collocated words in the database. In the jtmt project there are schema and data files that contain the table def and data for the table. Using a database seemed to be the most obvious way to do this, but of course it may be possible (although probably not very easy) to store them in some other form.

helpNeeded said...

Hi Sujit,

Please just went through your post and guess it will be useful in my project. But am not so good with my programming.

Could you assist on how this all works.

Sujit Pal said...

Hi helpNeeded, the post is just some Java implementations of various similarity measures I found in the TMAP book and elsewhere. When you say "how it all works", are you looking for the underlying mathematical theory behind these? If so, you should probably look elsewhere (do a search). If you are asking about the code itself, it should be fairly simple if you know Java.

helpNeeded said...

Thanks for your reply.

Trying out the code but I keep getting errors with the collections15 import. And i'll not be able to continue with the CosineSimilarity.java class.

Sujit Pal said...

You can get the generic collections library here.

Sreejith said...

Hi Sujit..
How can i identify a document as effcient search query.During my testing i found that if search vector is good it will give the accurate results.But if the query is worse then we can't expect the right output...
Thank u...

Sujit Pal said...

Hi Srijith, isn't this what you would expect on any search engine? When you say "bad" search vector, perhaps it contains too few terms or contains terms that are very common in the corpus, right? So I guess there would be two ways to go about "improving" your search vector - either insist on a minimum number of terms or post process the search vector to remove common terms? Or maybe I am not understanding what you are looking for?

Sreejith said...

Thank u sujit
I have created LSI of 10 near duplicate documents and passing each document as a search vector.But even if all the documents are near duplicates,for some search query LSI give very less score,some docs are identified as near duplicates and others not...

I put 0.8 as a threshold to filter the near duplicates,but some near duplicates give a score less than 0.8 and some exact different document give a score > 0.8 ..

So i can't conclude....This is the scenario..I hope u could now able to figure out my issue..
Thank u....

Sujit Pal said...

Hi Srijith, my understanding of LSI is that you achieve "semantic" indexing by removing the noisy dimensions out, leaving only the dimensions which truly represent the document. So documents that otherwise wouldn't look similar because of the presence of noisy dimensions now do. I am guessing that you did the same thing on the search vector side, ie removing the noisy dimensions from the search vector. Could the documents that don't seem to match have some outlier terms after cleanup that are causing the results to be skewed?

Ajaz said...

Hi Sujit, Gr8 library for text mining. Can you provide the schema for tmdb. There is no utils/data subdirectory in the textmine project.

Sujit Pal said...

Thanks Ajaz. About the schema in the textmine project, I just downloaded the textmine-0.21 tarball (from textmine.sf.net and click on Download) and exploded it locally, I see a whole bunch of .dat files under utils/data, and a tables.sql with the schema in utils. I used that to do the initial population, then added a few tables of my own, whose schemas are in the jtmt project.

Anonymous said...

Hi Pal,
I wind a collection of .sql files in jtmt project. what is the order in which those should be executed for creation for data base and populating fro dat files? please provide the steps. Also for windows wordnet 2.1 is only available , will it suffice??

Ajaz said...

Thanks for the info Sujit. will try it out.

Unknown said...

Hi Sujit,
I am trying to get jtmt working on my pc(windows). but facing some trouble with sql data base.What all scripts to be run or tables to be created? I found few '.sql' files at jtmt\src\main\sql . all these files need to be used?
I am interested in getting similarity metrics alone working.

Sujit Pal said...

@Anonymous:
The .sql files can be applied in any order after the textmine data is loaded (see my reply to Ajaz's comment above for details about that). If multiple tables are to be created in a certain order (because of fk references, etc), they are contained in a single .sql file so they are created in the correct order.

To access wordnet, I use jwi, which are advertised to work with Wordnet versions 1.6-3.0, so I guess you should be fine with Wordnet 2.1.

@lmthyaz:
For .sql order, please see my response to anonymous above.

For restricting to tables for similarity, unfortunately I don't have a good answer for you. I suggest only loading up the textmine tables first, then running the JUnit test case for similarity, and see if it complains about missing tables, then add them in.

imthyaz said...

Hi Sujit,
I am using Idf Indexer and cosine similarity of jtmt. when trying to find the simliraity betweenn two documents( corpus has only 2 files) i am getting a similarity as 'NaN'. what does it mean? where to make change if i want to modify?

Sujit Pal said...

Hi Imthyaz, hard to say without a bit more information, offhand I would say its probably an underflow or overflow (the resulting double is too small or too large to fit in the space provided), if you trace through the code and add some bounds checking, something like this:

if (result >= Double.MAX_VALUE) {
..result = Double.MAX_VALUE;
}
if (result <= Double.MIN_VALUE) {
..result = Double.MIN_VALUE;
}

I think the NaNs would disappear.

Anonymous said...

can you please tell me if its possible to find similarity between phrases using wordnet in java

Sujit Pal said...

Not sure about phrases, but I recently came across a Perl module called Wordnet::Similarity, which provides implementations of various functions for using Wordnet to find similarities between words. You may want to check it out.

Anonymous said...

thanks for providing the code for tf-idf.I am new in Java programming. I tried running the first code stopwordrecognizer but it gave me errors like
unable to find class: Irecognizer
unable to find class: Token
from where do i get these files?

Sujit Pal said...

Hi, you can find the IRecognizer class in the JTMT project (jtmt.sf.net).

Anonymous said...

I implement the follwoing class:

public class IdfIndexer implements Transformer

But the problem setEntry is unknown.

Can you help me

Sujit Pal said...

Transformer does not need setEntry(), all it needs is a transform() method, can you please provide more information?

Anonymous said...

Once we have the inner-document similarity matrix, How to use it for clustering?

Thanks

Sujit Pal said...

It probably won't be such a great idea to do this. The similarity matrix would be useful to find "related documents" from the one selected, but can't tell why its similar. Clustering clusters docs by features, so documents in a cluster are similar along that feature.

David said...

Hi Pal,
Thank you for your reply.
I calculate the similarity between documents by myself using my own algorithm, so that's why I generate a doc to doc matrix, but do not know how to cluster them.

thanks,

Sujit Pal said...

Hi David, thought a bit more about this, and I take my comment about it being a bad idea back... I think you could just use a standard algo such as Knn and replace the distance metric with your doc-doc similarity number. The k (size of each cluster) would be determined by the number of docs in your corpus / number of clusters you want.

Rachit Maheshwari said...

Hi sujit

Can u please tell me how to run tarball textmine.21

Sujit Pal said...

Hi Rachit, its been a while since I pulled the data from Dr Konchady's textmine project, but basically I used the instruction in the project page. The TMAP book has some pointers as well, but I think the installation instructions on the project page should provide enough detail.

arivu said...

package com.mycompany.myapp.similarity; how to down load sir..
thank you

Sujit Pal said...

Hi arivu, you can find it in the SVN repository for my JTMT project.

Unknown said...

Hi Sujit
Firstly thanks for the great post that really helps a lot for us. I have some few queries regarding similarity in documents.
1. in this post you have used database to detect abbreviations and multi-word phrases. Can we exclude this part? I mean simple create indexing of the words removing stop words and functional words. in all case (lsi, if/idf), termDocMatrix needs vectorGenerator.getMatrix() as parameter. what will be appropriate parameter for termDocMatrix if we do not wish to use database? Simply, want to use each file preprocessing (remove stop words and stemming) and indexing then create termDocMatrix.
2. If we are going to use the database, I get confuse which my sql file is to run. I found four .sql file at jtmt\src\main\sql viz. tmdb_load, tmdb_my_colloc, tmdb_postagger, tmdb_schema
3. But for the data, I could not find utils/data in the project.
4. Is it possible to run the JUnit as Java application adding main function?
Regars,
Niraj

Sujit Pal said...

Hi Niraj, sorry about the delay in responding. Yes, I think you can safely remove the database based abbreviation/collocation detection stuff while computing document similarity. I believe the sequence in you would run the SQL files is: tmdb_schema, tmdb_load, tmdb_my_colloc and tmdb_postagger. The data comes from the TextMine (textmine.sf.net) project. And yes, you can take the contents of the function annotated by @Test and put it into the main() function then you can run the JUnit code as a Java application.

Adriana said...

Post is very helping me,
but i don't understand how to create the documents,
can you help me sir?

Sujit Pal said...

Hi Adriana, the documents in the example are taken from Dr Konchady's TMAP book. They are just titles.

Sara said...

Hi,
How I can calculate the similarity for two vectors with different sizes?
Thanks

Sujit Pal said...

Hi Sara, if you are using cosine similarity then you cannot. In case of document vectors, they should be the same size, as each element of the vector will correspond to one word/term in the vocabulary (which is a superset of all terms found in the corpus). If they do differ in size for some reason, then you can pad the shorter one with zeros (in case of text you are saying that the extra terms occur zero times in the shorter document vector).

Sara said...

Hi ,

I did the same thing and add made all the zeros live. But, when I use Gaussian mixtures for clustering , it dose not work because I have many zeros in my matrix and all of my parameters will be NAN.
Thanks.

Sujit Pal said...

I am guessing you are trying to find similarities between "average" document across genres, yes? If so, very cool idea. The NaNs are most likely because of underflow problems. One way to work around that is to apply a Laplace correction (add 1 to the denominator and v to the denominator where v is the vocabulary size) to the TD vectors, although if you do that it will change the data from sparse to dense (although you can just bake this behavior into a subclass of Sparse Matrix to keep the memory properties the same). But it may be good to find out actually where the NaN is occurring before taking such drastic action.

Sara said...

Actually, I am trying to cluster the documents using guessing mixtures. The problem is that: since the sparse matrix has many zeros, the variance will be zero and the program dose not work. It happens when I am using Dirichlet mixtures too. I was wondering if you could explain more how I can change the data to dense matrix? What is denominator ?
Many thanks for your time.
Sara

Sujit Pal said...

Well, the idea is to add 1 to the numerator of every element in the TD matrix and v to its denominator where v is the vocabulary size. So assume a term t has occurred in document d n times, and N is the total number of words in the corpus, then the frequency of t is n/N, with the Laplace correction it becomes (n+1)/(N+v). This now changes your sparse matrix into a dense one.

sara said...

Thanks Sujit,
Unfortunately, using dense matrix didn't help me and I still have problem.

Sujit Pal said...

You're welcome, hopefully you will be able to fix the problem...

Anonymous said...

Hello,

Thank you for your post. On the cosine similarity part, you used norm1 ( double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1() ) to compute the dotProduct. From what I referred norm1 gives always positive results which makes the cosine similarity result always between 0 and 1 (while cosine similarity can also be between -1 and 0). Am I missing something or its a bug?

Thanks

Sujit Pal said...

Thanks, glad it was useful. The use of norm1 is not a bug, you can never have negative cosine in case of text since term frequencies are always zero or positive.

Anonymous said...

Hello,

Thanks for your reply. Yes, you are right, in the example you gave it is always positive. I tried to use your cosine similarity method on vectors taken from the V-matrix generated after SVD and it had negative values. Now, I have written mine. Anyway thank you for your post and prompt reply.

Best

Sujit Pal said...

Oh, okay, I see why you could get negative values now...thanks for sharing, I will look out for this in the future.

Anonymous said...

Hi Sujit,
I am trying to understand the similarity matrix output you have printed. You have five docs say d1,d2,..d5. How is the similarity represented here? Lets say we find similarity between d1 and d2, how do we represent in matrix form?

Please explain.
Thanks,
Rob.

Sujit Pal said...

Hi Rob, d1 and d2 point to row and column indexes 0 and 1 respectively. So similarity(d1,d2) would be at either [0,1] or [1,0] of the matrix (both values are equal). Also notice the diagonal is all 1, indicating that any document is perfectly similar to itself.

Anonymous said...

Hi Sujit,
I am a newbie to vector space modelling. I recently had to create a tool which could take in a query and calculate the give a list of files using tf-idf from a source code. Can you suggest me how to go about it?

VSM

Sujit Pal said...

Hi, if you already have a (nxm) TD matrix of your documents (call it D), you could use the same vectorizer to create a (nx1) vector for your query (call it Q), then you can generate a (nx1) similarity vector S = D * Q. Depending on the size of your document set, you can do the computation with dense or sparse matrices in memory, or you could use Hadoop if the matrices don't fit into memory.

Anonymous said...

Hola es muy interesante lo que acabo de leer, yo debo de trabajar con medidas de similitud y he leido acerca de la similitud de coseno me parece de gran ayuda su publicacion

Sujit Pal said...

Muchas gracias por sus amables palabras. Mi español no es bueno, pero traductor Google es mi amigo :-).

Anonymous said...

Si lo entiendo asi mismo hago con el ingles, tengo una duda ya descargue las 4 clases pero no puedo ejecutarlas debido a que ninguna de ellas tiene un metodo principal, me refiero a las clases de: CosineSimilarity, JaccardSimilarity, AbstractSimilarity, Searcher por favor ayudeme a resolver esto. GRACIAS

Sujit Pal said...

De nada. Ellos no están destinados a ser ejecutados por sí mismos, sino como parte de un proceso de indexación. Usted puede ver cómo eso creó en SimilarityTest.java (código fuente completo está disponible en jtmt.sf.net).

sandi said...

Hi Sujit, This indeed is a great post which helps to understand a lot of concepts.
On the cosine similarity part if I have a structured data-set of addresses(Name, fathers name, city, state, country, pincode) and I would like to associate weights to each of the entity for calculating the cosine similarity between the sentences.

I found an approach using Spark MLLIB in the below link
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

But in this case we have a generic weight for the whole sentence.

Thanks
Sandip

Sujit Pal said...

Thanks for the kind words, Sandi. Also thanks for the pointer to the Spark cosine similarity code. I had heard of the DIMSUM algorithm, realized after seeing the link that its already implemented in Spark. The code you pointed to computes cosine similarity (exact and approximate) between all pairs of sentences in the matrix. Each sentence is represented by a column in the matrix. I guess you could calculate cosine similarity in the same way for your structured address records, as long as you separate out the features, ie, (name1, name2), (address_line1_1, address_line1_2), ..., (pincode1, pincode2) must be padded to the same length. That way when you compute cosine similarity, it will be composed of each of these components separately. This also allows you to give different weights to the different components.

What2Watch said...

Hello Mr.Pal, thank you for such a grate post. I would like to ask something related with project. I will just use similarity methods and not use all your works. I have documents like yours but as one text file contains many lines with sentences like;

D1: word1,word2,word3,.....
D2: word1,word2,word3,.....
D3:..

To use your similarity functions how can i convert my txt contents to raw document matrix.
I couldnt find your email address to ask these to you :) thank you very much for your interest.

Sujit Pal said...

Hi What2Watch, I wrote this code a long time ago when I was working almost exclusively in Java and trying to explore text mining. To answer your question, the general steps to go from text files to document matrices are outlined very nicely on this Working with Text Data page on the Scikit-Learn documentation website. It's in Python (since Scikit-Learn is a Python library), but Python is more comfortable to work with than Java for this sort of stuff (I mostly do my programming in Python nowadays), mainly because there are a lot of high level modules that allow you to focus more on the application than the infrastructure. Plus Python is a more terse language so you have to write less boilerplate.

Anyway, to summarize, you tokenize the documents, build a vocabulary with the words, optionally clip the size of your vocabulary to the top most frequent and treat all other words as out-of-vocabulary, convert each document to a bag of words, optionally compute TF-IDF. At this point you have a term document matrix. To compute similarity, you need a document-document matrix, so you multiply the document-term matrix by its transpose. Each entry in this matrix is the unnormalized similarity between the 2 documents represented by the row and column. If you need to normalize, divide by the L2 norm of the two input matrices.

Thanks said...

Thank you for your effort.
Please can you tell me how to convert
My dataset that is as .csv with alot of words so i want to convert each word
To vector to make similarity and text
Mining. My big problem is to convert
My dataset can you help my how can I start I read about bag of word ,cosine
Similarity and vector space model but can't know how to start.
Thank you thank you for your time.

Sujit Pal said...

Yes, all of these need your input as a matrix. Easiest way to convert your text to a matrix (where each row represents a record and each column represents a word) is to use one of Scikit-learn's vectorizers -- in general the frequency distribution of words in text is very long tailed, so it is customary to only look at the N most important words. There are two approaches, first is to consider only the top N words and any other word as unknown (so essentially a vocabulary of N+1 words), or to map the full vocabulary into a smaller space of N, so some words may collide and occupy the same space in the reduced space. At this point you are looking at simple term frequency, ie, the entry in the matrix at position (i, j) is the number of times the j-th word in the vocabulary occurs in the i-th document. If you use the first approach, you would use the CountVectorizer, and if you use the second approach you would use the HashingVectorizer. In case you want to do some normalizations beyond term frequency, such as damp the effect of very frequent words, you could use the TfidfTransformer/TfidfVectorizer. Full set of scikit-learn vectorizers are listed here. You could also choose to use word vectors, that represent a semantic embedding in some low dimensional space (usually 100 or 300) against some large (not yours) datasets. For these, you would simply look up every word in your text against this external vector dataset (in the form word => vector) and replace all your words with their corresponding vectors, and arrange that in a matrix form (each row represents a row in the matrix). Thats basically all there is to the preprocessing.

Unknown said...

Hi Mr.Pal,
I need to clarify more on creating source matrix and query matrix.
How can I do that in python. Would you please kind enough to how to use them. I have read previous comment reply of you,but that confused me more. You said that position (i,j) gives frequency of a word in relevant document,but how can we represent a whole document in a row. Please kind enough to explain with a pseudocode or something like that so i can understand more. Thank you for your effort on helping the community with this kind of work. It's very rare to find this type of examples.

Sujit Pal said...

Apologies for the late reply. If you are still interested in the answer, following on from the comment that "position (i, j) of the TF matrix gives the frequency of the word j in document i. So the i-th row represents document i, and all the j entries of the row i represent the term frequencies of words in the vocabulary. Not all the words in the vocabulary will be in document i, so these slots will have 0s in them. So consider a corpus with 100 documents and 50,000 words in the vocabulary, the TF matrix will be of shape (100, 50000). A document at i=20 say may have only 100 words, so 100 of the 50000 slots in the i-th row will have values filled in, rest will be 0. Hope this helps!