Saturday, September 27, 2008

IR Math with Java : Similarity Measures

Last week, I wrote about building term document matrices based on Dr. Manu Konchady's Text Mining Application Programming book. This week, I continue working on computing some similarity metrics from the same set of data. Most of the material is covered in Chapter 8 of the book.

As before, our document collection consists of 7 document titles as shown below:

1
2
3
4
5
6
7
D1      Human machine interface for computer applications
D2      A survey of user opinion of computer system response time
D3      The EPS user interface management system
D4      System and human system engineering testing of EPS
D5      The generation of random, binary and ordered trees
D6      The intersection graph of paths in trees
D7      Graph minors: A survey

The document collection is first converted to a raw term frequency matrix. This is further normalized using either TF/IDF or LSI. The matrix thus normalized is the input to our similarity computations.

A document is represented by a column of data in our (normalized) term document matrix. To compute the similarity between two documents, we perform computations between two columns of the matrix. Conceptually, each document as a point in an n-dimensional term space, where each term corresponds to a dimension. So in our example, we have 23 terms, therefore our term space is a 23-dimensional space. In case this is hard to visualize (it was for me), it helps to think of documents with 2 terms (and hence a two dimensional term space), and extrapolate from there.

Jaccard Similarity

Jaccard Similarity is a simple and very intuitive measure of document to document similarity. It is defined as follows:

  similarity(A,B) = n(A ∩ B) / n(A ∪ B)

Using the Inclusion-Exclusion Principle, this reduces to:

  similarity(A,B) = n(A ∩ B) / n(A) + n(B) - n(A ∩ B)
  where:
    n(A ∩ B) = Σ min(Ai, Bi)
    n(A) = Σ Ai
    n(B) = Σ Bi
  for i = [0..n-1], where n = number of terms in our term-document matrix.

AbstractSimilarity.java

The AbstractSimilarity class shown below contains the code to compare all document pairs from our collection. This has a complexity of O(n2), which we can reduce to about half if we code in the knowledge that similarity(A,B) == similarity(B,A) and similarity(A,A) == 1.0, but which is still too high for large number of documents. I haven't added these knowledge here, though, since this is just toy data and the performance is tolerable.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Source: src/main/java/com/mycompany/myapp/similarity/AbstractSimilarity.java
package com.mycompany.myapp.similarity;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;

public abstract class AbstractSimilarity 
    implements Transformer<Matrix, Matrix> {

  public Matrix transform(Matrix termDocumentMatrix) {
    int numDocs = termDocumentMatrix.getColumnDimension();
    Matrix similarityMatrix = new Matrix(numDocs, numDocs);
    for (int i = 0; i < numDocs; i++) {
      Matrix sourceDocMatrix = termDocumentMatrix.getMatrix(
        0, termDocumentMatrix.getRowDimension() - 1, i, i); 
      for (int j = 0; j < numDocs; j++) {
        Matrix targetDocMatrix = termDocumentMatrix.getMatrix(
          0, termDocumentMatrix.getRowDimension() - 1, j, j);
        similarityMatrix.set(i, j, 
          computeSimilarity(sourceDocMatrix, targetDocMatrix));
      }
    }
    return similarityMatrix;
  }

  protected abstract double computeSimilarity(
      Matrix sourceDoc, Matrix targetDoc);
}

JaccardSimilarity.java

For the JaccardSimilarity class, we define the abstract method computeSimilarity() as shown below (based on our formula above). I have cheated here slightly - in Jama, norm1() is defined as the maximum column sum, but since our document matrices are column-matrices (ie 1 column), norm1() returns the same value as the sum of all elements in the document matrix. It does make the code less readable, in the sense that the reader/maintainer would have to go through an extra mental hoop, but it makes the code concise.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Source: src/main/java/com/mycompany/myapp/similarity/JaccardSimilarity.java
package com.mycompany.myapp.similarity;

import Jama.Matrix;

public class JaccardSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix source, Matrix target) {
    double intersection = 0.0D;
    for (int i = 0; i < source.getRowDimension();i++) {
      intersection += Math.min(source.get(i, 0), target.get(i, 0));
    }
    if (intersection > 0.0D) {
      double union = source.norm1() + target.norm1() - intersection;
      return intersection / union;
    } else {
      return 0.0D;
    }
  }
}

The result of our Jaccard similarity computation is shown below. The JUnit test to return this is SimilarityTest.java which is shown later in this post. The two matrices are derived from our raw term document matrix normalized using TF/IDF and LSI respectively. As you can see, the similarity matrix created off the LSI normalized vector shows more inter-document similarity than the one created off the TF/IDF normalized vector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
=== Jaccard Similarity (TF/IDF) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  0.0000  0.0818  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0710
    D3  0.0000  0.0000  1.0000  0.2254  0.0000  0.0000  0.0000
    D4  0.0818  0.0000  0.2254  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
    D6  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.1781
    D7  0.0000  0.0710  0.0000  0.0000  0.0000  0.1781  1.0000

=== Jaccard Similarity (LSI) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D3  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D4  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D6  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D7  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000

Cosine Similarity

Cosine Similarity is the more popular but also a slightly more complex measure of similarity. It is defined as:

  similarity(A,B) = cos θ = (A ⋅ B) / (|A| * |B|)
  where:
    A ⋅ B = Σ Ai * Bi
    |A| = sqrt(Σ Ai2)
    |B| = sqrt(Σ Bi2)
  for i = [0..n-1], where n = number of terms in our term-document matrix.

Here, θ is the angle between the gradients of the documents A and B in the n-dimensional term space. A and B are very similar as θ approaches 0° (cos(0) = 1), and very dissimilar as θ approaches 90° (cos(90) = 0).

Dr. E. Garcia has a very informative tutorial on Cosine Similarity that explains the formula above in much more detail (and with more humor).

CosineSimilarity.java

The code below computes the cosine similarity based on the formula above. Here, normF() is the Frobenius norm, the square root of the sum of all elements, and norm1() is the maximum column sum, which works for me because we are computing it off a one-dimensional matrix.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// Source: src/main/java/src/com/mycompany/myapp/similarity/CosineSimilarity.jav
a
package com.healthline.jrocker.similarity;

import Jama.Matrix;

public class CosineSimilarity extends AbstractSimilarity {

  @Override
  protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) {
    double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1();
    double eucledianDist = sourceDoc.normF() * targetDoc.normF();
    return dotProduct / eucledianDist;
  }
}

The results of this computation for a term document vector normalized with TF/IDF and LSI respectively are shown below. As before, the inter-document similarity is higher for the LSI normalized vector.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
=== Cosine Similarity (TF/IDF) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  0.0000  0.1316  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.1680
    D3  0.0000  0.0000  1.0000  0.4198  0.0000  0.0000  0.0000
    D4  0.1316  0.0000  0.4198  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
    D6  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.3154
    D7  0.0000  0.1680  0.0000  0.0000  0.0000  0.3154  1.0000

=== Cosine Similarity (LSI) ===
            D1      D2      D3      D4      D5      D6      D7
    D1  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D2  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D3  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D4  1.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
    D5  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D6  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000
    D7  0.0000  1.0000  0.0000  0.0000  1.0000  1.0000  1.0000

Searching - Similarity against a Query Vector

Using the similarity vectors above, it is now possible to build a Searcher. The objective is to find the most similar documents that match our query. For this, we will need to build a query vector consisting of the terms in the query, and then compute the similarity of the documents against the query vector. The computation is similar to what we have already done above if we treat the query vector as a specialized instance of a document vector. We use Cosine Similarity for our computations here.

Searcher.java

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
// Source: src/main/java/com/mycompany/myapp/similarity/Searcher.java
package com.healthline.jrocker.similarity;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import Jama.Matrix;

public class Searcher {

  public class SearchResult {
    public String title;
    public double score;
    public SearchResult(String title, double score) {
      this.title = title;
      this.score = score;
    }
  };
  
  private Matrix termDocumentMatrix;
  private List<String> documents;
  private List<String> terms;
  private AbstractSimilarity similarity;
  
  public void setTermDocumentMatrix(Matrix termDocumentMatrix) {
    this.termDocumentMatrix = termDocumentMatrix;
  }
  
  public void setDocuments(String[] documents) {
    this.documents = Arrays.asList(documents);
  }
  
  public void setTerms(String[] terms) {
    this.terms = Arrays.asList(terms);
  }
  
  public void setSimilarity(AbstractSimilarity similarity) {
    this.similarity = similarity;
  }
  
  public List<SearchResult> search(String query) {
    // build up query matrix
    Matrix queryMatrix = getQueryMatrix(query);
    final Map<String,Double> similarityMap = 
      new HashMap<String,Double>();
    for (int i = 0; i < termDocumentMatrix.getColumnDimension(); i++) {
      double sim = similarity.computeSimilarity(queryMatrix, 
        termDocumentMatrix.getMatrix(
          0, termDocumentMatrix.getRowDimension() - 1, i, i));
      if (sim > 0.0D) {
        similarityMap.put(documents.get(i), sim);
      }
    }
    return sortByScore(similarityMap);
  }
  
  private Matrix getQueryMatrix(String query) {
    Matrix queryMatrix = new Matrix(terms.size(), 1, 0.0D);
    String[] queryTerms = query.split("\\s+");
    for (String queryTerm : queryTerms) {
      int termIdx = 0;
      for (String term : terms) {
        if (queryTerm.equalsIgnoreCase(term)) {
          queryMatrix.set(termIdx, 0, 1.0D);
        }
        termIdx++;
      }
    }
    queryMatrix = queryMatrix.times(1 / queryMatrix.norm1());
    return queryMatrix;
  }
  
  private List<SearchResult> sortByScore(
      final Map<String,Double> similarityMap) {
    List<SearchResult> results = new ArrayList<SearchResult>();
    List<String> docNames = new ArrayList<String>();
    docNames.addAll(similarityMap.keySet());
    Collections.sort(docNames, new Comparator<String>() {
      public int compare(String s1, String s2) {
        return similarityMap.get(s2).compareTo(similarityMap.get(s1));
      }
    });
    for (String docName : docNames) {
      double score = similarityMap.get(docName);
      if (score < 0.00001D) {
        continue;
      }
      results.add(new SearchResult(docName, score));
    }
    return results;
  }
}

The results of searching with the query "human computer interface" against a similarity matrix built off a TF/IDF and an LSI normalization are shown below. Notice that we get a few more results from a LSI normalized vector. I have manually copied the titles over to make the result more readable. I suppose I could make the code do it, and I probably would have, if the code was going to be used by anybody other than myself.

1
2
3
4
5
6
7
8
Results for query: [human computer interface] using TF/IDF
D1 (score =   0.8431) Human machine interface for computer applications
D4 (score =   0.1881) System and human system engineering testing of EPS

Results for query: [human computer interface] using LSI
D4 (score =   0.2467) System and human system engineering testing of EPS
D3 (score =   0.2467) The EPS user interface management system
D1 (score =   0.2467) Human machine interface for computer applications

Calling the code - the JUnit test

For completeness, as well as to show you how the components above should be called, I show here the JUnit test that was used to run these components and derive results for this post.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
// Source: src/test/java/com/mycompany/myapp/similarity/SimilarityTest.java
package com.mycompany.myapp.similarity;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedMap;
import java.util.TreeMap;

import org.apache.commons.lang.StringUtils;
import org.junit.Before;
import org.junit.Test;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import Jama.Matrix;

import com.mycompany.myapp.indexers.IdfIndexer;
import com.mycompany.myapp.indexers.LsiIndexer;
import com.mycompany.myapp.indexers.VectorGenerator;
import com.mycompany.myapp.similarity.Searcher.SearchResult;

public class SimilarityTest {

  private VectorGenerator vectorGenerator;
  
  @Before
  public void setUp() throws Exception {
    vectorGenerator = new VectorGenerator();
    vectorGenerator.setDataSource(new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", 
      "tmdb", "irstuff"));
    Map<String,Reader> documents = 
      new LinkedHashMap<String,Reader>();
    BufferedReader reader = new BufferedReader(
      new FileReader("src/test/resources/data/indexing_sample_data.txt"));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] docTitleParts = StringUtils.split(line, ";");
      documents.put(docTitleParts[0], new StringReader(docTitleParts[1]));
    }
    vectorGenerator.generateVector(documents);
  }

  @Test
  public void testJaccardSimilarityWithTfIdfVector() throws Exception {
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    JaccardSimilarity jaccardSimilarity = new JaccardSimilarity();
    Matrix similarity = jaccardSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Jaccard Similarity (TF/IDF)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testJaccardSimilarityWithLsiVector() throws Exception {
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    JaccardSimilarity jaccardSimilarity = new JaccardSimilarity();
    Matrix similarity = jaccardSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Jaccard Similarity (LSI)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }

  @Test
  public void testCosineSimilarityWithTfIdfVector() throws Exception {
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    CosineSimilarity cosineSimilarity = new CosineSimilarity();
    Matrix similarity = cosineSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Cosine Similarity (TF/IDF)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testCosineSimilarityWithLsiVector() throws Exception {
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    CosineSimilarity cosineSimilarity = new CosineSimilarity();
    Matrix similarity = cosineSimilarity.transform(termDocMatrix);
    prettyPrintMatrix("Cosine Similarity (LSI)", similarity, 
      vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true));
  }
  
  @Test
  public void testSearchWithTfIdfVector() throws Exception {
    // generate the term document matrix via the appropriate indexer
    IdfIndexer indexer = new IdfIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    // set up the query
    Searcher searcher = new Searcher();
    searcher.setDocuments(vectorGenerator.getDocumentNames());
    searcher.setTerms(vectorGenerator.getWords());
    searcher.setSimilarity(new CosineSimilarity());
    searcher.setTermDocumentMatrix(termDocMatrix);
    // run the query
    List<SearchResult> results = 
      searcher.search("human computer interface");
    prettyPrintResults("human computer interface", results);
  }

  @Test
  public void testSearchWithLsiVector() throws Exception {
    // generate the term document matrix via the appropriate indexer
    LsiIndexer indexer = new LsiIndexer();
    Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix());
    // set up the query
    Searcher searcher = new Searcher();
    searcher.setDocuments(vectorGenerator.getDocumentNames());
    searcher.setTerms(vectorGenerator.getWords());
    searcher.setSimilarity(new CosineSimilarity());
    searcher.setTermDocumentMatrix(termDocMatrix);
    // run the query
    List<SearchResult> results = 
      searcher.search("human computer interface");
    prettyPrintResults("human computer interface", results);
  }

  private void prettyPrintMatrix(String legend, Matrix matrix, 
      String[] documentNames, PrintWriter writer) {
    writer.printf("=== %s ===%n", legend);
    writer.printf("%6s", " ");
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%8s", documentNames[i]);
    }
    writer.println();
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%6s", documentNames[i]);
      for (int j = 0; j < documentNames.length; j++) {
        writer.printf("%8.4f", matrix.get(i, j));
      }
      writer.println();
    }
    writer.flush();
  }
  
  private void prettyPrintResults(String query, 
      List<SearchResult> results) {
    System.out.printf("Results for query: [%s]%n", query);
    for (SearchResult result : results) {
      System.out.printf("%s (score = %8.4f)%n", result.title, result.score);
    }
  }
}

Conclusion

I hope this post was helpful. I had looked at Term Vectors in Lucene in the past, but the idea of an n-dimensional term space never quite sank in until I started working through the stuff I describe above. I know that the data I am using is toy data, and that the computations described in the blog are computationally too complex and impractical in real life, but I believe that knowing this stuff helps in building models that would be useful and practical.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

84 comments:

  1. for a better collection of similarity metrics see SimMetrics on sourceforge or http://www.dcs.shef.ac.uk/~sam/simmetrics.html

    ReplyDelete
  2. Thank you for the comment, Rev. Chapman, and thank you for compiling the list. It was very informative.

    ReplyDelete
  3. Thanks, that's very useful for my current experimental project.

    ReplyDelete
  4. Hi KaTat, glad the post helped. We should both really be thanking Dr Konchady :-), on whose book this stuff is based.

    ReplyDelete
  5. Post is very useful for me working on sentimental analysis.
    but i am not able to understand how to create indexing_sample_data.txt file from sample data given.
    Help me to find this text file from data set.
    Thanks

    ReplyDelete
  6. Hi Bablu, thanks for the feedback and glad it helped you. The file is in src/test/resources/data in the SVN repository for the JTMT project on sourceforge.

    ReplyDelete
  7. Thanks for reply. it really worked.
    but i am getting java.lang.NoClassDefFoundError
    in the "this.jdbcTemplate = new JdbcTemplate(dataSource);" in phrase and abbreviation recognizer. I have included all header files. May be this error can because of incorrect path of datasource. If that so then please specify way of path of datasource. help me to remove this error.
    Thanks

    ReplyDelete
  8. Hi Amit, NoClassDef means that the class could not be found, since this occurs in this line, it could either be JdbcTemplate or DataSource. The first is possible if you don't have Spring jars in your classpath, and the second should not happen, since DataSource is part of the javax.sql package built into the JDK (unless you are using an ancient version). In any case, your error message should tell you more - I suspect that you are missing spring jars.

    ReplyDelete
  9. Hi sujit, thanks for reply!!!
    I got that error was because of datasource. I tried to make connection to mysql through jdbc but i got exception
    "com.mysql.jdbc.CommunicationsException: Communications link failure due to underlying exception:

    ** BEGIN NESTED EXCEPTION **

    java.net.SocketException
    MESSAGE: Permission denied: connect"
    help me to remove this error.
    Thanks...

    ReplyDelete
  10. Hi Amit, this is probably because either (a) you need to expose a socket for JDBC to talk to MySql and/or (b) your username/password is incorrect or (c) some other reason I can't predict remotely. You may want to RTFM about Java/JDBC/MySQL connectivity, and if you think that is not necessary, try posting the problem on a MySQL forum - this looks like this is related to your setup and not with the code.

    ReplyDelete
  11. Hello sujit,
    Your suggestions are really working.
    Now i am not able to create database using textmine project.I have downloaded textmine project. Can you please tell me process to create "tmdb" database using textmine project. it will be very helpful.
    Thanks for help......

    ReplyDelete
  12. Hi Amit, good to know. As you can see, tmdb is simply a MySQL database loaded with data from the textmine project. You will find the data in the utils/data subdirectory of the textmine project. I had to massage it a bit, not sure if the code for that exists, if not, it was probably just a bunch of sed commands.

    ReplyDelete
  13. hi this is tehseen .i need code in java for finding semantic similarity between sentences or similarity between sentences.This will help in my project a lot

    ReplyDelete
  14. Hi Tehseen, calculating similarity between sentences is the same as calculating between documents, at least using the vector approach I describe here. Basically create a term/sentence matrix instead of a term/document matrix.

    ReplyDelete
  15. Hi Sujit, am using LsiIndexer.java for my project. but while executing am getting error such as " can not access Jama.Matrix; bad class file:\java.Matrix ; please remove or make sure it appears in the correct subdirectory of the class path ". Please help me on this. Thanks in advance.

    ReplyDelete
  16. Hi Ambika, you should download the entire project (contains updated LsiIndexer.java) from the SVN repository at jtmt.sf.net. I switched from Jama to commons-math-2.0 once they put the LSI Indexing stuff in there.

    ReplyDelete
  17. I see your LSI similarity values. I saw the documents too. It is saying 1.0 while, I can see documents are not even same in meanings. For example doc2 and doc 6 are not similar. Same for other records. How come your program calculation are saying this? If I missed anything, please let me know

    ReplyDelete
  18. Hi Sujit Fan, not sure why LSI says its similar but they are not actually. I believe it investigated it once but can't be sure. If you dump out the components of the decomposition, it would probably give you a better idea - you would see the terms that are the principal components - perhaps these words co-occur in the documents, even though they don't contribute much meaning? Then they may be good candidates for stopwords.

    ReplyDelete
  19. Hi Sujit,
    Thanks for your quick reply. You wrote that JAMA is not good for sparse matrix. But I think SVD is used to handle sparse matrix. Isn't it? If I am wrong can you please explain what problem we can face if we use a big sparse matrix in jama? speed? accuracy? etc. Also can you please tell me if there is any way to make JAMA work where rows are less than columns??? If NO! then can you point out any other LSI java package that I can use? I am looking for a small package like jama, not a BIG ONE :)

    ReplyDelete
  20. Hi, AFAIK, Jama has no support for sparse matrices - internally matrices are treated as a double[][] - when I say sparse matrix support, I mean that the data structure will only store the non-zero elements of the matrix. Obviously for large sparse matrices such as those for TD matrices, this can result in significant memory savings. Also AFAIK, SVD does not care about whether the matrix is sparse or dense. Also Jama is no longer under active development, so I would recommend commons-math (which is a bit larger than Jama) - they have recently been looking at SVD improvements lately. Actually, I would suggest that you get on the apache commons list and ask your questions there, you will probably get better answers there than from me.

    ReplyDelete
  21. Hi Sujit,
    Thank you so much for such a brilliant post. Am using your LSI and cosine similarity concept in my project. I downloaded your project, but While executing SimilarityTest.java , am getting error “ Exception in thread “main” java.lang.NoSuchMethoderror: main “ . I couldn’t find “main” method in your code. Please help me to solve this.
    One more doubt is, I dont quite understand it..why do we use a database here.
    vectorGenerator.setDataSource(new DriverManagerDataSource( "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", "root", "orange"));
    Is any other settings I should do for this like jdbcodbcconnection,,??
    Can I use oracle10g for this?
    is it compulsory to use database??
    Sorry, if my questions are too lame..
    Please help me on this. I will be grateful to you.

    Thanks in advance

    Regards,
    Anjali

    ReplyDelete
  22. Hi Anjali, thanks for the kind words.

    To answer your first question, I am guessing that you may be running the SimilarityTest as a Java application (from your IDE perhaps?), in which case it looks for the main() method - you will need to run this as a JUnit application - you can do this from your IDE or using Ant or Maven from the command line - see the documentation for whatever tool you are using.

    For the second question, the database is used for recognizing abbreviations and multi-word phrases. The first is a list of known abbreviations, which you can also put into a text file. The second is a bit more complex, since phrases are stored as collocated words in the database. In the jtmt project there are schema and data files that contain the table def and data for the table. Using a database seemed to be the most obvious way to do this, but of course it may be possible (although probably not very easy) to store them in some other form.

    ReplyDelete
  23. Hi Sujit,

    Please just went through your post and guess it will be useful in my project. But am not so good with my programming.

    Could you assist on how this all works.

    ReplyDelete
  24. Hi helpNeeded, the post is just some Java implementations of various similarity measures I found in the TMAP book and elsewhere. When you say "how it all works", are you looking for the underlying mathematical theory behind these? If so, you should probably look elsewhere (do a search). If you are asking about the code itself, it should be fairly simple if you know Java.

    ReplyDelete
  25. Thanks for your reply.

    Trying out the code but I keep getting errors with the collections15 import. And i'll not be able to continue with the CosineSimilarity.java class.

    ReplyDelete
  26. You can get the generic collections library here.

    ReplyDelete
  27. Hi Sujit..
    How can i identify a document as effcient search query.During my testing i found that if search vector is good it will give the accurate results.But if the query is worse then we can't expect the right output...
    Thank u...

    ReplyDelete
  28. Hi Srijith, isn't this what you would expect on any search engine? When you say "bad" search vector, perhaps it contains too few terms or contains terms that are very common in the corpus, right? So I guess there would be two ways to go about "improving" your search vector - either insist on a minimum number of terms or post process the search vector to remove common terms? Or maybe I am not understanding what you are looking for?

    ReplyDelete
  29. Thank u sujit
    I have created LSI of 10 near duplicate documents and passing each document as a search vector.But even if all the documents are near duplicates,for some search query LSI give very less score,some docs are identified as near duplicates and others not...

    I put 0.8 as a threshold to filter the near duplicates,but some near duplicates give a score less than 0.8 and some exact different document give a score > 0.8 ..

    So i can't conclude....This is the scenario..I hope u could now able to figure out my issue..
    Thank u....

    ReplyDelete
  30. Hi Srijith, my understanding of LSI is that you achieve "semantic" indexing by removing the noisy dimensions out, leaving only the dimensions which truly represent the document. So documents that otherwise wouldn't look similar because of the presence of noisy dimensions now do. I am guessing that you did the same thing on the search vector side, ie removing the noisy dimensions from the search vector. Could the documents that don't seem to match have some outlier terms after cleanup that are causing the results to be skewed?

    ReplyDelete
  31. Hi Sujit, Gr8 library for text mining. Can you provide the schema for tmdb. There is no utils/data subdirectory in the textmine project.

    ReplyDelete
  32. Thanks Ajaz. About the schema in the textmine project, I just downloaded the textmine-0.21 tarball (from textmine.sf.net and click on Download) and exploded it locally, I see a whole bunch of .dat files under utils/data, and a tables.sql with the schema in utils. I used that to do the initial population, then added a few tables of my own, whose schemas are in the jtmt project.

    ReplyDelete
  33. Hi Pal,
    I wind a collection of .sql files in jtmt project. what is the order in which those should be executed for creation for data base and populating fro dat files? please provide the steps. Also for windows wordnet 2.1 is only available , will it suffice??

    ReplyDelete
  34. Thanks for the info Sujit. will try it out.

    ReplyDelete
  35. Hi Sujit,
    I am trying to get jtmt working on my pc(windows). but facing some trouble with sql data base.What all scripts to be run or tables to be created? I found few '.sql' files at jtmt\src\main\sql . all these files need to be used?
    I am interested in getting similarity metrics alone working.

    ReplyDelete
  36. @Anonymous:
    The .sql files can be applied in any order after the textmine data is loaded (see my reply to Ajaz's comment above for details about that). If multiple tables are to be created in a certain order (because of fk references, etc), they are contained in a single .sql file so they are created in the correct order.

    To access wordnet, I use jwi, which are advertised to work with Wordnet versions 1.6-3.0, so I guess you should be fine with Wordnet 2.1.

    @lmthyaz:
    For .sql order, please see my response to anonymous above.

    For restricting to tables for similarity, unfortunately I don't have a good answer for you. I suggest only loading up the textmine tables first, then running the JUnit test case for similarity, and see if it complains about missing tables, then add them in.

    ReplyDelete
  37. Hi Sujit,
    I am using Idf Indexer and cosine similarity of jtmt. when trying to find the simliraity betweenn two documents( corpus has only 2 files) i am getting a similarity as 'NaN'. what does it mean? where to make change if i want to modify?

    ReplyDelete
  38. Hi Imthyaz, hard to say without a bit more information, offhand I would say its probably an underflow or overflow (the resulting double is too small or too large to fit in the space provided), if you trace through the code and add some bounds checking, something like this:

    if (result >= Double.MAX_VALUE) {
    ..result = Double.MAX_VALUE;
    }
    if (result <= Double.MIN_VALUE) {
    ..result = Double.MIN_VALUE;
    }

    I think the NaNs would disappear.

    ReplyDelete
  39. can you please tell me if its possible to find similarity between phrases using wordnet in java

    ReplyDelete
  40. Not sure about phrases, but I recently came across a Perl module called Wordnet::Similarity, which provides implementations of various functions for using Wordnet to find similarities between words. You may want to check it out.

    ReplyDelete
  41. thanks for providing the code for tf-idf.I am new in Java programming. I tried running the first code stopwordrecognizer but it gave me errors like
    unable to find class: Irecognizer
    unable to find class: Token
    from where do i get these files?

    ReplyDelete
  42. Hi, you can find the IRecognizer class in the JTMT project (jtmt.sf.net).

    ReplyDelete
  43. I implement the follwoing class:

    public class IdfIndexer implements Transformer

    But the problem setEntry is unknown.

    Can you help me

    ReplyDelete
  44. Transformer does not need setEntry(), all it needs is a transform() method, can you please provide more information?

    ReplyDelete
  45. Once we have the inner-document similarity matrix, How to use it for clustering?

    Thanks

    ReplyDelete
  46. It probably won't be such a great idea to do this. The similarity matrix would be useful to find "related documents" from the one selected, but can't tell why its similar. Clustering clusters docs by features, so documents in a cluster are similar along that feature.

    ReplyDelete
  47. Hi Pal,
    Thank you for your reply.
    I calculate the similarity between documents by myself using my own algorithm, so that's why I generate a doc to doc matrix, but do not know how to cluster them.

    thanks,

    ReplyDelete
  48. Hi David, thought a bit more about this, and I take my comment about it being a bad idea back... I think you could just use a standard algo such as Knn and replace the distance metric with your doc-doc similarity number. The k (size of each cluster) would be determined by the number of docs in your corpus / number of clusters you want.

    ReplyDelete
  49. Hi sujit

    Can u please tell me how to run tarball textmine.21

    ReplyDelete
  50. Hi Rachit, its been a while since I pulled the data from Dr Konchady's textmine project, but basically I used the instruction in the project page. The TMAP book has some pointers as well, but I think the installation instructions on the project page should provide enough detail.

    ReplyDelete
  51. package com.mycompany.myapp.similarity; how to down load sir..
    thank you

    ReplyDelete
  52. Hi arivu, you can find it in the SVN repository for my JTMT project.

    ReplyDelete
  53. Hi Sujit
    Firstly thanks for the great post that really helps a lot for us. I have some few queries regarding similarity in documents.
    1. in this post you have used database to detect abbreviations and multi-word phrases. Can we exclude this part? I mean simple create indexing of the words removing stop words and functional words. in all case (lsi, if/idf), termDocMatrix needs vectorGenerator.getMatrix() as parameter. what will be appropriate parameter for termDocMatrix if we do not wish to use database? Simply, want to use each file preprocessing (remove stop words and stemming) and indexing then create termDocMatrix.
    2. If we are going to use the database, I get confuse which my sql file is to run. I found four .sql file at jtmt\src\main\sql viz. tmdb_load, tmdb_my_colloc, tmdb_postagger, tmdb_schema
    3. But for the data, I could not find utils/data in the project.
    4. Is it possible to run the JUnit as Java application adding main function?
    Regars,
    Niraj

    ReplyDelete
  54. Hi Niraj, sorry about the delay in responding. Yes, I think you can safely remove the database based abbreviation/collocation detection stuff while computing document similarity. I believe the sequence in you would run the SQL files is: tmdb_schema, tmdb_load, tmdb_my_colloc and tmdb_postagger. The data comes from the TextMine (textmine.sf.net) project. And yes, you can take the contents of the function annotated by @Test and put it into the main() function then you can run the JUnit code as a Java application.

    ReplyDelete
  55. Post is very helping me,
    but i don't understand how to create the documents,
    can you help me sir?

    ReplyDelete
  56. Hi Adriana, the documents in the example are taken from Dr Konchady's TMAP book. They are just titles.

    ReplyDelete
  57. Hi,
    How I can calculate the similarity for two vectors with different sizes?
    Thanks

    ReplyDelete
  58. Hi Sara, if you are using cosine similarity then you cannot. In case of document vectors, they should be the same size, as each element of the vector will correspond to one word/term in the vocabulary (which is a superset of all terms found in the corpus). If they do differ in size for some reason, then you can pad the shorter one with zeros (in case of text you are saying that the extra terms occur zero times in the shorter document vector).

    ReplyDelete
  59. Hi ,

    I did the same thing and add made all the zeros live. But, when I use Gaussian mixtures for clustering , it dose not work because I have many zeros in my matrix and all of my parameters will be NAN.
    Thanks.

    ReplyDelete
  60. I am guessing you are trying to find similarities between "average" document across genres, yes? If so, very cool idea. The NaNs are most likely because of underflow problems. One way to work around that is to apply a Laplace correction (add 1 to the denominator and v to the denominator where v is the vocabulary size) to the TD vectors, although if you do that it will change the data from sparse to dense (although you can just bake this behavior into a subclass of Sparse Matrix to keep the memory properties the same). But it may be good to find out actually where the NaN is occurring before taking such drastic action.

    ReplyDelete
  61. Actually, I am trying to cluster the documents using guessing mixtures. The problem is that: since the sparse matrix has many zeros, the variance will be zero and the program dose not work. It happens when I am using Dirichlet mixtures too. I was wondering if you could explain more how I can change the data to dense matrix? What is denominator ?
    Many thanks for your time.
    Sara

    ReplyDelete
  62. Well, the idea is to add 1 to the numerator of every element in the TD matrix and v to its denominator where v is the vocabulary size. So assume a term t has occurred in document d n times, and N is the total number of words in the corpus, then the frequency of t is n/N, with the Laplace correction it becomes (n+1)/(N+v). This now changes your sparse matrix into a dense one.

    ReplyDelete
  63. Thanks Sujit,
    Unfortunately, using dense matrix didn't help me and I still have problem.

    ReplyDelete
  64. You're welcome, hopefully you will be able to fix the problem...

    ReplyDelete
  65. Hello,

    Thank you for your post. On the cosine similarity part, you used norm1 ( double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1() ) to compute the dotProduct. From what I referred norm1 gives always positive results which makes the cosine similarity result always between 0 and 1 (while cosine similarity can also be between -1 and 0). Am I missing something or its a bug?

    Thanks

    ReplyDelete
  66. Thanks, glad it was useful. The use of norm1 is not a bug, you can never have negative cosine in case of text since term frequencies are always zero or positive.

    ReplyDelete
  67. Hello,

    Thanks for your reply. Yes, you are right, in the example you gave it is always positive. I tried to use your cosine similarity method on vectors taken from the V-matrix generated after SVD and it had negative values. Now, I have written mine. Anyway thank you for your post and prompt reply.

    Best

    ReplyDelete
  68. Oh, okay, I see why you could get negative values now...thanks for sharing, I will look out for this in the future.

    ReplyDelete
  69. Hi Sujit,
    I am trying to understand the similarity matrix output you have printed. You have five docs say d1,d2,..d5. How is the similarity represented here? Lets say we find similarity between d1 and d2, how do we represent in matrix form?

    Please explain.
    Thanks,
    Rob.

    ReplyDelete
  70. Hi Rob, d1 and d2 point to row and column indexes 0 and 1 respectively. So similarity(d1,d2) would be at either [0,1] or [1,0] of the matrix (both values are equal). Also notice the diagonal is all 1, indicating that any document is perfectly similar to itself.

    ReplyDelete
  71. Hi Sujit,
    I am a newbie to vector space modelling. I recently had to create a tool which could take in a query and calculate the give a list of files using tf-idf from a source code. Can you suggest me how to go about it?

    VSM

    ReplyDelete
  72. Hi, if you already have a (nxm) TD matrix of your documents (call it D), you could use the same vectorizer to create a (nx1) vector for your query (call it Q), then you can generate a (nx1) similarity vector S = D * Q. Depending on the size of your document set, you can do the computation with dense or sparse matrices in memory, or you could use Hadoop if the matrices don't fit into memory.

    ReplyDelete
  73. Hola es muy interesante lo que acabo de leer, yo debo de trabajar con medidas de similitud y he leido acerca de la similitud de coseno me parece de gran ayuda su publicacion

    ReplyDelete
  74. Muchas gracias por sus amables palabras. Mi español no es bueno, pero traductor Google es mi amigo :-).

    ReplyDelete
  75. Si lo entiendo asi mismo hago con el ingles, tengo una duda ya descargue las 4 clases pero no puedo ejecutarlas debido a que ninguna de ellas tiene un metodo principal, me refiero a las clases de: CosineSimilarity, JaccardSimilarity, AbstractSimilarity, Searcher por favor ayudeme a resolver esto. GRACIAS

    ReplyDelete
  76. De nada. Ellos no están destinados a ser ejecutados por sí mismos, sino como parte de un proceso de indexación. Usted puede ver cómo eso creó en SimilarityTest.java (código fuente completo está disponible en jtmt.sf.net).

    ReplyDelete
  77. Hi Sujit, This indeed is a great post which helps to understand a lot of concepts.
    On the cosine similarity part if I have a structured data-set of addresses(Name, fathers name, city, state, country, pincode) and I would like to associate weights to each of the entity for calculating the cosine similarity between the sentences.

    I found an approach using Spark MLLIB in the below link
    https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/CosineSimilarity.scala

    But in this case we have a generic weight for the whole sentence.

    Thanks
    Sandip

    ReplyDelete
  78. Thanks for the kind words, Sandi. Also thanks for the pointer to the Spark cosine similarity code. I had heard of the DIMSUM algorithm, realized after seeing the link that its already implemented in Spark. The code you pointed to computes cosine similarity (exact and approximate) between all pairs of sentences in the matrix. Each sentence is represented by a column in the matrix. I guess you could calculate cosine similarity in the same way for your structured address records, as long as you separate out the features, ie, (name1, name2), (address_line1_1, address_line1_2), ..., (pincode1, pincode2) must be padded to the same length. That way when you compute cosine similarity, it will be composed of each of these components separately. This also allows you to give different weights to the different components.

    ReplyDelete
  79. Hello Mr.Pal, thank you for such a grate post. I would like to ask something related with project. I will just use similarity methods and not use all your works. I have documents like yours but as one text file contains many lines with sentences like;

    D1: word1,word2,word3,.....
    D2: word1,word2,word3,.....
    D3:..

    To use your similarity functions how can i convert my txt contents to raw document matrix.
    I couldnt find your email address to ask these to you :) thank you very much for your interest.

    ReplyDelete
  80. Hi What2Watch, I wrote this code a long time ago when I was working almost exclusively in Java and trying to explore text mining. To answer your question, the general steps to go from text files to document matrices are outlined very nicely on this Working with Text Data page on the Scikit-Learn documentation website. It's in Python (since Scikit-Learn is a Python library), but Python is more comfortable to work with than Java for this sort of stuff (I mostly do my programming in Python nowadays), mainly because there are a lot of high level modules that allow you to focus more on the application than the infrastructure. Plus Python is a more terse language so you have to write less boilerplate.

    Anyway, to summarize, you tokenize the documents, build a vocabulary with the words, optionally clip the size of your vocabulary to the top most frequent and treat all other words as out-of-vocabulary, convert each document to a bag of words, optionally compute TF-IDF. At this point you have a term document matrix. To compute similarity, you need a document-document matrix, so you multiply the document-term matrix by its transpose. Each entry in this matrix is the unnormalized similarity between the 2 documents represented by the row and column. If you need to normalize, divide by the L2 norm of the two input matrices.

    ReplyDelete
  81. Thank you for your effort.
    Please can you tell me how to convert
    My dataset that is as .csv with alot of words so i want to convert each word
    To vector to make similarity and text
    Mining. My big problem is to convert
    My dataset can you help my how can I start I read about bag of word ,cosine
    Similarity and vector space model but can't know how to start.
    Thank you thank you for your time.

    ReplyDelete
  82. Yes, all of these need your input as a matrix. Easiest way to convert your text to a matrix (where each row represents a record and each column represents a word) is to use one of Scikit-learn's vectorizers -- in general the frequency distribution of words in text is very long tailed, so it is customary to only look at the N most important words. There are two approaches, first is to consider only the top N words and any other word as unknown (so essentially a vocabulary of N+1 words), or to map the full vocabulary into a smaller space of N, so some words may collide and occupy the same space in the reduced space. At this point you are looking at simple term frequency, ie, the entry in the matrix at position (i, j) is the number of times the j-th word in the vocabulary occurs in the i-th document. If you use the first approach, you would use the CountVectorizer, and if you use the second approach you would use the HashingVectorizer. In case you want to do some normalizations beyond term frequency, such as damp the effect of very frequent words, you could use the TfidfTransformer/TfidfVectorizer. Full set of scikit-learn vectorizers are listed here. You could also choose to use word vectors, that represent a semantic embedding in some low dimensional space (usually 100 or 300) against some large (not yours) datasets. For these, you would simply look up every word in your text against this external vector dataset (in the form word => vector) and replace all your words with their corresponding vectors, and arrange that in a matrix form (each row represents a row in the matrix). Thats basically all there is to the preprocessing.

    ReplyDelete
  83. Hi Mr.Pal,
    I need to clarify more on creating source matrix and query matrix.
    How can I do that in python. Would you please kind enough to how to use them. I have read previous comment reply of you,but that confused me more. You said that position (i,j) gives frequency of a word in relevant document,but how can we represent a whole document in a row. Please kind enough to explain with a pseudocode or something like that so i can understand more. Thank you for your effort on helping the community with this kind of work. It's very rare to find this type of examples.

    ReplyDelete
  84. Apologies for the late reply. If you are still interested in the answer, following on from the comment that "position (i, j) of the TF matrix gives the frequency of the word j in document i. So the i-th row represents document i, and all the j entries of the row i represent the term frequencies of words in the vocabulary. Not all the words in the vocabulary will be in document i, so these slots will have 0s in them. So consider a corpus with 100 documents and 50,000 words in the vocabulary, the TF matrix will be of shape (100, 50000). A document at i=20 say may have only 100 words, so 100 of the 50000 slots in the i-th row will have values filled in, rest will be 0. Hope this helps!

    ReplyDelete

Comments are moderated to prevent spam.