Saturday, September 20, 2008

IR Math with Java : TF, IDF and LSI

Recently, I started working on a fairly large text-mining project. During that time I have seen several common sense heuristics being designed and applied with very good results (some of them from me :-)), so I think a big part of being an IR (Information Retrieval) programmer is the ability to think quantitatively and be able to model problems in simple mathematical or statistical terms. Unless you are some kind of math genius (which I am not) or already have a background in applied math, it helps to know something about the models that are being used or proposed to solve various classes of problems, in order to have a starting point.

Text Mining Application Programming by Dr. Manu Konchady, provides a lot of the math background I am looking for. The book targets programmers, not mathematicians or scientists, so it's easy to read (for me). It provides lucid explanations (with pseudo-code in the book and Perl code in the author's TextMine project) for basic algorithms used to solve some IR problems. The book doesn't cover advanced approaches, as one of my colleagues pointed out to me, but it provides a good base which one can use to research more advanced approaches.

I've been working through this book, off and on, since I bought it. I learn better by doing, so I try to build the components that are described in that chapter. I have written about these efforts earlier here and here. In this post, I describe my code for generating various types of "indexes" (really term/document matrices based off a toy collection of documents) based on the algorithms discussed in Chapter 3 of the book.

The book describes three types of indexing approaches - term frequency (TF), inverse document frequency (IDF) and latent semantic indexing (LSI). To compute the frequency matrix, it takes a collection of 7 titles and creates a term document vector by tokenizing the titles. The list of 7 document titles are shown below:

1
2
3
4
5
6
7
D1      Human machine interface for computer applications
D2      A survey of user opinion of computer system response time
D3      The EPS user interface management system
D4      System and human system engineering testing of EPS
D5      The generation of random, binary and ordered trees
D6      The intersection graph of paths in trees
D7      Graph minors: A survey

Raw Frequency Extraction

To extract the frequencies, we must first extract content words and phrases from the text. I have described tokenizers and token recognizers in earlier posts. For this work, we create two additional recognizers, a stop word recognizer and a content word recognizer.

New recognizer: StopwordRecognizer.java

This recognizer has its own list of stop words if called with a default (no-args) constructor. It can also be instantiated with a List of custom stopwords (from a custom document collection using Zipf's Law) in case that is desired. It checks for TokenType.WORD (so if a word is already classified as abbreviation or phrase, it will not be touched), and if it's value is in it's stop set, then its marked as TokenType.STOP_WORD.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Source: src/main/java/com/mycompany/myapp/recognizers/StopwordRecognizer.java
package com.mycompany.myapp.recognizers;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.lang.StringUtils;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

/**
 * A recognizer that recognizes common stop words. Special stopwords may
 * be passed in through the non-default constructor.
 */
public class StopwordRecognizer implements IRecognizer {

  // this list is taken from the TextMine project
  private static final String DEFAULT_STOPWORDS = 
    "a about add ago after all also an and another any are as at be " +
    "because been before being between big both but by came can come " +
    "could did do does due each else end far few for from get got had " +
    "has have he her here him himself his how if in into is it its " +
    "just let lie like low make many me might more most much must " +
    "my never no nor not now of off old on only or other our out over " +
    "per pre put re said same see she should since so some still such " +
    "take than that the their them then there these they this those " +
    "through to too under up use very via want was way we well were " +
    "what when where which while who will with would yes yet you your";

  private Set<String> stopwords = new HashSet<String>();
  
  public StopwordRecognizer() {
    super();
  }
  
  public StopwordRecognizer(String[] stopwords) {
    this.stopwords.addAll(Arrays.asList(stopwords));
  }
  
  public void init() throws Exception {
    if (stopwords.size() == 0) {
      String[] stopwordArray = StringUtils.split(DEFAULT_STOPWORDS, " ");
      stopwords.addAll(Arrays.asList(stopwordArray));
    }
  }

  public List<Token> recognize(List<Token> tokens) {
    List<Token> recognizedTokens = new ArrayList<Token>();
    for (Token token : tokens) {
      if (token.getType() == TokenType.WORD) {
        if (stopwords.contains(StringUtils.lowerCase(token.getValue()))) {
          token.setType(TokenType.STOP_WORD);
        }
      }
      recognizedTokens.add(token);
    }
    return recognizedTokens;
  }
}

New recognizer: ContentWordRecognizer.java

This recognizer filters out nouns, verbs, adjectives and adverbs and marks them as TokenType.CONTENT_WORDS. As in the previous recognizer, only words which are TokenType.WORD are acted on. The part-of-speech recognition is done using the WordNet dictionary, and the API to it is the MIT Java WordNet Interface.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Source: src/main/java/com/mycompany/myapp/recognizers/ContentWordRecognizer.java
package com.mycompany.myapp.recognizers;

import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.item.IIndexWord;
import edu.mit.jwi.item.POS;

/**
 * Recognizes content words (noun, verb, adjective, and adverb) from a
 * List of Token objects. Only TokenType.WORD tokens are considered in
 * this recognizer, and are converted to TokenType.CONTENT_WORD. Words
 * are looked up against the WordNet dictionary.
 */
public class ContentWordRecognizer implements IRecognizer {

  private IDictionary dictionary;
  private List<POS> allowablePartsOfSpeech = Arrays.asList(new POS[] {
    POS.NOUN, POS.VERB, POS.ADJECTIVE, POS.ADVERB});
  
  public void init() throws Exception {
    this.dictionary = new Dictionary(
      new URL("file", null, "/opt/wordnet-3.0/dict"));
    dictionary.open();
  }

  public List<Token> recognize(List<Token> tokens) {
    List<Token> outputTokens = new ArrayList<Token>();
    for (Token token : tokens) {
      Token outputToken = new Token(token.getValue(), token.getType());
      if (token.getType() == TokenType.WORD) {
        String word = token.getValue();
        for (POS allowablePartOfSpeech : allowablePartsOfSpeech) {
          IIndexWord indexWord = 
            dictionary.getIndexWord(word, allowablePartOfSpeech);
          if (indexWord != null) {
            outputToken.setType(TokenType.CONTENT_WORD);
            break;
          }
        }
      }
      outputTokens.add(outputToken);
    }
    return outputTokens;
  }
}

Generating the initial vector: VectorGenerator.java

The initial vector is created from a Map of document names to Readers pointing at the titles. This may seem overly complex for our particular situation, where we could have done with a Map<String,String>, but I was going for a more general solution with Map<String,Reader> where the Reader reads files of content. The code for the VectorGenerator is shown below. Its fairly simple, it tokenizes the titles into words, then for each word, passes it through a chain of recognizers. At the end of it, it only extracts the content words and creates a term-document vector as shown below:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
// Source: src/main/java/com/mycompany/myapp/indexers/VectorGenerator.java
package com.mycompany.myapp.indexers;

import java.io.PrintWriter;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedSet;
import java.util.TreeSet;

import javax.sql.DataSource;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.springframework.beans.factory.annotation.Required;

import Jama.Matrix;

import com.mycompany.myapp.recognizers.AbbreviationRecognizer;
import com.mycompany.myapp.recognizers.BoundaryRecognizer;
import com.mycompany.myapp.recognizers.ContentWordRecognizer;
import com.mycompany.myapp.recognizers.IRecognizer;
import com.mycompany.myapp.recognizers.PhraseRecognizer;
import com.mycompany.myapp.recognizers.RecognizerChain;
import com.mycompany.myapp.recognizers.StopwordRecognizer;
import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;
import com.mycompany.myapp.tokenizers.WordTokenizer;

/**
 * Generate the word occurence vector for a document collection.
 */
public class VectorGenerator {

  private DataSource dataSource;
  
  private Map<Integer,String> wordIdValueMap = 
    new HashMap<Integer,String>();
  private Map<Integer,String> documentIdNameMap = 
    new HashMap<Integer,String>();
  private Matrix matrix;

  @Required
  public void setDataSource(DataSource dataSource) {
    this.dataSource = dataSource;
  }

  public void generateVector(Map<String,Reader> documents) 
      throws Exception {
    Map<String,Bag<String>> documentWordFrequencyMap = 
      new HashMap<String,Bag<String>>();
    SortedSet<String> wordSet = new TreeSet<String>();
    Integer docId = 0;
    for (String key : documents.keySet()) {
      String text = getText(documents.get(key));
      Bag<String> wordFrequencies = getWordFrequencies(text);
      wordSet.addAll(wordFrequencies.uniqueSet());
      documentWordFrequencyMap.put(key, wordFrequencies);
      documentIdNameMap.put(docId, key);
      docId++;
    }
    // create a Map of ids to words from the wordSet
    int wordId = 0;
    for (String word : wordSet) {
      wordIdValueMap.put(wordId, word);
      wordId++;
    }
    // we need a documents.keySet().size() x wordSet.size() matrix to hold
    // this info
    int numDocs = documents.keySet().size();
    int numWords = wordSet.size();
    double[][] data = new double[numWords][numDocs];
    for (int i = 0; i < numWords; i++) {
      for (int j = 0; j < numDocs; j++) {
        String docName = documentIdNameMap.get(j);
        Bag<String> wordFrequencies = 
          documentWordFrequencyMap.get(docName);
        String word = wordIdValueMap.get(i);
        int count = wordFrequencies.getCount(word);
        data[i][j] = count;
      }
    }
    matrix = new Matrix(data);
  }

  public Matrix getMatrix() {
    return matrix;
  }
  
  public String[] getDocumentNames() {
    String[] documentNames = new String[documentIdNameMap.keySet().size()];
    for (int i = 0; i < documentNames.length; i++) {
      documentNames[i] = documentIdNameMap.get(i);
    }
    return documentNames;
  }
  
  public String[] getWords() {
    String[] words = new String[wordIdValueMap.keySet().size()];
    for (int i = 0; i < words.length; i++) {
      String word = wordIdValueMap.get(i);
      if (word.contains("|||")) {
        // phrases are stored with length for other purposes, strip it off
        // for this report.
        word = word.substring(0, word.indexOf("|||"));
      }
      words[i] = word;
    }
    return words;
  }

  private Bag<String> getWordFrequencies(String text) 
      throws Exception {
    Bag<String> wordBag = new HashBag<String>();
    WordTokenizer wordTokenizer = new WordTokenizer();
    wordTokenizer.setText(text);
    List<Token> tokens = new ArrayList<Token>();
    Token token = null;
    while ((token = wordTokenizer.nextToken()) != null) {
      tokens.add(token);
    }
    RecognizerChain recognizerChain = new RecognizerChain(
        Arrays.asList(new IRecognizer[] {
        new BoundaryRecognizer(),
        new AbbreviationRecognizer(dataSource),
        new PhraseRecognizer(dataSource),
        new StopwordRecognizer(),
        new ContentWordRecognizer()
    }));
    recognizerChain.init();
    List<Token> recognizedTokens = recognizerChain.recognize(tokens);
    for (Token recognizedToken : recognizedTokens) {
      if (recognizedToken.getType() == TokenType.ABBREVIATION ||
          recognizedToken.getType() == TokenType.PHRASE ||
          recognizedToken.getType() == TokenType.CONTENT_WORD) {
        // lowercase words to treat Human and human as the same word
        wordBag.add(StringUtils.lowerCase(recognizedToken.getValue()));
      }
    }
    return wordBag;
  }

  private String getText(Reader reader) throws Exception {
    StringBuilder textBuilder = new StringBuilder();
    char[] cbuf = new char[1024];
    int len = 0;
    while ((len = reader.read(cbuf, 0, 1024)) != -1) {
      textBuilder.append(ArrayUtils.subarray(cbuf, 0, len));
    }
    reader.close();
    return textBuilder.toString();
  }
}

The test case (to run each example) consists of a single JUnit test (see below), which simply instantiates and runs each "indexer" implementation.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Source: src/test/java/com/mycompany/myapp/indexers/IndexersTest.java
package com.mycompany.myapp.indexers;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.util.LinkedHashMap;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.junit.Before;
import org.junit.Test;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import Jama.Matrix;

public class IndexersTest {

  private VectorGenerator vectorGenerator;
  private Map<String,Reader> documents;
  
  @Before
  public void setUp() throws Exception {
    vectorGenerator = new VectorGenerator();
    vectorGenerator.setDataSource(new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", 
      "tmdb", "irstuff"));
    documents = new LinkedHashMap<String,Reader>();
    BufferedReader reader = new BufferedReader(
      new FileReader("src/test/resources/data/indexing_sample_data.txt"));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] docTitleParts = StringUtils.split(line, ";");
      documents.put(docTitleParts[0], new StringReader(docTitleParts[1]));
    }
  }
  
  @Test
  public void testVectorGeneration() throws Exception {
    vectorGenerator.generateVector(documents);
    prettyPrintMatrix("Raw Term Frequencies", vectorGenerator.getMatrix(), 
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), 
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testTfIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    TfIndexer indexer = new TfIndexer();

    Matrix tfMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Term Frequency", tfMatrix, 
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), 
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testIdfIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    IdfIndexer indexer = new IdfIndexer();
    Matrix idfMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Inverse Document Frequency", idfMatrix,
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(),
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testLsiIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    LsiIndexer indexer = new LsiIndexer();
    Matrix lsiMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Latent Semantic (LSI)", lsiMatrix,
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(),
      new PrintWriter(System.out, true));
  }
  
  private void prettyPrintMatrix(String legend, Matrix matrix, 
      String[] documentNames, String[] words, PrintWriter writer) {
    writer.printf("=== %s ===%n", legend);
    writer.printf("%15s", " ");
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%8s", documentNames[i]);
    }
    writer.println();
    for (int i = 0; i < words.length; i++) {
      writer.printf("%15s", words[i]);
      for (int j = 0; j < documentNames.length; j++) {
        writer.printf("%8.4f", matrix.get(i, j));
      }
      writer.println();
    }
    writer.flush();
  }
}

The first test in the JUnit test outputs our initial raw matrix. Later tests will create it again (see @Before) and operate on it in different ways. Here is what the raw matrix will look like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Raw Term Frequencies ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
       computer  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  1.0000
          human  1.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
      interface  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000
        machine  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000
        opinion  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
       response  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  1.0000
         system  0.0000  0.0000  1.0000  2.0000  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
           time  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000

Term Frequency Indexing

The term frequency indexing method is the most simplistic, all it does is normalize the raw frequencies across a single document. Thus, if a document had two words, one occuring twice and the other occuring thrice, the first word would be normalized to 2/5 (0.4) and the other to 3/5 (0.6). We have used Jama, a Java library algebra library, because of its ability to do SVD (but more on that later). Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Source: src/main/java/com/mycompany/myapp/indexers/TfIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;

/**
 * Normalizes the occurence matrix per document. Divides each entry by the
 * sum of occurence values for that column. That way a longer document which
 * has 2 occurences of a certain word will not be ranked higher than a
 * shorter document with 1 occurence of the word for that word. At the 
 * end of this transformation, the values are the frequency of the word 
 * in the document.
 */
public class TfIndexer implements Transformer<Matrix,Matrix> {

  public Matrix transform(Matrix matrix) {
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getMatrix(
        0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.set(i, j, (matrix.get(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(Matrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.get(i, 0);
    }
    return sum;
  }
}

The results of this computation is shown below. Notice that all the columns now add up to 1, meaning that all documents are being treated the same, regardless of their length (and consequently their number of content words).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Term Frequency ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       computer  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  0.2500  0.1667  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  0.5000  0.3333
          human  0.2500  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
      interface  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  0.5000  0.0000
        machine  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  0.2500  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.3333
        opinion  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       response  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.3333
         system  0.0000  0.0000  0.2500  0.3333  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
           time  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  0.2500  0.0000  0.0000  0.0000  0.0000

Inverse Document Frequency Indexing

Inverse Document Frequency attempts to smooth out the frequency of a word across documents. If a word occurs in more than one document, that means that it is less "precise" and hence its value should go down. The code below will take the raw vector and apply IDF to it in the form of a logarithmic smoothing operator, then normalize the results. So this is a combination of TF and IDF. The smoothing operator is:

  weighti,j = term_frequencyi,j * (1 + log(N) - log(di)
  where:
    weighti,j = value of the IDF matrix for documenti, wordj.
    term_frequencyi,j = raw frequency of the word at position (i,j).
    N = number of documents
    di = number of documents containing word i.

The indexer code is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Source: src/main/java/net/sf/jtmt/indexers/IdfIndexer.java
package net.sf.jtmt.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

/**
 * Reduces the weight of words which are commonly found (ie in more
 * documents). The factor by which it is reduced is chosen from the book
 * as:
 * f(m) = 1 + log(N/d(m))
 * where N = total number of docs in collection
 *       d(m) = number of docs containing word m
 * so where a word is more frequent (ie d(m) is high, f(m) would be low.
 */
public class IdfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    // Phase 1: apply IDF weight to the raw word frequencies
    int n = matrix.getColumnDimension();
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        double matrixElement = matrix.getEntry(i, j);
        if (matrixElement > 0.0D) {
          double dm = countDocsWithWord(
            matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1));
          matrix.setEntry(i, j, matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm)));
        }
      }
    }
    // Phase 2: normalize the word scores for a single document
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }

  private double countDocsWithWord(RealMatrix rowMatrix) {
    double numDocs = 0.0D;
    for (int j = 0; j < rowMatrix.getColumnDimension(); j++) {
      if (rowMatrix.getEntry(0, j) > 0.0D) {
        numDocs++;
      }
    }
    return numDocs;
  }
}

The resulting vector after IDF and normalization is applied is shown below. Notice that scores for words (such as human) which occur in more than one document has decreased.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Inverse Document Frequency ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       computer  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  0.1977  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  0.2167  0.1512  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  0.4333  0.3023
          human  0.2031  0.0000  0.0000  0.1512  0.0000  0.0000  0.0000
      interface  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  0.5667  0.0000
        machine  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  0.2833  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.3953
        opinion  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       response  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  0.1327  0.0000  0.0000  0.0000  0.0000  0.3023
         system  0.0000  0.0000  0.2167  0.3023  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  0.1977  0.0000  0.0000  0.0000
           time  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  0.2833  0.0000  0.0000  0.0000  0.0000

Latent Semantic Indexing (LSI)

Latent Semantic Indexing attempts to uncover latent relationships among documents based on word co-occurence. So if document A contains (w1,w2) and document B contains (w2,w3), we can conclude that there is something common between documents A and B. LSI does this by decomposing the input raw term frequency matrix (A, see below) into three different matrices (U, S and V) using Singular Value Decomposition (SVD). Once that is done, the three vectors are "reduced" and the original vector rebuilt from the reduced vectors. Because of the reduction, noisy relationships are suppressed and relations become very clearly visible. In pseudo-code:

  A = U * S * VT
  Ak = Uk * Sk * VkT
  where:
    A = the original matrix
    U = the word vector
    S = the sigma vector
    V = the document vector
    Uk = the reduced word submatrix consisting of 0..k-1 cols
    Sk = the reduced sigma submatrix consisting of 0..k-1 cols, 0..k-1 rows
    Vk = the reduced document submatrix consisting of 0..k-1 cols.
  Note:
    Jama will give you back V, so you need to reduce and transpose it
    before you compute Ak.

Dr E Garcia has an excellent LSI/SVD tutorial explaining this stuff in much more detail, which I found very useful.

As mentioned before, Jama was chosen because it was the only free Java library package I knew of that could do SVD. The code for the LSI Indexer is here:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Source: src/main/java/com/mycompany/myapp/indexers/LsiIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;
import Jama.SingularValueDecomposition;

/**
 * Uses Latent Semantic Indexing to find word associations between docs. 
 * Idea is to find the intersections of the words found in each document
 * and score them accordingly. We use SVD to accomplish this. We first
 * decompose the word frequency vector into the three parts, then multiply
 * the three components back to get our transformed matrix.
 */
public class LsiIndexer implements Transformer<Matrix,Matrix> {

  public Matrix transform(Matrix matrix) {
    // phase 1: Singular value decomposition
    SingularValueDecomposition svd = new SingularValueDecomposition(matrix);
    Matrix wordVector = svd.getU();
    Matrix sigma = svd.getS();
    Matrix documentVector = svd.getV();
    // compute the value of k (ie where to truncate)
    int k = (int) Math.floor(Math.sqrt(matrix.getColumnDimension()));
    Matrix reducedWordVector = wordVector.getMatrix(
      0, wordVector.getRowDimension() - 1, 0, k - 1);
    Matrix reducedSigma = sigma.getMatrix(0, k - 1, 0, k - 1);
    Matrix reducedDocumentVector = documentVector.getMatrix(
      0, documentVector.getRowDimension() - 1, 0, k - 1);
    Matrix weights = reducedWordVector.times(
      reducedSigma).times(reducedDocumentVector.transpose());
    // Phase 2: normalize the word scrores for a single document
    for (int j = 0; j < weights.getColumnDimension(); j++) {
      double sum = sum(weights.getMatrix(
        0, weights.getRowDimension() - 1, j, j));
      for (int i = 0; i < weights.getRowDimension(); i++) {
        weights.set(i, j, Math.abs((weights.get(i, j)) / sum));
      }
    }
    return weights;
  }

  private double sum(Matrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.get(i, 0);
    }
    return sum;
  }
}

And here is the output from the indexer. First the raw frequencies go through the singular value decomposition, reduction and recomposition process, then they are normalized for each document. Notice that there are more non-zero elements representing latent "relationship" weights.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
=== Latent Semantic (LSI) ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
       computer  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
computer system  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
    engineering  0.1138  0.0000  0.1138  0.1138  0.0000  0.0000  0.0000
            eps  0.1733  0.0000  0.1733  0.1733  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
          graph  0.0000  0.0559  0.0000  0.0000  0.0559  0.0559  0.0559
          human  0.1336  0.0000  0.1336  0.1336  0.0000  0.0000  0.0000
      interface  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
   intersection  0.0000  0.0105  0.0000  0.0000  0.0105  0.0105  0.0105
        machine  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
     management  0.0595  0.0000  0.0595  0.0595  0.0000  0.0000  0.0000
         minors  0.0000  0.0454  0.0000  0.0000  0.0454  0.0454  0.0454
        opinion  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
        ordered  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
       response  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
         survey  0.0000  0.1859  0.0000  0.0000  0.1859  0.1859  0.1859
         system  0.2871  0.0000  0.2871  0.2871  0.0000  0.0000  0.0000
        testing  0.1138  0.0000  0.1138  0.1138  0.0000  0.0000  0.0000
           time  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
           user  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
 user interface  0.0595  0.0000  0.0595  0.0595  0.0000  0.0000  0.0000
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.458 sec

Conclusion

The first two indexing methods is probably familiar to a lot of people, and it is very likely that the ones in use (indirectly from common IR libraries such as Lucene) in most shops are quite a bit more advanced than the ones shown. LSI using SVD was a new approach to me, it became intuitively obvious once I understood the process. Hopefully, this article was able to share some of my new-found insight with you. The Java code for each of these processes illustrates how easy it is to actually do these transforms, especially using libraries that do most of the heavy lifting.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

39 comments (moderated to prevent spam):

Sujit Pal said...

A poster who would like to remain anonymous pointed out (via email) the Semantic Vectors package which does semantic analysis (similar to LSI) using Random Projection but is computationally less intensive than doing SVD.

shiva!!! said...

hi..am working on finding the conceptual cohesion of classes using the LSI method.It takes comments lines between methods and comapres them using lSI to see if they are coherent.can you throw some light on how i could find if any two vectors created for documents are textually coherent.i read that if we apply cosine btwn the vectors it is possible.Is it possible to take cosine between two vectors.? what would the result be?

Sujit Pal said...

Sounds like a neat idea, Shiva. And yes, you can compute the cosine similarity between two term vectors - see my post on Similarity Measures, there should be an example there. The result of the cosine similarity is the cosine of the angle between the two document term-vectors in n-dimensional space, where n is the number of terms which occur in the two documents (union). You may also want to take a look at Dr Garcia's post (referenced in the article above).

shiva!!! said...

hi sujit..
Is it possbile for me to apply tf/idf and lsi..the former followed by the latter.That is i apply the lsi to the matrix genertated by the tf/idf measure instead of the raw matrix..what would this produce..

Sujit Pal said...

Hi Shiva, I'm no expert, but IDF and LSI are both normalization methods to reduce noise, so you could either do LSI first followed by TF (as in my example) or go the other way as you suggest. I suspect that your results may end up too normalized to be usable, but YMMV.

shiva!!! said...

Thanks a lot sujit..:)

shiva!!! said...

hi sujit..am quite new to java and am awed by your use of packages and libraries.I am actually working on the same text mining prof in my UG.Am currently using the netbeans IDE.i do not know how to run your prog in mine..can you pl help me with that?!

Also this piece of code is from the indexers test class..I dont quite understand it..why do we use a database.

vectorGenerator = new VectorGenerator();
vectorGenerator.setDataSource(new DriverManagerDataSource(
"com.mysql.jdbc.Driver",
.
.
.
.

/data/indexing_sample_data.txt"));


I actually have seven text files containing the titles how should i give it as input to the program? hope my doubts are not too lame..

shiva!!! said...

Hi sujit..did you implement the IR concepts in the post in an IDE.if so can you please mail me your project..my id is s.sivasubramanian.87@gmail.com
expecting a reply..
thanks..

Sujit Pal said...

Hi Shiva, sorry about the delay in replying - I was swamped for most of last week. Unfortunately, I cannot share the project, as it is in various pieces and contains some information that is not in the public domain. But if you are looking to find which jar files to include, take a look at jarfinder, I recently built a Webservice client and it was very helpful.

To answer your question about why there is a database connection for the VectorGenerator, some recognizers used for tokenizing depend on a database (they suck up the data at startup so everything is in-memory during processing).

To answer your other question about how to pass the data files, the VectorGenerator takes a Map of document names and Reader objects (take a look at the IndexersTest#testVectorGeneration() for an example, I've done this with Strings, but your requirement is to use the file itself.

I use Eclipse myself, but I use Netbeans for other languages, and I like it too, although not so much for Java programming.

David said...

how to use the TF-IDF for document clustering using Ant Colony Optimization in java?
Thx

shiva!!! said...

hi sujit..thanks for your comments..can you provide me with the word break rule.txt file used in the word tokenizer class..what exactly does this class do?

Sujit Pal said...

Its in here. It breaks up a text into word tokens.

Sujit Pal said...

@David: I just checked out this page, thanks for the pointer to this, I did not know of this class of algorithms. Since I don't know much about Ant colony optimization algorithms in general, my reply is probably not going to be hugely helpful, but it looks like a variation of standard shortest path algorithms. The output of TF-IDF is a normalized term vector, which can be represented as a point in n-dimensional space, where n is the number of terms in your vector. It would now be possible to calculate the distance between two documents in this space (basically cosine similarity). However, since you are using a swarm technique, I am guessing you will randomly compute the distance between pairs in parallel until you find some sort of convergence, then re-iterate, something like genetic algorithms, and ultimately you will come up with a set of cluster centroids, and then attach other documents to these centroids. I have some information here that may be helpful.

shiva!!! said...

thanks man!

Sujit Pal said...

You are welcome, Shiva, and best of luck with your project.

shiva!!! said...

hi sujit..i have a few doubts..AGAIN..

one..if i try to find the cosine similarity of a matrix obtained using lsi all the fields of the resulting matrix becomes 1.unable to figure that out

two.how should i change your code so that i give the output of the idf-tf phase as input to the lsi instead of raw term frequency matrix

thanks for all your support..really appreciate it.

regards

Sujit Pal said...

Hi Shiva, to answer your questions...
1) I suspect that because LSI reduces your document to its principal components (depending on the value of k for your S matrix, you are now looking only at the top k dimensions), the resulting documents are similar along their top k dimensions (cos 0 = 1).
2) Each of these components are designed as a Transformer that takes a Matrix and returns another Matrix. So the raw matrix is returned by VectorGenerator. If you wrap the output of VectorGenerator with the TfIdf transformer, then that will return you the matrix that has been converted using TF/IDF.

shiva!!! said...

thanks

DavidLiauw said...

I read the following article:
http://www.ims.sakarya.edu.tr/ims2006papers/IMS2006-107.doc

but I encountered problems at the time document clustering. Can you help me?

kalai said...

how to get the package src/main/java/com/mycompany/myapp/recognizers/StopwordRecognizer

Sujit Pal said...

@kalai: there are some recognizers described here. The stop word recognizer simply checks to see if word tokens are in a given list, and if so, skips adding it to the list of recognized tokens.

@DavidLiauw: thanks for the link to the article, I am still not that familiar with ant-colony optimization, but I will look at the article and post here if I can make some sense out of it.

betty said...

hi sujit..
Recently I learned from you.Your articles give me a lot of help.Now
I have a question about the class IdfIndexer.In this part,normalize the word scores for a single document, why do you do?Another,you difinite "double sum" ,what it means?

Thanks
Betty

Sujit Pal said...

Hi Betty, thanks for your comment and I am glad that the post was helpful. IDF indexing is done to reduce the influence of words that occur too many times in the corpus - these words have low discriminatory value when it comes to searching through the documents. The "double sum" just assigns storage for the variable sum, not sure I understand the question?

Sujit Pal said...

@David, I posted an answer to your question above in the comments section of this page, as I think its more relevant there.

DavidLiauw said...

On your blog, you form a TF-idf matrix of all documents. how to form a matrix-based TF-idf query from some of the documents in the index.

for example, i need matrix fro query "computer".

here my code:

public Matrix getTFIDFMatrix(File indexDir) throws IOException, ParseException {
Directory fsDir = FSDirectory.getDirectory(indexDir, false);
IndexReader reader = IndexReader.open(fsDir);

int numberOfTerms = 0;
int numberOfDocuments = reader.numDocs();
TermEnum allTerms = reader.terms();

for (; allTerms.next();) {
allTerms.term();

numberOfTerms++;
}
System.out.println("Total number of terms in index is " + numberOfTerms);
System.out.println("Total number of documents in index is " + numberOfDocuments);

double[][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];

for (int i = 0; i < numberOfTerms; i++) {
for (int j = 0; j < numberOfDocuments; j++) {
TFIDFMatrix[i][j] = 0.0;

}
}

allTerms = reader.terms();
for (int i = 0; allTerms.next(); i++) {

Term term = allTerms.term();
TermDocs td = reader.termDocs(term);
for (; td.next();) {
TFIDFMatrix[i][td.doc()] = td.freq();
}

}

allTerms = reader.terms();
for (int i = 0; allTerms.next(); i++) {
for (int j = 0; j < numberOfDocuments; j++) {

double tf = TFIDFMatrix[i][j];
double docFreq = (double) allTerms.docFreq();
double idf = (Math.log((double) numberOfDocuments / docFreq)) / 2.30258509299405;
// System.out.println( "Term: " + i + " Document " + j + " TF/DocFreq/IDF: " + tf + " " + docFreq + " " + idf);
TFIDFMatrix[i][j] = tf * idf;

}
}

reader.close();

return new Matrix(TFIDFMatrix);
}

Sujit Pal said...

Hi David, since you are using lucene indexes, perhaps you could first find the docIds that come back from a query "body:computer" or similar, then run through your code filtering on docId?

Paul T said...

In the IdfIndexer class the method CountDocsWithWord is called to produce a Double assigned to the variable dm. This method is called for every element in the matrix, but the dm variable is only used when the matrix element is non-zero. For text processing a significant amount of elements have a value of zero (obviously), yet this computationally intensive method runs for every single one. By moving the call to CountDocsWithWord inside the brackets after the

if (matrixElement > 0.0D)

check the code runs more efficiently by several orders of magnitude. While testing I reduced the processing of a large set of documents from 45 hours to 13 minutes by moving this method call. I'd suggest changing the code both on the blog and in the SourceForge project. As it is the code is almost unusable for any non-trivial sized documents.

Sujit Pal said...

Thanks Paul, I've updated your fix in both the code on the page and in the svn repository.

Musi Ali said...

Hi Sujit Pal,

I am a PhD student, n my area is recommender system. Currently I m working on extracting keywords (after stop word removal, stemming, and using only those words having more than some predefined TF-IDF threshold) from documents. The dimensions of my matrix can be very large.

I first looked at weka, it was not very good for this purpose, then at Lucene, it is also not providing good help in TF/IDF. I was going to build my own fraemwork, but now I found yours which is looking quite good.

I just wanted to ask, what is its performance in terms of speed/scalibility?

I am trying to reuse your code, which may arise some more questions to ask...;)

Thanks,

Rgds,

Sujit Pal said...

Hi Musi, thanks for your appreciation, but I would hardly call what I have here a framework - this was simply a way to teach myself a bit of IR from Manu Konchady's book. To answer your question on scalability, I haven't tried it on large documents - you probably want to use the SparseMatrix coming soon in commons-math 2.0 in order to scale memory-wise. I am no expert, but if you have questions, let me know, and I will try to answer them to the best of my knowledge.

From the description of your work, you are probably attempting to extract attributes for sentiment analysis. If so, you are probably looking at comments and/or reviews, which are typically not that large, so perhaps the stuff I have here will probably suit your needs.

Anonymous said...

hi sujit, thanks for your nice blog. it was really helpful to me.

i was testing your SVD code on some sample matrix. unfortunately it did not work. i was using an idf matrix and trying to perform svd on it.

heres the idf matrix:

Idf Matrix:

0.0 0.35774232409318646
0.3860085623910699 0.0
0.0 0.5366134861397797
0.3860085623910699 0.0
0.22798287521786023 0.10564418976703387

heres the svd matrix i got:

Svd Matrix:
0.2780378031824575 0.2780378031824575
0.08600220175459367 0.08600220175459365
0.4170567047736861 0.4170567047736861
0.08600220175459371 0.08600220175459371
0.13290108853466892 0.13290108853466895

you can see both of the cols are similar. which should not be the case. right?

H said...

hey sujit,
I very much appreciate your work, though I'm not really sure if (and why and how ;) ) the LSI part works correctly. how comes that the matrix you present afterwards still shows a term x document matrix containing the entire vocabulary, shouldn't it be reduced to k x documents matrix?

I experimented for a link recovery project with these values and it seemed to me that taking the squareroot, it results in inadequate values. how are your experiences?

cheers for your work and congrats,
henning

Sujit Pal said...

Thanks Henning and Anonymous, there seems to be a bug in the LsiIndexer. As H pointed out, the reduced matrix should have smaller dimensions after reconstruction - I will have to look at the code and get back to you on this. I will also run the example you gave me, Anonymous, and see if I can replicate this.

H said...

well, i have to correct my previous comment. LSI does not reduce the real dimension of the matrix, it stays the same size according to deerwester in his paper.
but in my calculations the resulting matrix seems to be way more "shuffled". anyway, thanks again for the work and examples,
henning

Sujit Pal said...

Thanks H, and yes, LSI is basically decomposing your (txd) matrix to a product of (txd) matrix U, (dxd) diagonal matrix S, and a (dxd) matrix V', then recomposing the original matrix with the first k values, ie (txk) matrix u, (kxk) matrix s and (kxd) matrix v', and the end result is the reduced matrix of size (t,d), which is the same as before. I wasn't thinking too clearly when I responded, sorry about that.

In any case, I think I know what the problem is now. My original code was written using the Jama library, for which the results are shown in the post. Now Jama offers only a dense matrix, and since the matrices one normally works with in a text processing are large and sparse, I went looking for a library that offered sparse matrices. The closest I could find was commons-math, which had a RealMatrix interface, although the only implementation was a dense RealMatrixImpl. I build a SparseRealMatrixImpl and contributed it to commons-math, which they accepted, so I took the snapshot and removed all the Jama references in favor of equivalent commons-math ones. I soon realized that certain parts of the existing commons-math package, such as SingularValueDecomposition, was built to work with dense matrices (RealMatrixImpl) only, although they exposed a RealMatrix reference. The devs were actively working on this, though, and I figured that I would just wait till they were done, and then I forgot all about it until you and Anonymous pointed it out recently.

I have replaced the commons-math-2.0-SNAPSHOT.jar with commons-math-2.0.jar in the pom.xml and the lib directory, so hopefully, and verified that my LsiIndexer call in IndexersTest return the exact same data that is posted on the blog. Hopefully this will work for you too.

Anonymous, can you please try out your example with the new libs and see if this fixes the problem too. One thing I would like to point out though - usually, LSI works for (txd) matrices where t >> d - in your case (and possibly to a lesser extent in my example), that does not seem to be true.

To answer your square root question, H, I haven't used it in LSI, since I don't have the kind of storage to work with large dense matrices. I have used it as an estimate of the number of clusters I should split a document set into and that worked fairly well.

Anonymous said...

hi sujit, im just curious the way you are normalising by dividing with a method "sum", is it a part of the procedure for tf, idf or lsi? if one does not go through this dividing procedure, is his way correct as well?

Sujit Pal said...

Hi, the reason for the TF normalization is that you may have a short document with say 10 occurrences of a term T, and a longer document with 50 - whether this means that the longer doc is a "better" result for the term T kind of depends. The premise behind TF normalization is that term density is a better indicator than term count - you have to decide whether that is true for you.

I think it makes sense to normalize if you want to do cosine similarity or clustering type of calculations on it, because you want the coordinate values of each document to be comparable.

sita said...

how to find the package for cosine similarity
// Source: src/main/java/src/com/mycompany/myapp/similarity/CosineSimilarity.jav
a
package com.healthline.jrocker.similarity;
please help me to execute this code

Sujit Pal said...

Hi Sita, you can find it in the similarity package of the jtmt project.