Saturday, September 20, 2008

IR Math with Java : TF, IDF and LSI

Recently, I started working on a fairly large text-mining project. During that time I have seen several common sense heuristics being designed and applied with very good results (some of them from me :-)), so I think a big part of being an IR (Information Retrieval) programmer is the ability to think quantitatively and be able to model problems in simple mathematical or statistical terms. Unless you are some kind of math genius (which I am not) or already have a background in applied math, it helps to know something about the models that are being used or proposed to solve various classes of problems, in order to have a starting point.

Text Mining Application Programming by Dr. Manu Konchady, provides a lot of the math background I am looking for. The book targets programmers, not mathematicians or scientists, so it's easy to read (for me). It provides lucid explanations (with pseudo-code in the book and Perl code in the author's TextMine project) for basic algorithms used to solve some IR problems. The book doesn't cover advanced approaches, as one of my colleagues pointed out to me, but it provides a good base which one can use to research more advanced approaches.

I've been working through this book, off and on, since I bought it. I learn better by doing, so I try to build the components that are described in that chapter. I have written about these efforts earlier here and here. In this post, I describe my code for generating various types of "indexes" (really term/document matrices based off a toy collection of documents) based on the algorithms discussed in Chapter 3 of the book.

The book describes three types of indexing approaches - term frequency (TF), inverse document frequency (IDF) and latent semantic indexing (LSI). To compute the frequency matrix, it takes a collection of 7 titles and creates a term document vector by tokenizing the titles. The list of 7 document titles are shown below:

1
2
3
4
5
6
7
D1      Human machine interface for computer applications
D2      A survey of user opinion of computer system response time
D3      The EPS user interface management system
D4      System and human system engineering testing of EPS
D5      The generation of random, binary and ordered trees
D6      The intersection graph of paths in trees
D7      Graph minors: A survey

Raw Frequency Extraction

To extract the frequencies, we must first extract content words and phrases from the text. I have described tokenizers and token recognizers in earlier posts. For this work, we create two additional recognizers, a stop word recognizer and a content word recognizer.

New recognizer: StopwordRecognizer.java

This recognizer has its own list of stop words if called with a default (no-args) constructor. It can also be instantiated with a List of custom stopwords (from a custom document collection using Zipf's Law) in case that is desired. It checks for TokenType.WORD (so if a word is already classified as abbreviation or phrase, it will not be touched), and if it's value is in it's stop set, then its marked as TokenType.STOP_WORD.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Source: src/main/java/com/mycompany/myapp/recognizers/StopwordRecognizer.java
package com.mycompany.myapp.recognizers;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.lang.StringUtils;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

/**
 * A recognizer that recognizes common stop words. Special stopwords may
 * be passed in through the non-default constructor.
 */
public class StopwordRecognizer implements IRecognizer {

  // this list is taken from the TextMine project
  private static final String DEFAULT_STOPWORDS = 
    "a about add ago after all also an and another any are as at be " +
    "because been before being between big both but by came can come " +
    "could did do does due each else end far few for from get got had " +
    "has have he her here him himself his how if in into is it its " +
    "just let lie like low make many me might more most much must " +
    "my never no nor not now of off old on only or other our out over " +
    "per pre put re said same see she should since so some still such " +
    "take than that the their them then there these they this those " +
    "through to too under up use very via want was way we well were " +
    "what when where which while who will with would yes yet you your";

  private Set<String> stopwords = new HashSet<String>();
  
  public StopwordRecognizer() {
    super();
  }
  
  public StopwordRecognizer(String[] stopwords) {
    this.stopwords.addAll(Arrays.asList(stopwords));
  }
  
  public void init() throws Exception {
    if (stopwords.size() == 0) {
      String[] stopwordArray = StringUtils.split(DEFAULT_STOPWORDS, " ");
      stopwords.addAll(Arrays.asList(stopwordArray));
    }
  }

  public List<Token> recognize(List<Token> tokens) {
    List<Token> recognizedTokens = new ArrayList<Token>();
    for (Token token : tokens) {
      if (token.getType() == TokenType.WORD) {
        if (stopwords.contains(StringUtils.lowerCase(token.getValue()))) {
          token.setType(TokenType.STOP_WORD);
        }
      }
      recognizedTokens.add(token);
    }
    return recognizedTokens;
  }
}

New recognizer: ContentWordRecognizer.java

This recognizer filters out nouns, verbs, adjectives and adverbs and marks them as TokenType.CONTENT_WORDS. As in the previous recognizer, only words which are TokenType.WORD are acted on. The part-of-speech recognition is done using the WordNet dictionary, and the API to it is the MIT Java WordNet Interface.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Source: src/main/java/com/mycompany/myapp/recognizers/ContentWordRecognizer.java
package com.mycompany.myapp.recognizers;

import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

import edu.mit.jwi.Dictionary;
import edu.mit.jwi.IDictionary;
import edu.mit.jwi.item.IIndexWord;
import edu.mit.jwi.item.POS;

/**
 * Recognizes content words (noun, verb, adjective, and adverb) from a
 * List of Token objects. Only TokenType.WORD tokens are considered in
 * this recognizer, and are converted to TokenType.CONTENT_WORD. Words
 * are looked up against the WordNet dictionary.
 */
public class ContentWordRecognizer implements IRecognizer {

  private IDictionary dictionary;
  private List<POS> allowablePartsOfSpeech = Arrays.asList(new POS[] {
    POS.NOUN, POS.VERB, POS.ADJECTIVE, POS.ADVERB});
  
  public void init() throws Exception {
    this.dictionary = new Dictionary(
      new URL("file", null, "/opt/wordnet-3.0/dict"));
    dictionary.open();
  }

  public List<Token> recognize(List<Token> tokens) {
    List<Token> outputTokens = new ArrayList<Token>();
    for (Token token : tokens) {
      Token outputToken = new Token(token.getValue(), token.getType());
      if (token.getType() == TokenType.WORD) {
        String word = token.getValue();
        for (POS allowablePartOfSpeech : allowablePartsOfSpeech) {
          IIndexWord indexWord = 
            dictionary.getIndexWord(word, allowablePartOfSpeech);
          if (indexWord != null) {
            outputToken.setType(TokenType.CONTENT_WORD);
            break;
          }
        }
      }
      outputTokens.add(outputToken);
    }
    return outputTokens;
  }
}

Generating the initial vector: VectorGenerator.java

The initial vector is created from a Map of document names to Readers pointing at the titles. This may seem overly complex for our particular situation, where we could have done with a Map<String,String>, but I was going for a more general solution with Map<String,Reader> where the Reader reads files of content. The code for the VectorGenerator is shown below. Its fairly simple, it tokenizes the titles into words, then for each word, passes it through a chain of recognizers. At the end of it, it only extracts the content words and creates a term-document vector as shown below:

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
// Source: src/main/java/com/mycompany/myapp/indexers/VectorGenerator.java
package com.mycompany.myapp.indexers;

import java.io.PrintWriter;
import java.io.Reader;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.SortedSet;
import java.util.TreeSet;

import javax.sql.DataSource;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.springframework.beans.factory.annotation.Required;

import Jama.Matrix;

import com.mycompany.myapp.recognizers.AbbreviationRecognizer;
import com.mycompany.myapp.recognizers.BoundaryRecognizer;
import com.mycompany.myapp.recognizers.ContentWordRecognizer;
import com.mycompany.myapp.recognizers.IRecognizer;
import com.mycompany.myapp.recognizers.PhraseRecognizer;
import com.mycompany.myapp.recognizers.RecognizerChain;
import com.mycompany.myapp.recognizers.StopwordRecognizer;
import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;
import com.mycompany.myapp.tokenizers.WordTokenizer;

/**
 * Generate the word occurence vector for a document collection.
 */
public class VectorGenerator {

  private DataSource dataSource;
  
  private Map<Integer,String> wordIdValueMap = 
    new HashMap<Integer,String>();
  private Map<Integer,String> documentIdNameMap = 
    new HashMap<Integer,String>();
  private Matrix matrix;

  @Required
  public void setDataSource(DataSource dataSource) {
    this.dataSource = dataSource;
  }

  public void generateVector(Map<String,Reader> documents) 
      throws Exception {
    Map<String,Bag<String>> documentWordFrequencyMap = 
      new HashMap<String,Bag<String>>();
    SortedSet<String> wordSet = new TreeSet<String>();
    Integer docId = 0;
    for (String key : documents.keySet()) {
      String text = getText(documents.get(key));
      Bag<String> wordFrequencies = getWordFrequencies(text);
      wordSet.addAll(wordFrequencies.uniqueSet());
      documentWordFrequencyMap.put(key, wordFrequencies);
      documentIdNameMap.put(docId, key);
      docId++;
    }
    // create a Map of ids to words from the wordSet
    int wordId = 0;
    for (String word : wordSet) {
      wordIdValueMap.put(wordId, word);
      wordId++;
    }
    // we need a documents.keySet().size() x wordSet.size() matrix to hold
    // this info
    int numDocs = documents.keySet().size();
    int numWords = wordSet.size();
    double[][] data = new double[numWords][numDocs];
    for (int i = 0; i < numWords; i++) {
      for (int j = 0; j < numDocs; j++) {
        String docName = documentIdNameMap.get(j);
        Bag<String> wordFrequencies = 
          documentWordFrequencyMap.get(docName);
        String word = wordIdValueMap.get(i);
        int count = wordFrequencies.getCount(word);
        data[i][j] = count;
      }
    }
    matrix = new Matrix(data);
  }

  public Matrix getMatrix() {
    return matrix;
  }
  
  public String[] getDocumentNames() {
    String[] documentNames = new String[documentIdNameMap.keySet().size()];
    for (int i = 0; i < documentNames.length; i++) {
      documentNames[i] = documentIdNameMap.get(i);
    }
    return documentNames;
  }
  
  public String[] getWords() {
    String[] words = new String[wordIdValueMap.keySet().size()];
    for (int i = 0; i < words.length; i++) {
      String word = wordIdValueMap.get(i);
      if (word.contains("|||")) {
        // phrases are stored with length for other purposes, strip it off
        // for this report.
        word = word.substring(0, word.indexOf("|||"));
      }
      words[i] = word;
    }
    return words;
  }

  private Bag<String> getWordFrequencies(String text) 
      throws Exception {
    Bag<String> wordBag = new HashBag<String>();
    WordTokenizer wordTokenizer = new WordTokenizer();
    wordTokenizer.setText(text);
    List<Token> tokens = new ArrayList<Token>();
    Token token = null;
    while ((token = wordTokenizer.nextToken()) != null) {
      tokens.add(token);
    }
    RecognizerChain recognizerChain = new RecognizerChain(
        Arrays.asList(new IRecognizer[] {
        new BoundaryRecognizer(),
        new AbbreviationRecognizer(dataSource),
        new PhraseRecognizer(dataSource),
        new StopwordRecognizer(),
        new ContentWordRecognizer()
    }));
    recognizerChain.init();
    List<Token> recognizedTokens = recognizerChain.recognize(tokens);
    for (Token recognizedToken : recognizedTokens) {
      if (recognizedToken.getType() == TokenType.ABBREVIATION ||
          recognizedToken.getType() == TokenType.PHRASE ||
          recognizedToken.getType() == TokenType.CONTENT_WORD) {
        // lowercase words to treat Human and human as the same word
        wordBag.add(StringUtils.lowerCase(recognizedToken.getValue()));
      }
    }
    return wordBag;
  }

  private String getText(Reader reader) throws Exception {
    StringBuilder textBuilder = new StringBuilder();
    char[] cbuf = new char[1024];
    int len = 0;
    while ((len = reader.read(cbuf, 0, 1024)) != -1) {
      textBuilder.append(ArrayUtils.subarray(cbuf, 0, len));
    }
    reader.close();
    return textBuilder.toString();
  }
}

The test case (to run each example) consists of a single JUnit test (see below), which simply instantiates and runs each "indexer" implementation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Source: src/test/java/com/mycompany/myapp/indexers/IndexersTest.java
package com.mycompany.myapp.indexers;

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.util.LinkedHashMap;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.junit.Before;
import org.junit.Test;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import Jama.Matrix;

public class IndexersTest {

  private VectorGenerator vectorGenerator;
  private Map<String,Reader> documents;
  
  @Before
  public void setUp() throws Exception {
    vectorGenerator = new VectorGenerator();
    vectorGenerator.setDataSource(new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", 
      "tmdb", "irstuff"));
    documents = new LinkedHashMap<String,Reader>();
    BufferedReader reader = new BufferedReader(
      new FileReader("src/test/resources/data/indexing_sample_data.txt"));
    String line = null;
    while ((line = reader.readLine()) != null) {
      String[] docTitleParts = StringUtils.split(line, ";");
      documents.put(docTitleParts[0], new StringReader(docTitleParts[1]));
    }
  }
  
  @Test
  public void testVectorGeneration() throws Exception {
    vectorGenerator.generateVector(documents);
    prettyPrintMatrix("Raw Term Frequencies", vectorGenerator.getMatrix(), 
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), 
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testTfIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    TfIndexer indexer = new TfIndexer();

    Matrix tfMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Term Frequency", tfMatrix, 
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), 
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testIdfIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    IdfIndexer indexer = new IdfIndexer();
    Matrix idfMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Inverse Document Frequency", idfMatrix,
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(),
      new PrintWriter(System.out, true));
  }
  
  @Test
  public void testLsiIndexer() throws Exception {
    vectorGenerator.generateVector(documents);
    LsiIndexer indexer = new LsiIndexer();
    Matrix lsiMatrix = indexer.transform(vectorGenerator.getMatrix());
    prettyPrintMatrix("Latent Semantic (LSI)", lsiMatrix,
      vectorGenerator.getDocumentNames(), vectorGenerator.getWords(),
      new PrintWriter(System.out, true));
  }
  
  private void prettyPrintMatrix(String legend, Matrix matrix, 
      String[] documentNames, String[] words, PrintWriter writer) {
    writer.printf("=== %s ===%n", legend);
    writer.printf("%15s", " ");
    for (int i = 0; i < documentNames.length; i++) {
      writer.printf("%8s", documentNames[i]);
    }
    writer.println();
    for (int i = 0; i < words.length; i++) {
      writer.printf("%15s", words[i]);
      for (int j = 0; j < documentNames.length; j++) {
        writer.printf("%8.4f", matrix.get(i, j));
      }
      writer.println();
    }
    writer.flush();
  }
}

The first test in the JUnit test outputs our initial raw matrix. Later tests will create it again (see @Before) and operate on it in different ways. Here is what the raw matrix will look like:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Raw Term Frequencies ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
       computer  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  1.0000  1.0000  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  1.0000
          human  1.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
      interface  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000
        machine  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  1.0000
        opinion  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000
       response  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  1.0000
         system  0.0000  0.0000  1.0000  2.0000  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000
           time  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  1.0000  0.0000  0.0000  0.0000  0.0000

Term Frequency Indexing

The term frequency indexing method is the most simplistic, all it does is normalize the raw frequencies across a single document. Thus, if a document had two words, one occuring twice and the other occuring thrice, the first word would be normalized to 2/5 (0.4) and the other to 3/5 (0.6). We have used Jama, a Java library algebra library, because of its ability to do SVD (but more on that later). Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// Source: src/main/java/com/mycompany/myapp/indexers/TfIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;

/**
 * Normalizes the occurence matrix per document. Divides each entry by the
 * sum of occurence values for that column. That way a longer document which
 * has 2 occurences of a certain word will not be ranked higher than a
 * shorter document with 1 occurence of the word for that word. At the 
 * end of this transformation, the values are the frequency of the word 
 * in the document.
 */
public class TfIndexer implements Transformer<Matrix,Matrix> {

  public Matrix transform(Matrix matrix) {
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getMatrix(
        0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.set(i, j, (matrix.get(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(Matrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.get(i, 0);
    }
    return sum;
  }
}

The results of this computation is shown below. Notice that all the columns now add up to 1, meaning that all documents are being treated the same, regardless of their length (and consequently their number of content words).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Term Frequency ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       computer  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  0.2500  0.1667  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  0.5000  0.3333
          human  0.2500  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
      interface  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  0.5000  0.0000
        machine  0.2500  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  0.2500  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.3333
        opinion  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       response  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.3333
         system  0.0000  0.0000  0.2500  0.3333  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  0.1667  0.0000  0.0000  0.0000
           time  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  0.1667  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  0.2500  0.0000  0.0000  0.0000  0.0000

Inverse Document Frequency Indexing

Inverse Document Frequency attempts to smooth out the frequency of a word across documents. If a word occurs in more than one document, that means that it is less "precise" and hence its value should go down. The code below will take the raw vector and apply IDF to it in the form of a logarithmic smoothing operator, then normalize the results. So this is a combination of TF and IDF. The smoothing operator is:

weighti,j = term_frequencyi,j * (1 + log(N) - log(di)
  where:
    weighti,j = value of the IDF matrix for documenti, wordj.
    term_frequencyi,j = raw frequency of the word at position (i,j).
    N = number of documents
    di = number of documents containing word i.

The indexer code is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
// Source: src/main/java/net/sf/jtmt/indexers/IdfIndexer.java
package net.sf.jtmt.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

/**
 * Reduces the weight of words which are commonly found (ie in more
 * documents). The factor by which it is reduced is chosen from the book
 * as:
 * f(m) = 1 + log(N/d(m))
 * where N = total number of docs in collection
 *       d(m) = number of docs containing word m
 * so where a word is more frequent (ie d(m) is high, f(m) would be low.
 */
public class IdfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    // Phase 1: apply IDF weight to the raw word frequencies
    int n = matrix.getColumnDimension();
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        double matrixElement = matrix.getEntry(i, j);
        if (matrixElement > 0.0D) {
          double dm = countDocsWithWord(
            matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1));
          matrix.setEntry(i, j, matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm)));
        }
      }
    }
    // Phase 2: normalize the word scores for a single document
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }

  private double countDocsWithWord(RealMatrix rowMatrix) {
    double numDocs = 0.0D;
    for (int j = 0; j < rowMatrix.getColumnDimension(); j++) {
      if (rowMatrix.getEntry(0, j) > 0.0D) {
        numDocs++;
      }
    }
    return numDocs;
  }
}

The resulting vector after IDF and normalization is applied is shown below. Notice that scores for words (such as human) which occur in more than one document has decreased.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
=== Inverse Document Frequency ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       computer  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
computer system  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
    engineering  0.0000  0.0000  0.0000  0.1977  0.0000  0.0000  0.0000
            eps  0.0000  0.0000  0.2167  0.1512  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
          graph  0.0000  0.0000  0.0000  0.0000  0.0000  0.4333  0.3023
          human  0.2031  0.0000  0.0000  0.1512  0.0000  0.0000  0.0000
      interface  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
   intersection  0.0000  0.0000  0.0000  0.0000  0.0000  0.5667  0.0000
        machine  0.2656  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
     management  0.0000  0.0000  0.2833  0.0000  0.0000  0.0000  0.0000
         minors  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.3953
        opinion  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
        ordered  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.2500  0.0000  0.0000
       response  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
         survey  0.0000  0.1327  0.0000  0.0000  0.0000  0.0000  0.3023
         system  0.0000  0.0000  0.2167  0.3023  0.0000  0.0000  0.0000
        testing  0.0000  0.0000  0.0000  0.1977  0.0000  0.0000  0.0000
           time  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
           user  0.0000  0.1735  0.0000  0.0000  0.0000  0.0000  0.0000
 user interface  0.0000  0.0000  0.2833  0.0000  0.0000  0.0000  0.0000

Latent Semantic Indexing (LSI)

Latent Semantic Indexing attempts to uncover latent relationships among documents based on word co-occurence. So if document A contains (w1,w2) and document B contains (w2,w3), we can conclude that there is something common between documents A and B. LSI does this by decomposing the input raw term frequency matrix (A, see below) into three different matrices (U, S and V) using Singular Value Decomposition (SVD). Once that is done, the three vectors are "reduced" and the original vector rebuilt from the reduced vectors. Because of the reduction, noisy relationships are suppressed and relations become very clearly visible. In pseudo-code:

A = U * S * VT
  Ak = Uk * Sk * VkT
  where:
    A = the original matrix
    U = the word vector
    S = the sigma vector
    V = the document vector
    Uk = the reduced word submatrix consisting of 0..k-1 cols
    Sk = the reduced sigma submatrix consisting of 0..k-1 cols, 0..k-1 rows
    Vk = the reduced document submatrix consisting of 0..k-1 cols.
  Note:
    Jama will give you back V, so you need to reduce and transpose it
    before you compute Ak.

Dr E Garcia used to have a really good tutorial on LSI/SVD which is sadly no longer available. However, the IR Book has a chapter dedicated to this. Thanks to ndk for suggesting this link.

As mentioned before, Jama was chosen because it was the only free Java library package I knew of that could do SVD. The code for the LSI Indexer is here:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Source: src/main/java/com/mycompany/myapp/indexers/LsiIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;

import Jama.Matrix;
import Jama.SingularValueDecomposition;

/**
 * Uses Latent Semantic Indexing to find word associations between docs. 
 * Idea is to find the intersections of the words found in each document
 * and score them accordingly. We use SVD to accomplish this. We first
 * decompose the word frequency vector into the three parts, then multiply
 * the three components back to get our transformed matrix.
 */
public class LsiIndexer implements Transformer<Matrix,Matrix> {

  public Matrix transform(Matrix matrix) {
    // phase 1: Singular value decomposition
    SingularValueDecomposition svd = new SingularValueDecomposition(matrix);
    Matrix wordVector = svd.getU();
    Matrix sigma = svd.getS();
    Matrix documentVector = svd.getV();
    // compute the value of k (ie where to truncate)
    int k = (int) Math.floor(Math.sqrt(matrix.getColumnDimension()));
    Matrix reducedWordVector = wordVector.getMatrix(
      0, wordVector.getRowDimension() - 1, 0, k - 1);
    Matrix reducedSigma = sigma.getMatrix(0, k - 1, 0, k - 1);
    Matrix reducedDocumentVector = documentVector.getMatrix(
      0, documentVector.getRowDimension() - 1, 0, k - 1);
    Matrix weights = reducedWordVector.times(
      reducedSigma).times(reducedDocumentVector.transpose());
    // Phase 2: normalize the word scrores for a single document
    for (int j = 0; j < weights.getColumnDimension(); j++) {
      double sum = sum(weights.getMatrix(
        0, weights.getRowDimension() - 1, j, j));
      for (int i = 0; i < weights.getRowDimension(); i++) {
        weights.set(i, j, Math.abs((weights.get(i, j)) / sum));
      }
    }
    return weights;
  }

  private double sum(Matrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.get(i, 0);
    }
    return sum;
  }
}

And here is the output from the indexer. First the raw frequencies go through the singular value decomposition, reduction and recomposition process, then they are normalized for each document. Notice that there are more non-zero elements representing latent "relationship" weights.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
=== Latent Semantic (LSI) ===
                     D1      D2      D3      D4      D5      D6      D7
         binary  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
       computer  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
computer system  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
    engineering  0.1138  0.0000  0.1138  0.1138  0.0000  0.0000  0.0000
            eps  0.1733  0.0000  0.1733  0.1733  0.0000  0.0000  0.0000
     generation  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
          graph  0.0000  0.0559  0.0000  0.0000  0.0559  0.0559  0.0559
          human  0.1336  0.0000  0.1336  0.1336  0.0000  0.0000  0.0000
      interface  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
   intersection  0.0000  0.0105  0.0000  0.0000  0.0105  0.0105  0.0105
        machine  0.0198  0.0000  0.0198  0.0198  0.0000  0.0000  0.0000
     management  0.0595  0.0000  0.0595  0.0595  0.0000  0.0000  0.0000
         minors  0.0000  0.0454  0.0000  0.0000  0.0454  0.0454  0.0454
        opinion  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
        ordered  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
         random  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000  0.0000
       response  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
         survey  0.0000  0.1859  0.0000  0.0000  0.1859  0.1859  0.1859
         system  0.2871  0.0000  0.2871  0.2871  0.0000  0.0000  0.0000
        testing  0.1138  0.0000  0.1138  0.1138  0.0000  0.0000  0.0000
           time  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
           user  0.0000  0.1405  0.0000  0.0000  0.1405  0.1405  0.1405
 user interface  0.0595  0.0000  0.0595  0.0595  0.0000  0.0000  0.0000
Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.458 sec

Conclusion

The first two indexing methods is probably familiar to a lot of people, and it is very likely that the ones in use (indirectly from common IR libraries such as Lucene) in most shops are quite a bit more advanced than the ones shown. LSI using SVD was a new approach to me, it became intuitively obvious once I understood the process. Hopefully, this article was able to share some of my new-found insight with you. The Java code for each of these processes illustrates how easy it is to actually do these transforms, especially using libraries that do most of the heavy lifting.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

148 comments (moderated to prevent spam):

Sujit Pal said...

A poster who would like to remain anonymous pointed out (via email) the Semantic Vectors package which does semantic analysis (similar to LSI) using Random Projection but is computationally less intensive than doing SVD.

Unknown said...

hi..am working on finding the conceptual cohesion of classes using the LSI method.It takes comments lines between methods and comapres them using lSI to see if they are coherent.can you throw some light on how i could find if any two vectors created for documents are textually coherent.i read that if we apply cosine btwn the vectors it is possible.Is it possible to take cosine between two vectors.? what would the result be?

Sujit Pal said...

Sounds like a neat idea, Shiva. And yes, you can compute the cosine similarity between two term vectors - see my post on Similarity Measures, there should be an example there. The result of the cosine similarity is the cosine of the angle between the two document term-vectors in n-dimensional space, where n is the number of terms which occur in the two documents (union). You may also want to take a look at Dr Garcia's post (referenced in the article above).

Unknown said...

hi sujit..
Is it possbile for me to apply tf/idf and lsi..the former followed by the latter.That is i apply the lsi to the matrix genertated by the tf/idf measure instead of the raw matrix..what would this produce..

Sujit Pal said...

Hi Shiva, I'm no expert, but IDF and LSI are both normalization methods to reduce noise, so you could either do LSI first followed by TF (as in my example) or go the other way as you suggest. I suspect that your results may end up too normalized to be usable, but YMMV.

Unknown said...

Thanks a lot sujit..:)

Unknown said...

hi sujit..am quite new to java and am awed by your use of packages and libraries.I am actually working on the same text mining prof in my UG.Am currently using the netbeans IDE.i do not know how to run your prog in mine..can you pl help me with that?!

Also this piece of code is from the indexers test class..I dont quite understand it..why do we use a database.

vectorGenerator = new VectorGenerator();
vectorGenerator.setDataSource(new DriverManagerDataSource(
"com.mysql.jdbc.Driver",
.
.
.
.

/data/indexing_sample_data.txt"));


I actually have seven text files containing the titles how should i give it as input to the program? hope my doubts are not too lame..

Unknown said...

Hi sujit..did you implement the IR concepts in the post in an IDE.if so can you please mail me your project..my id is s.sivasubramanian.87@gmail.com
expecting a reply..
thanks..

Sujit Pal said...

Hi Shiva, sorry about the delay in replying - I was swamped for most of last week. Unfortunately, I cannot share the project, as it is in various pieces and contains some information that is not in the public domain. But if you are looking to find which jar files to include, take a look at jarfinder, I recently built a Webservice client and it was very helpful.

To answer your question about why there is a database connection for the VectorGenerator, some recognizers used for tokenizing depend on a database (they suck up the data at startup so everything is in-memory during processing).

To answer your other question about how to pass the data files, the VectorGenerator takes a Map of document names and Reader objects (take a look at the IndexersTest#testVectorGeneration() for an example, I've done this with Strings, but your requirement is to use the file itself.

I use Eclipse myself, but I use Netbeans for other languages, and I like it too, although not so much for Java programming.

Anonymous said...

how to use the TF-IDF for document clustering using Ant Colony Optimization in java?
Thx

Unknown said...

hi sujit..thanks for your comments..can you provide me with the word break rule.txt file used in the word tokenizer class..what exactly does this class do?

Sujit Pal said...

Its in here. It breaks up a text into word tokens.

Sujit Pal said...

@David: I just checked out this page, thanks for the pointer to this, I did not know of this class of algorithms. Since I don't know much about Ant colony optimization algorithms in general, my reply is probably not going to be hugely helpful, but it looks like a variation of standard shortest path algorithms. The output of TF-IDF is a normalized term vector, which can be represented as a point in n-dimensional space, where n is the number of terms in your vector. It would now be possible to calculate the distance between two documents in this space (basically cosine similarity). However, since you are using a swarm technique, I am guessing you will randomly compute the distance between pairs in parallel until you find some sort of convergence, then re-iterate, something like genetic algorithms, and ultimately you will come up with a set of cluster centroids, and then attach other documents to these centroids. I have some information here that may be helpful.

Unknown said...

thanks man!

Sujit Pal said...

You are welcome, Shiva, and best of luck with your project.

Unknown said...

hi sujit..i have a few doubts..AGAIN..

one..if i try to find the cosine similarity of a matrix obtained using lsi all the fields of the resulting matrix becomes 1.unable to figure that out

two.how should i change your code so that i give the output of the idf-tf phase as input to the lsi instead of raw term frequency matrix

thanks for all your support..really appreciate it.

regards

Sujit Pal said...

Hi Shiva, to answer your questions...
1) I suspect that because LSI reduces your document to its principal components (depending on the value of k for your S matrix, you are now looking only at the top k dimensions), the resulting documents are similar along their top k dimensions (cos 0 = 1).
2) Each of these components are designed as a Transformer that takes a Matrix and returns another Matrix. So the raw matrix is returned by VectorGenerator. If you wrap the output of VectorGenerator with the TfIdf transformer, then that will return you the matrix that has been converted using TF/IDF.

DavidLiauw said...

I read the following article:
http://www.ims.sakarya.edu.tr/ims2006papers/IMS2006-107.doc

but I encountered problems at the time document clustering. Can you help me?

Unknown said...

how to get the package src/main/java/com/mycompany/myapp/recognizers/StopwordRecognizer

Sujit Pal said...

@kalai: there are some recognizers described here. The stop word recognizer simply checks to see if word tokens are in a given list, and if so, skips adding it to the list of recognized tokens.

@DavidLiauw: thanks for the link to the article, I am still not that familiar with ant-colony optimization, but I will look at the article and post here if I can make some sense out of it.

betty said...

hi sujit..
Recently I learned from you.Your articles give me a lot of help.Now
I have a question about the class IdfIndexer.In this part,normalize the word scores for a single document, why do you do?Another,you difinite "double sum" ,what it means?

Thanks
Betty

Sujit Pal said...

Hi Betty, thanks for your comment and I am glad that the post was helpful. IDF indexing is done to reduce the influence of words that occur too many times in the corpus - these words have low discriminatory value when it comes to searching through the documents. The "double sum" just assigns storage for the variable sum, not sure I understand the question?

Sujit Pal said...

@David, I posted an answer to your question above in the comments section of this page, as I think its more relevant there.

DavidLiauw said...

On your blog, you form a TF-idf matrix of all documents. how to form a matrix-based TF-idf query from some of the documents in the index.

for example, i need matrix fro query "computer".

here my code:

public Matrix getTFIDFMatrix(File indexDir) throws IOException, ParseException {
Directory fsDir = FSDirectory.getDirectory(indexDir, false);
IndexReader reader = IndexReader.open(fsDir);

int numberOfTerms = 0;
int numberOfDocuments = reader.numDocs();
TermEnum allTerms = reader.terms();

for (; allTerms.next();) {
allTerms.term();

numberOfTerms++;
}
System.out.println("Total number of terms in index is " + numberOfTerms);
System.out.println("Total number of documents in index is " + numberOfDocuments);

double[][] TFIDFMatrix = new double[numberOfTerms][numberOfDocuments];

for (int i = 0; i < numberOfTerms; i++) {
for (int j = 0; j < numberOfDocuments; j++) {
TFIDFMatrix[i][j] = 0.0;

}
}

allTerms = reader.terms();
for (int i = 0; allTerms.next(); i++) {

Term term = allTerms.term();
TermDocs td = reader.termDocs(term);
for (; td.next();) {
TFIDFMatrix[i][td.doc()] = td.freq();
}

}

allTerms = reader.terms();
for (int i = 0; allTerms.next(); i++) {
for (int j = 0; j < numberOfDocuments; j++) {

double tf = TFIDFMatrix[i][j];
double docFreq = (double) allTerms.docFreq();
double idf = (Math.log((double) numberOfDocuments / docFreq)) / 2.30258509299405;
// System.out.println( "Term: " + i + " Document " + j + " TF/DocFreq/IDF: " + tf + " " + docFreq + " " + idf);
TFIDFMatrix[i][j] = tf * idf;

}
}

reader.close();

return new Matrix(TFIDFMatrix);
}

Sujit Pal said...

Hi David, since you are using lucene indexes, perhaps you could first find the docIds that come back from a query "body:computer" or similar, then run through your code filtering on docId?

Paul T said...

In the IdfIndexer class the method CountDocsWithWord is called to produce a Double assigned to the variable dm. This method is called for every element in the matrix, but the dm variable is only used when the matrix element is non-zero. For text processing a significant amount of elements have a value of zero (obviously), yet this computationally intensive method runs for every single one. By moving the call to CountDocsWithWord inside the brackets after the

if (matrixElement > 0.0D)

check the code runs more efficiently by several orders of magnitude. While testing I reduced the processing of a large set of documents from 45 hours to 13 minutes by moving this method call. I'd suggest changing the code both on the blog and in the SourceForge project. As it is the code is almost unusable for any non-trivial sized documents.

Sujit Pal said...

Thanks Paul, I've updated your fix in both the code on the page and in the svn repository.

DRMag said...

Hi Sujit Pal,

I am a PhD student, n my area is recommender system. Currently I m working on extracting keywords (after stop word removal, stemming, and using only those words having more than some predefined TF-IDF threshold) from documents. The dimensions of my matrix can be very large.

I first looked at weka, it was not very good for this purpose, then at Lucene, it is also not providing good help in TF/IDF. I was going to build my own fraemwork, but now I found yours which is looking quite good.

I just wanted to ask, what is its performance in terms of speed/scalibility?

I am trying to reuse your code, which may arise some more questions to ask...;)

Thanks,

Rgds,

Sujit Pal said...

Hi Musi, thanks for your appreciation, but I would hardly call what I have here a framework - this was simply a way to teach myself a bit of IR from Manu Konchady's book. To answer your question on scalability, I haven't tried it on large documents - you probably want to use the SparseMatrix coming soon in commons-math 2.0 in order to scale memory-wise. I am no expert, but if you have questions, let me know, and I will try to answer them to the best of my knowledge.

From the description of your work, you are probably attempting to extract attributes for sentiment analysis. If so, you are probably looking at comments and/or reviews, which are typically not that large, so perhaps the stuff I have here will probably suit your needs.

Anonymous said...

hi sujit, thanks for your nice blog. it was really helpful to me.

i was testing your SVD code on some sample matrix. unfortunately it did not work. i was using an idf matrix and trying to perform svd on it.

heres the idf matrix:

Idf Matrix:

0.0 0.35774232409318646
0.3860085623910699 0.0
0.0 0.5366134861397797
0.3860085623910699 0.0
0.22798287521786023 0.10564418976703387

heres the svd matrix i got:

Svd Matrix:
0.2780378031824575 0.2780378031824575
0.08600220175459367 0.08600220175459365
0.4170567047736861 0.4170567047736861
0.08600220175459371 0.08600220175459371
0.13290108853466892 0.13290108853466895

you can see both of the cols are similar. which should not be the case. right?

H said...

hey sujit,
I very much appreciate your work, though I'm not really sure if (and why and how ;) ) the LSI part works correctly. how comes that the matrix you present afterwards still shows a term x document matrix containing the entire vocabulary, shouldn't it be reduced to k x documents matrix?

I experimented for a link recovery project with these values and it seemed to me that taking the squareroot, it results in inadequate values. how are your experiences?

cheers for your work and congrats,
henning

Sujit Pal said...

Thanks Henning and Anonymous, there seems to be a bug in the LsiIndexer. As H pointed out, the reduced matrix should have smaller dimensions after reconstruction - I will have to look at the code and get back to you on this. I will also run the example you gave me, Anonymous, and see if I can replicate this.

H said...

well, i have to correct my previous comment. LSI does not reduce the real dimension of the matrix, it stays the same size according to deerwester in his paper.
but in my calculations the resulting matrix seems to be way more "shuffled". anyway, thanks again for the work and examples,
henning

Sujit Pal said...

Thanks H, and yes, LSI is basically decomposing your (txd) matrix to a product of (txd) matrix U, (dxd) diagonal matrix S, and a (dxd) matrix V', then recomposing the original matrix with the first k values, ie (txk) matrix u, (kxk) matrix s and (kxd) matrix v', and the end result is the reduced matrix of size (t,d), which is the same as before. I wasn't thinking too clearly when I responded, sorry about that.

In any case, I think I know what the problem is now. My original code was written using the Jama library, for which the results are shown in the post. Now Jama offers only a dense matrix, and since the matrices one normally works with in a text processing are large and sparse, I went looking for a library that offered sparse matrices. The closest I could find was commons-math, which had a RealMatrix interface, although the only implementation was a dense RealMatrixImpl. I build a SparseRealMatrixImpl and contributed it to commons-math, which they accepted, so I took the snapshot and removed all the Jama references in favor of equivalent commons-math ones. I soon realized that certain parts of the existing commons-math package, such as SingularValueDecomposition, was built to work with dense matrices (RealMatrixImpl) only, although they exposed a RealMatrix reference. The devs were actively working on this, though, and I figured that I would just wait till they were done, and then I forgot all about it until you and Anonymous pointed it out recently.

I have replaced the commons-math-2.0-SNAPSHOT.jar with commons-math-2.0.jar in the pom.xml and the lib directory, so hopefully, and verified that my LsiIndexer call in IndexersTest return the exact same data that is posted on the blog. Hopefully this will work for you too.

Anonymous, can you please try out your example with the new libs and see if this fixes the problem too. One thing I would like to point out though - usually, LSI works for (txd) matrices where t >> d - in your case (and possibly to a lesser extent in my example), that does not seem to be true.

To answer your square root question, H, I haven't used it in LSI, since I don't have the kind of storage to work with large dense matrices. I have used it as an estimate of the number of clusters I should split a document set into and that worked fairly well.

Anonymous said...

hi sujit, im just curious the way you are normalising by dividing with a method "sum", is it a part of the procedure for tf, idf or lsi? if one does not go through this dividing procedure, is his way correct as well?

Sujit Pal said...

Hi, the reason for the TF normalization is that you may have a short document with say 10 occurrences of a term T, and a longer document with 50 - whether this means that the longer doc is a "better" result for the term T kind of depends. The premise behind TF normalization is that term density is a better indicator than term count - you have to decide whether that is true for you.

I think it makes sense to normalize if you want to do cosine similarity or clustering type of calculations on it, because you want the coordinate values of each document to be comparable.

sita said...

how to find the package for cosine similarity
// Source: src/main/java/src/com/mycompany/myapp/similarity/CosineSimilarity.jav
a
package com.healthline.jrocker.similarity;
please help me to execute this code

Sujit Pal said...

Hi Sita, you can find it in the similarity package of the jtmt project.

Richard said...

Hi Sujit,
Thanks for providing a very informatic website. I have a question about LSI. How did you decide that which K value you will use to reduce dimentions? Also wikipedia said lsi can help in document similarity that has same meaning for example car and automobile. How is it possible? Because it is just matrix game nothing else. If i am wrong please correct me

Sujit Pal said...

Thanks Richard. For LSI, I chose the k value somewhat arbitarily as sqrt(N) - it seemed to be as good a guess as any (and kind of based on a heuristic to decide the number of clusters in K-Means, although there is really no logical connection between the two).

You are right about it being a matrix "game" :-). How it works is that you cut off noisy dimensions. So given a document about a car and another about an automobile, their term vectors will have a higher similarity after LSI because they are likely to talk about the same things, and have the same terms as their principal dimensions. Before LSI, there may be terms in the (N-k) group which will cause the term vectors to move apart from each other.

23011986 said...

hello, Mr Sujit.
I need to implement method TF-IDF on a set of documents, I have to use your code of "", but I could not carries out it. there is package which misses.
if you can send to me the code of TF-IDF and the package have to use and a small description of code so that I can continue.
thank you infinitely.
my e mail:
meryeme.hadni86@hotmail.com

Sujit Pal said...

Hi aaa, you can find the complete code at the SVN repository of the jtmt project at Sourceforge. In the lib directory you will also find a copy of the latest JARs that are needed. On the main page of the project you will find a link to some of the blog posts on this blog that talk about specific parts of the jtmt codebase. I don't have any more documentation than this, but I think this should give you what you are looking for.

Richard said...

Hi Sujit,
Thanks for your reply. I am using JAMA for LSI. My matrix size is more than 600 columns and it took 3 hours to process it. I didn't include anything. I just passed 600 plus column matrix to svd. That's it. It was just a test. Can you tell me what is the problem with JAMA??? I used this example http://www.cs.princeton.edu/introcs/95linear/SVD.java.html and passed it my own matrix.
2nd question:
Mostly my columns are more than rows. JAMA don't work on this :(. I have document (column) x terms(rows). Right now i just transpose matrix to terms by document. But I don't think so it is right as if I want to work on matrix it should be Document by matrix ( for similarity ).
3rd Question:
I used mostly your code to perform similarity and clustering. Now i have 10 cluster. Each cluster has like 15 documents. If i see manually they are in similar cluster but still they are different :(. Why is this?? if i see cosine table then i can see they are similar by value. But not all cluster documents are similar with each other. What am i doing wrong??
Thanks in advance!

Sujit Pal said...

Hi Richard, I don't know the answers to your questions, but can hazard some guesses...
1) I mainly used JAMA and commons-math as an end-user, so I am not really up to the math going on inside these components. However, I believe that JAMA provides an accurate implementation. There are some issues with the commons-math version which is being worked on currently. The problem with JAMA is that the matrix is dense, so its likely that the 3 hours it took, it was swapping much of the time.
2) Yes, I think you need to have a t-d matrix (not a d-t one) for LSI. Not sure why JAMA doesn't allow this...its been a long time since I looked at the LSI stuff, perhaps this is required by the SVD algo?
3) Yes, I think the clustering stuff is not very accurate. None of them are real industrial strength clustering algos, I wrote them based on some core algos, in order to understand them. I don't know a good clustering libraries well enough, but I am trying to learn Weka and LingPipe, some day I will write about them, they will probably give better results than the stuff I wrote.

Harika said...

sir where can i refer for the contents of the 7documents that are being used as input to LSI

Sujit Pal said...

Hi Harika, I did not use documents in here. These are just the document titles (look towards the top of the post for the list). These are from Dr Konchady's TMAP book.

sandeepyerramsetti said...

Hi sujit.Ur blog and LSI was very much helpful for us as we r doing a project on LSI.u gave seven document titles.But we r unable to find the content of those seven documents on which u applied the entire LSI Code.so,can u please send me the contents for those seven documents.They are very much necessary for further proceeding of our project.

Praveen said...

Awesome article. Thanks a lot sujit. I ll be back with some doubts tat i have.

Sujit Pal said...

@sandeep: thanks. Please see my answer to harika's comment (one comment up from your comment). There are no documents here, only titles.
@praveen: thanks, I will try to answer them to the best of my ability.

blogwalker said...

I want to ask about the datasource used in the VectorGenerator. I checked folder sql in the repository and I found some sql scripts. But, I am still confused which scripts are needed for this? I assume it should be the scripts begin with "tmdb**". But, I still have no idea which kind of data need to be inserted into the database? I checked the tmdb_load.sql, it needs many DAT files but I can't find the files in the repository.
Thanks in advance.

blogwalker said...

I tried to calculate manually the IDF, but it gives different result compare to the calculation done by your program. For example the IDF for "human" in D1:
IDF = term_freq * (1 + log(N) - lod(d)) = 0.25 * (1 + log7 - log2) = 0.25 * 1.544 = 0.386
Based on the calculation done by computer, it is 0.2031 ( I took from the printed table).
Did I make mistake in my calculation? or you implement another things, because in the transform function (IdfIndexr class), you have two phases for the calculation. I don't really know the use of the 2nd phase (normalization).
Thanks in advance.

Sujit Pal said...

Hi Blogwalker,

To answer the first question (two comments up), the dat files are from Dr Manu Konchady's TextMine (textmine.sf.net) project.

To answer your second question, its possible but unlikely that the results are different - the numbers are from the initial version of the code, the changes I have made since is to switch out Jama and replace with commons-math. So there is scope for difference, since there has been /some/ change, but they should function identically.

Unknown said...

hi, i am a final year degree student. I am working on a project of document allocation system and classification system by using TD/IDF and Naive Bayes. Is there any open source code(java) that i can refer? I am still a newbie on this kind of knowledge. My email is reitto@gmail.com. Thank you

Sujit Pal said...

Hi Jason, you may want to check out weka. There are also others, but weka and lingpipe (not open source) are on my list to learn more about.

Unknown said...

Sir,
ur code is nice and was a great help..
But y did u complicate life by using Jama Matrix and Readers and all....
Ur code is difficult to modify because of dat...
anyway thnks
dont try to show off at times pls

Sujit Pal said...

Hi Deepika, you are welcome and thanks for the comment. Your last sentence cracked me up, btw - took me back to my college days :-).

In my defense, most of the stuff I write is for myself - so I can refer back 6 months or a year later...the fact that it helps someone is kind of an incidental side effect for me. So while I am happy that the post helped you solve your problem, I don't know enough to anticipate your problem and how you choose to solve it.

I am curious though, about your comment that the use of larger building blocks such as Readers and the Jama library complicate the code - I would have thought the opposite. At least wrt the Jama library, it gives you a Matrix class and all the methods in it. The alternative is to write the code yourself inline. Using libraries allows you to cut down the number of lines in your code base, replacing them with tested code from a third party library. The downside is that you need to take the time to get familiar with the library's API - but that pays for itself in the decrease in coding, testing, maintenance (bug fixing) times and also increases the common vocabulary within your team, ie they can simply say "dot product" instead of describing how its done.

What do you think?

Unknown said...

Looking at your code...
I find out that you will divide by 0.0D in your TF, IDF, LSI indexer code when the sum is zero. The result is NaN. You should add if(sum > 0.0D) ...

Sujit Pal said...

Thanks, updated this in the SVN repository.

Unknown said...

Sir,
I think u misinterpreted my last comment..
It was meant to be on a lighter note..
Anyway I felt inline was better option in matrix situation..
On top I wanted to ask the ways how i can improve the efficiency of the code.
In terms of time complexity its inefficient..
The use of database for phrases is also consuming time in my opinion..
Anyway great logic..Good code...really helpful...
Can you suggest some techniques to improve time efficiency please...
Is rule break iterator time efficient??
Anyway I have gone through ur other works...U really rock :)...

Unknown said...

Anyway coming to Jama Matrix..
I still am not convinced its useful here..
May be wat all u said is rite..
May be in long run it may turn out to be useful but i felt inline was a beter option here :)

Sujit Pal said...

Hi Deepika, thanks and no problems, I was not offended or anything, just thought it would be interesting to get your POV.

Regarding time complexity, can you point out areas of the code where this is the case? Sometimes it is possible to rewrite the code to make it more efficient, but frequently it just involves replacing a peice of code with a heuristic, which would be dependent on your data set.

BTW, most of the credit for this stuff goes to Dr Konchady, the author of the TMAP book. I basically took his algorithms and attempted to reimplement them in Java in an attempt to teach myself data mining.

Amu said...

Sir,
I couldn't find package net.sf.jtmt.indexers;this anywhere.. Will u plz help me to rum this pgm.

Sujit Pal said...

Hi Amu, you can find the code on my jtmt.sf.net project repository. You can build it (currently, unless you build yourself an ant build.xml file yourself) using your IDE.

Srijith said...

Hi..Sujit
Thank you...
While processing large (not too large actually) set of documents,RealMatrix shows the MaxIterationsExceededException exception..Its is not resolved yet by the Apache Math team..
How can we over come this isse??Is there any other Java Math package which yields best results ??

Thank you

Sujit Pal said...

You are welcome, Srijith. To answer your question, did you check out replacing the RealMatrix with a SparseRealMatrix in your code? Given that most of the matrices are going to be pretty sparse, this saves a good bit of memory. Not sure about the MaxIterationsExceededException though, haven't ever seen this (although I probably worked with smaller datasets than you are working with). Not sure if there are other math packages without this limitation, I have used Jama and then commons-math.

Srijith said...

Hi..Sujit..
Thank you..The Exception in Realmatrix is actually a bug and already reported to apache..and was informed that this issue will resolve in coming versions of commons math...This bug will occur when v r passing a sparse matrix to SVD.

So i searched for some other java matrix packages and found a lot,like,colt,JAMATO,UJMP,Mahout..

Now i rectified the said issue by replacing all the matrix operations by using the Mahout matrix..Its working better than previous..

Thank u for ur support

Amu said...

Hello Sir,
Again its me Amu sir.I am doing my project in Semantic based information retrieval. I am using Ontology as well as WordNet concept. The codes present in your blog will help me a lot if i can run it. Since i am new to java i couldnt run not even a single program from your blog.i dont know why?. will you please help me in my project.
Thanks in Advance...

Sujit Pal said...

Thanks for the pointer Srijith - I've been meaning to look at Mahout myself... I did read sometime back that SVD with Sparse Matrices was not supported at one point, but then I read it was...anyway, good to know your problem is solved.

Sujit Pal said...

Hi Amu, all the programs have associated JUnit test cases (rather than run via main() methods). If you have an IDE, then you can probably run it directly as JUnit tests, or use mvn test -Dtest=classname (if I remember correctly). At some point I plan on changing over to ant but haven't had the time so far...

sai sashankh said...

Hi Sujith,

Nice work. It is of great help to beginners like us. In my research I am trying to use TfIndexer and IdfIndexer classes.. TfIndexer seems to work fine.. however IdfIndexer gives an error near matrix.setEntry.
I did include the required jar file commmons-math1.2.jar.. Is there anything that I am missing???

Thanks a lot!
Sashankh.

Sujit Pal said...

Thanks Sai. Not sure why the error would happen - can you post the error, perhaps that may help narrow down the cause?

sai sashankh said...

Hi Sujith,

the errors i am getting are

-the method matrix.getentry(int, int, double) is not defined for the type RealMatrix
-the method setSimilarity(AbstractSimilarity) in the type Searcher is not applicable for the argument (CosineSimilarity)

thank you for the help.

Anonymous said...

Respected Sir,
I would like to hear your suggestion on this.I need to fetch documents from a directory,tokenize,find the term frequency across,and put it to a Vector of Hashtables. This has to be done for totally 5 directories.Fetching documents from a single directory and finding the term frequency is fine.But,how to extend the same to 5 domains?I'm totally confused.

Sujit Pal said...

@Sai: The first error is because you are using RealMatrix.setEntry() signature for getEntry(). The second error is stranger, maybe you've changed CosineSimilarity so it no longer extends AbstractSimilarity?

@Anonymous: I am pretty sure that I am misunderstanding your question, but I don't really see a problem, implementation wise. Your code already takes a directory object and produces a matrix right? So you can change your code to take a collection of directories and loop over them.

Anonymous said...

I'd like to point out to you that Stack Overflow is NOT a place to advertise your site. It is a place where programmers come to get help. If you are ever stuck with a challenging problem, feel free to come back and ask a question. And if you want to help other people like yourself, then you might answer their questions. But please do not join the community only to promote your own site.

Sujit Pal said...

I didn't...honest. And I wouldn't...I am aware of the rules and agree with them (not Stack Overflow's specific rules, but netiquette in general). Your comment doesn't reference a specific URL, so I went looking at Stack Overflow via google (sujitpal site:stackoverflow.com), but all I found were people referencing specific posts in response to questions...and they all seemed relevant. Can you please point me to the page where you find my site being advertised? If the reference is not relevant, I will try to contact SO to have it removed.

Unknown said...

Hi Sujit, great implementation of LSI. I am making one on php. Do you have anything on clustering with GHSOM algorythm?

Sujit Pal said...

Thanks goliathlup...and thanks for the link to GHSOM - and no, I am not familiar with it, but looks interesting.

Anonymous said...

Dear Sujit,
This is an excellent piece of code and very helpful. But, I have a little doubt. It seems to me (please correct me if I am wrong) using simple matrix operations provided in Jama, you have to calculate all singular values and eigenvectors and then truncate it later. For larger matrix, it might take too long to calculate all the singular values which we are not using anyway. Is it right? If yes, is there any other way to do it? And also, is it common to use sqrt as the reduced rank? Thank you. Again, I appreciate your post very much.

Regards,
Uttam.

Sujit Pal said...

Thanks Uttam. I've used Jama's SVD method to calculate the singular values. So there is only 1 operation per matrix, although that is quite memory intensive. There is also the SVD method in commons-math which allows you to operate on sparse matrices, but I was told recently (by a commenter on this blog) that there may be a bug in it with the current version, and that the SVD method in Apache-Mahout was the way to go. I got the square root idea from the TMAP book, so I guess it must be fairly common, but not sure if it is considered best practice or not.

vignesh said...

sir, i'm unable to access the packages from the below link..
where can i download the packages alternately...

http://jtmt.svn.sourceforge.net/viewvc/jtmt.tar.gz

ananta said...

Dear Sujit i really like ur post. I m using ur blog to write a code for lsi. however after i use ur code i have this error java.lang.OutOfMemoryError: Java heap space.

Could u help me how to fix it? the error is in the line RealMatrix wordVector = svd.getU(); of Lsiindexer.java

Sujit Pal said...

Hi Vignesh, just tried the link out myself now, it should work, and thats the only place to get it from. Alternatively, you can use an SVN client to pull the source (check the sourceforge online documentation on how to do this) and build it yourself.

Sujit Pal said...

Hi Ananta, the OutOfMemory indicates that your matrix is probably too large to fit in memory. Check to see if you can allocate more memory to your JVM (-Xmx2048M is the max if you have a 32-bit machine). If not (or in parallel), I would suggest checking if commons-math can handle SparseMatrix in their SVD code - if they can, use the SparseMatrix implementation of RealMatrix.

Anonymous said...

Hello Mr. Pal!

I am using your LSI program and I am facing some problem regarding it.
For the following input to your program,
0.00 1.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 1.00
0.00 0.00 0.00 0.00 1.00
0.00 0.00 0.00 0.00 1.00
0.00 0.00 1.00 0.00 0.00
0.00 0.00 0.00 2.00 0.00
2.00 1.00 0.00 0.00 0.00
0.00 1.00 0.00 0.00 0.00
0.00 0.00 1.00 0.00 0.00
0.00 0.00 1.00 0.00 0.00
1.00 0.00 0.00 0.00 0.00
0.00 1.00 0.00 1.00 0.00
0.00 0.00 0.00 2.00 0.00
0.00 0.00 2.00 0.00 0.00
0.00 0.00 1.00 0.00 0.00
1.00 0.00 0.00 0.00 0.00
1.00 1.00 0.00 0.00 0.00

I am getting this output:

0.09 0.08 ? 0.02 ?
0.00 0.00 ? 0.00 ?
0.00 0.00 ? 0.00 ?
0.00 0.00 ? 0.00 ?
0.00 0.00 ? 0.00 ?
0.03 0.05 ? 0.39 ?
0.34 0.28 ? 0.01 ?
0.09 0.08 ? 0.02 ?
0.00 0.00 ? 0.00 ?
0.00 0.00 ? 0.00 ?
0.13 0.10 ? 0.01 ?
0.07 0.10 ? 0.21 ?
0.03 0.05 ? 0.39 ?
0.00 0.00 ? 0.00 ?
0.00 0.00 ? 0.00 ?
0.13 0.10 ? 0.01 ?
0.21 0.18 ? 0.01 ?

I am wondering why the question marks are coming. Is this because of Jama?
Thank you!

Sujit Pal said...

Not sure, but my guess would be that these are either NaN (not a number) caused by float overflow or underflow. Frequently NaNs are caused by dividing by numbers that are almost 0. Its been a while since I wrote the code, but take a look at the client code to see if there are checks for very low or very high values (Double.MIN_VALUE and Double.MAX_VALUE) in the input matrix, that may fix the problem. Otherwise you will have to look within Jama.

Emmaunueleater said...

Hello,
I'm working on incorporating some parts of your jtmt project (vector generation and searching) in a project that I'm working on. I've got a question on an error. I get an exception in thread "main org.springframework.jdbc.CannotGetJdbcConnectionException" I've included the mysql-connector-java-5.05.jar in my build path in Eclipse, but the error persists. I'm wondering if it has to do with the connection,
jdbc:mysql://localhost:3306/tmdb
or user name tmdb
or password irstuff

Sujit Pal said...

Hi Emmaunueleater, can't say for sure if its to do with the database (or lack of it) from the description posted, but some of the code needs a database and maybe thats what you are running into. I used a MySQL database, the scripts for the schema and data are available in the src/main/sql directory.

Moha said...

Dear Mr. Sujit Pal. I am doing my undergraduate thesis creating IR system to find documents inside the computer using LSI. Is it possible to use many documents (including .pdf, .doc, .ppt, and .odf extensions) for searching using LSI. Doesn't it take long time for processing?
look forward for your advices and suggestions, please. thanks

Sujit Pal said...

Hi Moha, yes, you are absolutely right, LSI is very resource-intensive, so its unlikely to be very useful for large datasets. I haven't needed to use this for large datasets, but I believe Mahout has a Hadoop based SVD which may be useful. Another place to look at may be the Semantic Vectors project. I haven't looked much at either yet though.

Moha said...

Glad to hear your response. But how if I keep going on developing the IR system with LSI? any suggestion?
I have another question. regarding to Wikipedia, Cosine Similarity in information retrieval context won't be negatif. still, in the example by Dr. E Gracia, there are negatif values. Can you explain it to me?

Sujit Pal said...

Yeah, apart from Mahout and Semantic Vectors, I don't have any suggestions for LSI. Also per my understanding cosine similarity is always 0-1 (0 completely dissimilar 90 degrees angle between the two in vector space to completely similar 0 degrees angle). I am not sure which of Dr Garcia's articles you are referring to, I checked the one I linked to from the post and couldn't find any string with "cosine" -- you could probably ask him for clarification about it.

Moha said...

thanks for the suggestion. about the negative value in cosine similarity perhaps you can see this tutorial of Dr. Garcia. one result of similarity calculations shows negative value. but anyway, I have got yout JTMT project from sourceforge and have tried it (especially the IndexerTest.java). one thing I don't understand is how you use the database to store the words and get the matrix? when I check the database, there is nothing I see but empty tables. can you explain the logic of the program to me or may be the steps you used to generate the matrix from the database, please? I understand how to resolve the LSI problem from extracting the text, term weighting, generating term-document matrix, calculating the SVD, creating query-document matrix, and calculate the similarity using cosine similarity, but I don't know how to generate the matrix from the database.did you erase the table records after generating the matrices? Would you please send the answer to my email as well please: new.muha@yahoo.co.id. I really need your help, please. thank you Sir.

Sujit Pal said...

Hi Moha, thanks for the link to the page...turns out I was wrong about (0,1) range for cosine similarity, it can have values from (-1,1), so thanks for the education as well :-). From wikipedia, it seems that negative values do indicate things:

The resulting similarity ranges from −1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.

To answer your question, the database is being used to hold the configuration for certain recognizers (in VectorGenerator right?). The data is pulled from Dr Konchady's Textmine project. You can find copies of the data under stc/main/sql which can be loaded into the db. The db itself can be created using the *_schema.sql files.

Anonymous said...

PLEASE, I need to download the following jar file:

org.apache.commons.collections15.Transformer;

Can you provide me the link
Thanks too much

Sujit Pal said...

Its here:
http://larvalabs.com/collections/

Anonymous said...

hello, Sujit.
I need to ask, what's the mean of this part :

this.dictionary = new Dictionary(
new URL("file", null, "/opt/wordnet-3.0/dict"));

in ContentWordRecognizer.java.
I'm getting an error there when I run IndexersTest.java. "java.io.IOException: Dictionary directory does not exist: \opt\wordnet-3.0\dict".
What should I do to fixed the problem ?
Thanks.

Sujit Pal said...

This is the wordnet dictionary. You will have to download and install Wordnet - I installed mine under /opt/wordnet-3.0.

Anonymous said...

can you give URL to download the dictionary ?
Note : I'm use Windows OS in my implementation.
Actually, what's the mean of /opt/wordnet-3.0 ? Is it a directory location ? Please give me a clue when I'm using Windows.
Thanks.

Sujit Pal said...

Remember, Google's your friend :-). I searched for "wordnet download" and the instructions and download location, etc were in the first result. And yes, the /opt/wordnet-3.0 refers to the directory location - in case of windows, you would probably download and install into C:\Wordnet or something instead of /opt/wordnet-3.0.

Anonymous said...

Hi Sujit,
Please explain to me, what's the main purpose to use database in this implementation.
I still don't understand with that.
Please tell me the process in detail.
Thanks, Sujit...

guru said...

hello sir,
Im required to make a web crawler in java incorporating the concept of LSI . I am totally new to java. I was going through the LSI code and i want to ask you from where can i get the package
package com.mycompany.myapp.indexers;

Im also not getting as to how will these imports work. How will i be able to use them.
import org.apache.commons.collections15.Transformer;

import Jama.Matrix;
import Jama.SingularValueDecomposition

Sujit Pal said...

@Anonymous: the database is used to store word collocations, "common" abbreviations, etc, so the recognizers can tag them as such - this is used for word and sentence boundary recognition. If you look back a few posts, you will probably find a more detailed description - in any case, the code is also available (with inline comments) on my JTMT project page (link in answer to gurus' comment).

@guru: Hi guru, don't know your application well enough, but not sure how a crawler would incorporate LSI - indexer I can see though... Anyway, to answer your questions, my stuff (mycompany) is available at my JTMT project on Sourceforge, JAMA is available at this NIST site, and the collections15 package is available at Larvalabs. Good luck!

Raja Anbazhagan said...

I'm trying to cluster the documents using LSI...
I need to know how documents are ranked using SVD matrix...
Please...
Awaiting your response...

Sujit Pal said...

Hi Raja, LSI takes the TD matrix and finds the principal components. So the decomposed matrix will give you the terms that are considered "most relevant". You could use documents with the highest (normalized) frequency of each term as the cluster seeds.

23011986 said...

I’m trying to classify a document using SVM ...
I need to know how to use SVM multiclass??......
Please ...
Awaiting your response ...

Sujit Pal said...

Hi 23011986, sorry can't help you there. Don't know much about SVMs.

23011986 said...

ok, thank you

Navaraj said...

Hello Sujit sir,
It was really a great learning blog for me, and also lots of things to learn from following comments. I tried your code, I just have one small query, I had imported sql files as per required during IndexerTest and SimilarityTest but when I check those tables there is nothing in it, was that going good, or I did mistake somewhere? I used the resource files you provided.
Thank you very much,
Regards,

Sujit Pal said...

Hi Navaraj, the SQL files are in src/main/sql of the jtmt project on sourceforge. You will have to create the database yourself. The SQL files with "create table" are needed to set up the (empty) tables. The SQL files with "insert into" statements are for loading the data.

Anonymous said...

hi, i'm new in java programming.. i already copied your code but got error. the error like vectorGenerator.getMatrix(), vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), all about vectorgenerator.. can help me? thank you.

23011986 said...

Sujit goodnight,

I want to work with the Naive Bayes classifier with 11 class, and using Feature Selection with Information Gain and Chi-square. is that you can send me your code.
thank you in advance and sorry for my english.


Thanks a lot,
Meryeme

Anonymous said...

hi, i run your code but only appear Junit test.. Can help me what is the problem? thank you..

Sujit Pal said...

@Anonymous (04/07/2012) - are these compilation errors? You can find the VectorGenerator.java class in this post.

@Meryeme: Its been a while, so not sure where the code is, but you should find some references to Naive Bayes and Chi-Squared test (with code I wrote to do these things) scattered through my blogs. I usually kept all these in the jtmt.sf.net project, so thats another place you may want to look.

@Anonymous (04/08/2012) - yes, I like to use JUnit to run code from the command line. You can use the contents of the JUnit test to build yourself a main class if you prefer running them that way.

Java Developer said...

How to detect the Hindi Transliteration language any help appreciated?
such as हम एस ब्लोक को लाइक करता हु .
Hum es blok to Like karta hu.

Sujit Pal said...

I was thinking about your question, and at first thought this seems to be a simple character match and replace problem... there are a fixed number of consonants which can occur on their own or be modified in one of 8 different ways (corresponding to each vowel) - so per consonant, there are 9 characters. Now we can associate with each a "Hinglish" string. So if you read the Hindi string character by character, and substitute for each pattern the equivalent English string, that should produce what you are looking for. No?

somya said...

sir i am new to this and a student i want to do text categorization and before that i need to run the entire code on some dataset for calculating tfidf for each term... please please help me and guide me to run your code and what what steps i should do...
reply soon and my email is pieces23somya@gmail.com

i will b very grateful to you

Sujit Pal said...

Hi Somya, I would suggest looking at Weka or Mallet for small scale text categorization or at Mahout for large scale (ie lots of training data). The stuff here was just me trying to understand what went on under the covers, the code here will probably not do too well against large amounts of data.

Unknown said...

hello sir, i am working on text document categorization... at present i have completed the preprocessing part of the text by eliminating text noises like slangs and concept expansion via wordnet and now how do i proceed further to SVM... how do i scale down the text to numbers?? please help me in this regard.. as i am very new to IR field..

Sujit Pal said...

Hi Swathi, to convert the text to numbers, you would tokenize the text into words (terms) and create a sparse term document matrix which can be used as input to SVM. If you are language agnostic, then I would suggest using Python and Scikits Learn since they have these components already - see this tutorial for more details.

zeusmain said...

Hi.
I just wanna download your source code.

can you give me the whole source code to my e-mail?

my email address :: blue3309@naver.com

Sujit Pal said...

Hi zeusmain, all my code is available in SVN on the jtmt.sf.net project, you can download it yourself from there.

Najam said...

Dear Suijit,

I have tried calculting Tf-Idf using your approach. As you have calculated Tf and Idf separately so I thought it would be pretty easier to calculate it but I faced some problem. First I tried the multiply method of RealMatrix interface but got some error. I am trying simple matrix calculation also. Could you help in getting the Tf*idf value correctly?

Sujit Pal said...

Hi Najam, you would probably need to provide some more information. Can you point to the source name and line number which you are using, and what errors you are seeing, etc?

Najam said...

Dear Sujit,

You know Tf is a one matrix and IDF is another matrix. So if you need a TfIDF value, which is Tf * Idf (multiplication of Tf and IDF), you need to multiple two matrices Tf and IDF. Simple matric multiplication (column of one should be equal to rows of other). Therefore, when I tried to multiply two matrices to get the TF*IDF value I got Matrix multiplication error.

org.apache.commons.math.MathRuntimeException$4: 20x7 and 20x7 matrices are not multiplication compatible

So I tried another way by using nested loop to get the values and multiply it but resulting values are not correct. Something like this

public RealMatrix calculateTfIdf(RealMatrix tfRealMatrix, RealMatrix idfRealMatrix)
{
for(int i=0;i < tfRealMatrix.getColumnDimension();i++)
{
for(int j=0; j< idfRealMatrix.getColumnDimension();j++)
{
idfRealMatrix.setEntry(i, j, tfRealMatrix.getEntry(i, j) * idfRealMatrix.getEntry(i, j));
}
}
return idfRealMatrix;
}

I thought it is better to use the matrix to multiply the values in other case I have to change many things in code and being pretty new to it, it might lead me to some more troubles. That's why I asked you for some solution with minimum changes and getting the exact values.

I hope the problem is much clear now...Looking forward.

Sujit Pal said...

Actually, I have always thought of TF-IDF as a single combined transformation on each element of the TD matrix. With your scenario where you are calculating the TF matrix and then the IDF matrix, the TF-IDF matrix will be an element-wise multiplication, ie a dot product of the two.

Najam said...

You are right TF and IDF are calculated separately but in some cases you need to make a dot product of it because if removes the discrepancies in both TF and IDF matrices and I need a dot product That is why I asked for solution. If you have any solution please share.

Najam said...

TF-IDF as a single combined transformation on each element of the TD matrix

How can we achieve this?

Do you mean that all the functionality of TF and IDF ( which has done separately) could be done in a single transformation and returned?

Sujit Pal said...

Both TF and IDF are applied to individual terms. If you need to apply them separately then use dot product of tf(X) and idf(X) where X is the TD matrix and tf(X) and idf(X) is the tf() and idf() operations applied to each element of X.

Anonymous said...

Hi there. I would be very glad if you help me a step by step process using commmand prompt in running it. I got some errors running it on eclipse. Thanks!

Sujit Pal said...

Hi Anonymous, its been a while since I did this work and I don't remember specifics anymore, so at this point, it would be really hard (and boring) for me to build up a step-by-step process. I did most of the work as a learning exercise and wrote about the interesting bits as I finished them, so sequence wise, my blog posts are probably a good guide. All the functionality is written as JUnit tests and can be run from the command line using either mvn or ant. I started out using mvn but then switched to ant because I had to install a lot of libraries I couldn't find on the mvn repositories. So there is a lib directory in the jtmt (jtmt.sf.net) project. Easy way to start developing with Eclipse is to do "mvn eclipse:eclipse" to build the initial .classpath file, then go into Eclipse and point to the JARs in the lib directory for the build path. Hope the above description works well enough for you.

ndk said...

Thank you for the tutorial. I don't know if you are aware that the link to LSI/SVD tutorial is broken.
Also, I wonder if you choice to have k = sqrt(number_of_columns) is just random or if there is any intuition behind that?

How is k usually be chosen in practice? Do you usually have training dataset and you vary the value of k until you get the best result on the training data?

Thanks,

Sujit Pal said...

Hi ndk, you are welcome, this was me learning as well at that point :-). Thank you for pointing out the broken LSI/SVD link, I will fix, apparently the articles in Dr Garcia's site are now no longer free. The choice of k was suggested by the TMAP book - now that I know a little more, I would think the value of k would be determined by a training set and setting to various values and plotting a learning curve, then choosing the smallest value that gives good results.

ndk said...

Yes, that's what I thought too. Thank you very much for confirming.
I also searched around for the tutorial you cited but couldn't find any working link. However, here is a good tutorial from Stanford if you want to replace the broken link with: http://nlp.stanford.edu/IR-book/pdf/18lsi.pdf
Thanks again for your good work

Sujit Pal said...

Thanks for the kind words and the link to the IR book chapter. I have updated the post with this link.

Unknown said...

hii..
i am working very big documents. and i am getting java heap space error at OpenMapRealMatrix. how can i solve this problem?

thanks a lot.

Sujit Pal said...

Hi Hazal, hard to say for sure, but the matrices may get large (even using OpenMapRealMatrix which helps because they are also quite sparse) depending on the input size. So you may need a bigger heap or bigger RAM. Did you set your heap size with the -Xmx parameter to Java? If yes, then maybe you need more physical memory to accommodate.

Unknown said...

thanks a lot Sujit :)

Sujit Pal said...

>> nguyễn trang52k3it said: Hello Sujit Pal. He gave me the complete application package code of the subject...
Hi nguyễn, I've not published your post for your privacy, since you had your email address in there. I am guessing you are looking for the source code for this post? If so, all the code is downloadable from my project (http://jtmt.sf.net, click on Browse SVN).

Unknown said...

Hello Sujit,
Your post is cool.
I have been working on a project to create a model for Unsupervised Aspect based opinion mining on online reviews ( broadly domain independent).
I've followed the following steps
1. preprocessing - stopword removal, stemming
2. Term document matrix creation
3. Normalization of Term document matrix and also with and without TFidf.
4. SVD with dimension reduction
5 Product of left singular matrix ( TermVector) and Scale matrix (Sigma/ Weighted ) to produce k dimensional point, k is the reduction rank
6. applying Clustering on the points obtained using Bisecting K means ( hierarchial clustering)
I expected semantically equivalent words/terms/aspects to be clustered together.
For instance in case of a mobile review resolution, display, screen should be clustered in one cluster.
But i'm not getting the desired results.
I need urget help.
Thank you in advance.

Sujit Pal said...

Thanks for the kind words, Pooja. Your approach is very interesting, don't know much about aspect based opinion mining but it looks like a valid way to decompose the problem. Between #5 and #6, I think you might need to multiply the vector for the original word with the reduction vector to create the vector for the word in the reduced space? You are probably doing that already and just forgot to list it out. Did you try varying k to see if you get better results? Also since you are using bisecting K-Means, your problem is just 2 classes (positive/negative) right? Would it make more sense to just look at verbs in your review text instead of all the text?

Unknown said...

Thank you so much for taking time.
1. In step 5 i'm multiplying only the vector matrix and scale matrix, assuming it would give only the significant terms. But i haven't tried multiplying all the three decomposed matrix, as that would give the significance of the terms with respect to the documents( or opinions in my case) which i don't need.
2. Yup, i've considered filtering only nouns, as nouns are candidate aspects/ features.
3. Bisecting K means can produce more than two clusters. i do get a couple of them but not all classified properly:(
4. Varying k is making a difference, but not so good results.
I have another query,
I've used your Tfidf implementation in my project. idf is calculated as
IDF(t) = log_e(Total number of terms in all documents/ Number of terms in document d). but in your implementation you have considered
IDF(t) = 1 + log_e(Total number of terms in all documents) - log_e(Number of terms in document d)
Could you explain this please.
Thank you.

Sujit Pal said...

It was very likely an attempt at smoothing, although it should have been log(1 + N/d(i)) and not 1 + log(N/d(i)) in retrospect. I was learning this stuff at the time and following the TMAP book, so its likely that I translated the formula incorrectly, sorry about misleading.

Unknown said...

Thank you

Unknown said...

Kindly help with this


Ranking Task.
1. Implement custom ranking algorithm that ranks documents according to several Metadata attribute (like word count). Devise a rating system for documents based on importance of document, age of document, word count, source, file type etc. (10 marks)
2. Implement a number of sophisticated ranking algorithms taking into account keywords, Metadata, page structure, Named Entities and User Models. Make systematic comparisons between competing ranking algorithms and provide graphs of recall and precision ROC as your output.(50-70 marks)

3. Implement a number of sophisticated ranking algorithms taking into account keywords, Metadata, page structure, Named Entities and Thesaurus expansion, Make systematic comparisons between competing ranking algorithms and provide a graph of recall and precision ROC as your output. This will require you to make use of test corpus. Provide description on interface of what ranking is currently in use, and provide a breakdown of why a document is ranked the way it is (go beyond a score – break the score down).(70-90 marks)

Sujit Pal said...

Hi Bright, this looks like an interesting homework assignment :-). Easy way to do this would be to extract values for the various metadata fields during indexing (ie count the words and put it into a word-count field, find a way to quantify document importance, etc). Then build different queries with different fields and weights (boosts) to represent different custom ranking algorithms. For the recall and precision, you will need to decide for a set of queries (say 30-50), the documents that should be returned for the query - this is your gold set. For each query then, compute precision as the number of matches between your result and the gold set and recall as the number of results that showed up divided by the total number of results for that query in the gold set. For the score breakdown, you should consider using the explain plan. I am assuming you are using Lucene or Lucene based search tools. Also, if you are feeling adventurous, check out function queries - gives you more control but its slow, so you should prefer baking in metadata to using function queries if possible.

Unknown said...

Dear sir,

thank you for your post its really very good ..

i'm interesting to apply LSA for finding similar documents, as follow :

when large documents represented is a VSM (vector space model) then every row represent doc-name, columns are unique words and entries are TF-IDF values then my problem start from here

when applying LSA for dimension reduction, there are three matrices will produces, i will interest to applying clustering algorithm on the appropirate matrix will find similar documents as groups (clusters) so which matrix it will suit for that , thanks

i'm looking for explanatin of how to apply that ?

Sujit Pal said...

Thanks for the kind words Ahmed. If you are doing SVD decomposition, then the document-term matrix gets split up into a document (U), singular (S) and term (V.T) matrices. Assume your original matrix M was of size (50, 5000) (50 docs x 5000 terms), the sizes of U, S and V.T will be (50,5000), (5000,5000) and (5000, 5000). For clustering, I think you will have to reconstruct the matrix using a smaller S (and corresponding cuts on U cols and V.T rows), then compute M.T * M to get the document cooccurrence matrix by term.