## Saturday, September 27, 2008

### IR Math with Java : Similarity Measures

Last week, I wrote about building term document matrices based on Dr. Manu Konchady's Text Mining Application Programming book. This week, I continue working on computing some similarity metrics from the same set of data. Most of the material is covered in Chapter 8 of the book.

As before, our document collection consists of 7 document titles as shown below:

 ```1 2 3 4 5 6 7``` ```D1 Human machine interface for computer applications D2 A survey of user opinion of computer system response time D3 The EPS user interface management system D4 System and human system engineering testing of EPS D5 The generation of random, binary and ordered trees D6 The intersection graph of paths in trees D7 Graph minors: A survey ```

The document collection is first converted to a raw term frequency matrix. This is further normalized using either TF/IDF or LSI. The matrix thus normalized is the input to our similarity computations.

A document is represented by a column of data in our (normalized) term document matrix. To compute the similarity between two documents, we perform computations between two columns of the matrix. Conceptually, each document as a point in an n-dimensional term space, where each term corresponds to a dimension. So in our example, we have 23 terms, therefore our term space is a 23-dimensional space. In case this is hard to visualize (it was for me), it helps to think of documents with 2 terms (and hence a two dimensional term space), and extrapolate from there.

### Jaccard Similarity

Jaccard Similarity is a simple and very intuitive measure of document to document similarity. It is defined as follows:

```  similarity(A,B) = n(A ∩ B) / n(A ∪ B)
```

Using the Inclusion-Exclusion Principle, this reduces to:

```  similarity(A,B) = n(A ∩ B) / n(A) + n(B) - n(A ∩ B)
where:
n(A ∩ B) = Σ min(Ai, Bi)
n(A) = Σ Ai
n(B) = Σ Bi
for i = [0..n-1], where n = number of terms in our term-document matrix.
```

AbstractSimilarity.java

The AbstractSimilarity class shown below contains the code to compare all document pairs from our collection. This has a complexity of O(n2), which we can reduce to about half if we code in the knowledge that similarity(A,B) == similarity(B,A) and similarity(A,A) == 1.0, but which is still too high for large number of documents. I haven't added these knowledge here, though, since this is just toy data and the performance is tolerable.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29``` ```// Source: src/main/java/com/mycompany/myapp/similarity/AbstractSimilarity.java package com.mycompany.myapp.similarity; import org.apache.commons.collections15.Transformer; import Jama.Matrix; public abstract class AbstractSimilarity implements Transformer { public Matrix transform(Matrix termDocumentMatrix) { int numDocs = termDocumentMatrix.getColumnDimension(); Matrix similarityMatrix = new Matrix(numDocs, numDocs); for (int i = 0; i < numDocs; i++) { Matrix sourceDocMatrix = termDocumentMatrix.getMatrix( 0, termDocumentMatrix.getRowDimension() - 1, i, i); for (int j = 0; j < numDocs; j++) { Matrix targetDocMatrix = termDocumentMatrix.getMatrix( 0, termDocumentMatrix.getRowDimension() - 1, j, j); similarityMatrix.set(i, j, computeSimilarity(sourceDocMatrix, targetDocMatrix)); } } return similarityMatrix; } protected abstract double computeSimilarity( Matrix sourceDoc, Matrix targetDoc); } ```

JaccardSimilarity.java

For the JaccardSimilarity class, we define the abstract method computeSimilarity() as shown below (based on our formula above). I have cheated here slightly - in Jama, norm1() is defined as the maximum column sum, but since our document matrices are column-matrices (ie 1 column), norm1() returns the same value as the sum of all elements in the document matrix. It does make the code less readable, in the sense that the reader/maintainer would have to go through an extra mental hoop, but it makes the code concise.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21``` ```// Source: src/main/java/com/mycompany/myapp/similarity/JaccardSimilarity.java package com.mycompany.myapp.similarity; import Jama.Matrix; public class JaccardSimilarity extends AbstractSimilarity { @Override protected double computeSimilarity(Matrix source, Matrix target) { double intersection = 0.0D; for (int i = 0; i < source.getRowDimension();i++) { intersection += Math.min(source.get(i, 0), target.get(i, 0)); } if (intersection > 0.0D) { double union = source.norm1() + target.norm1() - intersection; return intersection / union; } else { return 0.0D; } } } ```

The result of our Jaccard similarity computation is shown below. The JUnit test to return this is SimilarityTest.java which is shown later in this post. The two matrices are derived from our raw term document matrix normalized using TF/IDF and LSI respectively. As you can see, the similarity matrix created off the LSI normalized vector shows more inter-document similarity than the one created off the TF/IDF normalized vector.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19``` ```=== Jaccard Similarity (TF/IDF) === D1 D2 D3 D4 D5 D6 D7 D1 1.0000 0.0000 0.0000 0.0818 0.0000 0.0000 0.0000 D2 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0710 D3 0.0000 0.0000 1.0000 0.2254 0.0000 0.0000 0.0000 D4 0.0818 0.0000 0.2254 1.0000 0.0000 0.0000 0.0000 D5 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 D6 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.1781 D7 0.0000 0.0710 0.0000 0.0000 0.0000 0.1781 1.0000 === Jaccard Similarity (LSI) === D1 D2 D3 D4 D5 D6 D7 D1 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 D2 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 1.0000 D3 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 D4 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 D5 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 1.0000 D6 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 1.0000 D7 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 1.0000 ```

### Cosine Similarity

Cosine Similarity is the more popular but also a slightly more complex measure of similarity. It is defined as:

```  similarity(A,B) = cos θ = (A ⋅ B) / (|A| * |B|)
where:
A ⋅ B = Σ Ai * Bi
|A| = sqrt(Σ Ai2)
|B| = sqrt(Σ Bi2)
for i = [0..n-1], where n = number of terms in our term-document matrix.
```

Here, θ is the angle between the gradients of the documents A and B in the n-dimensional term space. A and B are very similar as θ approaches 0° (cos(0) = 1), and very dissimilar as θ approaches 90° (cos(90) = 0).

Dr. E. Garcia has a very informative tutorial on Cosine Similarity that explains the formula above in much more detail (and with more humor).

CosineSimilarity.java

The code below computes the cosine similarity based on the formula above. Here, normF() is the Frobenius norm, the square root of the sum of all elements, and norm1() is the maximum column sum, which works for me because we are computing it off a one-dimensional matrix.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15``` ```// Source: src/main/java/src/com/mycompany/myapp/similarity/CosineSimilarity.jav a package com.healthline.jrocker.similarity; import Jama.Matrix; public class CosineSimilarity extends AbstractSimilarity { @Override protected double computeSimilarity(Matrix sourceDoc, Matrix targetDoc) { double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1(); double eucledianDist = sourceDoc.normF() * targetDoc.normF(); return dotProduct / eucledianDist; } } ```

The results of this computation for a term document vector normalized with TF/IDF and LSI respectively are shown below. As before, the inter-document similarity is higher for the LSI normalized vector.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19``` ```=== Cosine Similarity (TF/IDF) === D1 D2 D3 D4 D5 D6 D7 D1 1.0000 0.0000 0.0000 0.1316 0.0000 0.0000 0.0000 D2 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.1680 D3 0.0000 0.0000 1.0000 0.4198 0.0000 0.0000 0.0000 D4 0.1316 0.0000 0.4198 1.0000 0.0000 0.0000 0.0000 D5 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 D6 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.3154 D7 0.0000 0.1680 0.0000 0.0000 0.0000 0.3154 1.0000 === Cosine Similarity (LSI) === D1 D2 D3 D4 D5 D6 D7 D1 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 D2 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 1.0000 D3 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 D4 1.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 D5 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 1.0000 D6 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 1.0000 D7 0.0000 1.0000 0.0000 0.0000 1.0000 1.0000 1.0000 ```

### Searching - Similarity against a Query Vector

Using the similarity vectors above, it is now possible to build a Searcher. The objective is to find the most similar documents that match our query. For this, we will need to build a query vector consisting of the terms in the query, and then compute the similarity of the documents against the query vector. The computation is similar to what we have already done above if we treat the query vector as a specialized instance of a document vector. We use Cosine Similarity for our computations here.

Searcher.java

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97``` ```// Source: src/main/java/com/mycompany/myapp/similarity/Searcher.java package com.healthline.jrocker.similarity; import java.util.ArrayList; import java.util.Arrays; import java.util.Collections; import java.util.Comparator; import java.util.HashMap; import java.util.List; import java.util.Map; import Jama.Matrix; public class Searcher { public class SearchResult { public String title; public double score; public SearchResult(String title, double score) { this.title = title; this.score = score; } }; private Matrix termDocumentMatrix; private List documents; private List terms; private AbstractSimilarity similarity; public void setTermDocumentMatrix(Matrix termDocumentMatrix) { this.termDocumentMatrix = termDocumentMatrix; } public void setDocuments(String[] documents) { this.documents = Arrays.asList(documents); } public void setTerms(String[] terms) { this.terms = Arrays.asList(terms); } public void setSimilarity(AbstractSimilarity similarity) { this.similarity = similarity; } public List search(String query) { // build up query matrix Matrix queryMatrix = getQueryMatrix(query); final Map similarityMap = new HashMap(); for (int i = 0; i < termDocumentMatrix.getColumnDimension(); i++) { double sim = similarity.computeSimilarity(queryMatrix, termDocumentMatrix.getMatrix( 0, termDocumentMatrix.getRowDimension() - 1, i, i)); if (sim > 0.0D) { similarityMap.put(documents.get(i), sim); } } return sortByScore(similarityMap); } private Matrix getQueryMatrix(String query) { Matrix queryMatrix = new Matrix(terms.size(), 1, 0.0D); String[] queryTerms = query.split("\\s+"); for (String queryTerm : queryTerms) { int termIdx = 0; for (String term : terms) { if (queryTerm.equalsIgnoreCase(term)) { queryMatrix.set(termIdx, 0, 1.0D); } termIdx++; } } queryMatrix = queryMatrix.times(1 / queryMatrix.norm1()); return queryMatrix; } private List sortByScore( final Map similarityMap) { List results = new ArrayList(); List docNames = new ArrayList(); docNames.addAll(similarityMap.keySet()); Collections.sort(docNames, new Comparator() { public int compare(String s1, String s2) { return similarityMap.get(s2).compareTo(similarityMap.get(s1)); } }); for (String docName : docNames) { double score = similarityMap.get(docName); if (score < 0.00001D) { continue; } results.add(new SearchResult(docName, score)); } return results; } } ```

The results of searching with the query "human computer interface" against a similarity matrix built off a TF/IDF and an LSI normalization are shown below. Notice that we get a few more results from a LSI normalized vector. I have manually copied the titles over to make the result more readable. I suppose I could make the code do it, and I probably would have, if the code was going to be used by anybody other than myself.

 ```1 2 3 4 5 6 7 8``` ```Results for query: [human computer interface] using TF/IDF D1 (score = 0.8431) Human machine interface for computer applications D4 (score = 0.1881) System and human system engineering testing of EPS Results for query: [human computer interface] using LSI D4 (score = 0.2467) System and human system engineering testing of EPS D3 (score = 0.2467) The EPS user interface management system D1 (score = 0.2467) Human machine interface for computer applications ```

### Calling the code - the JUnit test

For completeness, as well as to show you how the components above should be called, I show here the JUnit test that was used to run these components and derive results for this post.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152``` ```// Source: src/test/java/com/mycompany/myapp/similarity/SimilarityTest.java package com.mycompany.myapp.similarity; import java.io.BufferedReader; import java.io.FileReader; import java.io.PrintWriter; import java.io.Reader; import java.io.StringReader; import java.util.ArrayList; import java.util.Collections; import java.util.Comparator; import java.util.HashMap; import java.util.LinkedHashMap; import java.util.List; import java.util.Map; import java.util.SortedMap; import java.util.TreeMap; import org.apache.commons.lang.StringUtils; import org.junit.Before; import org.junit.Test; import org.springframework.jdbc.datasource.DriverManagerDataSource; import Jama.Matrix; import com.mycompany.myapp.indexers.IdfIndexer; import com.mycompany.myapp.indexers.LsiIndexer; import com.mycompany.myapp.indexers.VectorGenerator; import com.mycompany.myapp.similarity.Searcher.SearchResult; public class SimilarityTest { private VectorGenerator vectorGenerator; @Before public void setUp() throws Exception { vectorGenerator = new VectorGenerator(); vectorGenerator.setDataSource(new DriverManagerDataSource( "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", "tmdb", "irstuff")); Map documents = new LinkedHashMap(); BufferedReader reader = new BufferedReader( new FileReader("src/test/resources/data/indexing_sample_data.txt")); String line = null; while ((line = reader.readLine()) != null) { String[] docTitleParts = StringUtils.split(line, ";"); documents.put(docTitleParts[0], new StringReader(docTitleParts[1])); } vectorGenerator.generateVector(documents); } @Test public void testJaccardSimilarityWithTfIdfVector() throws Exception { IdfIndexer indexer = new IdfIndexer(); Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix()); JaccardSimilarity jaccardSimilarity = new JaccardSimilarity(); Matrix similarity = jaccardSimilarity.transform(termDocMatrix); prettyPrintMatrix("Jaccard Similarity (TF/IDF)", similarity, vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true)); } @Test public void testJaccardSimilarityWithLsiVector() throws Exception { LsiIndexer indexer = new LsiIndexer(); Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix()); JaccardSimilarity jaccardSimilarity = new JaccardSimilarity(); Matrix similarity = jaccardSimilarity.transform(termDocMatrix); prettyPrintMatrix("Jaccard Similarity (LSI)", similarity, vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true)); } @Test public void testCosineSimilarityWithTfIdfVector() throws Exception { IdfIndexer indexer = new IdfIndexer(); Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix()); CosineSimilarity cosineSimilarity = new CosineSimilarity(); Matrix similarity = cosineSimilarity.transform(termDocMatrix); prettyPrintMatrix("Cosine Similarity (TF/IDF)", similarity, vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true)); } @Test public void testCosineSimilarityWithLsiVector() throws Exception { LsiIndexer indexer = new LsiIndexer(); Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix()); CosineSimilarity cosineSimilarity = new CosineSimilarity(); Matrix similarity = cosineSimilarity.transform(termDocMatrix); prettyPrintMatrix("Cosine Similarity (LSI)", similarity, vectorGenerator.getDocumentNames(), new PrintWriter(System.out, true)); } @Test public void testSearchWithTfIdfVector() throws Exception { // generate the term document matrix via the appropriate indexer IdfIndexer indexer = new IdfIndexer(); Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix()); // set up the query Searcher searcher = new Searcher(); searcher.setDocuments(vectorGenerator.getDocumentNames()); searcher.setTerms(vectorGenerator.getWords()); searcher.setSimilarity(new CosineSimilarity()); searcher.setTermDocumentMatrix(termDocMatrix); // run the query List results = searcher.search("human computer interface"); prettyPrintResults("human computer interface", results); } @Test public void testSearchWithLsiVector() throws Exception { // generate the term document matrix via the appropriate indexer LsiIndexer indexer = new LsiIndexer(); Matrix termDocMatrix = indexer.transform(vectorGenerator.getMatrix()); // set up the query Searcher searcher = new Searcher(); searcher.setDocuments(vectorGenerator.getDocumentNames()); searcher.setTerms(vectorGenerator.getWords()); searcher.setSimilarity(new CosineSimilarity()); searcher.setTermDocumentMatrix(termDocMatrix); // run the query List results = searcher.search("human computer interface"); prettyPrintResults("human computer interface", results); } private void prettyPrintMatrix(String legend, Matrix matrix, String[] documentNames, PrintWriter writer) { writer.printf("=== %s ===%n", legend); writer.printf("%6s", " "); for (int i = 0; i < documentNames.length; i++) { writer.printf("%8s", documentNames[i]); } writer.println(); for (int i = 0; i < documentNames.length; i++) { writer.printf("%6s", documentNames[i]); for (int j = 0; j < documentNames.length; j++) { writer.printf("%8.4f", matrix.get(i, j)); } writer.println(); } writer.flush(); } private void prettyPrintResults(String query, List results) { System.out.printf("Results for query: [%s]%n", query); for (SearchResult result : results) { System.out.printf("%s (score = %8.4f)%n", result.title, result.score); } } } ```

### Conclusion

I hope this post was helpful. I had looked at Term Vectors in Lucene in the past, but the idea of an n-dimensional term space never quite sank in until I started working through the stuff I describe above. I know that the data I am using is toy data, and that the computations described in the blog are computationally too complex and impractical in real life, but I believe that knowing this stuff helps in building models that would be useful and practical.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

## Saturday, September 20, 2008

### IR Math with Java : TF, IDF and LSI

Recently, I started working on a fairly large text-mining project. During that time I have seen several common sense heuristics being designed and applied with very good results (some of them from me :-)), so I think a big part of being an IR (Information Retrieval) programmer is the ability to think quantitatively and be able to model problems in simple mathematical or statistical terms. Unless you are some kind of math genius (which I am not) or already have a background in applied math, it helps to know something about the models that are being used or proposed to solve various classes of problems, in order to have a starting point.

Text Mining Application Programming by Dr. Manu Konchady, provides a lot of the math background I am looking for. The book targets programmers, not mathematicians or scientists, so it's easy to read (for me). It provides lucid explanations (with pseudo-code in the book and Perl code in the author's TextMine project) for basic algorithms used to solve some IR problems. The book doesn't cover advanced approaches, as one of my colleagues pointed out to me, but it provides a good base which one can use to research more advanced approaches.

I've been working through this book, off and on, since I bought it. I learn better by doing, so I try to build the components that are described in that chapter. I have written about these efforts earlier here and here. In this post, I describe my code for generating various types of "indexes" (really term/document matrices based off a toy collection of documents) based on the algorithms discussed in Chapter 3 of the book.

The book describes three types of indexing approaches - term frequency (TF), inverse document frequency (IDF) and latent semantic indexing (LSI). To compute the frequency matrix, it takes a collection of 7 titles and creates a term document vector by tokenizing the titles. The list of 7 document titles are shown below:

 ```1 2 3 4 5 6 7``` ```D1 Human machine interface for computer applications D2 A survey of user opinion of computer system response time D3 The EPS user interface management system D4 System and human system engineering testing of EPS D5 The generation of random, binary and ordered trees D6 The intersection graph of paths in trees D7 Graph minors: A survey ```

### Raw Frequency Extraction

To extract the frequencies, we must first extract content words and phrases from the text. I have described tokenizers and token recognizers in earlier posts. For this work, we create two additional recognizers, a stop word recognizer and a content word recognizer.

New recognizer: StopwordRecognizer.java

This recognizer has its own list of stop words if called with a default (no-args) constructor. It can also be instantiated with a List of custom stopwords (from a custom document collection using Zipf's Law) in case that is desired. It checks for TokenType.WORD (so if a word is already classified as abbreviation or phrase, it will not be touched), and if it's value is in it's stop set, then its marked as TokenType.STOP_WORD.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63``` ```// Source: src/main/java/com/mycompany/myapp/recognizers/StopwordRecognizer.java package com.mycompany.myapp.recognizers; import java.util.ArrayList; import java.util.Arrays; import java.util.HashSet; import java.util.List; import java.util.Set; import org.apache.commons.lang.StringUtils; import com.mycompany.myapp.tokenizers.Token; import com.mycompany.myapp.tokenizers.TokenType; /** * A recognizer that recognizes common stop words. Special stopwords may * be passed in through the non-default constructor. */ public class StopwordRecognizer implements IRecognizer { // this list is taken from the TextMine project private static final String DEFAULT_STOPWORDS = "a about add ago after all also an and another any are as at be " + "because been before being between big both but by came can come " + "could did do does due each else end far few for from get got had " + "has have he her here him himself his how if in into is it its " + "just let lie like low make many me might more most much must " + "my never no nor not now of off old on only or other our out over " + "per pre put re said same see she should since so some still such " + "take than that the their them then there these they this those " + "through to too under up use very via want was way we well were " + "what when where which while who will with would yes yet you your"; private Set stopwords = new HashSet(); public StopwordRecognizer() { super(); } public StopwordRecognizer(String[] stopwords) { this.stopwords.addAll(Arrays.asList(stopwords)); } public void init() throws Exception { if (stopwords.size() == 0) { String[] stopwordArray = StringUtils.split(DEFAULT_STOPWORDS, " "); stopwords.addAll(Arrays.asList(stopwordArray)); } } public List recognize(List tokens) { List recognizedTokens = new ArrayList(); for (Token token : tokens) { if (token.getType() == TokenType.WORD) { if (stopwords.contains(StringUtils.lowerCase(token.getValue()))) { token.setType(TokenType.STOP_WORD); } } recognizedTokens.add(token); } return recognizedTokens; } } ```

New recognizer: ContentWordRecognizer.java

This recognizer filters out nouns, verbs, adjectives and adverbs and marks them as TokenType.CONTENT_WORDS. As in the previous recognizer, only words which are TokenType.WORD are acted on. The part-of-speech recognition is done using the WordNet dictionary, and the API to it is the MIT Java WordNet Interface.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54``` ```// Source: src/main/java/com/mycompany/myapp/recognizers/ContentWordRecognizer.java package com.mycompany.myapp.recognizers; import java.net.URL; import java.util.ArrayList; import java.util.Arrays; import java.util.List; import com.mycompany.myapp.tokenizers.Token; import com.mycompany.myapp.tokenizers.TokenType; import edu.mit.jwi.Dictionary; import edu.mit.jwi.IDictionary; import edu.mit.jwi.item.IIndexWord; import edu.mit.jwi.item.POS; /** * Recognizes content words (noun, verb, adjective, and adverb) from a * List of Token objects. Only TokenType.WORD tokens are considered in * this recognizer, and are converted to TokenType.CONTENT_WORD. Words * are looked up against the WordNet dictionary. */ public class ContentWordRecognizer implements IRecognizer { private IDictionary dictionary; private List allowablePartsOfSpeech = Arrays.asList(new POS[] { POS.NOUN, POS.VERB, POS.ADJECTIVE, POS.ADVERB}); public void init() throws Exception { this.dictionary = new Dictionary( new URL("file", null, "/opt/wordnet-3.0/dict")); dictionary.open(); } public List recognize(List tokens) { List outputTokens = new ArrayList(); for (Token token : tokens) { Token outputToken = new Token(token.getValue(), token.getType()); if (token.getType() == TokenType.WORD) { String word = token.getValue(); for (POS allowablePartOfSpeech : allowablePartsOfSpeech) { IIndexWord indexWord = dictionary.getIndexWord(word, allowablePartOfSpeech); if (indexWord != null) { outputToken.setType(TokenType.CONTENT_WORD); break; } } } outputTokens.add(outputToken); } return outputTokens; } } ```

Generating the initial vector: VectorGenerator.java

The initial vector is created from a Map of document names to Readers pointing at the titles. This may seem overly complex for our particular situation, where we could have done with a Map<String,String>, but I was going for a more general solution with Map<String,Reader> where the Reader reads files of content. The code for the VectorGenerator is shown below. Its fairly simple, it tokenizes the titles into words, then for each word, passes it through a chain of recognizers. At the end of it, it only extracts the content words and creates a term-document vector as shown below:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158``` ```// Source: src/main/java/com/mycompany/myapp/indexers/VectorGenerator.java package com.mycompany.myapp.indexers; import java.io.PrintWriter; import java.io.Reader; import java.util.ArrayList; import java.util.Arrays; import java.util.HashMap; import java.util.List; import java.util.Map; import java.util.SortedSet; import java.util.TreeSet; import javax.sql.DataSource; import org.apache.commons.collections15.Bag; import org.apache.commons.collections15.bag.HashBag; import org.apache.commons.lang.ArrayUtils; import org.apache.commons.lang.StringUtils; import org.springframework.beans.factory.annotation.Required; import Jama.Matrix; import com.mycompany.myapp.recognizers.AbbreviationRecognizer; import com.mycompany.myapp.recognizers.BoundaryRecognizer; import com.mycompany.myapp.recognizers.ContentWordRecognizer; import com.mycompany.myapp.recognizers.IRecognizer; import com.mycompany.myapp.recognizers.PhraseRecognizer; import com.mycompany.myapp.recognizers.RecognizerChain; import com.mycompany.myapp.recognizers.StopwordRecognizer; import com.mycompany.myapp.tokenizers.Token; import com.mycompany.myapp.tokenizers.TokenType; import com.mycompany.myapp.tokenizers.WordTokenizer; /** * Generate the word occurence vector for a document collection. */ public class VectorGenerator { private DataSource dataSource; private Map wordIdValueMap = new HashMap(); private Map documentIdNameMap = new HashMap(); private Matrix matrix; @Required public void setDataSource(DataSource dataSource) { this.dataSource = dataSource; } public void generateVector(Map documents) throws Exception { Map> documentWordFrequencyMap = new HashMap>(); SortedSet wordSet = new TreeSet(); Integer docId = 0; for (String key : documents.keySet()) { String text = getText(documents.get(key)); Bag wordFrequencies = getWordFrequencies(text); wordSet.addAll(wordFrequencies.uniqueSet()); documentWordFrequencyMap.put(key, wordFrequencies); documentIdNameMap.put(docId, key); docId++; } // create a Map of ids to words from the wordSet int wordId = 0; for (String word : wordSet) { wordIdValueMap.put(wordId, word); wordId++; } // we need a documents.keySet().size() x wordSet.size() matrix to hold // this info int numDocs = documents.keySet().size(); int numWords = wordSet.size(); double[][] data = new double[numWords][numDocs]; for (int i = 0; i < numWords; i++) { for (int j = 0; j < numDocs; j++) { String docName = documentIdNameMap.get(j); Bag wordFrequencies = documentWordFrequencyMap.get(docName); String word = wordIdValueMap.get(i); int count = wordFrequencies.getCount(word); data[i][j] = count; } } matrix = new Matrix(data); } public Matrix getMatrix() { return matrix; } public String[] getDocumentNames() { String[] documentNames = new String[documentIdNameMap.keySet().size()]; for (int i = 0; i < documentNames.length; i++) { documentNames[i] = documentIdNameMap.get(i); } return documentNames; } public String[] getWords() { String[] words = new String[wordIdValueMap.keySet().size()]; for (int i = 0; i < words.length; i++) { String word = wordIdValueMap.get(i); if (word.contains("|||")) { // phrases are stored with length for other purposes, strip it off // for this report. word = word.substring(0, word.indexOf("|||")); } words[i] = word; } return words; } private Bag getWordFrequencies(String text) throws Exception { Bag wordBag = new HashBag(); WordTokenizer wordTokenizer = new WordTokenizer(); wordTokenizer.setText(text); List tokens = new ArrayList(); Token token = null; while ((token = wordTokenizer.nextToken()) != null) { tokens.add(token); } RecognizerChain recognizerChain = new RecognizerChain( Arrays.asList(new IRecognizer[] { new BoundaryRecognizer(), new AbbreviationRecognizer(dataSource), new PhraseRecognizer(dataSource), new StopwordRecognizer(), new ContentWordRecognizer() })); recognizerChain.init(); List recognizedTokens = recognizerChain.recognize(tokens); for (Token recognizedToken : recognizedTokens) { if (recognizedToken.getType() == TokenType.ABBREVIATION || recognizedToken.getType() == TokenType.PHRASE || recognizedToken.getType() == TokenType.CONTENT_WORD) { // lowercase words to treat Human and human as the same word wordBag.add(StringUtils.lowerCase(recognizedToken.getValue())); } } return wordBag; } private String getText(Reader reader) throws Exception { StringBuilder textBuilder = new StringBuilder(); char[] cbuf = new char[1024]; int len = 0; while ((len = reader.read(cbuf, 0, 1024)) != -1) { textBuilder.append(ArrayUtils.subarray(cbuf, 0, len)); } reader.close(); return textBuilder.toString(); } } ```

The test case (to run each example) consists of a single JUnit test (see below), which simply instantiates and runs each "indexer" implementation.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96``` ```// Source: src/test/java/com/mycompany/myapp/indexers/IndexersTest.java package com.mycompany.myapp.indexers; import java.io.BufferedReader; import java.io.FileReader; import java.io.PrintWriter; import java.io.Reader; import java.io.StringReader; import java.util.LinkedHashMap; import java.util.Map; import org.apache.commons.lang.StringUtils; import org.junit.Before; import org.junit.Test; import org.springframework.jdbc.datasource.DriverManagerDataSource; import Jama.Matrix; public class IndexersTest { private VectorGenerator vectorGenerator; private Map documents; @Before public void setUp() throws Exception { vectorGenerator = new VectorGenerator(); vectorGenerator.setDataSource(new DriverManagerDataSource( "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", "tmdb", "irstuff")); documents = new LinkedHashMap(); BufferedReader reader = new BufferedReader( new FileReader("src/test/resources/data/indexing_sample_data.txt")); String line = null; while ((line = reader.readLine()) != null) { String[] docTitleParts = StringUtils.split(line, ";"); documents.put(docTitleParts[0], new StringReader(docTitleParts[1])); } } @Test public void testVectorGeneration() throws Exception { vectorGenerator.generateVector(documents); prettyPrintMatrix("Raw Term Frequencies", vectorGenerator.getMatrix(), vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), new PrintWriter(System.out, true)); } @Test public void testTfIndexer() throws Exception { vectorGenerator.generateVector(documents); TfIndexer indexer = new TfIndexer(); Matrix tfMatrix = indexer.transform(vectorGenerator.getMatrix()); prettyPrintMatrix("Term Frequency", tfMatrix, vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), new PrintWriter(System.out, true)); } @Test public void testIdfIndexer() throws Exception { vectorGenerator.generateVector(documents); IdfIndexer indexer = new IdfIndexer(); Matrix idfMatrix = indexer.transform(vectorGenerator.getMatrix()); prettyPrintMatrix("Inverse Document Frequency", idfMatrix, vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), new PrintWriter(System.out, true)); } @Test public void testLsiIndexer() throws Exception { vectorGenerator.generateVector(documents); LsiIndexer indexer = new LsiIndexer(); Matrix lsiMatrix = indexer.transform(vectorGenerator.getMatrix()); prettyPrintMatrix("Latent Semantic (LSI)", lsiMatrix, vectorGenerator.getDocumentNames(), vectorGenerator.getWords(), new PrintWriter(System.out, true)); } private void prettyPrintMatrix(String legend, Matrix matrix, String[] documentNames, String[] words, PrintWriter writer) { writer.printf("=== %s ===%n", legend); writer.printf("%15s", " "); for (int i = 0; i < documentNames.length; i++) { writer.printf("%8s", documentNames[i]); } writer.println(); for (int i = 0; i < words.length; i++) { writer.printf("%15s", words[i]); for (int j = 0; j < documentNames.length; j++) { writer.printf("%8.4f", matrix.get(i, j)); } writer.println(); } writer.flush(); } } ```

The first test in the JUnit test outputs our initial raw matrix. Later tests will create it again (see @Before) and operate on it in different ways. Here is what the raw matrix will look like:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25``` ```=== Raw Term Frequencies === D1 D2 D3 D4 D5 D6 D7 binary 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 computer 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 computer system 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 engineering 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 eps 0.0000 0.0000 1.0000 1.0000 0.0000 0.0000 0.0000 generation 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 graph 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 1.0000 human 1.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 interface 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 intersection 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 machine 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 management 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 minors 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 1.0000 opinion 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 ordered 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 random 0.0000 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 response 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 survey 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 1.0000 system 0.0000 0.0000 1.0000 2.0000 0.0000 0.0000 0.0000 testing 0.0000 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 time 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 user 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 0.0000 user interface 0.0000 0.0000 1.0000 0.0000 0.0000 0.0000 0.0000 ```

### Term Frequency Indexing

The term frequency indexing method is the most simplistic, all it does is normalize the raw frequencies across a single document. Thus, if a document had two words, one occuring twice and the other occuring thrice, the first word would be normalized to 2/5 (0.4) and the other to 3/5 (0.6). We have used Jama, a Java library algebra library, because of its ability to do SVD (but more on that later). Here is the code:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36``` ```// Source: src/main/java/com/mycompany/myapp/indexers/TfIndexer.java package com.mycompany.myapp.indexers; import org.apache.commons.collections15.Transformer; import Jama.Matrix; /** * Normalizes the occurence matrix per document. Divides each entry by the * sum of occurence values for that column. That way a longer document which * has 2 occurences of a certain word will not be ranked higher than a * shorter document with 1 occurence of the word for that word. At the * end of this transformation, the values are the frequency of the word * in the document. */ public class TfIndexer implements Transformer { public Matrix transform(Matrix matrix) { for (int j = 0; j < matrix.getColumnDimension(); j++) { double sum = sum(matrix.getMatrix( 0, matrix.getRowDimension() -1, j, j)); for (int i = 0; i < matrix.getRowDimension(); i++) { matrix.set(i, j, (matrix.get(i, j) / sum)); } } return matrix; } private double sum(Matrix colMatrix) { double sum = 0.0D; for (int i = 0; i < colMatrix.getRowDimension(); i++) { sum += colMatrix.get(i, 0); } return sum; } } ```

The results of this computation is shown below. Notice that all the columns now add up to 1, meaning that all documents are being treated the same, regardless of their length (and consequently their number of content words).

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25``` ```=== Term Frequency === D1 D2 D3 D4 D5 D6 D7 binary 0.0000 0.0000 0.0000 0.0000 0.2500 0.0000 0.0000 computer 0.2500 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 computer system 0.0000 0.1667 0.0000 0.0000 0.0000 0.0000 0.0000 engineering 0.0000 0.0000 0.0000 0.1667 0.0000 0.0000 0.0000 eps 0.0000 0.0000 0.2500 0.1667 0.0000 0.0000 0.0000 generation 0.0000 0.0000 0.0000 0.0000 0.2500 0.0000 0.0000 graph 0.0000 0.0000 0.0000 0.0000 0.0000 0.5000 0.3333 human 0.2500 0.0000 0.0000 0.1667 0.0000 0.0000 0.0000 interface 0.2500 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 intersection 0.0000 0.0000 0.0000 0.0000 0.0000 0.5000 0.0000 machine 0.2500 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 management 0.0000 0.0000 0.2500 0.0000 0.0000 0.0000 0.0000 minors 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.3333 opinion 0.0000 0.1667 0.0000 0.0000 0.0000 0.0000 0.0000 ordered 0.0000 0.0000 0.0000 0.0000 0.2500 0.0000 0.0000 random 0.0000 0.0000 0.0000 0.0000 0.2500 0.0000 0.0000 response 0.0000 0.1667 0.0000 0.0000 0.0000 0.0000 0.0000 survey 0.0000 0.1667 0.0000 0.0000 0.0000 0.0000 0.3333 system 0.0000 0.0000 0.2500 0.3333 0.0000 0.0000 0.0000 testing 0.0000 0.0000 0.0000 0.1667 0.0000 0.0000 0.0000 time 0.0000 0.1667 0.0000 0.0000 0.0000 0.0000 0.0000 user 0.0000 0.1667 0.0000 0.0000 0.0000 0.0000 0.0000 user interface 0.0000 0.0000 0.2500 0.0000 0.0000 0.0000 0.0000 ```

### Inverse Document Frequency Indexing

Inverse Document Frequency attempts to smooth out the frequency of a word across documents. If a word occurs in more than one document, that means that it is less "precise" and hence its value should go down. The code below will take the raw vector and apply IDF to it in the form of a logarithmic smoothing operator, then normalize the results. So this is a combination of TF and IDF. The smoothing operator is:

```weighti,j = term_frequencyi,j * (1 + log(N) - log(di)
where:
weighti,j = value of the IDF matrix for documenti, wordj.
term_frequencyi,j = raw frequency of the word at position (i,j).
N = number of documents
di = number of documents containing word i.
```

The indexer code is shown below:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58``` ```// Source: src/main/java/net/sf/jtmt/indexers/IdfIndexer.java package net.sf.jtmt.indexers; import org.apache.commons.collections15.Transformer; import org.apache.commons.math.linear.RealMatrix; /** * Reduces the weight of words which are commonly found (ie in more * documents). The factor by which it is reduced is chosen from the book * as: * f(m) = 1 + log(N/d(m)) * where N = total number of docs in collection * d(m) = number of docs containing word m * so where a word is more frequent (ie d(m) is high, f(m) would be low. */ public class IdfIndexer implements Transformer { public RealMatrix transform(RealMatrix matrix) { // Phase 1: apply IDF weight to the raw word frequencies int n = matrix.getColumnDimension(); for (int j = 0; j < matrix.getColumnDimension(); j++) { for (int i = 0; i < matrix.getRowDimension(); i++) { double matrixElement = matrix.getEntry(i, j); if (matrixElement > 0.0D) { double dm = countDocsWithWord( matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1)); matrix.setEntry(i, j, matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm))); } } } // Phase 2: normalize the word scores for a single document for (int j = 0; j < matrix.getColumnDimension(); j++) { double sum = sum(matrix.getSubMatrix(0, matrix.getRowDimension() -1, j, j)); for (int i = 0; i < matrix.getRowDimension(); i++) { matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum)); } } return matrix; } private double sum(RealMatrix colMatrix) { double sum = 0.0D; for (int i = 0; i < colMatrix.getRowDimension(); i++) { sum += colMatrix.getEntry(i, 0); } return sum; } private double countDocsWithWord(RealMatrix rowMatrix) { double numDocs = 0.0D; for (int j = 0; j < rowMatrix.getColumnDimension(); j++) { if (rowMatrix.getEntry(0, j) > 0.0D) { numDocs++; } } return numDocs; } } ```

The resulting vector after IDF and normalization is applied is shown below. Notice that scores for words (such as human) which occur in more than one document has decreased.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25``` ```=== Inverse Document Frequency === D1 D2 D3 D4 D5 D6 D7 binary 0.0000 0.0000 0.0000 0.0000 0.2500 0.0000 0.0000 computer 0.2656 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 computer system 0.0000 0.1735 0.0000 0.0000 0.0000 0.0000 0.0000 engineering 0.0000 0.0000 0.0000 0.1977 0.0000 0.0000 0.0000 eps 0.0000 0.0000 0.2167 0.1512 0.0000 0.0000 0.0000 generation 0.0000 0.0000 0.0000 0.0000 0.2500 0.0000 0.0000 graph 0.0000 0.0000 0.0000 0.0000 0.0000 0.4333 0.3023 human 0.2031 0.0000 0.0000 0.1512 0.0000 0.0000 0.0000 interface 0.2656 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 intersection 0.0000 0.0000 0.0000 0.0000 0.0000 0.5667 0.0000 machine 0.2656 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 management 0.0000 0.0000 0.2833 0.0000 0.0000 0.0000 0.0000 minors 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.3953 opinion 0.0000 0.1735 0.0000 0.0000 0.0000 0.0000 0.0000 ordered 0.0000 0.0000 0.0000 0.0000 0.2500 0.0000 0.0000 random 0.0000 0.0000 0.0000 0.0000 0.2500 0.0000 0.0000 response 0.0000 0.1735 0.0000 0.0000 0.0000 0.0000 0.0000 survey 0.0000 0.1327 0.0000 0.0000 0.0000 0.0000 0.3023 system 0.0000 0.0000 0.2167 0.3023 0.0000 0.0000 0.0000 testing 0.0000 0.0000 0.0000 0.1977 0.0000 0.0000 0.0000 time 0.0000 0.1735 0.0000 0.0000 0.0000 0.0000 0.0000 user 0.0000 0.1735 0.0000 0.0000 0.0000 0.0000 0.0000 user interface 0.0000 0.0000 0.2833 0.0000 0.0000 0.0000 0.0000 ```

### Latent Semantic Indexing (LSI)

Latent Semantic Indexing attempts to uncover latent relationships among documents based on word co-occurence. So if document A contains (w1,w2) and document B contains (w2,w3), we can conclude that there is something common between documents A and B. LSI does this by decomposing the input raw term frequency matrix (A, see below) into three different matrices (U, S and V) using Singular Value Decomposition (SVD). Once that is done, the three vectors are "reduced" and the original vector rebuilt from the reduced vectors. Because of the reduction, noisy relationships are suppressed and relations become very clearly visible. In pseudo-code:

```A = U * S * VT
Ak = Uk * Sk * VkT
where:
A = the original matrix
U = the word vector
S = the sigma vector
V = the document vector
Uk = the reduced word submatrix consisting of 0..k-1 cols
Sk = the reduced sigma submatrix consisting of 0..k-1 cols, 0..k-1 rows
Vk = the reduced document submatrix consisting of 0..k-1 cols.
Note:
Jama will give you back V, so you need to reduce and transpose it
before you compute Ak.
```

Dr E Garcia used to have a really good tutorial on LSI/SVD which is sadly no longer available. However, the IR Book has a chapter dedicated to this. Thanks to ndk for suggesting this link.

As mentioned before, Jama was chosen because it was the only free Java library package I knew of that could do SVD. The code for the LSI Indexer is here:

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51``` ```// Source: src/main/java/com/mycompany/myapp/indexers/LsiIndexer.java package com.mycompany.myapp.indexers; import org.apache.commons.collections15.Transformer; import Jama.Matrix; import Jama.SingularValueDecomposition; /** * Uses Latent Semantic Indexing to find word associations between docs. * Idea is to find the intersections of the words found in each document * and score them accordingly. We use SVD to accomplish this. We first * decompose the word frequency vector into the three parts, then multiply * the three components back to get our transformed matrix. */ public class LsiIndexer implements Transformer { public Matrix transform(Matrix matrix) { // phase 1: Singular value decomposition SingularValueDecomposition svd = new SingularValueDecomposition(matrix); Matrix wordVector = svd.getU(); Matrix sigma = svd.getS(); Matrix documentVector = svd.getV(); // compute the value of k (ie where to truncate) int k = (int) Math.floor(Math.sqrt(matrix.getColumnDimension())); Matrix reducedWordVector = wordVector.getMatrix( 0, wordVector.getRowDimension() - 1, 0, k - 1); Matrix reducedSigma = sigma.getMatrix(0, k - 1, 0, k - 1); Matrix reducedDocumentVector = documentVector.getMatrix( 0, documentVector.getRowDimension() - 1, 0, k - 1); Matrix weights = reducedWordVector.times( reducedSigma).times(reducedDocumentVector.transpose()); // Phase 2: normalize the word scrores for a single document for (int j = 0; j < weights.getColumnDimension(); j++) { double sum = sum(weights.getMatrix( 0, weights.getRowDimension() - 1, j, j)); for (int i = 0; i < weights.getRowDimension(); i++) { weights.set(i, j, Math.abs((weights.get(i, j)) / sum)); } } return weights; } private double sum(Matrix colMatrix) { double sum = 0.0D; for (int i = 0; i < colMatrix.getRowDimension(); i++) { sum += colMatrix.get(i, 0); } return sum; } } ```

And here is the output from the indexer. First the raw frequencies go through the singular value decomposition, reduction and recomposition process, then they are normalized for each document. Notice that there are more non-zero elements representing latent "relationship" weights.

 ```1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26``` ```=== Latent Semantic (LSI) === D1 D2 D3 D4 D5 D6 D7 binary 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 computer 0.0198 0.0000 0.0198 0.0198 0.0000 0.0000 0.0000 computer system 0.0000 0.1405 0.0000 0.0000 0.1405 0.1405 0.1405 engineering 0.1138 0.0000 0.1138 0.1138 0.0000 0.0000 0.0000 eps 0.1733 0.0000 0.1733 0.1733 0.0000 0.0000 0.0000 generation 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 graph 0.0000 0.0559 0.0000 0.0000 0.0559 0.0559 0.0559 human 0.1336 0.0000 0.1336 0.1336 0.0000 0.0000 0.0000 interface 0.0198 0.0000 0.0198 0.0198 0.0000 0.0000 0.0000 intersection 0.0000 0.0105 0.0000 0.0000 0.0105 0.0105 0.0105 machine 0.0198 0.0000 0.0198 0.0198 0.0000 0.0000 0.0000 management 0.0595 0.0000 0.0595 0.0595 0.0000 0.0000 0.0000 minors 0.0000 0.0454 0.0000 0.0000 0.0454 0.0454 0.0454 opinion 0.0000 0.1405 0.0000 0.0000 0.1405 0.1405 0.1405 ordered 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 random 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 response 0.0000 0.1405 0.0000 0.0000 0.1405 0.1405 0.1405 survey 0.0000 0.1859 0.0000 0.0000 0.1859 0.1859 0.1859 system 0.2871 0.0000 0.2871 0.2871 0.0000 0.0000 0.0000 testing 0.1138 0.0000 0.1138 0.1138 0.0000 0.0000 0.0000 time 0.0000 0.1405 0.0000 0.0000 0.1405 0.1405 0.1405 user 0.0000 0.1405 0.0000 0.0000 0.1405 0.1405 0.1405 user interface 0.0595 0.0000 0.0595 0.0595 0.0000 0.0000 0.0000 Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.458 sec ```

### Conclusion

The first two indexing methods is probably familiar to a lot of people, and it is very likely that the ones in use (indirectly from common IR libraries such as Lucene) in most shops are quite a bit more advanced than the ones shown. LSI using SVD was a new approach to me, it became intuitively obvious once I understood the process. Hopefully, this article was able to share some of my new-found insight with you. The Java code for each of these processes illustrates how easy it is to actually do these transforms, especially using libraries that do most of the heavy lifting.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.