Saturday, March 21, 2009

Cartesian Joining Flat Files with Python Generators

I've been noticing that whenever I build something that does an interesting transformation (can't go into more detail than that, sorry) to some peice(s) of data, I inevitably get requests to do a similar transformation on data from a completely different source. The inputs are provided in the form of large CSV text dumps from database tables, and almost always is in formats other than what my original program expects, since it contains key identification data to allow the output to be uploaded back to the database.

While I am not complaining (it's definitely good to have one's work be considered useful by one's colleagues), it becomes tedious to write wrappers that take these different formats, extract the information I need to process, and writes back the output, preserving the key identification data to write back into the output. So for quite a while now, I have been toying with the idea of building a wrapper that allows me to do SQL like transformations on flat files.

In this post, I show some Python code to do cartesian joins to denormalize an arbitary number of flat files into a single one. Conceptually, what I am after is to do something along these lines:

1
2
3
4
select *
from table_a a, table_b b, table_c c, ...
where a.col1 = 'foo'
and b.col2 = 'bar'

The idea is to denormalize the tables into one flat file. I could do it fairly simply by nesting for loops, something like this (pseudo-code):

1
2
3
4
for row_a in table_a:
  for row_b in table_b:
    for row_c in table_c:
      sepchar.join([row_a, row_b, row_c])

However, I don't know the number of flat files I will be denormalizing until runtime, so the above approach won't work.

Another issue is the size of the files. They are expected to be large enough so sucking then into in-memory lists is not practical. Python supports Generators, which allows you to iterate through a file and return a line to a caller, without the caller having to maintain the state and restoring it on the next call (the yield keyword). David Beazly has a nice presentation (PDF) along with some code samples that explains this in more detail. So I built my own FileIterator object that produces an infinite generator on the flat files (ie, if it reaches EOF, it reopens the file and goes back to returning lines).

Here is the code. This Python recipe has some code which looks like it does something similar, but I ended up rolling my own because my needs were slightly different.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/usr/bin/python
# Computes a projection over an arbitary number of comma-separated
# text files.

class FileIterator():
  """
  A generator which recurses infinitely through a flat file. If EOF
  is reached, then it reopens the file and starts again from the
  beginning. All generators but the first one will be closed and
  reopened. The first one will terminate on EOF because otherwise
  this program will run in an infinite loop.
  """
  def __init__(self, filename):
    self.filename = filename
    self.infile = open(filename, 'rb')

  def nextLine(self):
    while (True):
      try:
        line = self.infile.readline()[:-1]
      except ValueError:
        self.infile = open(self.filename, 'rb')
        continue
      if (line == ''):
        break
      yield line
    self.infile.close()


def project(filenames):
  """
  Delegates to the recursive _project method (see last line)
  """
  def _project(fileIterators, sepchar="|", fline=None):
    """
    Private recursive method to iterate through the generator outputs
    in reverse and build up the final results.
    """
    last = fileIterators[-1]
    for line in last.nextLine():
      if (len(fileIterators) == 1):
        print sepchar.join([line, fline])
      else:
        rest = fileIterators[:-1]
        if (fline is None):
          _project(rest, sepchar, line)
        else:
          _project(rest, sepchar, sepchar.join([line, fline]))
  _project(map(lambda f: FileIterator(f), filenames))

def main():
  """
  Test code
  """
  filenames = ["/tmp/aa1", "/tmp/aa2", "/tmp/aa3", "/tmp/aa4"]
  project(filenames)

if __name__ == "__main__":
  main()

The main workhorse in here is the project function which delegates to a private _project function that does the actual recursion.

Here are my input files that I used for testing, and some lines from the output file, that will help you to figure out how the program works.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
file: /tmp/aa1
a1
a2
a3
file: /tmp/aa2
b1
b2
b3
file: /tmp/aa3
c1
c2
c3
file: /tmp/aa4
d1
d2

Which produces the following output:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
a1|b1|c1|d1
a2|b1|c1|d1
a3|b1|c1|d1
a1|b2|c1|d1
a2|b2|c1|d1
a3|b2|c1|d1
a1|b3|c1|d1
a2|b3|c1|d1
a3|b3|c1|d1
...

As you can see, the last file in the array is opened first, and used as a pivot for the previous file, and so on, until there are no more files left, at which point, the joined line is printed out.

Friday, March 13, 2009

Vector Space Classifier using Lucene

I have only recently started playing with Lucene's term API, and it looks really useful. Over the past year, I have tried to go through and understand the ideas presented in the TMAP book, and in the process I have built up a small set of tools to tokenize text, create and normalize term-document matrices, etc. Lucene provides some of this functionality through its term API, but in a more memory-efficient way. I was kind of aware of the term API, but I was too busy learning the basics of IR to worry too much about it, so I never took the time to figure it out earlier.

Anyway, I've been playing with classification lately, as you can see from my previous post. This week, I try out another popular classification approach based on the Term Vector Space Model. The idea is to compute the position in term space for the "average" or centroid document for each category, and then to find how "close" the target document is to each of these centroids. The closest centroid wins, ie the document is classified to its category.

Training

The classifier is trained with a pre-classified corpus of documents. Each document's term vectors are computed, and based on its category, put into a Term-Document (TD) matrix for that category. Once all documents are read, then the centroid document for each set of documents are calculated.

During the centroid calculation, we normalize each matrix using TF-IDF and then calculate the centroid for the documents in the matrix. A centroid is basically just the average of the rows in the TD matrix. If you think of a column in the TD Matrix as representing a single document, then a tuple of the elements of that column matrix can be considered as a coordinate that represents a point in n-dimensional space, where n is the number of terms (rows) in our TD Matrix. Thus a document made up of term coordinates which are the average of the rows would represent the centroid of the documents in that category.

In my example, the centroids are stored as in-memory member variables of the classifier, which can be accessed during the classification phase via an accessor. Another data structure is the term to position map, also created as a side effect of the training phase and accessible via an accessor. In real-world systems, you may want to train the classifier once and then reuse it many times over different documents, possibly over a period of days or months, so its probably better to store this data in a database table or some other persistent medium. If you go the database table route, you can coalesce the two data structures needed by the classify method into a single table by keying the centroid coordinates off the term itself. I haven't done this because I am lazy, so you are stuck with handing the two data structures to the classify method at the moment.

Classification

The classification process takes a body of text and the two data structures, creates an in-memory Lucene index off the text, and creates a document matrix out of the normalized term vectors. As in the training phase, we pass the raw frequencies through our TF-IDF indexers. Similarities are then calculated for this document matrix against the document matrices for each category. The category with the highest similarity between its centroid matrix and the document matrix is assigned to the document. The default similarity implementation used in this classifier is Cosine Similarity.

Notice that unlike the Naive Bayes classifier, this classifier is not binary. You can use the cosine similarity measure to find the best matching category for a document for multiple categories. Of course, it doesn't have to be this way, a Naive Bayes classifier can be run multiple times to make it non-binary, and a Vector Space classifier can be trained appropriately to make it binary.

Classifier Code

The code for the classifier is shown below. There is a bunch of setters at the beginning, which allow the caller to configure the classifier. Then the caller calls the train() method. Once the training is complete, the caller can call the classify() method, which returns a String representing the best category for the document. There is another method that will report the similarity scores for each category for the document, which can be used for debugging. There is some test code further down that illustrates the usage.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
// Source: src/main/java/com/mycompany/myapp/classifiers/LuceneVectorSpaceModelClassifier.java
package com.mycompany.myapp.classifiers;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.Transformer;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.collections15.comparators.ReverseComparator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.commons.math.linear.SparseRealMatrix;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.RAMDirectory;

import com.mycompany.myapp.clustering.ByValueComparator;
import com.mycompany.myapp.indexers.IdfIndexer;
import com.mycompany.myapp.indexers.TfIndexer;
import com.mycompany.myapp.similarity.AbstractSimilarity;
import com.mycompany.myapp.similarity.CosineSimilarity;

/**
 * Computes the position in term space of the centroid for each 
 * category during the training phase. During the classify phase,
 * the position in term space of the document to be classified is
 * computed and the cosine similarity of this document with each
 * of the centroids is computed. The category of the centroid which
 * is closest to the document is assigned to the document.
 */
public class LuceneVectorSpaceModelClassifier {

  private final Log log = LogFactory.getLog(getClass());
  
  private String indexDir;
  private String categoryFieldName;
  private String bodyFieldName;

  private Analyzer analyzer = new StandardAnalyzer();
  
  @SuppressWarnings("unchecked") 
  private Transformer<RealMatrix,RealMatrix>[] indexers = 
    new Transformer[] {
      new TfIndexer(),
      new IdfIndexer()
  };

  private AbstractSimilarity similarity = new CosineSimilarity();
  
  private Map<String,RealMatrix> centroidMap;
  private Map<String,Integer> termIdMap;
  private Map<String,Double> similarityMap;
  
  /**
   * Set the directory where the Lucene index is located.
   * @param indexDir the index directory.
   */
  public void setIndexDir(String indexDir) {
    this.indexDir = indexDir;
  }
  
  /**
   * Set the name of the Lucene field containing the preclassified category.
   * @param categoryFieldName the category field name.
   */
  public void setCategoryFieldName(String categoryFieldName) {
    this.categoryFieldName = categoryFieldName;
  }
  
  /**
   * Set the name of the Lucene field containing the document body. The
   * document body must have been indexed with TermVector.YES.
   * @param bodyFieldName the name of the document body field.
   */
  public void setBodyFieldName(String bodyFieldName) {
    this.bodyFieldName = bodyFieldName;
  }

  /**
   * The Analyzer used for tokenizing the document body during indexing,
   * and to tokenize the text to be classified. If not specified, the
   * classifier uses the StandardAnalyzer.
   * @param analyzer the analyzer reference.
   */
  public void setAnalyzer(Analyzer analyzer) {
    this.analyzer = analyzer;
  }
  
  /**
   * A transformer chain of indexers (or normalizers) to normalize the
   * document matrices. If not specified, the default is a chain of TF-IDF.
   * @param indexers the normalizer chain.
   */
  public void setIndexers(
      Transformer<RealMatrix,RealMatrix>[] indexers) {
    this.indexers = indexers;
  }
  
  /**
   * The Similarity implementation used to calculate the similarity between
   * the text to be classified and the category centroid document matrices.
   * Uses CosineSimilarity if not specified.
   * @param similarity the similarity to use.
   */
  public void setSimilarity(AbstractSimilarity similarity) {
    this.similarity = similarity;
  }

  /**
   * Implements the logic for training the classifier. The input is a Lucene
   * index of preclassified documents. The classifier is provided the name
   * of the field which contains the document body, as well as the name of
   * the field which contains the category name. Additionally, the document
   * body must have had its Term Vectors computed during indexing (using 
   * TermVector.YES). The train method uses the Term Vectors to compute a
   * geometrical centroid for each category in the index. The set of category
   * names to their associated centroid document matrix is available via the
   * getCentroidMap() method after training is complete.
   * @throws Exception if one is thrown.
   */
  public void train() throws Exception {
    log.info("Classifier training started");
    IndexReader reader = IndexReader.open(indexDir);
    // Set up a data structure for the term versus the row id in the matrix.
    // This is going to be used for looking up the term's row in the matrix.
    this.termIdMap = computeTermIdMap(reader);
    // Initialize the data structures to hold the td matrices for the various
    // categories.
    Bag<String> docsInCategory = computeDocsInCategory(reader);
    Map<String,Integer> currentDocInCategory = 
      new HashMap<String,Integer>();
    Map<String,RealMatrix> categoryTfMap = 
      new HashMap<String,RealMatrix>();
    for (String category : docsInCategory.uniqueSet()) {
      int numDocsInCategory = docsInCategory.getCount(category);
      categoryTfMap.put(category, 
        new SparseRealMatrix(termIdMap.size(), numDocsInCategory));
      currentDocInCategory.put(category, new Integer(0));
    }
    // extract each document body's TermVector into the td matrix for
    // that document's category
    int numDocs = reader.numDocs();
    for (int i = 0; i < numDocs; i++) {
      Document doc = reader.document(i);
      String category = doc.get(categoryFieldName);
      RealMatrix tfMatrix = categoryTfMap.get(category);
      // get the term frequency map 
      TermFreqVector vector = reader.getTermFreqVector(i, bodyFieldName);
      String[] terms = vector.getTerms();
      int[] frequencies = vector.getTermFrequencies();
      for (int j = 0; j < terms.length; j++) {
        int row = termIdMap.get(terms[j]);
        int col = currentDocInCategory.get(category);
        tfMatrix.setEntry(row, col, new Double(frequencies[j]));
      }
      incrementCurrentDoc(currentDocInCategory, category);
    }
    reader.close();
    // compute centroid vectors for each category
    this.centroidMap = new HashMap<String,RealMatrix>();
    for (String category : docsInCategory.uniqueSet()) {
      RealMatrix tdmatrix = categoryTfMap.get(category);
      RealMatrix centroid = computeCentroid(tdmatrix);
      centroidMap.put(category, centroid);
    }
    log.info("Classifier training complete");
  }

  /**
   * Returns the centroid map of category name to TD Matrix containing the
   * centroid document of the category. This data is computed as a side
   * effect of the train() method.
   * @return the centroid map computed from the training.
   */
  public Map<String,RealMatrix> getCentroidMap() {
    return centroidMap;
  }
  
  /**
   * Returns the map of analyzed terms versus their positions in the centroid
   * matrices. The data is computed as a side-effect of the train() method.
   * @return a Map of analyzed terms to their position in the matrix.
   */
  public Map<String,Integer> getTermIdMap() {
    return termIdMap;
  }

  /**
   * Once the classifier is trained using the train() method, it creates a
   * Map of category to associated centroid documents for each category, and
   * a termIdMap, which is a mapping of tokenized terms to its row number in
   * the document matrix for the centroid documents. These two structures are
   * used by the classify method to match up terms from the input text with
   * corresponding terms in the centroids to calculate similarities. Builds 
   * a Map of category names and the similarities of the input text to the
   * centroids in each category as a side effect. Returns the category with
   * the highest similarity score, ie the category this text should be 
   * classified under.
   * @param centroids a Map of category names to centroid document matrices.
   * @param termIdMap a Map of terms to their positions in the document
   *                  matrix.
   * @param text the text to classify.
   * @return the best category for the text.
   * @throws Exception if one is thrown.
   */
  public String classify(Map<String,RealMatrix> centroids, 
      Map<String,Integer> termIdMap, String text) throws Exception {
    RAMDirectory ramdir = new RAMDirectory();
    indexDocument(ramdir, "text", text);
    // now find the (normalized) term frequency vector for this
    RealMatrix docMatrix = buildMatrixFromIndex(ramdir, "text");
    // compute similarity using passed in Similarity implementation, we
    // use CosineSimilarity by default.
    this.similarityMap = new HashMap<String,Double>();
    for (String category : centroids.keySet()) {
      RealMatrix centroidMatrix = centroids.get(category);
      double sim = similarity.computeSimilarity(docMatrix, centroidMatrix);
      similarityMap.put(category, sim);
    }
    // sort the categories
    List<String> categories = new ArrayList<String>();
    categories.addAll(centroids.keySet());
    Collections.sort(categories, 
      new ReverseComparator<String>(
      new ByValueComparator<String,Double>(similarityMap)));
    // return the best category, the similarity map is also available
    // to the client for debugging or display.
    return categories.get(0);
  }

  /**
   * Returns the map of category to similarity with the document after
   * classification. The similarityMap is computed as a side-effect of
   * the classify() method, so the data is interesting only if this method
   * is called after classify() completes successfully.
   * @return map of category to similarity scores for text to classify.
   */
  public Map<String,Double> getSimilarityMap() {
    return similarityMap;
  }

  /**
   * Loops through the IndexReader's TermEnum enumeration, and creates a Map
   * of term to an integer id. This map is going to be used to assign string
   * terms to specific rows in the Term Document Matrix for each category.
   * @param reader a reference to the IndexReader.
   * @return a Map of terms to their integer ids (0-based).
   * @throws Exception if one is thrown.
   */
  private Map<String, Integer> computeTermIdMap(IndexReader reader) 
      throws Exception {
    Map<String,Integer> termIdMap = 
      new HashMap<String,Integer>();
    int id = 0;
    TermEnum termEnum = reader.terms();
    while (termEnum.next()) {
      String term = termEnum.term().text();
      if (termIdMap.containsKey(term)) {
        continue;
      }
      termIdMap.put(term, id);
      id++;
    }
    return termIdMap;
  }

  /**
   * Loops through the specified IndexReader and returns a Bag of categories
   * and their document counts. We don't use the BitSet/DocIdSet approach 
   * here because we don't know how many categories the training documents
   * have been classified into.
   * @param reader the reference to the IndexReader.
   * @return a Bag of category names and counts.
   * @throws Exception if one is thrown.
   */
  private Bag<String> computeDocsInCategory(IndexReader reader) 
      throws Exception {
    int numDocs = reader.numDocs();
    Bag<String> docsInCategory = new HashBag<String>();
    for (int i = 0; i < numDocs; i++) {
      Document doc = reader.document(i);
      String category = doc.get(categoryFieldName);
      docsInCategory.add(category);
    }
    return docsInCategory;
  }

  /**
   * Increments the counter for the category to point to the next document
   * index. This is used to manage the document index in the td matrix for
   * the category.
   * @param currDocs the Map of category-wise document Id counters.
   * @param category the category whose document-id we want to increment.
   */
  private void incrementCurrentDoc(Map<String,Integer> currDocs, 
      String category) {
    int currentDoc = currDocs.get(category);
    currDocs.put(category, currentDoc + 1);
  }
  
  /**
   * Compute the centroid document from the TD Matrix. Result is a matrix 
   * of term weights but for a single document only.
   * @param tdmatrix
   * @return
   */
  private RealMatrix computeCentroid(RealMatrix tdmatrix) {
    tdmatrix = normalizeWithTfIdf(tdmatrix);
    RealMatrix centroid = 
      new SparseRealMatrix(tdmatrix.getRowDimension(), 1);
    int numDocs = tdmatrix.getColumnDimension();
    int numTerms = tdmatrix.getRowDimension();
    for (int row = 0; row < numTerms; row++) {
      double rowSum = 0.0D;
      for (int col = 0; col < numDocs; col++) {
        rowSum += tdmatrix.getEntry(row, col); 
      }
      centroid.setEntry(row, 0, rowSum / ((double) numDocs));
    }
    return centroid;
  }

  /**
   * Builds an in-memory Lucene index using the text supplied for classification.
   * @param ramdir the RAM Directory reference.
   * @param fieldName the field name to index the text as.
   * @param text the text to index.
   * @throws Exception if one is thrown.
   */
  private void indexDocument(RAMDirectory ramdir, String fieldName, 
      String text) throws Exception {
    IndexWriter writer = 
      new IndexWriter(ramdir, analyzer, MaxFieldLength.UNLIMITED);
    Document doc = new Document();
    doc.add(new Field(
      fieldName, text, Store.YES, Index.ANALYZED, TermVector.YES));
    writer.addDocument(doc);
    writer.commit();
    writer.close();
  }

  /**
   * Given a Lucene index and a field name with pre-computed TermVectors,
   * creates a document matrix of terms. The document matrix is normalized
   * using the specified indexer chain.
   * @param ramdir the RAM Directory reference.
   * @param fieldName the name of the field to build the matrix from. 
   * @return a normalized Document matrix of terms and frequencies.
   * @throws Exception if one is thrown.
   */
  private RealMatrix buildMatrixFromIndex(RAMDirectory ramdir, 
      String fieldName) throws Exception {
    IndexReader reader = IndexReader.open(ramdir);
    TermFreqVector vector = reader.getTermFreqVector(0, fieldName);
    String[] terms = vector.getTerms();
    int[] frequencies = vector.getTermFrequencies();
    RealMatrix docMatrix = new SparseRealMatrix(termIdMap.size(), 1);
    for (int i = 0; i < terms.length; i++) {
      String term = terms[i];
      if (termIdMap.containsKey(term)) {
        int row = termIdMap.get(term);
        docMatrix.setEntry(row, 0, frequencies[i]);
      }
    }
    reader.close();
    // normalize the docMatrix using TF-IDF
    docMatrix = normalizeWithTfIdf(docMatrix);
    return docMatrix;
  }

  /**
   * Pass the input TD Matrix through a chain of transformers to normalize
   * the TD Matrix. Here we do TF-IDF normalization, although it is possible
   * to do other types of normalization (such as LSI) by passing in the
   * appropriate chain of normalizers (or indexers).
   * @param docMatrix the un-normalized TD Matrix.
   * @return the normalized TD Matrix.
   */
  private RealMatrix normalizeWithTfIdf(RealMatrix docMatrix) {
    for (Transformer<RealMatrix,RealMatrix> indexer : indexers) {
      docMatrix = indexer.transform(docMatrix);
    }
    return docMatrix;
  }
}

Related Code

I have reused some code that I had written to support other components developed earlier. When I wrote them earlier, I was using the Jama Matrix package. However, I switched sometime late last year to using the linear algebra classes in commons-math instead. I started using commons-math in anticipation of being able to use the SparseRealMatrix implementation which I had suggested and contributed a first cut for, but the 2.0 release is still not out, so its likely you will have to download and build from the svn repository if you want to run my code. In each subsection below, I point out where you can get the Jama version if you want it.

TfIndexer

This indexer normalizes each term count by dividing by the total number of terms for a given document. This has the effect of normalizing the effect of long documents versus shorter ones. At the end of the normalization, the term count becomes a number between 0 and 1, with the total of all the term frequencies for a document being equal to 1.

The Jama version of the code can be found in my post IR Math with Java : TF, IDF and LSI.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
// Source: src/main/java/com/mycompany/myapp/indexers/TfIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

public class TfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(
        0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }
}

IdfIndexer

This transformation has the effect of reducing the frequency of words that are commonly found in the document set. The factor fw by which the frequency of term w is reduced is given by the formula:

  fw = 1 + log(N/nw)
  where:
    N = total number of documents in the collection
    nw = number of documents containing word w

The code is shown below. The Jama version of the code can also be found in my post IR Math with Java : TF, IDF and LSI.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
// Source: src/main/java/com/mycompany/myapp/indexers/IdfIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

public class IdfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    // Phase 1: apply IDF weight to the raw word frequencies
    int n = matrix.getColumnDimension();
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        double dm = countDocsWithWord(
          matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1));
        double matrixElement = matrix.getEntry(i, j);
        if (matrixElement > 0.0D) {
          matrix.setEntry(i, j, 
            matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm)));
        }
      }
    }
    // Phase 2: normalize the word scores for a single document
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(
        0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }

  private double countDocsWithWord(RealMatrix rowMatrix) {
    double numDocs = 0.0D;
    for (int j = 0; j < rowMatrix.getColumnDimension(); j++) {
      if (rowMatrix.getEntry(0, j) > 0.0D) {
        numDocs++;
      }
    }
    return numDocs;
  }
}

CosineSimilarity

Cosine Similarity calculates the cosine of the angle between the lines joining the origin of the term space to the each document's position. The higher the value of the cosine, the smaller the angle between the two lines, and hence more similar the documents. Cosine Similarity is calculated as:

  cos θ = A • B / |A| * |B|
  where A = document matrix for the first document,
        B = document matrix for the second document.

The code for the CosineSimilarity class is shown below. The Jama version can be found in my post IR Math with Java : Similarity Measures.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
// Source: src/main/java/com/mycompany/myapp/similarity/CosineSimilarity.java
package com.mycompany.myapp.similarity;

import org.apache.commons.math.linear.RealMatrix;
import org.apache.commons.math.linear.SparseRealMatrix;

public class CosineSimilarity extends AbstractSimilarity {

  @Override
  public double computeSimilarity(
      RealMatrix sourceDoc, RealMatrix targetDoc) {
    if (sourceDoc.getRowDimension() != targetDoc.getRowDimension() ||
        sourceDoc.getColumnDimension() != targetDoc.getColumnDimension() ||
        sourceDoc.getColumnDimension() != 1) {
      throw new IllegalArgumentException(
        "Matrices are not column matrices or not of the same size");
    }
    // max col sum, only 1 col, so...
    double dotProduct = dot(sourceDoc, targetDoc);
    // sqrt of sum of squares of all elements, only one col, so...
    double eucledianDist = 
      sourceDoc.getFrobeniusNorm() * targetDoc.getFrobeniusNorm();
    return dotProduct / eucledianDist;
  }
  
  private double dot(RealMatrix source, RealMatrix target) {
    int maxRows = source.getRowDimension();
    int maxCols = source.getColumnDimension();
    RealMatrix dotProduct = new SparseRealMatrix(maxRows, maxCols);
    for (int row = 0; row < maxRows; row++) {
      for (int col = 0; col < maxCols; col++) {
        dotProduct.setEntry(row, col, 
          source.getEntry(row, col) * target.getEntry(row, col));
      }
    }
    return dotProduct.getNorm();
  }
}

Test Code

For the test, we use the same collection of Reuters news items from the TextMine project that was used for testing the Binary Naive Bayes Classifier described in my previous post. The indexing code is pretty much the same, except that we now compute the term vectors of the body during indexing time. There is a single test, which trains the classifier, then classifies 5 documents with the classifier. Here is the JUnit test.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
// Source: src/test/java/com/mycompany/myapp/classifiers/LuceneVectorSpaceModelClassifierTest.java
package com.mycompany.myapp.classifiers;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.Map;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.FSDirectory;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;

import com.mycompany.myapp.indexers.IdfIndexer;
import com.mycompany.myapp.indexers.TfIndexer;
import com.mycompany.myapp.similarity.CosineSimilarity;
import com.mycompany.myapp.summarizers.SummaryAnalyzer;

public class LuceneVectorSpaceModelClassifierTest {

  private static final Log log = 
    LogFactory.getLog(LuceneVectorSpaceModelClassifierTest.class);
  
  private static String INPUT_FILE = 
    "src/test/resources/data/sugar-coffee-cocoa-docs.txt";
  private static String INDEX_DIR = "src/test/resources/data/scc-index";
  private static String[] DOCS_TO_CLASSIFY = new String[] {
    "src/test/resources/data/cocoa.txt",
    "src/test/resources/data/cocoa1.txt",
    "src/test/resources/data/cocoa2.txt",
    "src/test/resources/data/coffee.txt",
    "src/test/resources/data/coffee1.txt"
  };

  @BeforeClass
  public static void buildIndex() throws Exception {
    log.debug("Building index...");
    BufferedReader reader = new BufferedReader(new FileReader(INPUT_FILE));
    IndexWriter writer = 
      new IndexWriter(FSDirectory.getDirectory(INDEX_DIR), 
      new SummaryAnalyzer(), MaxFieldLength.UNLIMITED);
    String line = null;
    int lno = 0;
    StringBuilder bodybuf = new StringBuilder();
    String category = null;
    while ((line = reader.readLine()) != null) {
      if (line.endsWith(".sgm")) {
        // header line
        if (lno > 0) {
          // not the very first line, so dump current body buffer and
          // reinit the buffer.
          writeToIndex(writer, category, bodybuf.toString());
          bodybuf = new StringBuilder();
        }
        category = StringUtils.trim(StringUtils.split(line, ":")[1]);
        continue;
      } else {
        // not a header line, accumulate line into bodybuf
        bodybuf.append(line).append(" ");
      }
      lno++;
    }
    // last record
    writeToIndex(writer, category, bodybuf.toString());
    reader.close();
    writer.commit();
    writer.optimize();
    writer.close();
  }

  private static void writeToIndex(IndexWriter writer, String category, 
      String body) throws Exception {
    Document doc = new Document();
    doc.add(new Field("category", category, Store.YES, Index.NOT_ANALYZED));
    doc.add(
      new Field("body", body, Store.YES, Index.ANALYZED, TermVector.YES));
    writer.addDocument(doc);
  }

  @AfterClass
  public static void deleteIndex() throws Exception {
    log.info("Deleting index directory...");
    FileUtils.deleteDirectory(new File(INDEX_DIR));
  }
  
  @Test
  public void testLuceneNaiveBayesClassifier() throws Exception {
    LuceneVectorSpaceModelClassifier classifier = 
      new LuceneVectorSpaceModelClassifier();
    // setup
    classifier.setIndexDir(INDEX_DIR);
    classifier.setAnalyzer(new SummaryAnalyzer());
    classifier.setCategoryFieldName("category");
    classifier.setBodyFieldName("body");
    // this is the default but we set it anyway, to illustrate usage
    classifier.setIndexers(new Transformer[] {
      new TfIndexer(),
      new IdfIndexer()
    });
    // this is the default but we set it anyway, to illustrate usage.
    // Similarity need not be set before training, it can be set before
    // the classification step.
    classifier.setSimilarity(new CosineSimilarity());
    // training
    classifier.train();
    // classification
    Map<String,RealMatrix> centroidMap = classifier.getCentroidMap();
    Map<String,Integer> termIdMap = classifier.getTermIdMap();
    String[] categories = centroidMap.keySet().toArray(new String[0]);
    for (String testDoc : DOCS_TO_CLASSIFY) {
      File f = new File(testDoc);
      String category = classifier.classify(centroidMap, termIdMap, 
        FileUtils.readFileToString(f, "UTF-8"));
      System.out.println(">>> " + f.getName() + 
        " => category: " + category);
      Map<String,Double> similarityMap = classifier.getSimilarityMap();
      String[] pairs = new String[categories.length];
      for (int i = 0; i < categories.length; i++) {
         pairs[i] = categories[i] + ":" + similarityMap.get(categories[i]);
      }
      System.out.println("(" + StringUtils.join(pairs, ", ") + ")");
    }
  }
}

Results

Here are the results. It was a bit surprising to see such good results, so I went back and checked the code to see if I was doing something wrong :-). As you can see, it correctly classified all my 5 documents.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
>>> cocoa.txt => category: cocoa
(cocoa:0.7499364961896885, coffee:0.21387426054867117, sugar:0.15213562681433365)
>>> cocoa1.txt => category: cocoa
(cocoa:0.35404965894048845, coffee:0.15006958907480905, sugar:0.14425804054775068)
>>> cocoa2.txt => category: cocoa
(cocoa:0.2993396230523616, coffee:0.1754388455250711, sugar:0.18650205458278443)
>>> coffee.txt => category: coffee
(cocoa:0.18703846088862733, coffee:0.45354676135783173, sugar:0.20549314483406184)
>>> coffee1.txt => category: coffee
(cocoa:0.1436949323744925, coffee:0.3702669738594301, sugar:0.2316259997838632)

Possible Improvements

With the Naive Bayes approach, I had to enable feature selection and use the top √n terms to get it to classify correctly. I had thought of doing something similar here if required, basically by using SVD to extract the principal √n components and using them to compute the similarity. It is quite easy to do if needed though, simply by setting a different chain of indexers.

Another interesting toolkit to try out for this stuff is the Semantic Vectors project, which seems to be quite promising from the little I understand about it. A commenter on a previous related post pointed me to this - now that I've made the leap to using Lucene for the tokenization part, it seems logical to give this a try, something I plan to do next week.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.