Salmon Run: March 2009

I have only recently started playing with Lucene's term API, and it looks really useful. Over the past year, I have tried to go through and understand the ideas presented in the TMAP book, and in the process I have built up a small set of tools to tokenize text, create and normalize term-document matrices, etc. Lucene provides some of this functionality through its term API, but in a more memory-efficient way. I was kind of aware of the term API, but I was too busy learning the basics of IR to worry too much about it, so I never took the time to figure it out earlier.

Anyway, I've been playing with classification lately, as you can see from my previous post. This week, I try out another popular classification approach based on the Term Vector Space Model. The idea is to compute the position in term space for the "average" or centroid document for each category, and then to find how "close" the target document is to each of these centroids. The closest centroid wins, ie the document is classified to its category.

Training

The classifier is trained with a pre-classified corpus of documents. Each document's term vectors are computed, and based on its category, put into a Term-Document (TD) matrix for that category. Once all documents are read, then the centroid document for each set of documents are calculated.

During the centroid calculation, we normalize each matrix using TF-IDF and then calculate the centroid for the documents in the matrix. A centroid is basically just the average of the rows in the TD matrix. If you think of a column in the TD Matrix as representing a single document, then a tuple of the elements of that column matrix can be considered as a coordinate that represents a point in n-dimensional space, where n is the number of terms (rows) in our TD Matrix. Thus a document made up of term coordinates which are the average of the rows would represent the centroid of the documents in that category.

In my example, the centroids are stored as in-memory member variables of the classifier, which can be accessed during the classification phase via an accessor. Another data structure is the term to position map, also created as a side effect of the training phase and accessible via an accessor. In real-world systems, you may want to train the classifier once and then reuse it many times over different documents, possibly over a period of days or months, so its probably better to store this data in a database table or some other persistent medium. If you go the database table route, you can coalesce the two data structures needed by the classify method into a single table by keying the centroid coordinates off the term itself. I haven't done this because I am lazy, so you are stuck with handing the two data structures to the classify method at the moment.

Classification

The classification process takes a body of text and the two data structures, creates an in-memory Lucene index off the text, and creates a document matrix out of the normalized term vectors. As in the training phase, we pass the raw frequencies through our TF-IDF indexers. Similarities are then calculated for this document matrix against the document matrices for each category. The category with the highest similarity between its centroid matrix and the document matrix is assigned to the document. The default similarity implementation used in this classifier is Cosine Similarity.

Notice that unlike the Naive Bayes classifier, this classifier is not binary. You can use the cosine similarity measure to find the best matching category for a document for multiple categories. Of course, it doesn't have to be this way, a Naive Bayes classifier can be run multiple times to make it non-binary, and a Vector Space classifier can be trained appropriately to make it binary.

Classifier Code

The code for the classifier is shown below. There is a bunch of setters at the beginning, which allow the caller to configure the classifier. Then the caller calls the train() method. Once the training is complete, the caller can call the classify() method, which returns a String representing the best category for the document. There is another method that will report the similarity scores for each category for the document, which can be used for debugging. There is some test code further down that illustrates the usage.

// Source: src/main/java/com/mycompany/myapp/classifiers/LuceneVectorSpaceModelClassifier.java
package com.mycompany.myapp.classifiers;

import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.collections15.Bag;
import org.apache.commons.collections15.Transformer;
import org.apache.commons.collections15.bag.HashBag;
import org.apache.commons.collections15.comparators.ReverseComparator;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.commons.math.linear.SparseRealMatrix;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.RAMDirectory;

import com.mycompany.myapp.clustering.ByValueComparator;
import com.mycompany.myapp.indexers.IdfIndexer;
import com.mycompany.myapp.indexers.TfIndexer;
import com.mycompany.myapp.similarity.AbstractSimilarity;
import com.mycompany.myapp.similarity.CosineSimilarity;

/**
 * Computes the position in term space of the centroid for each 
 * category during the training phase. During the classify phase,
 * the position in term space of the document to be classified is
 * computed and the cosine similarity of this document with each
 * of the centroids is computed. The category of the centroid which
 * is closest to the document is assigned to the document.
 */
public class LuceneVectorSpaceModelClassifier {

  private final Log log = LogFactory.getLog(getClass());
  
  private String indexDir;
  private String categoryFieldName;
  private String bodyFieldName;

  private Analyzer analyzer = new StandardAnalyzer();
  
  @SuppressWarnings("unchecked") 
  private Transformer<RealMatrix,RealMatrix>[] indexers = 
    new Transformer[] {
      new TfIndexer(),
      new IdfIndexer()
  };

  private AbstractSimilarity similarity = new CosineSimilarity();
  
  private Map<String,RealMatrix> centroidMap;
  private Map<String,Integer> termIdMap;
  private Map<String,Double> similarityMap;
  
  /**
   * Set the directory where the Lucene index is located.
   * @param indexDir the index directory.
   */
  public void setIndexDir(String indexDir) {
    this.indexDir = indexDir;
  }
  
  /**
   * Set the name of the Lucene field containing the preclassified category.
   * @param categoryFieldName the category field name.
   */
  public void setCategoryFieldName(String categoryFieldName) {
    this.categoryFieldName = categoryFieldName;
  }
  
  /**
   * Set the name of the Lucene field containing the document body. The
   * document body must have been indexed with TermVector.YES.
   * @param bodyFieldName the name of the document body field.
   */
  public void setBodyFieldName(String bodyFieldName) {
    this.bodyFieldName = bodyFieldName;
  }

  /**
   * The Analyzer used for tokenizing the document body during indexing,
   * and to tokenize the text to be classified. If not specified, the
   * classifier uses the StandardAnalyzer.
   * @param analyzer the analyzer reference.
   */
  public void setAnalyzer(Analyzer analyzer) {
    this.analyzer = analyzer;
  }
  
  /**
   * A transformer chain of indexers (or normalizers) to normalize the
   * document matrices. If not specified, the default is a chain of TF-IDF.
   * @param indexers the normalizer chain.
   */
  public void setIndexers(
      Transformer<RealMatrix,RealMatrix>[] indexers) {
    this.indexers = indexers;
  }
  
  /**
   * The Similarity implementation used to calculate the similarity between
   * the text to be classified and the category centroid document matrices.
   * Uses CosineSimilarity if not specified.
   * @param similarity the similarity to use.
   */
  public void setSimilarity(AbstractSimilarity similarity) {
    this.similarity = similarity;
  }

  /**
   * Implements the logic for training the classifier. The input is a Lucene
   * index of preclassified documents. The classifier is provided the name
   * of the field which contains the document body, as well as the name of
   * the field which contains the category name. Additionally, the document
   * body must have had its Term Vectors computed during indexing (using 
   * TermVector.YES). The train method uses the Term Vectors to compute a
   * geometrical centroid for each category in the index. The set of category
   * names to their associated centroid document matrix is available via the
   * getCentroidMap() method after training is complete.
   * @throws Exception if one is thrown.
   */
  public void train() throws Exception {
    log.info("Classifier training started");
    IndexReader reader = IndexReader.open(indexDir);
    // Set up a data structure for the term versus the row id in the matrix.
    // This is going to be used for looking up the term's row in the matrix.
    this.termIdMap = computeTermIdMap(reader);
    // Initialize the data structures to hold the td matrices for the various
    // categories.
    Bag<String> docsInCategory = computeDocsInCategory(reader);
    Map<String,Integer> currentDocInCategory = 
      new HashMap<String,Integer>();
    Map<String,RealMatrix> categoryTfMap = 
      new HashMap<String,RealMatrix>();
    for (String category : docsInCategory.uniqueSet()) {
      int numDocsInCategory = docsInCategory.getCount(category);
      categoryTfMap.put(category, 
        new SparseRealMatrix(termIdMap.size(), numDocsInCategory));
      currentDocInCategory.put(category, new Integer(0));
    }
    // extract each document body's TermVector into the td matrix for
    // that document's category
    int numDocs = reader.numDocs();
    for (int i = 0; i < numDocs; i++) {
      Document doc = reader.document(i);
      String category = doc.get(categoryFieldName);
      RealMatrix tfMatrix = categoryTfMap.get(category);
      // get the term frequency map 
      TermFreqVector vector = reader.getTermFreqVector(i, bodyFieldName);
      String[] terms = vector.getTerms();
      int[] frequencies = vector.getTermFrequencies();
      for (int j = 0; j < terms.length; j++) {
        int row = termIdMap.get(terms[j]);
        int col = currentDocInCategory.get(category);
        tfMatrix.setEntry(row, col, new Double(frequencies[j]));
      }
      incrementCurrentDoc(currentDocInCategory, category);
    }
    reader.close();
    // compute centroid vectors for each category
    this.centroidMap = new HashMap<String,RealMatrix>();
    for (String category : docsInCategory.uniqueSet()) {
      RealMatrix tdmatrix = categoryTfMap.get(category);
      RealMatrix centroid = computeCentroid(tdmatrix);
      centroidMap.put(category, centroid);
    }
    log.info("Classifier training complete");
  }

  /**
   * Returns the centroid map of category name to TD Matrix containing the
   * centroid document of the category. This data is computed as a side
   * effect of the train() method.
   * @return the centroid map computed from the training.
   */
  public Map<String,RealMatrix> getCentroidMap() {
    return centroidMap;
  }
  
  /**
   * Returns the map of analyzed terms versus their positions in the centroid
   * matrices. The data is computed as a side-effect of the train() method.
   * @return a Map of analyzed terms to their position in the matrix.
   */
  public Map<String,Integer> getTermIdMap() {
    return termIdMap;
  }

  /**
   * Once the classifier is trained using the train() method, it creates a
   * Map of category to associated centroid documents for each category, and
   * a termIdMap, which is a mapping of tokenized terms to its row number in
   * the document matrix for the centroid documents. These two structures are
   * used by the classify method to match up terms from the input text with
   * corresponding terms in the centroids to calculate similarities. Builds 
   * a Map of category names and the similarities of the input text to the
   * centroids in each category as a side effect. Returns the category with
   * the highest similarity score, ie the category this text should be 
   * classified under.
   * @param centroids a Map of category names to centroid document matrices.
   * @param termIdMap a Map of terms to their positions in the document
   *                  matrix.
   * @param text the text to classify.
   * @return the best category for the text.
   * @throws Exception if one is thrown.
   */
  public String classify(Map<String,RealMatrix> centroids, 
      Map<String,Integer> termIdMap, String text) throws Exception {
    RAMDirectory ramdir = new RAMDirectory();
    indexDocument(ramdir, "text", text);
    // now find the (normalized) term frequency vector for this
    RealMatrix docMatrix = buildMatrixFromIndex(ramdir, "text");
    // compute similarity using passed in Similarity implementation, we
    // use CosineSimilarity by default.
    this.similarityMap = new HashMap<String,Double>();
    for (String category : centroids.keySet()) {
      RealMatrix centroidMatrix = centroids.get(category);
      double sim = similarity.computeSimilarity(docMatrix, centroidMatrix);
      similarityMap.put(category, sim);
    }
    // sort the categories
    List<String> categories = new ArrayList<String>();
    categories.addAll(centroids.keySet());
    Collections.sort(categories, 
      new ReverseComparator<String>(
      new ByValueComparator<String,Double>(similarityMap)));
    // return the best category, the similarity map is also available
    // to the client for debugging or display.
    return categories.get(0);
  }

  /**
   * Returns the map of category to similarity with the document after
   * classification. The similarityMap is computed as a side-effect of
   * the classify() method, so the data is interesting only if this method
   * is called after classify() completes successfully.
   * @return map of category to similarity scores for text to classify.
   */
  public Map<String,Double> getSimilarityMap() {
    return similarityMap;
  }

  /**
   * Loops through the IndexReader's TermEnum enumeration, and creates a Map
   * of term to an integer id. This map is going to be used to assign string
   * terms to specific rows in the Term Document Matrix for each category.
   * @param reader a reference to the IndexReader.
   * @return a Map of terms to their integer ids (0-based).
   * @throws Exception if one is thrown.
   */
  private Map<String, Integer> computeTermIdMap(IndexReader reader) 
      throws Exception {
    Map<String,Integer> termIdMap = 
      new HashMap<String,Integer>();
    int id = 0;
    TermEnum termEnum = reader.terms();
    while (termEnum.next()) {
      String term = termEnum.term().text();
      if (termIdMap.containsKey(term)) {
        continue;
      }
      termIdMap.put(term, id);
      id++;
    }
    return termIdMap;
  }

  /**
   * Loops through the specified IndexReader and returns a Bag of categories
   * and their document counts. We don't use the BitSet/DocIdSet approach 
   * here because we don't know how many categories the training documents
   * have been classified into.
   * @param reader the reference to the IndexReader.
   * @return a Bag of category names and counts.
   * @throws Exception if one is thrown.
   */
  private Bag<String> computeDocsInCategory(IndexReader reader) 
      throws Exception {
    int numDocs = reader.numDocs();
    Bag<String> docsInCategory = new HashBag<String>();
    for (int i = 0; i < numDocs; i++) {
      Document doc = reader.document(i);
      String category = doc.get(categoryFieldName);
      docsInCategory.add(category);
    }
    return docsInCategory;
  }

  /**
   * Increments the counter for the category to point to the next document
   * index. This is used to manage the document index in the td matrix for
   * the category.
   * @param currDocs the Map of category-wise document Id counters.
   * @param category the category whose document-id we want to increment.
   */
  private void incrementCurrentDoc(Map<String,Integer> currDocs, 
      String category) {
    int currentDoc = currDocs.get(category);
    currDocs.put(category, currentDoc + 1);
  }
  
  /**
   * Compute the centroid document from the TD Matrix. Result is a matrix 
   * of term weights but for a single document only.
   * @param tdmatrix
   * @return
   */
  private RealMatrix computeCentroid(RealMatrix tdmatrix) {
    tdmatrix = normalizeWithTfIdf(tdmatrix);
    RealMatrix centroid = 
      new SparseRealMatrix(tdmatrix.getRowDimension(), 1);
    int numDocs = tdmatrix.getColumnDimension();
    int numTerms = tdmatrix.getRowDimension();
    for (int row = 0; row < numTerms; row++) {
      double rowSum = 0.0D;
      for (int col = 0; col < numDocs; col++) {
        rowSum += tdmatrix.getEntry(row, col); 
      }
      centroid.setEntry(row, 0, rowSum / ((double) numDocs));
    }
    return centroid;
  }

  /**
   * Builds an in-memory Lucene index using the text supplied for classification.
   * @param ramdir the RAM Directory reference.
   * @param fieldName the field name to index the text as.
   * @param text the text to index.
   * @throws Exception if one is thrown.
   */
  private void indexDocument(RAMDirectory ramdir, String fieldName, 
      String text) throws Exception {
    IndexWriter writer = 
      new IndexWriter(ramdir, analyzer, MaxFieldLength.UNLIMITED);
    Document doc = new Document();
    doc.add(new Field(
      fieldName, text, Store.YES, Index.ANALYZED, TermVector.YES));
    writer.addDocument(doc);
    writer.commit();
    writer.close();
  }

  /**
   * Given a Lucene index and a field name with pre-computed TermVectors,
   * creates a document matrix of terms. The document matrix is normalized
   * using the specified indexer chain.
   * @param ramdir the RAM Directory reference.
   * @param fieldName the name of the field to build the matrix from. 
   * @return a normalized Document matrix of terms and frequencies.
   * @throws Exception if one is thrown.
   */
  private RealMatrix buildMatrixFromIndex(RAMDirectory ramdir, 
      String fieldName) throws Exception {
    IndexReader reader = IndexReader.open(ramdir);
    TermFreqVector vector = reader.getTermFreqVector(0, fieldName);
    String[] terms = vector.getTerms();
    int[] frequencies = vector.getTermFrequencies();
    RealMatrix docMatrix = new SparseRealMatrix(termIdMap.size(), 1);
    for (int i = 0; i < terms.length; i++) {
      String term = terms[i];
      if (termIdMap.containsKey(term)) {
        int row = termIdMap.get(term);
        docMatrix.setEntry(row, 0, frequencies[i]);
      }
    }
    reader.close();
    // normalize the docMatrix using TF-IDF
    docMatrix = normalizeWithTfIdf(docMatrix);
    return docMatrix;
  }

  /**
   * Pass the input TD Matrix through a chain of transformers to normalize
   * the TD Matrix. Here we do TF-IDF normalization, although it is possible
   * to do other types of normalization (such as LSI) by passing in the
   * appropriate chain of normalizers (or indexers).
   * @param docMatrix the un-normalized TD Matrix.
   * @return the normalized TD Matrix.
   */
  private RealMatrix normalizeWithTfIdf(RealMatrix docMatrix) {
    for (Transformer<RealMatrix,RealMatrix> indexer : indexers) {
      docMatrix = indexer.transform(docMatrix);
    }
    return docMatrix;
  }
}

Related Code

I have reused some code that I had written to support other components developed earlier. When I wrote them earlier, I was using the Jama Matrix package. However, I switched sometime late last year to using the linear algebra classes in commons-math instead. I started using commons-math in anticipation of being able to use the SparseRealMatrix implementation which I had suggested and contributed a first cut for, but the 2.0 release is still not out, so its likely you will have to download and build from the svn repository if you want to run my code. In each subsection below, I point out where you can get the Jama version if you want it.

TfIndexer

This indexer normalizes each term count by dividing by the total number of terms for a given document. This has the effect of normalizing the effect of long documents versus shorter ones. At the end of the normalization, the term count becomes a number between 0 and 1, with the total of all the term frequencies for a document being equal to 1.

The Jama version of the code can be found in my post IR Math with Java : TF, IDF and LSI.

// Source: src/main/java/com/mycompany/myapp/indexers/TfIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

public class TfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(
        0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }
}

IdfIndexer

This transformation has the effect of reducing the frequency of words that are commonly found in the document set. The factor f_w by which the frequency of term w is reduced is given by the formula:

  f_w = 1 + log(N/n_w)
  where:
    N = total number of documents in the collection
    n_w = number of documents containing word w

The code is shown below. The Jama version of the code can also be found in my post IR Math with Java : TF, IDF and LSI.

// Source: src/main/java/com/mycompany/myapp/indexers/IdfIndexer.java
package com.mycompany.myapp.indexers;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.math.linear.RealMatrix;

public class IdfIndexer implements Transformer<RealMatrix,RealMatrix> {

  public RealMatrix transform(RealMatrix matrix) {
    // Phase 1: apply IDF weight to the raw word frequencies
    int n = matrix.getColumnDimension();
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        double dm = countDocsWithWord(
          matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1));
        double matrixElement = matrix.getEntry(i, j);
        if (matrixElement > 0.0D) {
          matrix.setEntry(i, j, 
            matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm)));
        }
      }
    }
    // Phase 2: normalize the word scores for a single document
    for (int j = 0; j < matrix.getColumnDimension(); j++) {
      double sum = sum(matrix.getSubMatrix(
        0, matrix.getRowDimension() -1, j, j));
      for (int i = 0; i < matrix.getRowDimension(); i++) {
        matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum));
      }
    }
    return matrix;
  }

  private double sum(RealMatrix colMatrix) {
    double sum = 0.0D;
    for (int i = 0; i < colMatrix.getRowDimension(); i++) {
      sum += colMatrix.getEntry(i, 0);
    }
    return sum;
  }

  private double countDocsWithWord(RealMatrix rowMatrix) {
    double numDocs = 0.0D;
    for (int j = 0; j < rowMatrix.getColumnDimension(); j++) {
      if (rowMatrix.getEntry(0, j) > 0.0D) {
        numDocs++;
      }
    }
    return numDocs;
  }
}

CosineSimilarity

Cosine Similarity calculates the cosine of the angle between the lines joining the origin of the term space to the each document's position. The higher the value of the cosine, the smaller the angle between the two lines, and hence more similar the documents. Cosine Similarity is calculated as:

  cos θ = A • B / |A| * |B|
  where A = document matrix for the first document,
        B = document matrix for the second document.

The code for the CosineSimilarity class is shown below. The Jama version can be found in my post IR Math with Java : Similarity Measures.

// Source: src/main/java/com/mycompany/myapp/similarity/CosineSimilarity.java
package com.mycompany.myapp.similarity;

import org.apache.commons.math.linear.RealMatrix;
import org.apache.commons.math.linear.SparseRealMatrix;

public class CosineSimilarity extends AbstractSimilarity {

  @Override
  public double computeSimilarity(
      RealMatrix sourceDoc, RealMatrix targetDoc) {
    if (sourceDoc.getRowDimension() != targetDoc.getRowDimension() ||
        sourceDoc.getColumnDimension() != targetDoc.getColumnDimension() ||
        sourceDoc.getColumnDimension() != 1) {
      throw new IllegalArgumentException(
        "Matrices are not column matrices or not of the same size");
    }
    // max col sum, only 1 col, so...
    double dotProduct = dot(sourceDoc, targetDoc);
    // sqrt of sum of squares of all elements, only one col, so...
    double eucledianDist = 
      sourceDoc.getFrobeniusNorm() * targetDoc.getFrobeniusNorm();
    return dotProduct / eucledianDist;
  }
  
  private double dot(RealMatrix source, RealMatrix target) {
    int maxRows = source.getRowDimension();
    int maxCols = source.getColumnDimension();
    RealMatrix dotProduct = new SparseRealMatrix(maxRows, maxCols);
    for (int row = 0; row < maxRows; row++) {
      for (int col = 0; col < maxCols; col++) {
        dotProduct.setEntry(row, col, 
          source.getEntry(row, col) * target.getEntry(row, col));
      }
    }
    return dotProduct.getNorm();
  }
}

Test Code

For the test, we use the same collection of Reuters news items from the TextMine project that was used for testing the Binary Naive Bayes Classifier described in my previous post. The indexing code is pretty much the same, except that we now compute the term vectors of the body during indexing time. There is a single test, which trains the classifier, then classifies 5 documents with the classifier. Here is the JUnit test.

// Source: src/test/java/com/mycompany/myapp/classifiers/LuceneVectorSpaceModelClassifierTest.java
package com.mycompany.myapp.classifiers;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.Map;

import org.apache.commons.collections15.Transformer;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.document.Field.TermVector;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.IndexWriter.MaxFieldLength;
import org.apache.lucene.store.FSDirectory;
import org.junit.AfterClass;
import org.junit.BeforeClass;
import org.junit.Test;

import com.mycompany.myapp.indexers.IdfIndexer;
import com.mycompany.myapp.indexers.TfIndexer;
import com.mycompany.myapp.similarity.CosineSimilarity;
import com.mycompany.myapp.summarizers.SummaryAnalyzer;

public class LuceneVectorSpaceModelClassifierTest {

  private static final Log log = 
    LogFactory.getLog(LuceneVectorSpaceModelClassifierTest.class);
  
  private static String INPUT_FILE = 
    "src/test/resources/data/sugar-coffee-cocoa-docs.txt";
  private static String INDEX_DIR = "src/test/resources/data/scc-index";
  private static String[] DOCS_TO_CLASSIFY = new String[] {
    "src/test/resources/data/cocoa.txt",
    "src/test/resources/data/cocoa1.txt",
    "src/test/resources/data/cocoa2.txt",
    "src/test/resources/data/coffee.txt",
    "src/test/resources/data/coffee1.txt"
  };

  @BeforeClass
  public static void buildIndex() throws Exception {
    log.debug("Building index...");
    BufferedReader reader = new BufferedReader(new FileReader(INPUT_FILE));
    IndexWriter writer = 
      new IndexWriter(FSDirectory.getDirectory(INDEX_DIR), 
      new SummaryAnalyzer(), MaxFieldLength.UNLIMITED);
    String line = null;
    int lno = 0;
    StringBuilder bodybuf = new StringBuilder();
    String category = null;
    while ((line = reader.readLine()) != null) {
      if (line.endsWith(".sgm")) {
        // header line
        if (lno > 0) {
          // not the very first line, so dump current body buffer and
          // reinit the buffer.
          writeToIndex(writer, category, bodybuf.toString());
          bodybuf = new StringBuilder();
        }
        category = StringUtils.trim(StringUtils.split(line, ":")[1]);
        continue;
      } else {
        // not a header line, accumulate line into bodybuf
        bodybuf.append(line).append(" ");
      }
      lno++;
    }
    // last record
    writeToIndex(writer, category, bodybuf.toString());
    reader.close();
    writer.commit();
    writer.optimize();
    writer.close();
  }

  private static void writeToIndex(IndexWriter writer, String category, 
      String body) throws Exception {
    Document doc = new Document();
    doc.add(new Field("category", category, Store.YES, Index.NOT_ANALYZED));
    doc.add(
      new Field("body", body, Store.YES, Index.ANALYZED, TermVector.YES));
    writer.addDocument(doc);
  }

  @AfterClass
  public static void deleteIndex() throws Exception {
    log.info("Deleting index directory...");
    FileUtils.deleteDirectory(new File(INDEX_DIR));
  }
  
  @Test
  public void testLuceneNaiveBayesClassifier() throws Exception {
    LuceneVectorSpaceModelClassifier classifier = 
      new LuceneVectorSpaceModelClassifier();
    // setup
    classifier.setIndexDir(INDEX_DIR);
    classifier.setAnalyzer(new SummaryAnalyzer());
    classifier.setCategoryFieldName("category");
    classifier.setBodyFieldName("body");
    // this is the default but we set it anyway, to illustrate usage
    classifier.setIndexers(new Transformer[] {
      new TfIndexer(),
      new IdfIndexer()
    });
    // this is the default but we set it anyway, to illustrate usage.
    // Similarity need not be set before training, it can be set before
    // the classification step.
    classifier.setSimilarity(new CosineSimilarity());
    // training
    classifier.train();
    // classification
    Map<String,RealMatrix> centroidMap = classifier.getCentroidMap();
    Map<String,Integer> termIdMap = classifier.getTermIdMap();
    String[] categories = centroidMap.keySet().toArray(new String[0]);
    for (String testDoc : DOCS_TO_CLASSIFY) {
      File f = new File(testDoc);
      String category = classifier.classify(centroidMap, termIdMap, 
        FileUtils.readFileToString(f, "UTF-8"));
      System.out.println(">>> " + f.getName() + 
        " => category: " + category);
      Map<String,Double> similarityMap = classifier.getSimilarityMap();
      String[] pairs = new String[categories.length];
      for (int i = 0; i < categories.length; i++) {
         pairs[i] = categories[i] + ":" + similarityMap.get(categories[i]);
      }
      System.out.println("(" + StringUtils.join(pairs, ", ") + ")");
    }
  }
}

Results

Here are the results. It was a bit surprising to see such good results, so I went back and checked the code to see if I was doing something wrong :-). As you can see, it correctly classified all my 5 documents.

>>> cocoa.txt => category: cocoa
(cocoa:0.7499364961896885, coffee:0.21387426054867117, sugar:0.15213562681433365)
>>> cocoa1.txt => category: cocoa
(cocoa:0.35404965894048845, coffee:0.15006958907480905, sugar:0.14425804054775068)
>>> cocoa2.txt => category: cocoa
(cocoa:0.2993396230523616, coffee:0.1754388455250711, sugar:0.18650205458278443)
>>> coffee.txt => category: coffee
(cocoa:0.18703846088862733, coffee:0.45354676135783173, sugar:0.20549314483406184)
>>> coffee1.txt => category: coffee
(cocoa:0.1436949323744925, coffee:0.3702669738594301, sugar:0.2316259997838632)

Possible Improvements

With the Naive Bayes approach, I had to enable feature selection and use the top √n terms to get it to classify correctly. I had thought of doing something similar here if required, basically by using SVD to extract the principal √n components and using them to compute the similarity. It is quite easy to do if needed though, simply by setting a different chain of indexers.

Another interesting toolkit to try out for this stuff is the Semantic Vectors project, which seems to be quite promising from the little I understand about it. A commenter on a previous related post pointed me to this - now that I've made the leap to using Lucene for the tokenization part, it seems logical to give this a try, something I plan to do next week.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

Salmon Run

Saturday, March 21, 2009

Cartesian Joining Flat Files with Python Generators

Friday, March 13, 2009

Vector Space Classifier using Lucene

Training

Classification

Classifier Code

Related Code

Test Code

Results

Possible Improvements

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me