## Friday, March 13, 2009

### Vector Space Classifier using Lucene

I have only recently started playing with Lucene's term API, and it looks really useful. Over the past year, I have tried to go through and understand the ideas presented in the TMAP book, and in the process I have built up a small set of tools to tokenize text, create and normalize term-document matrices, etc. Lucene provides some of this functionality through its term API, but in a more memory-efficient way. I was kind of aware of the term API, but I was too busy learning the basics of IR to worry too much about it, so I never took the time to figure it out earlier.

Anyway, I've been playing with classification lately, as you can see from my previous post. This week, I try out another popular classification approach based on the Term Vector Space Model. The idea is to compute the position in term space for the "average" or centroid document for each category, and then to find how "close" the target document is to each of these centroids. The closest centroid wins, ie the document is classified to its category.

### Training

The classifier is trained with a pre-classified corpus of documents. Each document's term vectors are computed, and based on its category, put into a Term-Document (TD) matrix for that category. Once all documents are read, then the centroid document for each set of documents are calculated.

During the centroid calculation, we normalize each matrix using TF-IDF and then calculate the centroid for the documents in the matrix. A centroid is basically just the average of the rows in the TD matrix. If you think of a column in the TD Matrix as representing a single document, then a tuple of the elements of that column matrix can be considered as a coordinate that represents a point in n-dimensional space, where n is the number of terms (rows) in our TD Matrix. Thus a document made up of term coordinates which are the average of the rows would represent the centroid of the documents in that category.

In my example, the centroids are stored as in-memory member variables of the classifier, which can be accessed during the classification phase via an accessor. Another data structure is the term to position map, also created as a side effect of the training phase and accessible via an accessor. In real-world systems, you may want to train the classifier once and then reuse it many times over different documents, possibly over a period of days or months, so its probably better to store this data in a database table or some other persistent medium. If you go the database table route, you can coalesce the two data structures needed by the classify method into a single table by keying the centroid coordinates off the term itself. I haven't done this because I am lazy, so you are stuck with handing the two data structures to the classify method at the moment.

### Classification

The classification process takes a body of text and the two data structures, creates an in-memory Lucene index off the text, and creates a document matrix out of the normalized term vectors. As in the training phase, we pass the raw frequencies through our TF-IDF indexers. Similarities are then calculated for this document matrix against the document matrices for each category. The category with the highest similarity between its centroid matrix and the document matrix is assigned to the document. The default similarity implementation used in this classifier is Cosine Similarity.

Notice that unlike the Naive Bayes classifier, this classifier is not binary. You can use the cosine similarity measure to find the best matching category for a document for multiple categories. Of course, it doesn't have to be this way, a Naive Bayes classifier can be run multiple times to make it non-binary, and a Vector Space classifier can be trained appropriately to make it binary.

### Classifier Code

The code for the classifier is shown below. There is a bunch of setters at the beginning, which allow the caller to configure the classifier. Then the caller calls the train() method. Once the training is complete, the caller can call the classify() method, which returns a String representing the best category for the document. There is another method that will report the similarity scores for each category for the document, which can be used for debugging. There is some test code further down that illustrates the usage.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400``` ```// Source: src/main/java/com/mycompany/myapp/classifiers/LuceneVectorSpaceModelClassifier.java package com.mycompany.myapp.classifiers; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; import org.apache.commons.collections15.Bag; import org.apache.commons.collections15.Transformer; import org.apache.commons.collections15.bag.HashBag; import org.apache.commons.collections15.comparators.ReverseComparator; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.commons.math.linear.RealMatrix; import org.apache.commons.math.linear.SparseRealMatrix; import org.apache.lucene.analysis.Analyzer; import org.apache.lucene.analysis.standard.StandardAnalyzer; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.Field.Index; import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.Field.TermVector; import org.apache.lucene.index.IndexReader; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.TermEnum; import org.apache.lucene.index.TermFreqVector; import org.apache.lucene.index.IndexWriter.MaxFieldLength; import org.apache.lucene.store.RAMDirectory; import com.mycompany.myapp.clustering.ByValueComparator; import com.mycompany.myapp.indexers.IdfIndexer; import com.mycompany.myapp.indexers.TfIndexer; import com.mycompany.myapp.similarity.AbstractSimilarity; import com.mycompany.myapp.similarity.CosineSimilarity; /** * Computes the position in term space of the centroid for each * category during the training phase. During the classify phase, * the position in term space of the document to be classified is * computed and the cosine similarity of this document with each * of the centroids is computed. The category of the centroid which * is closest to the document is assigned to the document. */ public class LuceneVectorSpaceModelClassifier { private final Log log = LogFactory.getLog(getClass()); private String indexDir; private String categoryFieldName; private String bodyFieldName; private Analyzer analyzer = new StandardAnalyzer(); @SuppressWarnings("unchecked") private Transformer[] indexers = new Transformer[] { new TfIndexer(), new IdfIndexer() }; private AbstractSimilarity similarity = new CosineSimilarity(); private Map centroidMap; private Map termIdMap; private Map similarityMap; /** * Set the directory where the Lucene index is located. * @param indexDir the index directory. */ public void setIndexDir(String indexDir) { this.indexDir = indexDir; } /** * Set the name of the Lucene field containing the preclassified category. * @param categoryFieldName the category field name. */ public void setCategoryFieldName(String categoryFieldName) { this.categoryFieldName = categoryFieldName; } /** * Set the name of the Lucene field containing the document body. The * document body must have been indexed with TermVector.YES. * @param bodyFieldName the name of the document body field. */ public void setBodyFieldName(String bodyFieldName) { this.bodyFieldName = bodyFieldName; } /** * The Analyzer used for tokenizing the document body during indexing, * and to tokenize the text to be classified. If not specified, the * classifier uses the StandardAnalyzer. * @param analyzer the analyzer reference. */ public void setAnalyzer(Analyzer analyzer) { this.analyzer = analyzer; } /** * A transformer chain of indexers (or normalizers) to normalize the * document matrices. If not specified, the default is a chain of TF-IDF. * @param indexers the normalizer chain. */ public void setIndexers( Transformer[] indexers) { this.indexers = indexers; } /** * The Similarity implementation used to calculate the similarity between * the text to be classified and the category centroid document matrices. * Uses CosineSimilarity if not specified. * @param similarity the similarity to use. */ public void setSimilarity(AbstractSimilarity similarity) { this.similarity = similarity; } /** * Implements the logic for training the classifier. The input is a Lucene * index of preclassified documents. The classifier is provided the name * of the field which contains the document body, as well as the name of * the field which contains the category name. Additionally, the document * body must have had its Term Vectors computed during indexing (using * TermVector.YES). The train method uses the Term Vectors to compute a * geometrical centroid for each category in the index. The set of category * names to their associated centroid document matrix is available via the * getCentroidMap() method after training is complete. * @throws Exception if one is thrown. */ public void train() throws Exception { log.info("Classifier training started"); IndexReader reader = IndexReader.open(indexDir); // Set up a data structure for the term versus the row id in the matrix. // This is going to be used for looking up the term's row in the matrix. this.termIdMap = computeTermIdMap(reader); // Initialize the data structures to hold the td matrices for the various // categories. Bag docsInCategory = computeDocsInCategory(reader); Map currentDocInCategory = new HashMap(); Map categoryTfMap = new HashMap(); for (String category : docsInCategory.uniqueSet()) { int numDocsInCategory = docsInCategory.getCount(category); categoryTfMap.put(category, new SparseRealMatrix(termIdMap.size(), numDocsInCategory)); currentDocInCategory.put(category, new Integer(0)); } // extract each document body's TermVector into the td matrix for // that document's category int numDocs = reader.numDocs(); for (int i = 0; i < numDocs; i++) { Document doc = reader.document(i); String category = doc.get(categoryFieldName); RealMatrix tfMatrix = categoryTfMap.get(category); // get the term frequency map TermFreqVector vector = reader.getTermFreqVector(i, bodyFieldName); String[] terms = vector.getTerms(); int[] frequencies = vector.getTermFrequencies(); for (int j = 0; j < terms.length; j++) { int row = termIdMap.get(terms[j]); int col = currentDocInCategory.get(category); tfMatrix.setEntry(row, col, new Double(frequencies[j])); } incrementCurrentDoc(currentDocInCategory, category); } reader.close(); // compute centroid vectors for each category this.centroidMap = new HashMap(); for (String category : docsInCategory.uniqueSet()) { RealMatrix tdmatrix = categoryTfMap.get(category); RealMatrix centroid = computeCentroid(tdmatrix); centroidMap.put(category, centroid); } log.info("Classifier training complete"); } /** * Returns the centroid map of category name to TD Matrix containing the * centroid document of the category. This data is computed as a side * effect of the train() method. * @return the centroid map computed from the training. */ public Map getCentroidMap() { return centroidMap; } /** * Returns the map of analyzed terms versus their positions in the centroid * matrices. The data is computed as a side-effect of the train() method. * @return a Map of analyzed terms to their position in the matrix. */ public Map getTermIdMap() { return termIdMap; } /** * Once the classifier is trained using the train() method, it creates a * Map of category to associated centroid documents for each category, and * a termIdMap, which is a mapping of tokenized terms to its row number in * the document matrix for the centroid documents. These two structures are * used by the classify method to match up terms from the input text with * corresponding terms in the centroids to calculate similarities. Builds * a Map of category names and the similarities of the input text to the * centroids in each category as a side effect. Returns the category with * the highest similarity score, ie the category this text should be * classified under. * @param centroids a Map of category names to centroid document matrices. * @param termIdMap a Map of terms to their positions in the document * matrix. * @param text the text to classify. * @return the best category for the text. * @throws Exception if one is thrown. */ public String classify(Map centroids, Map termIdMap, String text) throws Exception { RAMDirectory ramdir = new RAMDirectory(); indexDocument(ramdir, "text", text); // now find the (normalized) term frequency vector for this RealMatrix docMatrix = buildMatrixFromIndex(ramdir, "text"); // compute similarity using passed in Similarity implementation, we // use CosineSimilarity by default. this.similarityMap = new HashMap(); for (String category : centroids.keySet()) { RealMatrix centroidMatrix = centroids.get(category); double sim = similarity.computeSimilarity(docMatrix, centroidMatrix); similarityMap.put(category, sim); } // sort the categories List categories = new ArrayList(); categories.addAll(centroids.keySet()); Collections.sort(categories, new ReverseComparator( new ByValueComparator(similarityMap))); // return the best category, the similarity map is also available // to the client for debugging or display. return categories.get(0); } /** * Returns the map of category to similarity with the document after * classification. The similarityMap is computed as a side-effect of * the classify() method, so the data is interesting only if this method * is called after classify() completes successfully. * @return map of category to similarity scores for text to classify. */ public Map getSimilarityMap() { return similarityMap; } /** * Loops through the IndexReader's TermEnum enumeration, and creates a Map * of term to an integer id. This map is going to be used to assign string * terms to specific rows in the Term Document Matrix for each category. * @param reader a reference to the IndexReader. * @return a Map of terms to their integer ids (0-based). * @throws Exception if one is thrown. */ private Map computeTermIdMap(IndexReader reader) throws Exception { Map termIdMap = new HashMap(); int id = 0; TermEnum termEnum = reader.terms(); while (termEnum.next()) { String term = termEnum.term().text(); if (termIdMap.containsKey(term)) { continue; } termIdMap.put(term, id); id++; } return termIdMap; } /** * Loops through the specified IndexReader and returns a Bag of categories * and their document counts. We don't use the BitSet/DocIdSet approach * here because we don't know how many categories the training documents * have been classified into. * @param reader the reference to the IndexReader. * @return a Bag of category names and counts. * @throws Exception if one is thrown. */ private Bag computeDocsInCategory(IndexReader reader) throws Exception { int numDocs = reader.numDocs(); Bag docsInCategory = new HashBag(); for (int i = 0; i < numDocs; i++) { Document doc = reader.document(i); String category = doc.get(categoryFieldName); docsInCategory.add(category); } return docsInCategory; } /** * Increments the counter for the category to point to the next document * index. This is used to manage the document index in the td matrix for * the category. * @param currDocs the Map of category-wise document Id counters. * @param category the category whose document-id we want to increment. */ private void incrementCurrentDoc(Map currDocs, String category) { int currentDoc = currDocs.get(category); currDocs.put(category, currentDoc + 1); } /** * Compute the centroid document from the TD Matrix. Result is a matrix * of term weights but for a single document only. * @param tdmatrix * @return */ private RealMatrix computeCentroid(RealMatrix tdmatrix) { tdmatrix = normalizeWithTfIdf(tdmatrix); RealMatrix centroid = new SparseRealMatrix(tdmatrix.getRowDimension(), 1); int numDocs = tdmatrix.getColumnDimension(); int numTerms = tdmatrix.getRowDimension(); for (int row = 0; row < numTerms; row++) { double rowSum = 0.0D; for (int col = 0; col < numDocs; col++) { rowSum += tdmatrix.getEntry(row, col); } centroid.setEntry(row, 0, rowSum / ((double) numDocs)); } return centroid; } /** * Builds an in-memory Lucene index using the text supplied for classification. * @param ramdir the RAM Directory reference. * @param fieldName the field name to index the text as. * @param text the text to index. * @throws Exception if one is thrown. */ private void indexDocument(RAMDirectory ramdir, String fieldName, String text) throws Exception { IndexWriter writer = new IndexWriter(ramdir, analyzer, MaxFieldLength.UNLIMITED); Document doc = new Document(); doc.add(new Field( fieldName, text, Store.YES, Index.ANALYZED, TermVector.YES)); writer.addDocument(doc); writer.commit(); writer.close(); } /** * Given a Lucene index and a field name with pre-computed TermVectors, * creates a document matrix of terms. The document matrix is normalized * using the specified indexer chain. * @param ramdir the RAM Directory reference. * @param fieldName the name of the field to build the matrix from. * @return a normalized Document matrix of terms and frequencies. * @throws Exception if one is thrown. */ private RealMatrix buildMatrixFromIndex(RAMDirectory ramdir, String fieldName) throws Exception { IndexReader reader = IndexReader.open(ramdir); TermFreqVector vector = reader.getTermFreqVector(0, fieldName); String[] terms = vector.getTerms(); int[] frequencies = vector.getTermFrequencies(); RealMatrix docMatrix = new SparseRealMatrix(termIdMap.size(), 1); for (int i = 0; i < terms.length; i++) { String term = terms[i]; if (termIdMap.containsKey(term)) { int row = termIdMap.get(term); docMatrix.setEntry(row, 0, frequencies[i]); } } reader.close(); // normalize the docMatrix using TF-IDF docMatrix = normalizeWithTfIdf(docMatrix); return docMatrix; } /** * Pass the input TD Matrix through a chain of transformers to normalize * the TD Matrix. Here we do TF-IDF normalization, although it is possible * to do other types of normalization (such as LSI) by passing in the * appropriate chain of normalizers (or indexers). * @param docMatrix the un-normalized TD Matrix. * @return the normalized TD Matrix. */ private RealMatrix normalizeWithTfIdf(RealMatrix docMatrix) { for (Transformer indexer : indexers) { docMatrix = indexer.transform(docMatrix); } return docMatrix; } } ```

### Related Code

I have reused some code that I had written to support other components developed earlier. When I wrote them earlier, I was using the Jama Matrix package. However, I switched sometime late last year to using the linear algebra classes in commons-math instead. I started using commons-math in anticipation of being able to use the SparseRealMatrix implementation which I had suggested and contributed a first cut for, but the 2.0 release is still not out, so its likely you will have to download and build from the svn repository if you want to run my code. In each subsection below, I point out where you can get the Jama version if you want it.

TfIndexer

This indexer normalizes each term count by dividing by the total number of terms for a given document. This has the effect of normalizing the effect of long documents versus shorter ones. At the end of the normalization, the term count becomes a number between 0 and 1, with the total of all the term frequencies for a document being equal to 1.

The Jama version of the code can be found in my post IR Math with Java : TF, IDF and LSI.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27``` ```// Source: src/main/java/com/mycompany/myapp/indexers/TfIndexer.java package com.mycompany.myapp.indexers; import org.apache.commons.collections15.Transformer; import org.apache.commons.math.linear.RealMatrix; public class TfIndexer implements Transformer { public RealMatrix transform(RealMatrix matrix) { for (int j = 0; j < matrix.getColumnDimension(); j++) { double sum = sum(matrix.getSubMatrix( 0, matrix.getRowDimension() -1, j, j)); for (int i = 0; i < matrix.getRowDimension(); i++) { matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum)); } } return matrix; } private double sum(RealMatrix colMatrix) { double sum = 0.0D; for (int i = 0; i < colMatrix.getRowDimension(); i++) { sum += colMatrix.getEntry(i, 0); } return sum; } } ```

IdfIndexer

This transformation has the effect of reducing the frequency of words that are commonly found in the document set. The factor fw by which the frequency of term w is reduced is given by the formula:

```  fw = 1 + log(N/nw)
where:
N = total number of documents in the collection
nw = number of documents containing word w
```

The code is shown below. The Jama version of the code can also be found in my post IR Math with Java : TF, IDF and LSI.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51``` ```// Source: src/main/java/com/mycompany/myapp/indexers/IdfIndexer.java package com.mycompany.myapp.indexers; import org.apache.commons.collections15.Transformer; import org.apache.commons.math.linear.RealMatrix; public class IdfIndexer implements Transformer { public RealMatrix transform(RealMatrix matrix) { // Phase 1: apply IDF weight to the raw word frequencies int n = matrix.getColumnDimension(); for (int j = 0; j < matrix.getColumnDimension(); j++) { for (int i = 0; i < matrix.getRowDimension(); i++) { double dm = countDocsWithWord( matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1)); double matrixElement = matrix.getEntry(i, j); if (matrixElement > 0.0D) { matrix.setEntry(i, j, matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm))); } } } // Phase 2: normalize the word scores for a single document for (int j = 0; j < matrix.getColumnDimension(); j++) { double sum = sum(matrix.getSubMatrix( 0, matrix.getRowDimension() -1, j, j)); for (int i = 0; i < matrix.getRowDimension(); i++) { matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum)); } } return matrix; } private double sum(RealMatrix colMatrix) { double sum = 0.0D; for (int i = 0; i < colMatrix.getRowDimension(); i++) { sum += colMatrix.getEntry(i, 0); } return sum; } private double countDocsWithWord(RealMatrix rowMatrix) { double numDocs = 0.0D; for (int j = 0; j < rowMatrix.getColumnDimension(); j++) { if (rowMatrix.getEntry(0, j) > 0.0D) { numDocs++; } } return numDocs; } } ```

CosineSimilarity

Cosine Similarity calculates the cosine of the angle between the lines joining the origin of the term space to the each document's position. The higher the value of the cosine, the smaller the angle between the two lines, and hence more similar the documents. Cosine Similarity is calculated as:

```  cos θ = A • B / |A| * |B|
where A = document matrix for the first document,
B = document matrix for the second document.
```

The code for the CosineSimilarity class is shown below. The Jama version can be found in my post IR Math with Java : Similarity Measures.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38``` ```// Source: src/main/java/com/mycompany/myapp/similarity/CosineSimilarity.java package com.mycompany.myapp.similarity; import org.apache.commons.math.linear.RealMatrix; import org.apache.commons.math.linear.SparseRealMatrix; public class CosineSimilarity extends AbstractSimilarity { @Override public double computeSimilarity( RealMatrix sourceDoc, RealMatrix targetDoc) { if (sourceDoc.getRowDimension() != targetDoc.getRowDimension() || sourceDoc.getColumnDimension() != targetDoc.getColumnDimension() || sourceDoc.getColumnDimension() != 1) { throw new IllegalArgumentException( "Matrices are not column matrices or not of the same size"); } // max col sum, only 1 col, so... double dotProduct = dot(sourceDoc, targetDoc); // sqrt of sum of squares of all elements, only one col, so... double eucledianDist = sourceDoc.getFrobeniusNorm() * targetDoc.getFrobeniusNorm(); return dotProduct / eucledianDist; } private double dot(RealMatrix source, RealMatrix target) { int maxRows = source.getRowDimension(); int maxCols = source.getColumnDimension(); RealMatrix dotProduct = new SparseRealMatrix(maxRows, maxCols); for (int row = 0; row < maxRows; row++) { for (int col = 0; col < maxCols; col++) { dotProduct.setEntry(row, col, source.getEntry(row, col) * target.getEntry(row, col)); } } return dotProduct.getNorm(); } } ```

### Test Code

For the test, we use the same collection of Reuters news items from the TextMine project that was used for testing the Binary Naive Bayes Classifier described in my previous post. The indexing code is pretty much the same, except that we now compute the term vectors of the body during indexing time. There is a single test, which trains the classifier, then classifies 5 documents with the classifier. Here is the JUnit test.

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137``` ```// Source: src/test/java/com/mycompany/myapp/classifiers/LuceneVectorSpaceModelClassifierTest.java package com.mycompany.myapp.classifiers; import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.util.Map; import org.apache.commons.collections15.Transformer; import org.apache.commons.io.FileUtils; import org.apache.commons.lang.StringUtils; import org.apache.commons.logging.Log; import org.apache.commons.logging.LogFactory; import org.apache.commons.math.linear.RealMatrix; import org.apache.lucene.document.Document; import org.apache.lucene.document.Field; import org.apache.lucene.document.Field.Index; import org.apache.lucene.document.Field.Store; import org.apache.lucene.document.Field.TermVector; import org.apache.lucene.index.IndexWriter; import org.apache.lucene.index.IndexWriter.MaxFieldLength; import org.apache.lucene.store.FSDirectory; import org.junit.AfterClass; import org.junit.BeforeClass; import org.junit.Test; import com.mycompany.myapp.indexers.IdfIndexer; import com.mycompany.myapp.indexers.TfIndexer; import com.mycompany.myapp.similarity.CosineSimilarity; import com.mycompany.myapp.summarizers.SummaryAnalyzer; public class LuceneVectorSpaceModelClassifierTest { private static final Log log = LogFactory.getLog(LuceneVectorSpaceModelClassifierTest.class); private static String INPUT_FILE = "src/test/resources/data/sugar-coffee-cocoa-docs.txt"; private static String INDEX_DIR = "src/test/resources/data/scc-index"; private static String[] DOCS_TO_CLASSIFY = new String[] { "src/test/resources/data/cocoa.txt", "src/test/resources/data/cocoa1.txt", "src/test/resources/data/cocoa2.txt", "src/test/resources/data/coffee.txt", "src/test/resources/data/coffee1.txt" }; @BeforeClass public static void buildIndex() throws Exception { log.debug("Building index..."); BufferedReader reader = new BufferedReader(new FileReader(INPUT_FILE)); IndexWriter writer = new IndexWriter(FSDirectory.getDirectory(INDEX_DIR), new SummaryAnalyzer(), MaxFieldLength.UNLIMITED); String line = null; int lno = 0; StringBuilder bodybuf = new StringBuilder(); String category = null; while ((line = reader.readLine()) != null) { if (line.endsWith(".sgm")) { // header line if (lno > 0) { // not the very first line, so dump current body buffer and // reinit the buffer. writeToIndex(writer, category, bodybuf.toString()); bodybuf = new StringBuilder(); } category = StringUtils.trim(StringUtils.split(line, ":")[1]); continue; } else { // not a header line, accumulate line into bodybuf bodybuf.append(line).append(" "); } lno++; } // last record writeToIndex(writer, category, bodybuf.toString()); reader.close(); writer.commit(); writer.optimize(); writer.close(); } private static void writeToIndex(IndexWriter writer, String category, String body) throws Exception { Document doc = new Document(); doc.add(new Field("category", category, Store.YES, Index.NOT_ANALYZED)); doc.add( new Field("body", body, Store.YES, Index.ANALYZED, TermVector.YES)); writer.addDocument(doc); } @AfterClass public static void deleteIndex() throws Exception { log.info("Deleting index directory..."); FileUtils.deleteDirectory(new File(INDEX_DIR)); } @Test public void testLuceneNaiveBayesClassifier() throws Exception { LuceneVectorSpaceModelClassifier classifier = new LuceneVectorSpaceModelClassifier(); // setup classifier.setIndexDir(INDEX_DIR); classifier.setAnalyzer(new SummaryAnalyzer()); classifier.setCategoryFieldName("category"); classifier.setBodyFieldName("body"); // this is the default but we set it anyway, to illustrate usage classifier.setIndexers(new Transformer[] { new TfIndexer(), new IdfIndexer() }); // this is the default but we set it anyway, to illustrate usage. // Similarity need not be set before training, it can be set before // the classification step. classifier.setSimilarity(new CosineSimilarity()); // training classifier.train(); // classification Map centroidMap = classifier.getCentroidMap(); Map termIdMap = classifier.getTermIdMap(); String[] categories = centroidMap.keySet().toArray(new String[0]); for (String testDoc : DOCS_TO_CLASSIFY) { File f = new File(testDoc); String category = classifier.classify(centroidMap, termIdMap, FileUtils.readFileToString(f, "UTF-8")); System.out.println(">>> " + f.getName() + " => category: " + category); Map similarityMap = classifier.getSimilarityMap(); String[] pairs = new String[categories.length]; for (int i = 0; i < categories.length; i++) { pairs[i] = categories[i] + ":" + similarityMap.get(categories[i]); } System.out.println("(" + StringUtils.join(pairs, ", ") + ")"); } } } ```

### Results

Here are the results. It was a bit surprising to see such good results, so I went back and checked the code to see if I was doing something wrong :-). As you can see, it correctly classified all my 5 documents.

 ``` 1 2 3 4 5 6 7 8 9 10``` ```>>> cocoa.txt => category: cocoa (cocoa:0.7499364961896885, coffee:0.21387426054867117, sugar:0.15213562681433365) >>> cocoa1.txt => category: cocoa (cocoa:0.35404965894048845, coffee:0.15006958907480905, sugar:0.14425804054775068) >>> cocoa2.txt => category: cocoa (cocoa:0.2993396230523616, coffee:0.1754388455250711, sugar:0.18650205458278443) >>> coffee.txt => category: coffee (cocoa:0.18703846088862733, coffee:0.45354676135783173, sugar:0.20549314483406184) >>> coffee1.txt => category: coffee (cocoa:0.1436949323744925, coffee:0.3702669738594301, sugar:0.2316259997838632) ```

### Possible Improvements

With the Naive Bayes approach, I had to enable feature selection and use the top √n terms to get it to classify correctly. I had thought of doing something similar here if required, basically by using SVD to extract the principal √n components and using them to compute the similarity. It is quite easy to do if needed though, simply by setting a different chain of indexers.

Another interesting toolkit to try out for this stuff is the Semantic Vectors project, which seems to be quite promising from the little I understand about it. A commenter on a previous related post pointed me to this - now that I've made the leap to using Lucene for the tokenization part, it seems logical to give this a try, something I plan to do next week.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

#### 63 comments (moderated to prevent spam):

Nitin said...

currently common-math nightly build do not contains SparseMatrix.

:|
nitin

Sujit Pal said...

Thanks Nitin, looks like the commons-math nightly build is still for the 1.2 version, so it doesn't contain the SparseRealMatrix code. Its there in the svn repository though (obviously :-)), so I've changed the post to point to the svn repository instead. Thanks for catching this.

Anonymous said...

Do you have your test code available for download? I am dilligently trying to recreate your experiment, but have several pieces missing (AbstractSimilarity, JAMA vs common.math, etc). Would it be possilbe to zip up this experiment?

Sujit Pal said...

I don't yet, but should have soon. I have been reusing some of my existing code in more recent blogs, and its hard to remember all the dependencies, and I guess it gets confusing for you. You are not the first person to ask, but so far I've been too lazy to package it up properly...but this morning, I have requested project space on sf.net, and I have my code ready for upload into their repository, so once (and if) I get approval (they say 3 biz days), the code will be at jtmt.sf.net.

Anonymous said...

could u let me know how to obtain the test data? sugar-coffee-cocoa-docs.txt and the 54 articles?

Sujit Pal said...

@Anonymous: its in the project at jtmt.sf.net, look under src/test/resources. I copied this from textmine.sf.net project so you will also find it in there.

Anonymous said...

I'm quite new to IR so got a question. It seems to me that your code didn't use "idf" for documents to be classified.

In function "buildMatrixFromIndex", the line "docMatrix=normalizeWithTfIdf(docMatrix)" will just simply do normalizing with tf only because docMatrix is a column matrix.

If so, is there any easy way to modify the code?

If not so, sorry for my misunderstanding.

Sujit Pal said...

Hi, good observation, actually didn't think of that when I was writing the code. However, given the situation - I am trying to classify documents based on the similarity of one document's term vector to another's, IDF probably doesn't make sense. Normalizing the term frequencies across the document, however, does make sense, since we want to compare them in the same space, and density rather than count is what we should compare. What do you think?

Anonymous said...

I got your idea. It's logical.
However, I still don't understand the IR principle in general. Perhaps you have some answer.

Suppose there are only two terms "the" and "dog" in a document.
Now imagine that we are going to visualize the document vector in 2D space of "the" (X axis) and "dog" (Y axis).

Without "idf", we have a vector with the angle of 45. However, with idf, we get one with a larger angle, assuming that the importance of "the" is suppressed by "idf".

The question is: cosine similarity measure should yield different scores for these different angles unless you use the same normalizing scheme for both training and testing documents, right?

Also I found another more confusing statement (to me) in a textbook: "Introduction to IR" by Christopher D. Manning, Prabhakar Raghavan and Hinrich SchÃ¼tze.

"For example, a very standard weighting scheme is lnc.ltc, where the document vector has log-weighted term frequency, no idf (for both effectiveness and efficiency reasons), and cosine normalization, while the query vector uses log-weighted term frequency, idf weighting, and cosine normalization"

http://nlp.stanford.edu/IR-book/pdf/06vect.pdf pp.128

They use idf for query, but not for documents. Interesting.

So, it comes to a conclusion that it's not uncommon to compare query vector with document vector using different schemes of normalization although I got confused myself.

Sujit Pal said...

Nope, I don't understand it either, but it could very well be because I am not as smart/knowledgeable enough about this stuff as the author. I've been meaning to start reading the IR book you mention myself, so far haven't gotten around to it. But using different normalization schemes and being able to compare the two reliably seems counter-intuitive to me.

Also, the query vector is for a single document, right? In that case, IDF is a no-op (something you discovered in my code), so perhaps that is how this scheme seems to work?

Dave said...

Really appreciate what you have done with your jtmt project. Am currently using your vector space classifier for some testing and having very good success with it. Am wondering if you can think of an easy was to integrate in 10 fold cross validation. Believe I want to read in the whole dataset into an index and then partition the index into 10 individual indexes. Would then somehow train/test on these. Just having a bit of a hard time with the implementation details. Any help would be greatly appreciated. Keep up the good work.

Sujit Pal said...

Thanks Dave. My example works off a set of pre-classified documents and attempts to match a new document to one of the classes. But your approach using 10-fold cross validation is probably more useful for when you have no classified documents to begin with. Let me think about it and see if there is some way the code can be adapted to use that.

Sujit Pal said...

Dave, I checked out cross-validation based on your comment above, and added a cross-validation method to the classifier. You've probably already solved this yourself, but here is a link to the post describing this.

Elif said...

Hello,

Thank you for this post. It's really informative and well-explained.

I downloaded the jtmt code. I wanted to play with it a bit but I get an error which says:

The type org.mortbay.component.AbstractLifeCycle cannot be resolved. It is indirectly referenced from required .class files JohHandler.java jtmt/src/main/java/net/sf/jtmt/httpwrapper line 23

I found out that this is about jetty-6.1.5.jar . But the AbstractLifeCycle is supposed to be in it. I tried to compile with jetty-6.1.6 which didn't work. Then with java5 instead of java6, which didn't work again.

Do you have any idea what the problem might be?

(I guess this is not the right place for this question, but I couldn't find any more suitable post.)

Sujit Pal said...

Thanks Elif. A more relevant place for this comment is this page. To answer your question, though, this class is in jetty-util in my 6.1.5 version - its not declared in the POM because Maven2 takes care of transitive dependencies.

- said...

I am having a problem with the line in the CosineSimilarity class.

RealMatrix dotProduct = new SparseRealMatrix(maxRows, maxCols);

The apache common math jar does not contain a class SparseRealMatrix. I also used your libs from sourceforge. Can you help further? Thanks in advance

Anonymous said...

Hello,
how can i build jtmt project .. i m using
Vector Space Classifier for focused crawler

Sujit Pal said...

@-: very imaginative name, btw :-). So heres the story with SparseRealMatrix... I needed one, so I built one in commons-math and contributed it back. However, the developers there are more knowledgeable than me about the ripple effect of adding this kind of thing, so they added in the functionality but under a different class name - I forget exactly the usage, but check the commons-math 2.0+ javadocs and you should be able to figure it out.
@Anonymous: there is a pom.xml - but it contains jar references that are not necessarily public. So I would suggest building an Ant buildfile using mvn ant:ant, then in the classpath, point everything to the lib directory, then run ant jar.

Anonymous said...

Dear sijit pal, your code for vector space classifier using lucene help me a lot. Do u have a code for gram schmidt process. If yes could you post that in the block.

Anonymous said...

Hi Pal:
we are new to IR and Lucene, yet we need to build a searching system, scoring the similarity only by the positions of the terms and with NO regard of TF/IDF.
e.g. suppose we have
q:{t1,t2,t3}
d1:{t1,t3,t2,t4}
d2:{t1,t3}
d3:{t1,t2,t5,t3}
d4:{t4,t1,t2,t3}
then the result is to be
d4>d3>d1>d2
just like a comparison among different bus paths and every single path consists of many stops, quite straightforward; however, we didn't find yet any lucene API fit for this job.
Would you be kind to give us some advice for our task? Any suggestion is appreciated.
Thank you.

Jack H.

Sujit Pal said...

Thanks for the feedback, glad it was helpful. I did not know about the Gram-Schmidt process, thanks for the pointer - if I ever end up building a classifier based on it, I will post it.

Sujit Pal said...

This answer is for the comment about looking for a lucene query to compare bus stops (2 comments up).

This is just a suggestion, haven't tried it yet, so not sure... would it make sense to store the tn values in a multi-field along with its position, and then search on them with an OR query? And have a custom similarity class which boosts based on the ordering. There is an example of something similar in the Lucene in Action (Second Edition) book.

Anonymous said...

Hi Pal:

Heading with your suggestion, we do also wish to have further hints on..
1. the topic or the sample talking about similar requirement in Lucene in Action II.
2. we are not sure how we can do this by boosting on order, for, as far as our understanding, there seems no mechanics for boosting query on order, hence much likely we cannot have a comparison according to the query content.

We do suffer much from these issues, and that's why we keep trying to ask further advice.
We hope it did not make you feel bothered.

Thank you again for your kindness.

Jack H.

Sujit Pal said...

Hi Jack, I believe its section 13.4: Searching entities with SIREn, I am guessing it will need to be adapted for your purpose. Another possibility may be SpanQueries - I want to learn about them myself, so I will use it to try out your use case and see if it would work.

Anonymous said...

Hi, I've assignment about lucene, and I accidentally found your site. I've copied the codes above, I have download the packages you uploaded before, but the program still can't run because I can't import the packages. Can you tell me how to convert a file into a package like you did in the program?? I use JCreator as my editor. Thanks a lot..

Sujit Pal said...

@Jack: I think I may have (at least partial) solution to your problem on this post.

@Anonymous: I won't argue about the ethics of what you are asking, but you do realize that you are shooting yourself in the foot, right? That is, unless you are not really looking for a programming career? Anyway, to answer your question, take a look at the JTMT project, that has a pom.xml which takes care of the packaging (mvn jar:jar).

Teeraphol said...

do u have the .zip file for vector space model for searching in java ???

Sujit Pal said...

No, but you can find the code in the repository for the JTMT project on Sourceforge.

deathlike_silence said...

Hi Sujit!

Great post, great blog!
I was wondering if it feasible to create a binary vector space classifier without training negative documents based on your code.
What are your thoughts on that?

Regards,
Yiannis

Sujit Pal said...

Thanks Yannis. I guess you can some form of binary classification (although it seems to be more like clustering) by specifying a radius cutoff, ie if certain documents do not fall within a certain distance in the term space from the centroid of the cluster of positive docs, then it is a negative doc. But then you would have to analyze your data to find what the radius cutoff should be.

Anonymous said...

Hi Sujit!

I used your technique in a project in my postgraduate program. I used wikipedia and it's categories for training. As you mentioned, I had to modify some things, in order to store the centroids in SQL.

The program returns as a category one of the top categories of eurovoc ( http://eurovoc.europa.eu/drupal/?q=download/subject_oriented&cl=en )

You can test it here:
http://143.233.226.74:8080/NLPproject/categorization.jsp

BR,
Yiannis

Sujit Pal said...

Hi Yannis, this is very cool, congratulations! I tested your categorizer with a few news items from Google, and as long as I stay within the classifications of the Eurovoc thesauri, it seems to be pretty accurate. With something outside (such as sports), it still classifies (wrongly) but with a sufficiently low score to indicate a low confidence. Great stuff, thanks for sharing!

deathlike_silence said...

Hi Sujit!

I don't have the source code for doing this in a public repository. If at any time are you interested in it just let me know. My email is johngouf@iit.demokritos.gr

BR,
Yiannis

Sujit Pal said...

Thanks Yannis, and you are welcome. Once again, great work. Not sure if you are okay with your email address being public, let me know if you are not and I will delete your last comment.

Anonymous said...

hai sujit sir..
how can i build this jtmt project
for Vector Space Classifier.
is it possible with out lucene.. if it is possible how can i import org.apache.commons.* and
org.apache.lucene.*,ect...

Sujit Pal said...

Hi Anonymous, in this particular case we are using Lucene's inverted index to find term frequencies to do the classification. There are definitely other ways to do this, including without using Lucene. The JTMT project can be built using ant, you will find the lucene jars in the lib directory.

Shereen Albitar said...

Hi,
I've been testing this classifier for a while for my research project. The cosinus similarity demands that both matrix has the same dimensions. I would like to know how the classifier can ensure that while centroids are calculated apart from the vector to classify.

Traca said...

Hi. I am a Student and i need to build a classifier. I´ve tryed run run and test your code but, it was not possible. How can i run it in NetBeans? i´ve downloaded everything. Thanks

Sujit Pal said...

@Shereen: I think I ensure that by considering all the terms and all the documents in the vector space for each class. So the centroid for class A is a point in the same TD vector space as the centroid for class B. I use sparse matrices though. Another way to weed out non-interesting terms is to use some form of feature reduction. A simple way is to eliminate any term which scores below a certain number.

@Traca: not sure what your comfort level with Java is, so not sure how to interpret your "... was not possible" comment. There is a build.xml file from which it should be possible to compile on the command line. If you are saying compile failed from the IDE, make sure that your IDE's project classpath contains the jars in the lib directory. Also I mostly run using JUnit tests (which can be run from the command line using ant unittest). Hope this helps...

Shereen Albitar said...

Thanks for your answer. As for centroid matrices during training, the same vector space is used but the point that is not clear for me is how this is applied to new documents during test(classification). Their indexes are created independently to the centroid matrix. Then their vectors are compared to those of centroids using cosinus similarity measure that demands that both compared vectors has to be created in the same vector space.
My question concerns classify function in LuceneVectorSpaceModelClassifier class.

Traca said...

Hi. Thanks for the answer. I just put the classes in a project (NetBeans) and try to compile it. There is no error from the compiler,and i´´ve made a class where i call your test class. I did not try wiht the command line yet.

chandu said...

hai sujith sir...

i'm a student.. using eclipse tool.i Run entire JTMT(but i want to run classifier only)project As JUnit Test i'm getting Java Build path Probleams(100 of 111 items)
saying
Project 'jtmt' is missing required library: '\Users\sujit\Healthline\prod\common\lib\external\dyuproject-openid-1.1.7.jar' ....etc..
eventhough i add jar to Libraries....

Sujit Pal said...

@Shereen: good catch, its probably a bug in the code, thanks for pointing it out. Since the new document is put into a ram index, its term frequency vector is going to have different (subset of) terms than in the main document index. The buildDocumentFromIndex method should probably merge the terms from the index vector and those from the document (and set the ones from the document to 0).

@Traca: thanks for the feedback, good to know that you are (at least for the moment) all set.

@Chandu: Sorry, that reference to the openid.jar must have been added erroneously. Delete this jar reference from the .classpath file (root of the project). I tried updating it and committing to SVN but its not recognizing me at the moment.

VIDYA BHUSHAN SINGH said...

Hi,

I am new to test mining area, but I have to use it for my project. Your blog is the best source of information and tools. I am trying to run the jtmt program using Netbeans, but I am getting following errors. can you identify what could be the problem. There are three directories in the scc-index directory containing the segment files, I am using the same code provided by you, but I don't know why I am getting this error.

what is segment file and how can I generate that for initial training.

thanks

Exception:
11/12/27 05:15:11 INFO classifiers.LuceneVectorSpaceModelClassifier: Classifier training started
java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.SimpleFSDirectory@C:\Users\VIDYA\Documents\NetBeansProjects\PalSoftware\resources\data\scc-index: files: [doc_cocoa.bin, doc_coffee.bin, doc_sugar.bin, term_cocoa.bin, term_coffee.bin, term_sugar.bin]
at org.apache.lucene.index.SegmentInfos\$FindSegmentsFile.run(SegmentInfos.java:628)
at net.sf.jtmt.classifiers.LuceneVectorSpaceModelClassifier.train(LuceneVectorSpaceModelClassifier.java:156)
at net.sf.jtmt.classifiers.LuceneVectorSpaceModelClassifier.train(LuceneVectorSpaceModelClassifier.java:143)
at net.sf.jtmt.classifiers.LuceneVectorSpaceModelClassifierTest.testLuceneVectorSpaceClassifier(LuceneVectorSpaceModelClassifierTest.java:119)
at net.sf.jtmt.classifiers.LuceneVectorSpaceModelClassifierTest.main(LuceneVectorSpaceModelClassifierTest.java:165)

Sujit Pal said...

Hi Vidya, read through my code again, and I believe the scc-index directory contains 3 lucene indexes, one for each category. Basically the training phase reads the indexes for each category and calculates the centroids to compare against the new input to be classified. The code seems to be pointing at the parent of these 3 directories, not the index directories themselves), and not finding the index as a result. Segment files are lucene indexes - an index is a directory containing one or more segment files.

venkat said...

I want to understand the source code for vector space model only ,so from where i have to start reading.

Sujit Pal said...

Hi Venkat, I think the classifier code itself should be enough. The basic premise is to feed the classifier a set of documents in each category, allow it to compute centroids for each category, then pass it one or more test documents and ask it what the category is, which is the one whose centroid is closest to the documents location in term space. From experience, the hardest thing to wrap your head around is the n dimensional term space idea - each term (that is considered for calculation) in the document corresponds to a dimension - once you get that the rest is gravy.

venkat said...

Thank u sir . I understood the source code and used in my Project.

Sujit Pal said...

You are welcome Venkat.

mahesh said...

Can u plz post Text classification using Support vector machine(SVM) code ... I badly need it ... can u plz help me ....

Sujit Pal said...

Hi Mahesh, sorry, don't know too much about SVMs, I'll probably check it out, heard about this from someone else as well, so its probably worth looking into... But I won't be able to do this right away.

Anonymous said...

Hi, can I get the source code ? Thankyou very much :)

Sujit Pal said...

Sure its on jtmt.sf.net's SVN.

Anonymous said...

Thanks for this great post!

I got the example up and running just fine. However when I try to use my own data which is quite large a get a NumberIsTooLargeException when I try to create a OpenMapRealMatrix(col, row). The reason for this exception is obvious col * row > Integer.MAX_VALUE.

Are you aware of any solutions that solve this limit and still use OpenMapRealMatrix?

A possible solution I am thinking of is to split the total OpenMapRealMatrix in parts but a more simple solution would be great.

Sujit Pal said...

Thanks, you are welcome. I have never encountered this error, you must be dealing with huge matrices, I am (pleasantly) surprised that my code works with such large sizes of data. However, I looked through the OpenMapRealMatrix Javadocs and it seems that NumberIsTooLargeException is thrown in cases where the condition you refer to may not necessarily be true. Unless you are certain that you are encountering the col * row > MAX_VALUE case, you may want to check the method call at the line number the exception is being thrown from. Of course, in case it /is/ a size issue , one other solution (probably easier than splitting and merging) is to consider building an OpenMapRealMatrix implementation using BigInteger for the row and column indexes?

Vignesh Srinivasan said...

is the algorithm same as Rocchio classification.I have tried some general wikipedia documents .it seems to work well..But will it be useful for Email classification on a large scale?

Sujit Pal said...

Yes, it is the same as Rocchio classification. Once the centroids are computed based on training data, each new point is classified based on the closest centroid. To answer your second question, I think it will depend on your scale - the training phase is memory intensive, but if your training set is not too large, it should work out fine. The classification stage takes each 1 document at a time and constructs its term vector in memory, so that should work out fine.

Vignesh Srinivasan said...

Hi Sujit,

How can i Store the training model in DB or some serialized object so that i don,t need to do training again..

Sujit Pal said...

The trained model is contained in the Lucene index, so its already serialized.

Vignesh Srinivasan said...

Hi sujit,
I understand we use lucene index for training.
In real-world systems, you may want to train the classifier once and then reuse it many times over different documents, possibly over a period of days or months, so its probably better to store this data in a database table or some other persistent medium. If you go the database table route, you can coalesce the two data structures needed by the classify method into a single table by keying the centroid coordinates off the term itself.

I did not understand completly the above point..can you please explain how i can do this.It will be really useful..

Sujit Pal said...

Well, during the classification step, you are trying to build the document vector for the document in question, then comparing against the document vectors corresponding to each of your cluster centroids. So you could extract the cluster centroids into a database table. Currently, each cluster centroid looks like this:

{"A": [x1, x2, x3, ..., xn]}

where the A corresponds to the cluster label and the xi elements stands for the value of the i-th dimension of the centroid document vector. The other data structure is the term-id map which maps the {term: position} where position corresponds to i in the cluster centroid data above. You will need the {term:position} mapping when you want to create the document vector for the input document.

So I guess what I was trying to say is that if you use a database table, you can get rid of the i value and map the term directly to the cluster, ie, the data structure you could use would look like this:

{"A": {t1: x1, t2: x2, ..., tn: xn}}

Obviously the associated code would need to change accordingly...

Zainil Abidin said...