## Friday, March 13, 2009

### Vector Space Classifier using Lucene

I have only recently started playing with Lucene's term API, and it looks really useful. Over the past year, I have tried to go through and understand the ideas presented in the TMAP book, and in the process I have built up a small set of tools to tokenize text, create and normalize term-document matrices, etc. Lucene provides some of this functionality through its term API, but in a more memory-efficient way. I was kind of aware of the term API, but I was too busy learning the basics of IR to worry too much about it, so I never took the time to figure it out earlier.

Anyway, I've been playing with classification lately, as you can see from my previous post. This week, I try out another popular classification approach based on the Term Vector Space Model. The idea is to compute the position in term space for the "average" or centroid document for each category, and then to find how "close" the target document is to each of these centroids. The closest centroid wins, ie the document is classified to its category.

### Training

The classifier is trained with a pre-classified corpus of documents. Each document's term vectors are computed, and based on its category, put into a Term-Document (TD) matrix for that category. Once all documents are read, then the centroid document for each set of documents are calculated.

During the centroid calculation, we normalize each matrix using TF-IDF and then calculate the centroid for the documents in the matrix. A centroid is basically just the average of the rows in the TD matrix. If you think of a column in the TD Matrix as representing a single document, then a tuple of the elements of that column matrix can be considered as a coordinate that represents a point in n-dimensional space, where n is the number of terms (rows) in our TD Matrix. Thus a document made up of term coordinates which are the average of the rows would represent the centroid of the documents in that category.

In my example, the centroids are stored as in-memory member variables of the classifier, which can be accessed during the classification phase via an accessor. Another data structure is the term to position map, also created as a side effect of the training phase and accessible via an accessor. In real-world systems, you may want to train the classifier once and then reuse it many times over different documents, possibly over a period of days or months, so its probably better to store this data in a database table or some other persistent medium. If you go the database table route, you can coalesce the two data structures needed by the classify method into a single table by keying the centroid coordinates off the term itself. I haven't done this because I am lazy, so you are stuck with handing the two data structures to the classify method at the moment.

### Classification

The classification process takes a body of text and the two data structures, creates an in-memory Lucene index off the text, and creates a document matrix out of the normalized term vectors. As in the training phase, we pass the raw frequencies through our TF-IDF indexers. Similarities are then calculated for this document matrix against the document matrices for each category. The category with the highest similarity between its centroid matrix and the document matrix is assigned to the document. The default similarity implementation used in this classifier is Cosine Similarity.

Notice that unlike the Naive Bayes classifier, this classifier is not binary. You can use the cosine similarity measure to find the best matching category for a document for multiple categories. Of course, it doesn't have to be this way, a Naive Bayes classifier can be run multiple times to make it non-binary, and a Vector Space classifier can be trained appropriately to make it binary.

### Classifier Code

The code for the classifier is shown below. There is a bunch of setters at the beginning, which allow the caller to configure the classifier. Then the caller calls the train() method. Once the training is complete, the caller can call the classify() method, which returns a String representing the best category for the document. There is another method that will report the similarity scores for each category for the document, which can be used for debugging. There is some test code further down that illustrates the usage.

### Related Code

I have reused some code that I had written to support other components developed earlier. When I wrote them earlier, I was using the Jama Matrix package. However, I switched sometime late last year to using the linear algebra classes in commons-math instead. I started using commons-math in anticipation of being able to use the SparseRealMatrix implementation which I had suggested and contributed a first cut for, but the 2.0 release is still not out, so its likely you will have to download and build from the svn repository if you want to run my code. In each subsection below, I point out where you can get the Jama version if you want it.

TfIndexer

This indexer normalizes each term count by dividing by the total number of terms for a given document. This has the effect of normalizing the effect of long documents versus shorter ones. At the end of the normalization, the term count becomes a number between 0 and 1, with the total of all the term frequencies for a document being equal to 1.

The Jama version of the code can be found in my post IR Math with Java : TF, IDF and LSI.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 // Source: src/main/java/com/mycompany/myapp/indexers/TfIndexer.java package com.mycompany.myapp.indexers; import org.apache.commons.collections15.Transformer; import org.apache.commons.math.linear.RealMatrix; public class TfIndexer implements Transformer { public RealMatrix transform(RealMatrix matrix) { for (int j = 0; j < matrix.getColumnDimension(); j++) { double sum = sum(matrix.getSubMatrix( 0, matrix.getRowDimension() -1, j, j)); for (int i = 0; i < matrix.getRowDimension(); i++) { matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum)); } } return matrix; } private double sum(RealMatrix colMatrix) { double sum = 0.0D; for (int i = 0; i < colMatrix.getRowDimension(); i++) { sum += colMatrix.getEntry(i, 0); } return sum; } }

IdfIndexer

This transformation has the effect of reducing the frequency of words that are commonly found in the document set. The factor fw by which the frequency of term w is reduced is given by the formula:

fw = 1 + log(N/nw)
where:
N = total number of documents in the collection
nw = number of documents containing word w

The code is shown below. The Jama version of the code can also be found in my post IR Math with Java : TF, IDF and LSI.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 // Source: src/main/java/com/mycompany/myapp/indexers/IdfIndexer.java package com.mycompany.myapp.indexers; import org.apache.commons.collections15.Transformer; import org.apache.commons.math.linear.RealMatrix; public class IdfIndexer implements Transformer { public RealMatrix transform(RealMatrix matrix) { // Phase 1: apply IDF weight to the raw word frequencies int n = matrix.getColumnDimension(); for (int j = 0; j < matrix.getColumnDimension(); j++) { for (int i = 0; i < matrix.getRowDimension(); i++) { double dm = countDocsWithWord( matrix.getSubMatrix(i, i, 0, matrix.getColumnDimension() - 1)); double matrixElement = matrix.getEntry(i, j); if (matrixElement > 0.0D) { matrix.setEntry(i, j, matrix.getEntry(i,j) * (1 + Math.log(n) - Math.log(dm))); } } } // Phase 2: normalize the word scores for a single document for (int j = 0; j < matrix.getColumnDimension(); j++) { double sum = sum(matrix.getSubMatrix( 0, matrix.getRowDimension() -1, j, j)); for (int i = 0; i < matrix.getRowDimension(); i++) { matrix.setEntry(i, j, (matrix.getEntry(i, j) / sum)); } } return matrix; } private double sum(RealMatrix colMatrix) { double sum = 0.0D; for (int i = 0; i < colMatrix.getRowDimension(); i++) { sum += colMatrix.getEntry(i, 0); } return sum; } private double countDocsWithWord(RealMatrix rowMatrix) { double numDocs = 0.0D; for (int j = 0; j < rowMatrix.getColumnDimension(); j++) { if (rowMatrix.getEntry(0, j) > 0.0D) { numDocs++; } } return numDocs; } }

CosineSimilarity

Cosine Similarity calculates the cosine of the angle between the lines joining the origin of the term space to the each document's position. The higher the value of the cosine, the smaller the angle between the two lines, and hence more similar the documents. Cosine Similarity is calculated as:

cos θ = A • B / |A| * |B|
where A = document matrix for the first document,
B = document matrix for the second document.

The code for the CosineSimilarity class is shown below. The Jama version can be found in my post IR Math with Java : Similarity Measures.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 // Source: src/main/java/com/mycompany/myapp/similarity/CosineSimilarity.java package com.mycompany.myapp.similarity; import org.apache.commons.math.linear.RealMatrix; import org.apache.commons.math.linear.SparseRealMatrix; public class CosineSimilarity extends AbstractSimilarity { @Override public double computeSimilarity( RealMatrix sourceDoc, RealMatrix targetDoc) { if (sourceDoc.getRowDimension() != targetDoc.getRowDimension() || sourceDoc.getColumnDimension() != targetDoc.getColumnDimension() || sourceDoc.getColumnDimension() != 1) { throw new IllegalArgumentException( "Matrices are not column matrices or not of the same size"); } // max col sum, only 1 col, so... double dotProduct = dot(sourceDoc, targetDoc); // sqrt of sum of squares of all elements, only one col, so... double eucledianDist = sourceDoc.getFrobeniusNorm() * targetDoc.getFrobeniusNorm(); return dotProduct / eucledianDist; } private double dot(RealMatrix source, RealMatrix target) { int maxRows = source.getRowDimension(); int maxCols = source.getColumnDimension(); RealMatrix dotProduct = new SparseRealMatrix(maxRows, maxCols); for (int row = 0; row < maxRows; row++) { for (int col = 0; col < maxCols; col++) { dotProduct.setEntry(row, col, source.getEntry(row, col) * target.getEntry(row, col)); } } return dotProduct.getNorm(); } }

### Test Code

For the test, we use the same collection of Reuters news items from the TextMine project that was used for testing the Binary Naive Bayes Classifier described in my previous post. The indexing code is pretty much the same, except that we now compute the term vectors of the body during indexing time. There is a single test, which trains the classifier, then classifies 5 documents with the classifier. Here is the JUnit test.

### Results

Here are the results. It was a bit surprising to see such good results, so I went back and checked the code to see if I was doing something wrong :-). As you can see, it correctly classified all my 5 documents.

 1 2 3 4 5 6 7 8 9 10 >>> cocoa.txt => category: cocoa (cocoa:0.7499364961896885, coffee:0.21387426054867117, sugar:0.15213562681433365) >>> cocoa1.txt => category: cocoa (cocoa:0.35404965894048845, coffee:0.15006958907480905, sugar:0.14425804054775068) >>> cocoa2.txt => category: cocoa (cocoa:0.2993396230523616, coffee:0.1754388455250711, sugar:0.18650205458278443) >>> coffee.txt => category: coffee (cocoa:0.18703846088862733, coffee:0.45354676135783173, sugar:0.20549314483406184) >>> coffee1.txt => category: coffee (cocoa:0.1436949323744925, coffee:0.3702669738594301, sugar:0.2316259997838632)

### Possible Improvements

With the Naive Bayes approach, I had to enable feature selection and use the top √n terms to get it to classify correctly. I had thought of doing something similar here if required, basically by using SVD to extract the principal √n components and using them to compute the similarity. It is quite easy to do if needed though, simply by setting a different chain of indexers.

Another interesting toolkit to try out for this stuff is the Semantic Vectors project, which seems to be quite promising from the little I understand about it. A commenter on a previous related post pointed me to this - now that I've made the leap to using Lucene for the tokenization part, it seems logical to give this a try, something I plan to do next week.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

#### 63 comments (moderated to prevent spam):

Nitin said...

currently common-math nightly build do not contains SparseMatrix.

:|
nitin

Sujit Pal said...

Thanks Nitin, looks like the commons-math nightly build is still for the 1.2 version, so it doesn't contain the SparseRealMatrix code. Its there in the svn repository though (obviously :-)), so I've changed the post to point to the svn repository instead. Thanks for catching this.

Anonymous said...

Do you have your test code available for download? I am dilligently trying to recreate your experiment, but have several pieces missing (AbstractSimilarity, JAMA vs common.math, etc). Would it be possilbe to zip up this experiment?

Sujit Pal said...

I don't yet, but should have soon. I have been reusing some of my existing code in more recent blogs, and its hard to remember all the dependencies, and I guess it gets confusing for you. You are not the first person to ask, but so far I've been too lazy to package it up properly...but this morning, I have requested project space on sf.net, and I have my code ready for upload into their repository, so once (and if) I get approval (they say 3 biz days), the code will be at jtmt.sf.net.

Anonymous said...

could u let me know how to obtain the test data? sugar-coffee-cocoa-docs.txt and the 54 articles?

Sujit Pal said...

@Anonymous: its in the project at jtmt.sf.net, look under src/test/resources. I copied this from textmine.sf.net project so you will also find it in there.

Anonymous said...

I'm quite new to IR so got a question. It seems to me that your code didn't use "idf" for documents to be classified.

In function "buildMatrixFromIndex", the line "docMatrix=normalizeWithTfIdf(docMatrix)" will just simply do normalizing with tf only because docMatrix is a column matrix.

If so, is there any easy way to modify the code?

If not so, sorry for my misunderstanding.

Sujit Pal said...

Hi, good observation, actually didn't think of that when I was writing the code. However, given the situation - I am trying to classify documents based on the similarity of one document's term vector to another's, IDF probably doesn't make sense. Normalizing the term frequencies across the document, however, does make sense, since we want to compare them in the same space, and density rather than count is what we should compare. What do you think?

Anonymous said...

I got your idea. It's logical.
However, I still don't understand the IR principle in general. Perhaps you have some answer.

Suppose there are only two terms "the" and "dog" in a document.
Now imagine that we are going to visualize the document vector in 2D space of "the" (X axis) and "dog" (Y axis).

Without "idf", we have a vector with the angle of 45. However, with idf, we get one with a larger angle, assuming that the importance of "the" is suppressed by "idf".

The question is: cosine similarity measure should yield different scores for these different angles unless you use the same normalizing scheme for both training and testing documents, right?

Also I found another more confusing statement (to me) in a textbook: "Introduction to IR" by Christopher D. Manning, Prabhakar Raghavan and Hinrich SchÃ¼tze.

"For example, a very standard weighting scheme is lnc.ltc, where the document vector has log-weighted term frequency, no idf (for both effectiveness and efficiency reasons), and cosine normalization, while the query vector uses log-weighted term frequency, idf weighting, and cosine normalization"

http://nlp.stanford.edu/IR-book/pdf/06vect.pdf pp.128

They use idf for query, but not for documents. Interesting.

So, it comes to a conclusion that it's not uncommon to compare query vector with document vector using different schemes of normalization although I got confused myself.

Sujit Pal said...

Nope, I don't understand it either, but it could very well be because I am not as smart/knowledgeable enough about this stuff as the author. I've been meaning to start reading the IR book you mention myself, so far haven't gotten around to it. But using different normalization schemes and being able to compare the two reliably seems counter-intuitive to me.

Also, the query vector is for a single document, right? In that case, IDF is a no-op (something you discovered in my code), so perhaps that is how this scheme seems to work?

Dave said...

Really appreciate what you have done with your jtmt project. Am currently using your vector space classifier for some testing and having very good success with it. Am wondering if you can think of an easy was to integrate in 10 fold cross validation. Believe I want to read in the whole dataset into an index and then partition the index into 10 individual indexes. Would then somehow train/test on these. Just having a bit of a hard time with the implementation details. Any help would be greatly appreciated. Keep up the good work.

Sujit Pal said...

Thanks Dave. My example works off a set of pre-classified documents and attempts to match a new document to one of the classes. But your approach using 10-fold cross validation is probably more useful for when you have no classified documents to begin with. Let me think about it and see if there is some way the code can be adapted to use that.

Sujit Pal said...

Dave, I checked out cross-validation based on your comment above, and added a cross-validation method to the classifier. You've probably already solved this yourself, but here is a link to the post describing this.

Elif said...

Hello,

Thank you for this post. It's really informative and well-explained.

I downloaded the jtmt code. I wanted to play with it a bit but I get an error which says:

The type org.mortbay.component.AbstractLifeCycle cannot be resolved. It is indirectly referenced from required .class files JohHandler.java jtmt/src/main/java/net/sf/jtmt/httpwrapper line 23

I found out that this is about jetty-6.1.5.jar . But the AbstractLifeCycle is supposed to be in it. I tried to compile with jetty-6.1.6 which didn't work. Then with java5 instead of java6, which didn't work again.

Do you have any idea what the problem might be?

(I guess this is not the right place for this question, but I couldn't find any more suitable post.)

Sujit Pal said...

Thanks Elif. A more relevant place for this comment is this page. To answer your question, though, this class is in jetty-util in my 6.1.5 version - its not declared in the POM because Maven2 takes care of transitive dependencies.

- said...

I am having a problem with the line in the CosineSimilarity class.

RealMatrix dotProduct = new SparseRealMatrix(maxRows, maxCols);

The apache common math jar does not contain a class SparseRealMatrix. I also used your libs from sourceforge. Can you help further? Thanks in advance

Anonymous said...

Hello,
how can i build jtmt project .. i m using
Vector Space Classifier for focused crawler

Sujit Pal said...

@-: very imaginative name, btw :-). So heres the story with SparseRealMatrix... I needed one, so I built one in commons-math and contributed it back. However, the developers there are more knowledgeable than me about the ripple effect of adding this kind of thing, so they added in the functionality but under a different class name - I forget exactly the usage, but check the commons-math 2.0+ javadocs and you should be able to figure it out.
@Anonymous: there is a pom.xml - but it contains jar references that are not necessarily public. So I would suggest building an Ant buildfile using mvn ant:ant, then in the classpath, point everything to the lib directory, then run ant jar.

Anonymous said...

Dear sijit pal, your code for vector space classifier using lucene help me a lot. Do u have a code for gram schmidt process. If yes could you post that in the block.

Anonymous said...

Hi Pal:
we are new to IR and Lucene, yet we need to build a searching system, scoring the similarity only by the positions of the terms and with NO regard of TF/IDF.
e.g. suppose we have
q:{t1,t2,t3}
d1:{t1,t3,t2,t4}
d2:{t1,t3}
d3:{t1,t2,t5,t3}
d4:{t4,t1,t2,t3}
then the result is to be
d4>d3>d1>d2
just like a comparison among different bus paths and every single path consists of many stops, quite straightforward; however, we didn't find yet any lucene API fit for this job.
Would you be kind to give us some advice for our task? Any suggestion is appreciated.
Thank you.

Jack H.

Sujit Pal said...

Thanks for the feedback, glad it was helpful. I did not know about the Gram-Schmidt process, thanks for the pointer - if I ever end up building a classifier based on it, I will post it.

Sujit Pal said...

This answer is for the comment about looking for a lucene query to compare bus stops (2 comments up).

This is just a suggestion, haven't tried it yet, so not sure... would it make sense to store the tn values in a multi-field along with its position, and then search on them with an OR query? And have a custom similarity class which boosts based on the ordering. There is an example of something similar in the Lucene in Action (Second Edition) book.

Anonymous said...

Hi Pal:

Heading with your suggestion, we do also wish to have further hints on..
1. the topic or the sample talking about similar requirement in Lucene in Action II.
2. we are not sure how we can do this by boosting on order, for, as far as our understanding, there seems no mechanics for boosting query on order, hence much likely we cannot have a comparison according to the query content.

We do suffer much from these issues, and that's why we keep trying to ask further advice.
We hope it did not make you feel bothered.

Thank you again for your kindness.

Jack H.

Sujit Pal said...

Hi Jack, I believe its section 13.4: Searching entities with SIREn, I am guessing it will need to be adapted for your purpose. Another possibility may be SpanQueries - I want to learn about them myself, so I will use it to try out your use case and see if it would work.

Anonymous said...

Hi, I've assignment about lucene, and I accidentally found your site. I've copied the codes above, I have download the packages you uploaded before, but the program still can't run because I can't import the packages. Can you tell me how to convert a file into a package like you did in the program?? I use JCreator as my editor. Thanks a lot..

Sujit Pal said...

@Jack: I think I may have (at least partial) solution to your problem on this post.

@Anonymous: I won't argue about the ethics of what you are asking, but you do realize that you are shooting yourself in the foot, right? That is, unless you are not really looking for a programming career? Anyway, to answer your question, take a look at the JTMT project, that has a pom.xml which takes care of the packaging (mvn jar:jar).

Teeraphol said...

do u have the .zip file for vector space model for searching in java ???

Sujit Pal said...

No, but you can find the code in the repository for the JTMT project on Sourceforge.

deathlike_silence said...

Hi Sujit!

Great post, great blog!
I was wondering if it feasible to create a binary vector space classifier without training negative documents based on your code.
What are your thoughts on that?

Regards,
Yiannis

Sujit Pal said...

Thanks Yannis. I guess you can some form of binary classification (although it seems to be more like clustering) by specifying a radius cutoff, ie if certain documents do not fall within a certain distance in the term space from the centroid of the cluster of positive docs, then it is a negative doc. But then you would have to analyze your data to find what the radius cutoff should be.

Anonymous said...

Hi Sujit!

I used your technique in a project in my postgraduate program. I used wikipedia and it's categories for training. As you mentioned, I had to modify some things, in order to store the centroids in SQL.

The program returns as a category one of the top categories of eurovoc ( http://eurovoc.europa.eu/drupal/?q=download/subject_oriented&cl=en )

You can test it here:
http://143.233.226.74:8080/NLPproject/categorization.jsp

BR,
Yiannis

Sujit Pal said...

Hi Yannis, this is very cool, congratulations! I tested your categorizer with a few news items from Google, and as long as I stay within the classifications of the Eurovoc thesauri, it seems to be pretty accurate. With something outside (such as sports), it still classifies (wrongly) but with a sufficiently low score to indicate a low confidence. Great stuff, thanks for sharing!

deathlike_silence said...

Hi Sujit!

I don't have the source code for doing this in a public repository. If at any time are you interested in it just let me know. My email is johngouf@iit.demokritos.gr

BR,
Yiannis

Sujit Pal said...

Thanks Yannis, and you are welcome. Once again, great work. Not sure if you are okay with your email address being public, let me know if you are not and I will delete your last comment.

Anonymous said...

hai sujit sir..
how can i build this jtmt project
for Vector Space Classifier.
is it possible with out lucene.. if it is possible how can i import org.apache.commons.* and
org.apache.lucene.*,ect...

Sujit Pal said...

Hi Anonymous, in this particular case we are using Lucene's inverted index to find term frequencies to do the classification. There are definitely other ways to do this, including without using Lucene. The JTMT project can be built using ant, you will find the lucene jars in the lib directory.

Shereen Albitar said...

Hi,
I've been testing this classifier for a while for my research project. The cosinus similarity demands that both matrix has the same dimensions. I would like to know how the classifier can ensure that while centroids are calculated apart from the vector to classify.

Traca said...

Hi. I am a Student and i need to build a classifier. I´ve tryed run run and test your code but, it was not possible. How can i run it in NetBeans? i´ve downloaded everything. Thanks

Sujit Pal said...

@Shereen: I think I ensure that by considering all the terms and all the documents in the vector space for each class. So the centroid for class A is a point in the same TD vector space as the centroid for class B. I use sparse matrices though. Another way to weed out non-interesting terms is to use some form of feature reduction. A simple way is to eliminate any term which scores below a certain number.

@Traca: not sure what your comfort level with Java is, so not sure how to interpret your "... was not possible" comment. There is a build.xml file from which it should be possible to compile on the command line. If you are saying compile failed from the IDE, make sure that your IDE's project classpath contains the jars in the lib directory. Also I mostly run using JUnit tests (which can be run from the command line using ant unittest). Hope this helps...

Shereen Albitar said...

Thanks for your answer. As for centroid matrices during training, the same vector space is used but the point that is not clear for me is how this is applied to new documents during test(classification). Their indexes are created independently to the centroid matrix. Then their vectors are compared to those of centroids using cosinus similarity measure that demands that both compared vectors has to be created in the same vector space.
My question concerns classify function in LuceneVectorSpaceModelClassifier class.

Traca said...

Hi. Thanks for the answer. I just put the classes in a project (NetBeans) and try to compile it. There is no error from the compiler,and i´´ve made a class where i call your test class. I did not try wiht the command line yet.

chandu said...

hai sujith sir...

i'm a student.. using eclipse tool.i Run entire JTMT(but i want to run classifier only)project As JUnit Test i'm getting Java Build path Probleams(100 of 111 items)
saying
Project 'jtmt' is missing required library: '\Users\sujit\Healthline\prod\common\lib\external\dyuproject-openid-1.1.7.jar' ....etc..
eventhough i add jar to Libraries....

Sujit Pal said...

@Shereen: good catch, its probably a bug in the code, thanks for pointing it out. Since the new document is put into a ram index, its term frequency vector is going to have different (subset of) terms than in the main document index. The buildDocumentFromIndex method should probably merge the terms from the index vector and those from the document (and set the ones from the document to 0).

@Traca: thanks for the feedback, good to know that you are (at least for the moment) all set.

@Chandu: Sorry, that reference to the openid.jar must have been added erroneously. Delete this jar reference from the .classpath file (root of the project). I tried updating it and committing to SVN but its not recognizing me at the moment.

VIDYA BHUSHAN SINGH said...

Hi,

I am new to test mining area, but I have to use it for my project. Your blog is the best source of information and tools. I am trying to run the jtmt program using Netbeans, but I am getting following errors. can you identify what could be the problem. There are three directories in the scc-index directory containing the segment files, I am using the same code provided by you, but I don't know why I am getting this error.

what is segment file and how can I generate that for initial training.

thanks

Exception:
11/12/27 05:15:11 INFO classifiers.LuceneVectorSpaceModelClassifier: Classifier training started
java.io.FileNotFoundException: no segments* file found in org.apache.lucene.store.SimpleFSDirectory@C:\Users\VIDYA\Documents\NetBeansProjects\PalSoftware\resources\data\scc-index: files: [doc_cocoa.bin, doc_coffee.bin, doc_sugar.bin, term_cocoa.bin, term_coffee.bin, term_sugar.bin]
at org.apache.lucene.index.SegmentInfos\$FindSegmentsFile.run(SegmentInfos.java:628)
at net.sf.jtmt.classifiers.LuceneVectorSpaceModelClassifier.train(LuceneVectorSpaceModelClassifier.java:156)
at net.sf.jtmt.classifiers.LuceneVectorSpaceModelClassifier.train(LuceneVectorSpaceModelClassifier.java:143)
at net.sf.jtmt.classifiers.LuceneVectorSpaceModelClassifierTest.testLuceneVectorSpaceClassifier(LuceneVectorSpaceModelClassifierTest.java:119)
at net.sf.jtmt.classifiers.LuceneVectorSpaceModelClassifierTest.main(LuceneVectorSpaceModelClassifierTest.java:165)

Sujit Pal said...

Hi Vidya, read through my code again, and I believe the scc-index directory contains 3 lucene indexes, one for each category. Basically the training phase reads the indexes for each category and calculates the centroids to compare against the new input to be classified. The code seems to be pointing at the parent of these 3 directories, not the index directories themselves), and not finding the index as a result. Segment files are lucene indexes - an index is a directory containing one or more segment files.

venkat said...

I want to understand the source code for vector space model only ,so from where i have to start reading.

Sujit Pal said...

Hi Venkat, I think the classifier code itself should be enough. The basic premise is to feed the classifier a set of documents in each category, allow it to compute centroids for each category, then pass it one or more test documents and ask it what the category is, which is the one whose centroid is closest to the documents location in term space. From experience, the hardest thing to wrap your head around is the n dimensional term space idea - each term (that is considered for calculation) in the document corresponds to a dimension - once you get that the rest is gravy.

venkat said...

Thank u sir . I understood the source code and used in my Project.

Sujit Pal said...

You are welcome Venkat.

mahesh said...

Can u plz post Text classification using Support vector machine(SVM) code ... I badly need it ... can u plz help me ....

Sujit Pal said...

Hi Mahesh, sorry, don't know too much about SVMs, I'll probably check it out, heard about this from someone else as well, so its probably worth looking into... But I won't be able to do this right away.

Anonymous said...

Hi, can I get the source code ? Thankyou very much :)

Sujit Pal said...

Sure its on jtmt.sf.net's SVN.

Anonymous said...

Thanks for this great post!

I got the example up and running just fine. However when I try to use my own data which is quite large a get a NumberIsTooLargeException when I try to create a OpenMapRealMatrix(col, row). The reason for this exception is obvious col * row > Integer.MAX_VALUE.

Are you aware of any solutions that solve this limit and still use OpenMapRealMatrix?

A possible solution I am thinking of is to split the total OpenMapRealMatrix in parts but a more simple solution would be great.

Sujit Pal said...

Thanks, you are welcome. I have never encountered this error, you must be dealing with huge matrices, I am (pleasantly) surprised that my code works with such large sizes of data. However, I looked through the OpenMapRealMatrix Javadocs and it seems that NumberIsTooLargeException is thrown in cases where the condition you refer to may not necessarily be true. Unless you are certain that you are encountering the col * row > MAX_VALUE case, you may want to check the method call at the line number the exception is being thrown from. Of course, in case it /is/ a size issue , one other solution (probably easier than splitting and merging) is to consider building an OpenMapRealMatrix implementation using BigInteger for the row and column indexes?

Vignesh Srinivasan said...

is the algorithm same as Rocchio classification.I have tried some general wikipedia documents .it seems to work well..But will it be useful for Email classification on a large scale?

Sujit Pal said...

Yes, it is the same as Rocchio classification. Once the centroids are computed based on training data, each new point is classified based on the closest centroid. To answer your second question, I think it will depend on your scale - the training phase is memory intensive, but if your training set is not too large, it should work out fine. The classification stage takes each 1 document at a time and constructs its term vector in memory, so that should work out fine.

Vignesh Srinivasan said...

Hi Sujit,

How can i Store the training model in DB or some serialized object so that i don,t need to do training again..

Sujit Pal said...

The trained model is contained in the Lucene index, so its already serialized.

Vignesh Srinivasan said...

Hi sujit,
I understand we use lucene index for training.
In real-world systems, you may want to train the classifier once and then reuse it many times over different documents, possibly over a period of days or months, so its probably better to store this data in a database table or some other persistent medium. If you go the database table route, you can coalesce the two data structures needed by the classify method into a single table by keying the centroid coordinates off the term itself.

I did not understand completly the above point..can you please explain how i can do this.It will be really useful..

Sujit Pal said...

Well, during the classification step, you are trying to build the document vector for the document in question, then comparing against the document vectors corresponding to each of your cluster centroids. So you could extract the cluster centroids into a database table. Currently, each cluster centroid looks like this:

{"A": [x1, x2, x3, ..., xn]}

where the A corresponds to the cluster label and the xi elements stands for the value of the i-th dimension of the centroid document vector. The other data structure is the term-id map which maps the {term: position} where position corresponds to i in the cluster centroid data above. You will need the {term:position} mapping when you want to create the document vector for the input document.

So I guess what I was trying to say is that if you use a database table, you can get rid of the i value and map the term directly to the cluster, ie, the data structure you could use would look like this:

{"A": {t1: x1, t2: x2, ..., tn: xn}}

Obviously the associated code would need to change accordingly...

Zainil Abidin said...