Comments on Salmon Run: Vector Space Classifier using Lucene

Hi Zainil, not sure what you are looking to do, th...

2014-02-05T15:17:59.029-08:00

Hi Zainil, not sure what you are looking to do, the categories come from the training data, if you want to make it something different you can change it to something else in the training data.

Please help me.. how to change new category from c...

2014-02-05T02:22:32.776-08:00

Please help me..
how to change new category from cocoa to other category.

Well, during the classification step, you are tryi...

2013-02-02T12:52:43.059-08:00

Well, during the classification step, you are trying to build the document vector for the document in question, then comparing against the document vectors corresponding to each of your cluster centroids. So you could extract the cluster centroids into a database table. Currently, each cluster centroid looks like this:

{"A": [x1, x2, x3, ..., xn]}

where the A

Hi sujit, Thanks for your reply. I understand we u...

2013-02-01T20:44:02.879-08:00

Hi sujit,
Thanks for your reply.
I understand we use lucene index for training.
In real-world systems, you may want to train the classifier once and then reuse it many times over different documents, possibly over a period of days or months, so its probably better to store this data in a database table or some other persistent medium. If you go the database table route, you can

The trained model is contained in the Lucene index...

2013-02-01T10:35:56.762-08:00

The trained model is contained in the Lucene index, so its already serialized.

Hi Sujit, Thanks for your sincere reply.. How can...

2013-02-01T07:58:20.077-08:00

Hi Sujit,

Thanks for your sincere reply..
How can i Store the training model in DB or some serialized object so that i don,t need to do training again..
Please kindly suggest

Yes, it is the same as Rocchio classification. Onc...

2013-01-26T14:49:18.719-08:00

Yes, it is the same as Rocchio classification. Once the centroids are computed based on training data, each new point is classified based on the closest centroid. To answer your second question, I think it will depend on your scale - the training phase is memory intensive, but if your training set is not too large, it should work out fine. The classification stage takes each 1 document at a time

is the algorithm same as Rocchio classification.I ...

2013-01-24T02:52:12.386-08:00

is the algorithm same as Rocchio classification.I have tried some general wikipedia documents .it seems to work well..But will it be useful for Email classification on a large scale?

Thanks, you are welcome. I have never encountered ...

2012-07-11T07:47:18.162-07:00

Thanks, you are welcome. I have never encountered this error, you must be dealing with huge matrices, I am (pleasantly) surprised that my code works with such large sizes of data. However, I looked through the OpenMapRealMatrix Javadocs and it seems that NumberIsTooLargeException is thrown in cases where the condition you refer to may not necessarily be true. Unless you are certain that you are

Thanks for this great post! I got the example up ...

2012-07-04T05:10:12.291-07:00

Thanks for this great post!

I got the example up and running just fine. However when I try to use my own data which is quite large a get a NumberIsTooLargeException when I try to create a OpenMapRealMatrix(col, row). The reason for this exception is obvious col * row > Integer.MAX_VALUE.

Are you aware of any solutions that solve this limit and still use OpenMapRealMatrix?

Sure its on jtmt.sf.net's SVN.

2012-04-04T19:27:38.743-07:00

Sure its on jtmt.sf.net's SVN.

Hi, can I get the source code ? Thankyou very much...

2012-04-03T10:30:07.779-07:00

Hi, can I get the source code ? Thankyou very much :)

Hi Mahesh, sorry, don't know too much about SV...

2012-03-17T17:58:13.555-07:00

Hi Mahesh, sorry, don't know too much about SVMs, I'll probably check it out, heard about this from someone else as well, so its probably worth looking into... But I won't be able to do this right away.

Can u plz post Text classification using Support v...

2012-03-17T13:01:22.801-07:00

Can u plz post Text classification using Support vector machine(SVM) code ... I badly need it ... can u plz help me ....

Thanks in advance...

You are welcome Venkat.

2012-01-20T22:09:55.466-08:00

You are welcome Venkat.

Thank u sir . I understood the source code and use...

2012-01-19T02:07:09.771-08:00

Thank u sir . I understood the source code and used in my Project.

Hi Venkat, I think the classifier code itself shou...

2012-01-14T10:53:17.196-08:00

Hi Venkat, I think the classifier code itself should be enough. The basic premise is to feed the classifier a set of documents in each category, allow it to compute centroids for each category, then pass it one or more test documents and ask it what the category is, which is the one whose centroid is closest to the documents location in term space. From experience, the hardest thing to wrap your

I want to understand the source code for vector sp...

2012-01-09T21:52:36.962-08:00

I want to understand the source code for vector space model only ,so from where i have to start reading.
I downloaded entire source code of this projet

Hi Vidya, read through my code again, and I believ...

2011-12-30T14:01:21.333-08:00

Hi Vidya, read through my code again, and I believe the scc-index directory contains 3 lucene indexes, one for each category. Basically the training phase reads the indexes for each category and calculates the centroids to compare against the new input to be classified. The code seems to be pointing at the parent of these 3 directories, not the index directories themselves), and not finding the

Hi, I am new to test mining area, but I have to u...

2011-12-27T02:29:06.496-08:00

Hi,

I am new to test mining area, but I have to use it for my project. Your blog is the best source of information and tools. I am trying to run the jtmt program using Netbeans, but I am getting following errors. can you identify what could be the problem. There are three directories in the scc-index directory containing the segment files, I am using the same code provided by you, but

@Shereen: good catch, its probably a bug in the co...

2011-06-25T08:25:09.669-07:00

@Shereen: good catch, its probably a bug in the code, thanks for pointing it out. Since the new document is put into a ram index, its term frequency vector is going to have different (subset of) terms than in the main document index. The buildDocumentFromIndex method should probably merge the terms from the index vector and those from the document (and set the ones from the document to 0).
<

hai sujith sir... i'm a student.. using ecli...

2011-06-25T01:14:42.002-07:00

hai sujith sir...

i'm a student.. using eclipse tool.i Run entire JTMT(but i want to run classifier only)project As JUnit Test i'm getting Java Build path Probleams(100 of 111 items)
saying
Project 'jtmt' is missing required library: '\Users\sujit\Healthline\prod\common\lib\external\dyuproject-openid-1.1.7.jar' ....etc..
eventhough i

Hi. Thanks for the answer. I just put the classes ...

2011-06-20T16:45:38.391-07:00

Hi. Thanks for the answer. I just put the classes in a project (NetBeans) and try to compile it. There is no error from the compiler,and i´´ve made a class where i call your test class. I did not try wiht the command line yet.

Thanks for your answer. As for centroid matrices d...

2011-06-19T23:35:56.797-07:00

Thanks for your answer. As for centroid matrices during training, the same vector space is used but the point that is not clear for me is how this is applied to new documents during test(classification). Their indexes are created independently to the centroid matrix. Then their vectors are compared to those of centroids using cosinus similarity measure that demands that both compared vectors has

@Shereen: I think I ensure that by considering all...

2011-06-18T21:10:16.805-07:00

@Shereen: I think I ensure that by considering all the terms and all the documents in the vector space for each class. So the centroid for class A is a point in the same TD vector space as the centroid for class B. I use sparse matrices though. Another way to weed out non-interesting terms is to use some form of feature reduction. A simple way is to eliminate any term which scores below a certain