tag:blogger.com,1999:blog-7583720.post56550440703764051..comments2024-03-17T13:30:18.387-07:00Comments on Salmon Run: Vector Space Classifier using LuceneSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.comBlogger63125tag:blogger.com,1999:blog-7583720.post-86731678515710161262014-02-05T15:17:59.029-08:002014-02-05T15:17:59.029-08:00Hi Zainil, not sure what you are looking to do, th...Hi Zainil, not sure what you are looking to do, the categories come from the training data, if you want to make it something different you can change it to something else in the training data.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-36190354576460046382014-02-05T02:22:32.776-08:002014-02-05T02:22:32.776-08:00Please help me..
how to change new category from c...Please help me..<br />how to change new category from cocoa to other category.Anonymoushttps://www.blogger.com/profile/02697459595142361769noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-20410829654636761552013-02-02T12:52:43.059-08:002013-02-02T12:52:43.059-08:00Well, during the classification step, you are tryi...Well, during the classification step, you are trying to build the document vector for the document in question, then comparing against the document vectors corresponding to each of your cluster centroids. So you could extract the cluster centroids into a database table. Currently, each cluster centroid looks like this:<br /><br />{"A": [x1, x2, x3, ..., xn]}<br /><br />where the A Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-19579326311156388222013-02-01T20:44:02.879-08:002013-02-01T20:44:02.879-08:00Hi sujit,
Thanks for your reply.
I understand we u...Hi sujit,<br />Thanks for your reply.<br />I understand we use lucene index for training.<br />In real-world systems, you may want to train the classifier once and then reuse it many times over different documents, possibly over a period of days or months, so its probably better to store this data in a database table or some other persistent medium. If you go the database table route, you can Vignesh Srinivasanhttps://www.blogger.com/profile/09138709995946320737noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-52838749944349286932013-02-01T10:35:56.762-08:002013-02-01T10:35:56.762-08:00The trained model is contained in the Lucene index...The trained model is contained in the Lucene index, so its already serialized.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-90329507211546734542013-02-01T07:58:20.077-08:002013-02-01T07:58:20.077-08:00Hi Sujit,
Thanks for your sincere reply..
How can...Hi Sujit,<br /><br />Thanks for your sincere reply..<br />How can i Store the training model in DB or some serialized object so that i don,t need to do training again..<br />Please kindly suggestVignesh Srinivasanhttps://www.blogger.com/profile/09138709995946320737noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-75324977359905320182013-01-26T14:49:18.719-08:002013-01-26T14:49:18.719-08:00Yes, it is the same as Rocchio classification. Onc...Yes, it is the same as Rocchio classification. Once the centroids are computed based on training data, each new point is classified based on the closest centroid. To answer your second question, I think it will depend on your scale - the training phase is memory intensive, but if your training set is not too large, it should work out fine. The classification stage takes each 1 document at a time Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-52779069101820486992013-01-24T02:52:12.386-08:002013-01-24T02:52:12.386-08:00is the algorithm same as Rocchio classification.I ...is the algorithm same as Rocchio classification.I have tried some general wikipedia documents .it seems to work well..But will it be useful for Email classification on a large scale?Vignesh Srinivasanhttps://www.blogger.com/profile/09138709995946320737noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-82979002896765763962012-07-11T07:47:18.162-07:002012-07-11T07:47:18.162-07:00Thanks, you are welcome. I have never encountered ...Thanks, you are welcome. I have never encountered this error, you must be dealing with huge matrices, I am (pleasantly) surprised that my code works with such large sizes of data. However, I looked through the OpenMapRealMatrix Javadocs and it seems that NumberIsTooLargeException is thrown in cases where the condition you refer to may not necessarily be true. Unless you are certain that you are Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-91507961836109589032012-07-04T05:10:12.291-07:002012-07-04T05:10:12.291-07:00Thanks for this great post!
I got the example up ...Thanks for this great post!<br /><br />I got the example up and running just fine. However when I try to use my own data which is quite large a get a NumberIsTooLargeException when I try to create a OpenMapRealMatrix(col, row). The reason for this exception is obvious col * row > Integer.MAX_VALUE.<br /><br />Are you aware of any solutions that solve this limit and still use OpenMapRealMatrix?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-82846904737029038552012-04-04T19:27:38.743-07:002012-04-04T19:27:38.743-07:00Sure its on jtmt.sf.net's SVN.Sure its on jtmt.sf.net's SVN.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-10529237565345920372012-04-03T10:30:07.779-07:002012-04-03T10:30:07.779-07:00Hi, can I get the source code ? Thankyou very much...Hi, can I get the source code ? Thankyou very much :)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-66617665564511895332012-03-17T17:58:13.555-07:002012-03-17T17:58:13.555-07:00Hi Mahesh, sorry, don't know too much about SV...Hi Mahesh, sorry, don't know too much about SVMs, I'll probably check it out, heard about this from someone else as well, so its probably worth looking into... But I won't be able to do this right away.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-88142430548590287352012-03-17T13:01:22.801-07:002012-03-17T13:01:22.801-07:00Can u plz post Text classification using Support v...Can u plz post Text classification using Support vector machine(SVM) code ... I badly need it ... can u plz help me .... <br /><br />Thanks in advance...maheshhttps://www.blogger.com/profile/02496968883495949891noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-41517900847048375782012-01-20T22:09:55.466-08:002012-01-20T22:09:55.466-08:00You are welcome Venkat.You are welcome Venkat.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-87486653037371886262012-01-19T02:07:09.771-08:002012-01-19T02:07:09.771-08:00Thank u sir . I understood the source code and use...Thank u sir . I understood the source code and used in my Project.venkathttps://www.blogger.com/profile/12909907923162990577noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-17373790575312877862012-01-14T10:53:17.196-08:002012-01-14T10:53:17.196-08:00Hi Venkat, I think the classifier code itself shou...Hi Venkat, I think the classifier code itself should be enough. The basic premise is to feed the classifier a set of documents in each category, allow it to compute centroids for each category, then pass it one or more test documents and ask it what the category is, which is the one whose centroid is closest to the documents location in term space. From experience, the hardest thing to wrap your Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-21481304625039486962012-01-09T21:52:36.962-08:002012-01-09T21:52:36.962-08:00I want to understand the source code for vector sp...I want to understand the source code for vector space model only ,so from where i have to start reading.<br />I downloaded entire source code of this projetvenkatnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-38236377196072056872011-12-30T14:01:21.333-08:002011-12-30T14:01:21.333-08:00Hi Vidya, read through my code again, and I believ...Hi Vidya, read through my code again, and I believe the scc-index directory contains 3 lucene indexes, one for each category. Basically the training phase reads the indexes for each category and calculates the centroids to compare against the new input to be classified. The code seems to be pointing at the parent of these 3 directories, not the index directories themselves), and not finding the Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-49459211220980153172011-12-27T02:29:06.496-08:002011-12-27T02:29:06.496-08:00Hi,
I am new to test mining area, but I have to u...Hi,<br /><br />I am new to test mining area, but I have to use it for my project. Your blog is the best source of information and tools. I am trying to run the jtmt program using Netbeans, but I am getting following errors. can you identify what could be the problem. There are three directories in the scc-index directory containing the segment files, I am using the same code provided by you, but VIDYA BHUSHAN SINGHhttps://www.blogger.com/profile/18018527058053145964noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-78710290730645127762011-06-25T08:25:09.669-07:002011-06-25T08:25:09.669-07:00@Shereen: good catch, its probably a bug in the co...@Shereen: good catch, its probably a bug in the code, thanks for pointing it out. Since the new document is put into a ram index, its term frequency vector is going to have different (subset of) terms than in the main document index. The buildDocumentFromIndex method should probably merge the terms from the index vector and those from the document (and set the ones from the document to 0).<br /><Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-67919866120116845672011-06-25T01:14:42.002-07:002011-06-25T01:14:42.002-07:00hai sujith sir...
i'm a student.. using ecli...hai sujith sir...<br /><br /><br />i'm a student.. using eclipse tool.i Run entire JTMT(but i want to run classifier only)project As JUnit Test i'm getting Java Build path Probleams(100 of 111 items)<br />saying <br />Project 'jtmt' is missing required library: '\Users\sujit\Healthline\prod\common\lib\external\dyuproject-openid-1.1.7.jar' ....etc.. <br />eventhough i chanduhttps://www.blogger.com/profile/07341331160923631253noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-18479438531860952302011-06-20T16:45:38.391-07:002011-06-20T16:45:38.391-07:00Hi. Thanks for the answer. I just put the classes ...Hi. Thanks for the answer. I just put the classes in a project (NetBeans) and try to compile it. There is no error from the compiler,and i´´ve made a class where i call your test class. I did not try wiht the command line yet.Tracanoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-85190600626321479092011-06-19T23:35:56.797-07:002011-06-19T23:35:56.797-07:00Thanks for your answer. As for centroid matrices d...Thanks for your answer. As for centroid matrices during training, the same vector space is used but the point that is not clear for me is how this is applied to new documents during test(classification). Their indexes are created independently to the centroid matrix. Then their vectors are compared to those of centroids using cosinus similarity measure that demands that both compared vectors has Shereen Albitarnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-11846540062039303282011-06-18T21:10:16.805-07:002011-06-18T21:10:16.805-07:00@Shereen: I think I ensure that by considering all...@Shereen: I think I ensure that by considering all the terms and all the documents in the vector space for each class. So the centroid for class A is a point in the same TD vector space as the centroid for class B. I use sparse matrices though. Another way to weed out non-interesting terms is to use some form of feature reduction. A simple way is to eliminate any term which scores below a certainSujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.com