tag:blogger.com,1999:blog-7583720.post4153986163299294086..comments2024-03-05T03:17:02.289-08:00Comments on Salmon Run: Binary Naive Bayes Classifier using LuceneSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.comBlogger27125tag:blogger.com,1999:blog-7583720.post-42053094411954097842013-09-22T09:43:30.305-07:002013-09-22T09:43:30.305-07:00I think it may be because I am accumulating the (t...I think it may be because I am accumulating the (term, score) tuples into the trainingSet map. If you change the code to write this out to a database table, it will probably work out. Remember to batch up your inserts during training and create your index only after training is complete, otherwise the training process can become very slow.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-41967983100408037752013-09-21T21:17:11.426-07:002013-09-21T21:17:11.426-07:00Awesome post. I am trying to classify documents wi...Awesome post. I am trying to classify documents with large data set (Train set is about 6M with about 50k categories. I see that this approach would throw out of memory error. Any thoughts of how to do it?<br /><br />ThanksAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-35316338461838120912012-04-22T12:08:42.809-07:002012-04-22T12:08:42.809-07:00Hi Hardik, the files I have in there are pulled fr...Hi Hardik, the files I have in there are pulled from the data in the TMAP book, and they have considerable amount of overlap because the subject areas have considerable overlap which is why I did the feature selection. Even so the results are not perfect - I did not do any calculations to find how accurate the classifier is - I did not know how to at that point. One useful starting point would beSujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-45532047580488151422012-04-19T22:54:56.103-07:002012-04-19T22:54:56.103-07:00Hi sujit please help
I tried your above logic with...Hi sujit please help<br />I tried your above logic with cocoa as classification in c# with feature selection=true<br />My results are<br />cocoa.txt=true<br />cocoa1.txt=true<br />cocoa1.txt=true<br />coffee.txt=true<br />coffee1.txt=false<br />By looking at your result only coffee.txt shows wrong result<br /><br />Now when i classify as coffee category<br />My results are<br />cocoa.txt=true<br Hardikhttps://www.blogger.com/profile/17152135931375907953noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-39162399274291817672012-04-19T04:30:00.146-07:002012-04-19T04:30:00.146-07:00Hi Sujit
I have converted the above code in c# but...Hi Sujit<br />I have converted the above code in c# but have not used your summary analyzer.I have used the standard analyzer.I have merged all the different indexes(sugar,cocoa,coffee) with one index located at src > test > resources > data > scc-index.Now i classify the 5 files given by you<br />According to your above output m getting accurate result only when i pass false,true(Hardikhttps://www.blogger.com/profile/17152135931375907953noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-41356736971540702332012-04-15T23:03:07.648-07:002012-04-15T23:03:07.648-07:00continue for the above comment:OR i have tried luc...continue for the above comment:OR i have tried lucene for document categorization which give me socre according to the document classified<br />Can i use the lucene logic for my document classification or do i have to use navies and lucene both for document classification..<br />Please help<br />ThanksHardikhttps://www.blogger.com/profile/17152135931375907953noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-48340537362583025222012-04-15T22:59:34.556-07:002012-04-15T22:59:34.556-07:00Hello Sujit Pal
Thanks for the reply
As i dont kno...Hello Sujit Pal<br />Thanks for the reply<br />As i dont know java so m not able to understand the above code<br />I really need the above coding to be converted into .net<br />Please help<br />ThanksHardikhttps://www.blogger.com/profile/17152135931375907953noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-9424200034243851892012-03-28T16:01:21.255-07:002012-03-28T16:01:21.255-07:00Hi Hardik, publishing your comment, maybe someone ...Hi Hardik, publishing your comment, maybe someone can help you...Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-28936758782110707622012-03-26T22:39:45.292-07:002012-03-26T22:39:45.292-07:00its an awesome blog by salmon Run
Can anyone help ...its an awesome blog by salmon Run<br />Can anyone help me out to use the above logic in c#<br />It would be very helpful<br />ThanksHardikhttps://www.blogger.com/profile/17152135931375907953noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-836881331553298302012-03-24T13:03:49.771-07:002012-03-24T13:03:49.771-07:00Hi ch.naveen, I don't have a main class, but t...Hi ch.naveen, I don't have a main class, but the closest to that is ClassiferTest, which is a JUnit test case and which is run using the "unittest" target in the build.xml (calls the Ant junit target). You can use that to build your own main method if you want.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-8695630743474211252012-03-21T01:07:47.673-07:002012-03-21T01:07:47.673-07:00hello
sir,
can you please tell me which class is ...hello<br />sir,<br /><br />can you please tell me which class is the main class of the program.<br /><br />regards,<br />ch.naveenCH.M.N.NAVEENhttps://www.blogger.com/profile/15606501312817376614noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-43295305342643617142011-10-02T10:10:01.564-07:002011-10-02T10:10:01.564-07:00Hi Sab, you may want to take a look at classifier4...Hi Sab, you may want to take a look at classifier4j (classifier4j.sf.net).Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-17619932848372146762011-09-25T23:46:44.932-07:002011-09-25T23:46:44.932-07:00could you tell me where i can find pure java solut...could you tell me where i can find pure java solution to naive bayes classifier??sabhttps://www.blogger.com/profile/01883609639759322679noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-46740076512073423502011-09-04T11:26:13.175-07:002011-09-04T11:26:13.175-07:00Hi Savita, not sure if I can provide additional he...Hi Savita, not sure if I can provide additional help. Did the code in this post not help? If not, can you pose specific questions? There is also a non-lucene Naive Bayes classifier from the classifier4j (classifier4j.sf.net) that may be helpful if you want a pure-Java (ie no Lucene) solution.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-20131933767339440662011-08-29T10:23:41.495-07:002011-08-29T10:23:41.495-07:00thank you for the post ..i want to implement Bayes...thank you for the post ..i want to implement Bayes theorem using java programming language can u please help me out..am beginner with the classifier of data mining concept..can you help me out if is it k with you..Savitahttps://www.blogger.com/profile/10556936527181958184noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-50694550144764139482011-06-18T21:42:17.538-07:002011-06-18T21:42:17.538-07:00Hi Anonymous, thanks, and you are welcome. By &quo...Hi Anonymous, thanks, and you are welcome. By "build" do you mean compile? If so, yes, in Eclipse, your build path should contain the contents of the lib directory (the lib directory contains the lucene jars as well).Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-70643742069261987072011-06-16T03:12:08.007-07:002011-06-16T03:12:08.007-07:00Hi..Sujith..
Thanks a lot for u blog..
I m not f...Hi..Sujith..<br /><br />Thanks a lot for u blog..<br /><br />I m not familiar with lucene.Is it possible to build Binary Naive Bayes classifier and all JTMT (summarizers,classifiers) projects by Eclipse tool..Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-9418035167830194032010-05-01T10:46:24.128-07:002010-05-01T10:46:24.128-07:00Thanks Nakamoto. To run this stuff, I used the JUn...Thanks Nakamoto. To run this stuff, I used the JUnit test shown in the post. Did that not run for you? Can you post the stack trace?Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-68681570719590074262010-04-05T09:32:29.698-07:002010-04-05T09:32:29.698-07:00thanks for your post
it is necessary for me. but i...thanks for your post<br />it is necessary for me. but i can't execute your code. please says to me for the details. Thanks!nakamotonoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-69918582616268918232010-02-07T12:51:06.746-08:002010-02-07T12:51:06.746-08:00Thanks, yet another reason to learn how to use Wek...Thanks, yet another reason to learn how to use Weka...I have been putting it off for a while, have been through the Witten-Frank book, but haven't actually used it for anything.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-81349571690784550562010-02-04T12:46:44.324-08:002010-02-04T12:46:44.324-08:00Sujit,
Thanks for sharing your thoughts.
I recen...Sujit,<br /><br />Thanks for sharing your thoughts.<br /><br />I recently used Weka package against the same training set and test set. Weka's Naives Bayes based classifier got 3 out of 5 (coffee.txt, coffee1.txt and cocoa.txt) right. However, weka's support vector classifier is able to do a better job. It classified all 5 correctly.<br /><br />Like you said, there might be some areas Unknownhttps://www.blogger.com/profile/17540298745902935245noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-64599704759416052782009-12-25T19:23:11.822-08:002009-12-25T19:23:11.822-08:00Thanks, Lisong. I suspect that turning on feature ...Thanks, Lisong. I suspect that turning on feature selection results in the classifier overfitting the data, so we get stellar results for one category but really bad ones for another. I recently ran a cross validation for another of my "bright ideas" (a Lucene based classifier), and it came back with an accuracy score of 35% :-/.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-64982889605591634012009-12-08T14:09:12.149-08:002009-12-08T14:09:12.149-08:00Sujit, thanks for the great post!
Do you happen ...Sujit, thanks for the great post! <br /><br />Do you happen to run this Binary Naive Bayes classifier against the category coffee using the 5 documents: cocoa.txt, cocoa1.txt, cocoa2.txt, coffee.txt and coffee1.txt.<br /><br />When I tried to run the classifier against category cocoa, I have the same correct result as you posted. <br /><br />But when I tried to classify these 5 documents against Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-63422432838808615122009-11-26T12:38:34.953-08:002009-11-26T12:38:34.953-08:00Hi thushara, I was trying to derive the formula fo...Hi thushara, I was trying to derive the formula for r in terms of data values I had already. I looked at the formula you mention above, and I can derive my formula for r from it. Although my formula for r can be simplified somewhat by factoring out the product so it comes a product of fractions.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-67482247132311731142009-11-23T11:45:34.830-08:002009-11-23T11:45:34.830-08:00the formula for the ratio (r) seems different from...the formula for the ratio (r) seems different from the ratio given by the wikipedia entry you mention in the post. i think it should be:<br /><br />r = P(C)/P(!C) * product [P(wi|C)/P(wi|!C)]<br /><br />sorry, i can't use the regular notation with this reduced functionality input box.thusharahttps://www.blogger.com/profile/09820727533887579134noreply@blogger.com