tag:blogger.com,1999:blog-7583720.post2330173740359368573..comments2024-03-05T03:17:02.289-08:00Comments on Salmon Run: Entity Discovery using Mahout CollocDriverSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.comBlogger8125tag:blogger.com,1999:blog-7583720.post-88719215376647480692013-10-29T10:15:51.592-07:002013-10-29T10:15:51.592-07:00I see, thank you for the detailed explanation. We ...I see, thank you for the detailed explanation. We do something similar (but cruder version of what you are doing) during our concept mapping process - we restrict analysis to noun phrases (although that part is now disabled because of concerns about throwing the baby out of the bathwater) and we post-process the concepts so that the longest match is considered. Thanks to your explanation, I thinkSujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-32811440766657048432013-10-28T11:29:01.967-07:002013-10-28T11:29:01.967-07:00Not really Synonym normalization. This is how I di...Not really Synonym normalization. This is how I did it. For example if I take the document http://www.washingtonpost.com/blogs/the-fix/wp/2013/10/24/john-boehners-next-big-test-immigration-reform/<br /><br />Just taking the (NN || NNS) and (NNP || NNPS) would give the following candidates (also note that there are some editorial mistakes in them as well :-) which will get discarded)<br /><br />[Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-52182027624075748902013-10-26T10:59:29.393-07:002013-10-26T10:59:29.393-07:00My goal is more modest, its to help human taxonomy...My goal is more modest, its to help human taxonomy editors discover new concepts in a corpus of text. Do you normalize synonyms automatically? If you do, would appreciate some pointers.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-51798797763573637312013-10-24T18:08:13.259-07:002013-10-24T18:08:13.259-07:00I used it to normalize concepts extracted via pos ...I used it to normalize concepts extracted via pos tagging since I did not want extraneous words. For example if a doc contains several variations of a concept like - computer security, computer vulnerability, hacking, computer system vulnerability etc - I use this this to normalize and find base conceptAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-6300111529993511412013-10-24T10:08:35.559-07:002013-10-24T10:08:35.559-07:00Hi Ravi, it looks interesting, similar to RAKE wit...Hi Ravi, it looks interesting, similar to RAKE with the rules but uses ngrams (which almost every other approach uses). I was going to try implementing and benchmarking it against some job description data I had (from the Adzuna challenge on Kaggle) and see how it compares with some of the other approaches.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-64650489024333303652013-10-23T11:22:32.978-07:002013-10-23T11:22:32.978-07:00Let me know your thoughts as everybody understands...Let me know your thoughts as everybody understands an algo differently. I just want to corroborate my understanding ;-)Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-26412792229910399292013-10-21T18:55:34.762-07:002013-10-21T18:55:34.762-07:00Thanks Ravi this looks very interesting. Definitel...Thanks Ravi this looks very interesting. Definitely something worth trying out.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-71366590432515276302013-10-21T08:43:04.931-07:002013-10-21T08:43:04.931-07:00Hello Sujit,
I tried several algorithms (RA...Hello Sujit,<br /> I tried several algorithms (RAKE, PMI, N-Grams, Maximum Entropy etc) for concept/Theme extraction from document texts and found this decent paper from stanford which gave reasonably good results although the algorithm itself is pretty basic.<br /><br />http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=85057D4ADAAD516A5F763D7EC94F5B66?doi=10.1.1.173.5881&rep=Anonymousnoreply@blogger.com