tag:blogger.com,1999:blog-7583720.post1991200619449336013..comments2024-03-05T03:17:02.289-08:00Comments on Salmon Run: Hierarchical Agglomerative Clustering with HadoopSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.comBlogger18125tag:blogger.com,1999:blog-7583720.post-12662141372313306712017-03-07T19:51:50.750-08:002017-03-07T19:51:50.750-08:00One thing to note is that the implementation of th...One thing to note is that the implementation of the algorithm described here is inefficient. There are other implementations that are more efficient like the one in the Packt book Hadoop Mapreduce Cookbook (mentioned in an earlier comment).Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-42127995629869673912017-03-07T19:34:52.631-08:002017-03-07T19:34:52.631-08:00Ohhh Great ...
Thank you So much sir for that :) ...Ohhh Great ... <br />Thank you So much sir for that :) Hiralhttps://www.blogger.com/profile/06962057231210819004noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-49573009356422075872017-03-07T08:07:51.101-08:002017-03-07T08:07:51.101-08:00Hi Hiral, this is covered in my blog post previous...Hi Hiral, this is covered in <a href="http://sujitpal.blogspot.com/2009/09/generating-term-document-matrix-using.html" rel="nofollow">my blog post</a> previous to this one.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-19864940315885526462017-03-06T23:57:30.945-08:002017-03-06T23:57:30.945-08:00Hi Sujit,
First of all a great post.I want to debu...Hi Sujit,<br />First of all a great post.I want to debug your code and for that you have mentioned the input should be RAW TFs.PLease let me know how can i prepare your input formats Hiralhttps://www.blogger.com/profile/06962057231210819004noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-9236809821497846592016-01-18T08:00:40.370-08:002016-01-18T08:00:40.370-08:00Hi vitthal, while the code works, it is quite slow...Hi vitthal, while the code works, it is quite slow. I would advise using something more efficient. Look around the net, you should find HAC being described with code in quite a few textbooks - I found a reference using the google search "hadoop hierarchical clustering" - its the book <a href="https://www.packtpub.com/big-data-and-business-intelligence/hadoop-mapreduce-cookbook" rel="Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-89982466735501637832016-01-18T00:07:34.993-08:002016-01-18T00:07:34.993-08:00HI HOW I CAN RUN THIS CODE. I AM NEW IN HADOOP. PL...HI HOW I CAN RUN THIS CODE. I AM NEW IN HADOOP. PLZ HELP<br />vitthalnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-10772462812856656452010-08-30T12:44:06.266-07:002010-08-30T12:44:06.266-07:00NOTE: the post above has been deleted because it c...NOTE: the post above has been deleted because it contained incoherent expletive-ridden ramblings in Hinglish (Hindi using English alphabet). I deleted it because (a) it was offensive, (b) because it contained no useful information and (c) because it would be incomprehensible to most English speaking readers.<br /><br />@Vijay: From your post, it appears that you could not make the code work. I Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-34802704617506818312010-08-30T02:25:20.914-07:002010-08-30T02:25:20.914-07:00This comment has been removed by a blog administrator.vijay dinanath chauhannoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-5631466861832597982010-05-01T11:26:27.730-07:002010-05-01T11:26:27.730-07:00Hi abeppu, thanks for your detailed and thoughtful...Hi abeppu, thanks for your detailed and thoughtful comment. I will try to answer to the best of my ability.<br /><br />1) I wrote my own identity reducer because one was not available with the new API when Hadoop 0.20 just came out. Haven't looked at Hadoop lately, but I think its there now.<br />2) This is from the definition of HAC. At each stage a single item is merged into a cluster. As Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-8196076939691264182010-04-10T20:59:29.001-07:002010-04-10T20:59:29.001-07:00Hi,
I know this post is rather old, but I just s...Hi, <br /><br />I know this post is rather old, but I just saw it linked to from hacker news this week. I have to say I'm confused by several points of your approach. Perhaps you could shed some light on the choices you made in your implementation?<br /><br />Why, for example, did you write your own identity reducer rather than using the one that comes with hadoop (http://hadoop.apache.org/abeppuhttps://www.blogger.com/profile/02369748197097052355noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-57788064293591122912010-03-02T09:53:54.893-08:002010-03-02T09:53:54.893-08:00Hi Anjana, the HAC algorithm is very simple. You c...Hi Anjana, the HAC algorithm is very simple. You can find a <a href="http://en.wikipedia.org/wiki/Cluster_analysis#Hierarchical_clustering" rel="nofollow">simple explanation on Wikipedia</a> - in fact, its the top hit on Google for this term.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-57236891175494134502010-02-22T04:49:50.962-08:002010-02-22T04:49:50.962-08:00hey i need a program to implement heirarachical ag...hey i need a program to implement heirarachical agglomeritive algorithm to just cluster a small amount of sample datasets....can u plz help me out with it....also i want it to be as simple as possible.....anjana painoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-85335928985348982902010-02-13T11:45:24.356-08:002010-02-13T11:45:24.356-08:00Hi Alpesh, I am using Cosine similarity between th...Hi Alpesh, I am using Cosine similarity between the document vectors to compute the similarity between them. Perhaps they are very similar, ie below the threshold and hence clustering stops? I would suggest checking to see how many iterations it has run through, etc, in order to debug the issue.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-93556686631444882010-02-12T15:14:34.705-08:002010-02-12T15:14:34.705-08:00Hi, Sujit
i done as per your blog for Hirerarchic...Hi, Sujit<br /><br />i done as per your blog for Hirerarchical Clustering but it di not work for me.<br /><br />i submited 5 txt files artical form wikipedia(mahatma gandhi.txt,ratan tata.txt , hitler.txt,microsoft.txt,apple.txt)<br /><br />but when i run clustering algo it gave me output :<br /><br />gandhi.txt 0.0015904572564612327,0.002584493041749503,5.964214711729623E <br /><br />alpesh dhamelianoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-62108827548753922242010-02-07T13:32:43.826-08:002010-02-07T13:32:43.826-08:00Hi,
Once you already have your most of your clus...Hi, <br /><br />Once you already have your most of your clusters (as would be the case for incremental indexing), you would have a good idea of how wide your cluster should be. At that point, I think it would be far simpler to just use that information to figure out if your new document should join an existing cluster or start a new one.<br /><br />For N >>> k, where N is the number of Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-25946110468178973472010-02-01T14:09:51.591-08:002010-02-01T14:09:51.591-08:00hi sujit
how can i do incremental clustering usin...hi sujit<br /><br />how can i do incremental clustering using your code?<br /><br /><br />can i handle millions document using your code?Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-58988676465897002002009-11-26T12:30:29.234-08:002009-11-26T12:30:29.234-08:00Thanks, Alex, and you are welcome.Thanks, Alex, and you are welcome.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-48847141591457760822009-11-26T02:16:23.661-08:002009-11-26T02:16:23.661-08:00Excellent post, thanks a lot,
this is one of the m...Excellent post, thanks a lot,<br />this is one of the most comprehensive examples of hadoop implementation I've seenAlex Kamilnoreply@blogger.com