tag:blogger.com,1999:blog-7583720.post5912070587121268813..comments2018-04-19T08:38:02.718-07:00Comments on Salmon Run: IR Math with Java : Similarity MeasuresSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.comBlogger82125tag:blogger.com,1999:blog-7583720.post-32762534649475381242018-02-07T13:39:08.046-08:002018-02-07T13:39:08.046-08:00Yes, all of these need your input as a matrix. Eas...Yes, all of these need your input as a matrix. Easiest way to convert your text to a matrix (where each row represents a record and each column represents a word) is to use one of Scikit-learn's vectorizers -- in general the frequency distribution of words in text is very long tailed, so it is customary to only look at the N most important words. There are two approaches, first is to considerSujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-10934469468265411042018-02-05T05:02:53.922-08:002018-02-05T05:02:53.922-08:00Thank you for your effort.
Please can you tell me ...Thank you for your effort.<br />Please can you tell me how to convert<br />My dataset that is as .csv with alot of words so i want to convert each word<br />To vector to make similarity and text<br />Mining. My big problem is to convert<br />My dataset can you help my how can I start I read about bag of word ,cosine <br />Similarity and vector space model but can't know how to start.<br />Thankshttps://www.blogger.com/profile/13481063765937934791noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-39300372645599409952017-01-30T10:05:31.382-08:002017-01-30T10:05:31.382-08:00Hi What2Watch, I wrote this code a long time ago w...Hi What2Watch, I wrote this code a long time ago when I was working almost exclusively in Java and trying to explore text mining. To answer your question, the general steps to go from text files to document matrices are outlined very nicely on this <a href="http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html" rel="nofollow">Working with Text Data</a> page on the Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-46841797184558402422017-01-22T03:16:49.315-08:002017-01-22T03:16:49.315-08:00Hello Mr.Pal, thank you for such a grate post. I w...Hello Mr.Pal, thank you for such a grate post. I would like to ask something related with project. I will just use similarity methods and not use all your works. I have documents like yours but as one text file contains many lines with sentences like; <br /><br />D1: word1,word2,word3,.....<br />D2: word1,word2,word3,.....<br />D3:..<br /><br />To use your similarity functions how can i convert What2Watchhttps://www.blogger.com/profile/00150357533098593447noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-74881954626486367892016-12-14T08:21:46.344-08:002016-12-14T08:21:46.344-08:00Thanks for the kind words, Sandi. Also thanks for ...Thanks for the kind words, Sandi. Also thanks for the pointer to the Spark cosine similarity code. I had heard of the DIMSUM algorithm, realized after seeing the link that its already implemented in Spark. The code you pointed to computes cosine similarity (exact and approximate) between all pairs of sentences in the matrix. Each sentence is represented by a column in the matrix. I guess you Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-52472386804146228032016-12-13T03:12:42.312-08:002016-12-13T03:12:42.312-08:00Hi Sujit, This indeed is a great post which helps ...Hi Sujit, This indeed is a great post which helps to understand a lot of concepts.<br />On the cosine similarity part if I have a structured data-set of addresses(Name, fathers name, city, state, country, pincode) and I would like to associate weights to each of the entity for calculating the cosine similarity between the sentences.<br /><br />I found an approach using Spark MLLIB in the below sandihttps://www.blogger.com/profile/02711614844196990304noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-33846932640781209942013-10-24T10:14:52.921-07:002013-10-24T10:14:52.921-07:00De nada. Ellos no están destinados a ser ejecutado...De nada. Ellos no están destinados a ser ejecutados por sí mismos, sino como parte de un proceso de indexación. Usted puede ver cómo eso creó en SimilarityTest.java (código fuente completo está disponible en jtmt.sf.net).<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-63932250040967367862013-10-22T14:15:34.903-07:002013-10-22T14:15:34.903-07:00Si lo entiendo asi mismo hago con el ingles, tengo...Si lo entiendo asi mismo hago con el ingles, tengo una duda ya descargue las 4 clases pero no puedo ejecutarlas debido a que ninguna de ellas tiene un metodo principal, me refiero a las clases de: CosineSimilarity, JaccardSimilarity, AbstractSimilarity, Searcher por favor ayudeme a resolver esto. GRACIAS Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-11488412705890683012013-10-15T09:58:50.632-07:002013-10-15T09:58:50.632-07:00Muchas gracias por sus amables palabras. Mi españo...Muchas gracias por sus amables palabras. Mi español no es bueno, pero traductor Google es mi amigo :-).<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-21437172285587657122013-10-15T09:08:25.757-07:002013-10-15T09:08:25.757-07:00Hola es muy interesante lo que acabo de leer, yo d...Hola es muy interesante lo que acabo de leer, yo debo de trabajar con medidas de similitud y he leido acerca de la similitud de coseno me parece de gran ayuda su publicacion <br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-37050246605542925902013-08-07T08:32:49.971-07:002013-08-07T08:32:49.971-07:00Hi, if you already have a (nxm) TD matrix of your ...Hi, if you already have a (nxm) TD matrix of your documents (call it D), you could use the same vectorizer to create a (nx1) vector for your query (call it Q), then you can generate a (nx1) similarity vector S = D * Q. Depending on the size of your document set, you can do the computation with dense or sparse matrices in memory, or you could <a href="http://sujitpal.blogspot.com/2012/07/Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-29088231818551002082013-08-05T15:02:57.734-07:002013-08-05T15:02:57.734-07:00Hi Sujit,
I am a newbie to vector space modelling....Hi Sujit,<br />I am a newbie to vector space modelling. I recently had to create a tool which could take in a query and calculate the give a list of files using tf-idf from a source code. Can you suggest me how to go about it?<br /><br />VSMAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-26447358227274124562013-07-31T11:31:24.110-07:002013-07-31T11:31:24.110-07:00Hi Rob, d1 and d2 point to row and column indexes ...Hi Rob, d1 and d2 point to row and column indexes 0 and 1 respectively. So similarity(d1,d2) would be at either [0,1] or [1,0] of the matrix (both values are equal). Also notice the diagonal is all 1, indicating that any document is perfectly similar to itself.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-56603481484357556222013-07-30T22:09:50.855-07:002013-07-30T22:09:50.855-07:00Hi Sujit,
I am trying to understand the similarity...Hi Sujit,<br />I am trying to understand the similarity matrix output you have printed. You have five docs say d1,d2,..d5. How is the similarity represented here? Lets say we find similarity between d1 and d2, how do we represent in matrix form?<br /><br />Please explain.<br />Thanks,<br />Rob.<br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-55399536705169050672012-12-10T18:53:27.997-08:002012-12-10T18:53:27.997-08:00Oh, okay, I see why you could get negative values ...Oh, okay, I see why you could get negative values now...thanks for sharing, I will look out for this in the future.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-33895860207009417052012-12-10T00:11:02.692-08:002012-12-10T00:11:02.692-08:00Hello,
Thanks for your reply. Yes, you are right,...Hello,<br /><br />Thanks for your reply. Yes, you are right, in the example you gave it is always positive. I tried to use your cosine similarity method on vectors taken from the V-matrix generated after SVD and it had negative values. Now, I have written mine. Anyway thank you for your post and prompt reply.<br /><br />BestAnonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-42068784710496706232012-12-08T10:47:51.881-08:002012-12-08T10:47:51.881-08:00Thanks, glad it was useful. The use of norm1 is no...Thanks, glad it was useful. The use of norm1 is not a bug, you can never have negative cosine in case of text since term frequencies are always zero or positive.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-46390861484120063602012-12-06T06:30:11.178-08:002012-12-06T06:30:11.178-08:00Hello,
Thank you for your post. On the cosine sim...Hello,<br /><br />Thank you for your post. On the cosine similarity part, you used norm1 ( double dotProduct = sourceDoc.arrayTimes(targetDoc).norm1() ) to compute the dotProduct. From what I referred norm1 gives always positive results which makes the cosine similarity result always between 0 and 1 (while cosine similarity can also be between -1 and 0). Am I missing something or its a bug?<br />Anonymousnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-43283242991622401542012-11-21T12:22:15.723-08:002012-11-21T12:22:15.723-08:00You're welcome, hopefully you will be able to ...You're welcome, hopefully you will be able to fix the problem...<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-77989865372474341432012-11-20T17:58:40.067-08:002012-11-20T17:58:40.067-08:00Thanks Sujit,
Unfortunately, using dense matrix di...Thanks Sujit,<br />Unfortunately, using dense matrix didn't help me and I still have problem. <br />saranoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-49437317959256556272012-11-19T18:41:18.717-08:002012-11-19T18:41:18.717-08:00Well, the idea is to add 1 to the numerator of eve...Well, the idea is to add 1 to the numerator of every element in the TD matrix and v to its denominator where v is the vocabulary size. So assume a term t has occurred in document d n times, and N is the total number of words in the corpus, then the frequency of t is n/N, with the Laplace correction it becomes (n+1)/(N+v). This now changes your sparse matrix into a dense one.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-66011680160884985992012-11-19T17:37:41.129-08:002012-11-19T17:37:41.129-08:00Actually, I am trying to cluster the documents usi...Actually, I am trying to cluster the documents using guessing mixtures. The problem is that: since the sparse matrix has many zeros, the variance will be zero and the program dose not work. It happens when I am using Dirichlet mixtures too. I was wondering if you could explain more how I can change the data to dense matrix? What is denominator ?<br />Many thanks for your time.<br />SaraSaranoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-37919814426860519362012-11-18T09:49:23.447-08:002012-11-18T09:49:23.447-08:00I am guessing you are trying to find similarities ...I am guessing you are trying to find similarities between "average" document across genres, yes? If so, very cool idea. The NaNs are most likely because of underflow problems. One way to work around that is to apply a Laplace correction (add 1 to the denominator and v to the denominator where v is the vocabulary size) to the TD vectors, although if you do that it will change the data Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-64450722146104067932012-11-16T07:08:36.194-08:002012-11-16T07:08:36.194-08:00Hi ,
I did the same thing and add made all the z...Hi ,<br /><br />I did the same thing and add made all the zeros live. But, when I use Gaussian mixtures for clustering , it dose not work because I have many zeros in my matrix and all of my parameters will be NAN.<br />Thanks.<br />Saranoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-86609035552094781192012-11-15T17:22:49.607-08:002012-11-15T17:22:49.607-08:00Hi Sara, if you are using cosine similarity then y...Hi Sara, if you are using cosine similarity then you cannot. In case of document vectors, they should be the same size, as each element of the vector will correspond to one word/term in the vocabulary (which is a superset of all terms found in the corpus). If they do differ in size for some reason, then you can pad the shorter one with zeros (in case of text you are saying that the extra terms Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.com