Comments on Salmon Run: IR Math with Java : Similarity Measures

Apologies for the late reply. If you are still int...

2020-04-29T14:34:10.023-07:00

Apologies for the late reply. If you are still interested in the answer, following on from the comment that "position (i, j) of the TF matrix gives the frequency of the word j in document i. So the i-th row represents document i, and all the j entries of the row i represent the term frequencies of words in the vocabulary. Not all the words in the vocabulary will be in document i, so these

Hi Mr.Pal, I need to clarify more on creating sour...

2018-06-09T20:44:33.276-07:00

Hi Mr.Pal,
I need to clarify more on creating source matrix and query matrix.
How can I do that in python. Would you please kind enough to how to use them. I have read previous comment reply of you,but that confused me more. You said that position (i,j) gives frequency of a word in relevant document,but how can we represent a whole document in a row. Please kind enough to explain with a

Yes, all of these need your input as a matrix. Eas...

2018-02-07T13:39:08.046-08:00

Yes, all of these need your input as a matrix. Easiest way to convert your text to a matrix (where each row represents a record and each column represents a word) is to use one of Scikit-learn's vectorizers -- in general the frequency distribution of words in text is very long tailed, so it is customary to only look at the N most important words. There are two approaches, first is to consider

Thank you for your effort. Please can you tell me ...

2018-02-05T05:02:53.922-08:00

Thank you for your effort.
Please can you tell me how to convert
My dataset that is as .csv with alot of words so i want to convert each word
To vector to make similarity and text
Mining. My big problem is to convert
My dataset can you help my how can I start I read about bag of word ,cosine
Similarity and vector space model but can't know how to start.

Hi What2Watch, I wrote this code a long time ago w...

2017-01-30T10:05:31.382-08:00

Hi What2Watch, I wrote this code a long time ago when I was working almost exclusively in Java and trying to explore text mining. To answer your question, the general steps to go from text files to document matrices are outlined very nicely on this Working with Text Data page on the

Hello Mr.Pal, thank you for such a grate post. I w...

2017-01-22T03:16:49.315-08:00

Hello Mr.Pal, thank you for such a grate post. I would like to ask something related with project. I will just use similarity methods and not use all your works. I have documents like yours but as one text file contains many lines with sentences like;

D1: word1,word2,word3,.....
D2: word1,word2,word3,.....
D3:..

To use your similarity functions how can i convert

Thanks for the kind words, Sandi. Also thanks for ...

2016-12-14T08:21:46.344-08:00

Thanks for the kind words, Sandi. Also thanks for the pointer to the Spark cosine similarity code. I had heard of the DIMSUM algorithm, realized after seeing the link that its already implemented in Spark. The code you pointed to computes cosine similarity (exact and approximate) between all pairs of sentences in the matrix. Each sentence is represented by a column in the matrix. I guess you

Hi Sujit, This indeed is a great post which helps ...

2016-12-13T03:12:42.312-08:00

Hi Sujit, This indeed is a great post which helps to understand a lot of concepts.
On the cosine similarity part if I have a structured data-set of addresses(Name, fathers name, city, state, country, pincode) and I would like to associate weights to each of the entity for calculating the cosine similarity between the sentences.

I found an approach using Spark MLLIB in the below

De nada. Ellos no están destinados a ser ejecutado...

2013-10-24T10:14:52.921-07:00

De nada. Ellos no están destinados a ser ejecutados por sí mismos, sino como parte de un proceso de indexación. Usted puede ver cómo eso creó en SimilarityTest.java (código fuente completo está disponible en jtmt.sf.net).

Si lo entiendo asi mismo hago con el ingles, tengo...

2013-10-22T14:15:34.903-07:00

Si lo entiendo asi mismo hago con el ingles, tengo una duda ya descargue las 4 clases pero no puedo ejecutarlas debido a que ninguna de ellas tiene un metodo principal, me refiero a las clases de: CosineSimilarity, JaccardSimilarity, AbstractSimilarity, Searcher por favor ayudeme a resolver esto. GRACIAS

Muchas gracias por sus amables palabras. Mi españo...

2013-10-15T09:58:50.632-07:00

Muchas gracias por sus amables palabras. Mi español no es bueno, pero traductor Google es mi amigo :-).

Hola es muy interesante lo que acabo de leer, yo d...

2013-10-15T09:08:25.757-07:00

Hola es muy interesante lo que acabo de leer, yo debo de trabajar con medidas de similitud y he leido acerca de la similitud de coseno me parece de gran ayuda su publicacion

Hi, if you already have a (nxm) TD matrix of your ...

2013-08-07T08:32:49.971-07:00

Hi, if you already have a (nxm) TD matrix of your documents (call it D), you could use the same vectorizer to create a (nx1) vector for your query (call it Q), then you can generate a (nx1) similarity vector S = D * Q. Depending on the size of your document set, you can do the computation with dense or sparse matrices in memory, or you could