Friday, August 29, 2014

Topic Modeling with gensim


Over the past couple of weeks, I've been trying different ways to gain insight into my little corpus of 2000+ Patient Health Records (PHRs). Topic modeling is one way to do it, and I've been meaning to learn gensim, a Python library for topic modeling, so I decided to use gensim to do some topic modeling on this dataset. I have used Mahout for Topic Modeling before, but my data is quite small and doesn't need the complexity of Map-Reduce (besides the objective is to learn gensim :-)). As a bonus, gensim also offers a wrapper for Mallet, another popular Java topic modeling library.

This post describes my experiment. Ultimately, the insights I got were not particulary interesting, but it got me familiar with gensim. I hope you find it useful.

The source format of the dataset is a collection of JSON files. My first step is to pre-process the data so that the text portion is written out into a collection of text files. In hindsight, this seems redundant since this could be merged with the next step, but it makes things a bit clearer if separated.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Source: gensim_preprocess.py
import json
import os

JSONS_DIR = "/path/to/jsons/dir"
TEXTS_DIR = "/path/to/texts/dir"

for fn in os.listdir(JSONS_DIR):
    print "Converting JSON: %s" % (fn)
    fjson = open(os.path.join(JSONS_DIR, fn), 'rb')
    data = json.load(fjson)
    fjson.close()
    tfn = os.path.splitext(fn)[0] + ".txt"
    ftext = open(os.path.join(TEXTS_DIR, tfn), 'wb')
    ftext.write(data["text"].encode("utf-8"))
    ftext.close()

The next step is to convert the texts to a format that gensim can use - namely a Bag of Words (BoW) representation. Gensim expects to be fed a corpus data structure, basically a list of sparse vectors. The sparse vector elements consists of (id, score) pairs, where the id is a numeric ID that is mapped to the term via a dictionary. Gensim's author has taken pains to keep its memory requirements down by using a streaming approach to build corpora and dictionaries. The iter_docs() function below implements this streaming approach.

In the code below, I read the text of each file, pass the words through gensim's tokenizer and filter out stopwords (from NLTK's English stopword list) using our custom MyCorpus class. These words are used to create a dictionary and BoW corpus, which is serialized to files for use in the next step.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Source: bow_model.py
import logging
import os
import nltk
import gensim

def iter_docs(topdir, stoplist):
    for fn in os.listdir(topdir):
        fin = open(os.path.join(topdir, fn), 'rb')
        text = fin.read()
        fin.close()
        yield (x for x in 
            gensim.utils.tokenize(text, lowercase=True, deacc=True, 
                                  errors="ignore")
            if x not in stoplist)

class MyCorpus(object):

    def __init__(self, topdir, stoplist):
        self.topdir = topdir
        self.stoplist = stoplist
        self.dictionary = gensim.corpora.Dictionary(iter_docs(topdir, stoplist))
        
    def __iter__(self):
        for tokens in iter_docs(self.topdir, self.stoplist):
            yield self.dictionary.doc2bow(tokens)


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

TEXTS_DIR = "/path/to/texts/dir"
MODELS_DIR = "/path/to/models/dir"

stoplist = set(nltk.corpus.stopwords.words("english"))
corpus = MyCorpus(TEXTS_DIR, stoplist)

corpus.dictionary.save(os.path.join(MODELS_DIR, "mtsamples.dict"))
gensim.corpora.MmCorpus.serialize(os.path.join(MODELS_DIR, "mtsamples.mm"), 
                                  corpus)

I didn't know (and didn't have an opinion about) how many topics this corpus should yield so I decided to compute this by reducing the features to two dimensions, then clustering the points for different values of K (number of clusters) to find an optimum value. Gensim offers various transforms that allow us to project the vectors in a corpus to a different coordinate space. One such transform is the Latent Semantic Indexing (LSI) transform, which we use to project the original data to 2D.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Source: lsi_model.py
import logging
import os
import gensim

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

MODELS_DIR = "/path/to/models/dir"

dictionary = gensim.corpora.Dictionary.load(os.path.join(MODELS_DIR, 
                                            "mtsamples.dict"))
corpus = gensim.corpora.MmCorpus(os.path.join(MODELS_DIR, "mtsamples.mm"))

tfidf = gensim.models.TfidfModel(corpus, normalize=True)
corpus_tfidf = tfidf[corpus]

# project to 2 dimensions for visualization
lsi = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)

# write out coordinates to file
fcoords = open(os.path.join(MODELS_DIR, "coords.csv"), 'wb')
for vector in lsi[corpus]:
    if len(vector) != 2:
        continue
    fcoords.write("%6.4f\t%6.4f\n" % (vector[0][1], vector[1][1]))
fcoords.close()

Next I clustered the points in the reduced 2D LSI space using KMeans, varying the number of clusters (K) from 1 to 10. The objective function used is the Inertia of the cluster, defined as the sum of squared differences of each point to its cluster centroid. This value is provided directly from the Scikit-Learn KMeans algorithm. Other popular functions include Distortion (Inertia divided by the number of points) or the Percentage of Variance Explained, as described on this StackOverflow page.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Source: num_topics.py
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

MODELS_DIR = "/path/to/models/dir"
MAX_K = 10

X = np.loadtxt(os.path.join(MODELS_DIR, "coords.csv"), delimiter="\t")
ks = range(1, MAX_K + 1)

inertias = np.zeros(MAX_K)
diff = np.zeros(MAX_K)
diff2 = np.zeros(MAX_K)
diff3 = np.zeros(MAX_K)
for k in ks:
    kmeans = KMeans(k).fit(X)
    inertias[k - 1] = kmeans.inertia_
    # first difference    
    if k > 1:
        diff[k - 1] = inertias[k - 1] - inertias[k - 2]
    # second difference
    if k > 2:
        diff2[k - 1] = diff[k - 1] - diff[k - 2]
    # third difference
    if k > 3:
        diff3[k - 1] = diff2[k - 1] - diff2[k - 2]

elbow = np.argmin(diff3[3:]) + 3

plt.plot(ks, inertias, "b*-")
plt.plot(ks[elbow], inertias[elbow], marker='o', markersize=12,
         markeredgewidth=2, markeredgecolor='r', markerfacecolor=None)
plt.ylabel("Inertia")
plt.xlabel("K")
plt.show()

I plotted the Inertias for different values of K, then used Vincent Granville's approach of calculating the third differential to find an elbow point. The elbow point happens here for K=5 and is marked with a red dot in the graph below.

I then re-ran the KMeans algorithm with K=5 and generated the clusters.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Source: viz_topics_scatter.py
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

MODELS_DIR = "/path/to/models/dir"
NUM_TOPICS = 5

X = np.loadtxt(os.path.join(MODELS_DIR, "coords.csv"), delimiter="\t")
kmeans = KMeans(NUM_TOPICS).fit(X)
y = kmeans.labels_

colors = ["b", "g", "r", "m", "c"]
for i in range(X.shape[0]):
    plt.scatter(X[i][0], X[i][1], c=colors[y[i]], s=10)    
plt.show()

which gives me clusters that look like this.


I then ran the full LDA transform against the BoW corpus, with the number of topics set to 5. As in LSI, I load up the corpus and dictionary from files, then apply the transform to project the documents into the LDA Topic space. Notice that LDA and LSI are conceptually similar in gensim - both are transforms that map one vector space to another.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Source: lda_model.py
import logging
import os
import gensim

MODELS_DIR = "/path/to/models/dir"
NUM_TOPICS = 5

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

dictionary = gensim.corpora.Dictionary.load(os.path.join(MODELS_DIR, 
                                            "mtsamples.dict"))
corpus = gensim.corpora.MmCorpus(os.path.join(MODELS_DIR, "mtsamples.mm"))

# Project to LDA space
lda = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=NUM_TOPICS)
lda.print_topics(NUM_TOPICS)

The results are as follows. As you can see, each topic is made up of a mixture of terms. The top 10 terms from each topic is shown in the output below and covers between 69-80% of the probability space for each topic.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
2014-08-27 20:59:04,850 : INFO : topic #0 (0.200): 0.015*patient + \
      0.014*right + 0.009*left + 0.009*procedure + 0.006*well + \
      0.005*placed + 0.005*history + 0.004*skin + 0.004*normal + \
      0.004*anesthesia
2014-08-27 20:59:04,852 : INFO : topic #1 (0.200): 0.015*patient + \
      0.009*placed + 0.008*right + 0.007*normal + 0.007*left + \
      0.006*procedure + 0.005*using + 0.004*anterior + 0.004*history + \
      0.004*skin
2014-08-27 20:59:04,854 : INFO : topic #2 (0.200): 0.012*left + \
      0.010*patient + 0.010*right + 0.007*history + 0.005*well + \
      0.005*normal + 0.004*pain + 0.004*also + 0.003*without + \
      0.003*disease
2014-08-27 20:59:04,856 : INFO : topic #3 (0.200): 0.019*patient + \
      0.011*left + 0.011*normal + 0.010*right + 0.006*well + \
      0.006*procedure + 0.005*pain + 0.004*placed + 0.004*history + \
      0.004*using
2014-08-27 20:59:04,858 : INFO : topic #4 (0.200): 0.022*patient + \
      0.013*history + 0.007*mg + 0.006*normal + 0.005*pain + \
      0.005*also + 0.005*left + 0.004*right + 0.004*year + 0.004*well

and here is the same information shown graphically using wordclouds (generated using the code below).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import os
import wordcloud

MODELS_DIR = "models"

final_topics = open(os.path.join(MODELS_DIR, "final_topics.txt"), 'rb')
curr_topic = 0
for line in final_topics:
    line = line.strip()[line.rindex(":") + 2:]
    scores = [float(x.split("*")[0]) for x in line.split(" + ")]
    words = [x.split("*")[1] for x in line.split(" + ")]
    freqs = []
    for word, score in zip(words, scores):
        freqs.append((word, score))
    elements = wordcloud.fit_words(freqs, width=120, height=120)
    wordcloud.draw(elements, "gs_topic_%d.png" % (curr_topic),
                   width=120, height=120)
    curr_topic += 1
final_topics.close()












As I mentioned earlier, the only insight I get from the above is that the data seems to be predominantly about patients, history and procedures. But we already knew that :-).

Some ideas for improvement for the future. One could be to really get aggressive about removing noise words, much more than just the NLTK English stopwords, that way we could remove terms such as "without" and "year" from the input, possibly resulting in a cleaner set of topics. Another could be to check for higher values of K, perhaps against a higher dimensional space, since we are computing the elbow analytically anyway.

46 comments:

  1. Very useful tutorial and discussion! I'm an information systems PhD student trying to do a literature review for a paper, and starting to feel quite lost. I was planning to do this with gensim just to check if I'm missing something obvious, but I'm not much of a coder so it really helps to see the steps laid out clearly.

    ReplyDelete
  2. Thanks Brendon, glad it helped!

    ReplyDelete
  3. Thanks for this! No question that the ntlk stopword defaults are far too minimal.

    ReplyDelete
  4. Hi there
    I see this as one of my favorites post about the subject, I am implementing it with my own corpus however no matter the size of the corpus 1.000, 10.000, 90.000 docs all time i get same number of topics(3), could you please give some idea regarding why that could happen?

    thanks a lot
    angelo

    ReplyDelete
  5. Thanks Angelo. I am guessing you are referring to the method I described to automatically find the optimal number of clusters? I can't think of a reason why it should always return 3. Did you try applying the method to the some of the other measures and see if it changes?

    ReplyDelete
  6. Very informative overview Sujit. I'm fairly new to Python but do understand the statistics behind it. How would you code this when you only have a single .txt file as the corpus? Each line in the text represents a document. I do not have JSON files.

    ReplyDelete
  7. Thanks Patrick. The easiest way to do what you want would be to just modify the gensim_preprocess.py code to read a text file and dump out each line of the file into a separate text file, and use that corpus going forward.

    ReplyDelete
  8. Hi Sujit Pal,

    I have split my large text file into multiple text file. So each line in my original text file is now a seperate text document. This is to adhere to your project as much as possible. My setup is the following:

    All text files and Python (.py)files are in:
    F:\SNA\Project\Corpus

    so there is: text1.txt, text2.txt, text3.txt etc. In total 300 text files.

    Now I run the "bow_model.py" file and I get no errors but it seems the dicitionary has not built. See log message below:

    2015-02-16 22:43:11,214 : INFO : built Dictionary(0 unique tokens: []) from 0 documents (total 0 corpus positions)
    2015-02-16 22:43:11,217 : INFO : saving Dictionary object under F:/SNA/Project/gensim/models/mtsamples.dict, separately None
    2015-02-16 22:43:11,219 : INFO : storing corpus in Matrix Market format to F:/SNA/Project/gensim/models/mtsamples.mm
    2015-02-16 22:43:11,220 : INFO : saving sparse matrix to F:/SNA/Project/gensim/models/mtsamples.mm
    2015-02-16 22:43:11,220 : INFO : saving MmCorpus index to F:/SNA/Project/gensim/models/mtsamples.mm.index

    Did the actual script "bow_model.py" iterate over the text files? What am I doing wrong? Below is the code I have used:

    -----------------------

    # Source: bow_model.py
    import logging
    import os
    import nltk
    import gensim

    def iter_docs(topdir, stoplist):
    for fn in os.listdir(topdir):
    fin = open(os.path.join(topdir, fn), 'rb')
    text = fin.read()
    fin.close()
    yield (x for x in
    gensim.utils.tokenize(text, lowercase=True, deacc=True, errors="ignore")
    if x not in stoplist)

    class MyCorpus(object):

    def __init__(self, topdir, stoplist):
    self.topdir = topdir
    self.stoplist = stoplist
    self.dictionary = gensim.corpora.Dictionary(iter_docs(topdir, stoplist))

    def __iter__(self):
    for tokens in iter_docs(self.topdir, self.stoplist):
    yield self.dictionary.doc2bow(tokens)


    logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)

    TEXTS_DIR = "F:/SNA/Project/gensim/text/"
    MODELS_DIR = "F:/SNA/Project/gensim/models/"

    stoplist = set(nltk.corpus.stopwords.words("english"))
    corpus = MyCorpus(TEXTS_DIR, stoplist)

    corpus.dictionary.save(os.path.join(MODELS_DIR, "mtsamples.dict"))
    gensim.corpora.MmCorpus.serialize(os.path.join(MODELS_DIR, "mtsamples.mm"),corpus)

    -------------------

    ReplyDelete
  9. Succeeded Thanks! The solution was to put the corpus text files in
    project/text/ folder.

    Ran the analysis and now looking for a way to be more aggressive (adding more stop words) with the English stop words nltk library. There is a lot of noise in my data set. These are words that will return in almost every document. Do you happen to know how I can import these additional stop words?

    ReplyDelete
  10. Cool, congratulations! One way to find additional candidates for stop words could be to find the ones which occur in large number of documents (hence their predictive power is low). This is typically done by computing each word's IDF (Inverse Document Frequency) = log(number of documents / (number of documents containing word + 1)), and cut off low values of IDF (maybe using a chart to detect a knee).

    ReplyDelete
  11. Hi Sujit,

    "This is typically done by computing each word's IDF (Inverse Document Frequency)"

    Is there a specific Python script for this so I can work through my corpus of .txt documents?

    Based on IDF want to select those words that occur very frequent. Also how would you add those words to the stop word list?

    ReplyDelete
  12. Sorry about the delay, missed answering earlier. You can build up a data structure of terms to a list of documents as you scan through the documents, and then sort the resulting structure, something like this...

    term_docs = dict()
    ndocs = 0
    for fn in os.listdir(DATADIR):
    ..f = open(os.path.join(DATADIR,fn),'rb')
    ..text = f.read()
    ..f.close()
    ..for sent in nltk.sent_tokenize(text):
    ....for term in nltk.word_tokenize(sent):
    ......if term_docs.has_key(term):
    ........term_docs[term].add(ndoc)
    ......else:
    ........term_docs[term].add([ndoc])
    ..ndocs += 1
    idfs = []
    for term in term_docs:
    ..idf = math.log(ndocs/len(term_docs[term]+1)
    ..idfs.append((term,idf))
    print sorted(idfs, key=itemgetter(1), reverse=True)

    ReplyDelete
  13. Thank Sujit for your clear explanation and the code ..

    I wonder how i can print the final topic/document mapping .. i.e. the topic ID for each Doc ID ..

    Thanks
    Walid

    ReplyDelete
  14. You are welcome. Regarding your question about list of topics for a given document, I haven't tried this, but according to this page, you can do:

    >>> doc_lda = lda[doc_bow]

    where doc_bow is the vector for each individual document in the corpus.

    ReplyDelete
  15. Thanks for the interesting work. Could you please give an explanation of projecting 2D semantic space by LSI to find a number of topics and later applying the number to LDA modeling? What is the reason for using LSI first instead of LDA?

    ReplyDelete
  16. Hi 라일, you are welcome. I used LSI to estimate the number of topics to pass into LDA. I guess I could have tried to estimate the number of topics by cross-validation with LDA. But I couldn't think of an objective measure to evaluate LDA to figure out which number of topics is "best".

    ReplyDelete
  17. Hello Sujit:

    I am sorry if this question has already been asked and if you have answered it, but I just wanted to get your soundbyte on gensim vs pylucene. I need to decide which one to use for a project of mine and I would be grateful to hear your opinion about which you like better. I need only basic IR implementations, i.e. TDF-IDF, consine score, etc. However, my dataset is extremely huge, so I need efficient implementations. Thanks in advance.

    -Pavithra

    ReplyDelete
  18. Hi Pavithra, I often use Lucene (although less using pyLucene and more through Solr's and now ES's HTTP interface) as a persistence mechanism for large datasets, but usually when I can get away with comparing one document to another. Since you are looking for absolute numbers (TF-IDF, cosine similarity) Lucene may not be the best choice - reason being that the default Lucene similarity implementation TFIDFSimilarity approximates cosine similarity but has some differences introduced for performance (but this similarity is still very good for relative scoring, which is what it is used for anyway). OTOH, gensim does implement TF-IDF and cosine accurately and is built to be used in streaming mode, so as long as you can accomodate one document in your dataset in memory at a time, you should be good.

    ReplyDelete
  19. Hi Sujit,

    Thanks for this post! It was a great help and very insightful (especially on using LSI to determine the number of topics for LDA).

    One question -- what is in the file final_topics.txt in the wordcloud step? I didn't see a previous step to create/export out this file before it's called in the wordcloud part.

    Thanks!

    -Ryan

    ReplyDelete
  20. Hi Ryan, you are welcome, glad you found it helpful. The credit for the insight around using LSI really belongs to Dr Vincent Granville, who suggested it in one of his posts on LinkedIn. The final_topics.txt file is just the data that is written out to STDOUT by lda.print_topics, I captured it into a file and parsed it. Using LdaModel.show_topics() was probably a better way to get a structured set of words and probabilities instead of parsing the output of LdaModel.print_topics(), but at the time I didn't know about it.

    ReplyDelete
  21. Got it working -- show_topics() definitely made it a bit easier. Thanks again for your help and look forward to more posts in the future! (and will make sure to cite both you and Dr. Granville in my project)

    ReplyDelete
  22. Hi Sujit,
    Thank you very much for this great informative post on GenSim, I am glad I got your post as google search result for GenSim.
    I am a software engineer in web and server-side general programming, but have little knowledge on Machine Learning . If you can provide your thoughts on my problem described below , that will be great help .

    I am planning to develop a Webservice which will do classification of Elementary Mathematics Word Problems for grades 1 to 5 . The following url Link gives in Idea on what type of classification I am trying to achieve .
    http://www.math.niu.edu/courses/math402/packet/packet-2.pdf

    I am looking at these 4 software solutions. 1) Gensim 2) MALLOT 3/ Stanford Topic Modeling Toolkit 4/ Princeton Topic modelling s/w ( URL are given below for these software systems )

    http://mallet.cs.umass.edu/topics.php
    http://nlp.stanford.edu/software/tmt/tmt-0.4/
    https://www.cs.princeton.edu/~blei/topicmodeling.html

    It would be great help if you can through some light WHICH of the above FOUR may suite well for the kind of problem I am trying to solve .

    Here is some reference from a Research Paper on similar kind of Mathematics Topic Modelling problem they are trying to solve .
    https://t.co/qeKDcvFUbT


    Thank you very much
    Reddy

    ReplyDelete
  23. Hi Reddy, all of these packages are good, although I am only familiar with Gensim and MALLET. You will have to try them out against your test data - in general they might have slightly different default values, but otherwise tuned results should be similar. However, from the PDF it looks like you are trying to solve a classification problem. Topic Models will return you a set of (unnamed) topics and a set of (word, probability) pairs for each, where the probability is the probability of the word appearing in documents for that topic, whereas what you probably want is just a single number indicating what type of problem it is. If so, you can still do the topic modeling to reduce the input data to a small set of weighted features (where the features are top n terms from each topic and the probabilities are the weights) and then run it through a classifier. Alternatively, you can start with terms directly without the topic modeling step.

    ReplyDelete
  24. Thank you very much Sujit for taking time to understand my lengthy question and providing your views . your answer clears the fog in my head about the difference between Classification and Topic Modeling .

    ReplyDelete
  25. You are welcome, glad it helped.

    ReplyDelete
  26. Very great tutorial! Thanks for this, i am following above tutorial but i am not able to generate any coordinates on the csv file, kindly advise, thanks again

    ReplyDelete
  27. Thanks Ole. Are you getting an error message when trying to generate coords.csv? Can you post your error stack trace if so? Maybe the gensim API changed from when I used it for this stuff.

    ReplyDelete
  28. Awesome tutorial. Thanks a lot!

    ReplyDelete
  29. Thanks Soham, and you are welcome!

    ReplyDelete
  30. The topics you are getting seem very typical when using few topics. They tend to be uninformative and have a great deal of overlap. I would suggest experimenting with the number of topics.

    ReplyDelete
  31. Thanks Ólavur, good suggestion, maybe the the reason words are bleeding into each other is because there are too few topics. I don't think I have the data anymore, but will keep this in mind for future experiments.

    ReplyDelete
  32. You can try Hierachical Dirichlet instead of LDA. It will automatically generate the number of topics. Gensim has a method. Usually it suggest more topics than a typical person would intuit.

    ReplyDelete
  33. Thank you, I didn't know about this, will check it out.

    ReplyDelete
  34. Great and detailed tutorial. It has been insightful and very helpful.

    One comment regarding the elbow point of the inertia curve. I was having the same problem as angelo337, argmin sometimes returned 0, and the +3 in the "elbow = np.argmin(diff3[3:]) + 3" was the reason the number of k for the elbow point appeared to be 3 for many runs.

    Might it be the case that the elbow point is actually at: np.argmax(diff3[3:])+3
    I tried this with 1-50 K in 2D (LSI for 2 topics), and the number of topics was 8 which seemed as an elbow point.

    ReplyDelete
  35. Man, thanks a lot for the detailed step-by-step tutorial
    It is two years old, so some changes need to be done (wordcloud has been updated) .. but overall vey useful. Thanks

    ReplyDelete
  36. Hi Abd, thank you for the kind words, glad it was helpful.

    ReplyDelete
  37. Thank you for the kind words Mitchall. I looked at the original link for Dr Granville's method (it was wrongly set in the post, corrected it now), and talks about using argmax of the third derivative of the percentage of variance explained (PoVE). Scikit-learn gives me inertia, ie, the sum of distances of samples to their closest centroid, and as I increase k it seems to fall. Intuitively, the tighter the clustering (lower inertia), the more the PoVE, which you can also see in the chart in Dr Granville's post, so almost like PoVE = 1 - inertia, that is why I chose argmin (since the negative sign would carry over into the third derivative). But maybe better approach might be to do argmax on 1-inertia instead. Also from the link, if the third derivative is 0, then probably we should fall back to the argmax of the second derivative.

    ReplyDelete
  38. Yes you are right, argmax on the 3rd differences on PoVE is equivalent to argmin on the 3rd differences for the inertias. It was the 0 third derivative case causing the problems.
    Thanks for the response

    ReplyDelete
  39. Hi,

    Thank you very much for sharing the code!

    I am getting an error while running the 'wordcloud' section. The error is as follows.

    elements = wordcloud.fit_words(freqs, width=120, height=120)
    AttributeError: 'module' object has no attribute 'fit_words'

    I saved the topics in 'final_topics.txt' file. Format is:

    topic #0 (0.020): 0.007*"market" + 0.006*"pension" + 0.005*"stock" + 0.005*"credit" + 0.004*"stake" + 0.004*"said" + 0.004*"inc" + 0.004*"million" + 0.004*"companies" + 0.004*"new"
    topic #1 (0.020): 0.014*"fed" + 0.011*"federal" + 0.010*"mr" + 0.010*"said" + 0.009*"chairman" + 0.007*"bernanke" + 0.006*"policy" + 0.006*"financial" + 0.006*"would" + 0.005*"year"
    topic #2 (0.020): 0.017*"u" + 0.016*"bank" + 0.014*"currency" + 0.011*"german" + 0.010*"japan" + 0.008*"west" + 0.008*"banks" + 0.007*"japanese" + 0.007*"germany" + 0.006*"said"
    topic #3 (0.020): 0.013*"stores" + 0.009*"said" + 0.007*"store" + 0.007*"year" + 0.007*"wal" + 0.007*"sales" + 0.006*"mart" + 0.006*"retailers" + 0.005*"shopping" + 0.005*"inc"
    topic #4 (0.020): 0.136*"airplane" + 0.039*"leaves" + 0.014*"finding" + 0.014*"new" + 0.008*"recession" + 0.007*"says" + 0.007*"york" + 0.007*"statistics" + 0.006*"know" + 0.006*"year"
    topic #5 (0.020): 0.046*"euro" + 0.045*"dollar" + 0.019*"yen" + 0.016*"u" + 0.014*"currency" + 0.010*"ecb" + 0.010*"europe" + 0.008*"european" + 0.008*"currencies" + 0.007*"new"
    topic #6 (0.020): 0.031*"investors" + 0.031*"stocks" + 0.028*"talking" + 0.021*"brian" + 0.020*"engines" + 0.020*"gate" + 0.020*"montagu" + 0.017*"rolling" + 0.017*"catch" + 0.016*"worried"
    topic #7 (0.020): 0.011*"would" + 0.010*"tax" + 0.006*"new" + 0.005*"year" + 0.004*"prices" + 0.004*"capital" + 0.004*"cut" + 0.004*"market" + 0.004*"bill" + 0.004*"said"
    topic #8 (0.020): 0.062*"contradictory" + 0.062*"balanced" + 0.043*"leaves" + 0.016*"budget" + 0.015*"rolling" + 0.015*"neither" + 0.012*"tax" + 0.008*"inflation" + 0.008*"spending" + 0.007*"economic"
    topic #9 (0.020): 0.215*"midland" + 0.007*"stage" + 0.007*"fowl" + 0.007*"shuts" + 0.007*"taxis" + 0.007*"connotes" + 0.005*"said" + 0.005*"economy" + 0.005*"india" + 0.005*"balanced"

    Please help me in this regard.

    Thanks in advance!

    Best regards,
    Mohammed

    ReplyDelete
  40. Hi Mohammed, looks like the wordcloud package (https://github.com/amueller/word_cloud) has changed. Take a look at examples/simple.py to see the new usage pattern.

    ReplyDelete
  41. Hi,

    Many thanks for your response!

    I am a python beginner. I tried to modify the code but did not work. Could you please update the wordcloud part of your code.

    Thanks in advance.

    Best regards,
    Mohammed

    ReplyDelete
  42. Sujit,

    This is a wonderful process. I was contemplating how to best analyze the output and I like both your K-means and textual approach. I used to work in bioinformatics and found the end-user needed both.

    Could you make a recommendation as to how I could try your approach on Pubmed, or Google Scholar?

    Thank you in advance.

    -Tim Maguire, PhD

    ReplyDelete
  43. @Mohammed: I will update once I find some time. Unfortunately too many things going on at the same time right now.

    @Tim: thanks for the kind words. Actually I cannot take credit for the K-Means approach, I learned about this from Dr Vincent Granville's blogs. I have found it to be mostly useful in my experiments. Regarding running on Pubmed or Google Scholar, you probably need something that will scale to that kind of data volume - my current big data tool is Spark thanks to my employer. Spark supports LDA using GraphX. The value of k would vary based on whether you were trying to cluster/featurize for some downstream process or simply try to understand the data. In the latter case, the advice I have read is generally to keep k <= 20.

    ReplyDelete
  44. Hi Sujit,
    I want to find the similarity between my corpus of 15000 documents, with each document consisting of at most 10 words. I am using LSI Model from Gensim, but I am unable to decide the training parameters. The similarity score I receive using the index generated is unsatisfactory.
    Also, I feel that it is because of my data, that LSI is unable to do so.
    Can you please help, I am new to Gensim.

    ReplyDelete
  45. One way to decide is to use coherence scores. Gensim supports it natively now. Idea is to build different topic models with different hyperparameters (mostly number of topics) and then compute the coherence score. Check out this blog post by Selva Prabhakaran for more details.

    ReplyDelete

Comments are moderated to prevent spam.