Friday, August 29, 2014

Topic Modeling with gensim


Over the past couple of weeks, I've been trying different ways to gain insight into my little corpus of 2000+ Patient Health Records (PHRs). Topic modeling is one way to do it, and I've been meaning to learn gensim, a Python library for topic modeling, so I decided to use gensim to do some topic modeling on this dataset. I have used Mahout for Topic Modeling before, but my data is quite small and doesn't need the complexity of Map-Reduce (besides the objective is to learn gensim :-)). As a bonus, gensim also offers a wrapper for Mallet, another popular Java topic modeling library.

This post describes my experiment. Ultimately, the insights I got were not particulary interesting, but it got me familiar with gensim. I hope you find it useful.

The source format of the dataset is a collection of JSON files. My first step is to pre-process the data so that the text portion is written out into a collection of text files. In hindsight, this seems redundant since this could be merged with the next step, but it makes things a bit clearer if separated.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Source: gensim_preprocess.py
import json
import os

JSONS_DIR = "/path/to/jsons/dir"
TEXTS_DIR = "/path/to/texts/dir"

for fn in os.listdir(JSONS_DIR):
    print "Converting JSON: %s" % (fn)
    fjson = open(os.path.join(JSONS_DIR, fn), 'rb')
    data = json.load(fjson)
    fjson.close()
    tfn = os.path.splitext(fn)[0] + ".txt"
    ftext = open(os.path.join(TEXTS_DIR, tfn), 'wb')
    ftext.write(data["text"].encode("utf-8"))
    ftext.close()

The next step is to convert the texts to a format that gensim can use - namely a Bag of Words (BoW) representation. Gensim expects to be fed a corpus data structure, basically a list of sparse vectors. The sparse vector elements consists of (id, score) pairs, where the id is a numeric ID that is mapped to the term via a dictionary. Gensim's author has taken pains to keep its memory requirements down by using a streaming approach to build corpora and dictionaries. The iter_docs() function below implements this streaming approach.

In the code below, I read the text of each file, pass the words through gensim's tokenizer and filter out stopwords (from NLTK's English stopword list) using our custom MyCorpus class. These words are used to create a dictionary and BoW corpus, which is serialized to files for use in the next step.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Source: bow_model.py
import logging
import os
import nltk
import gensim

def iter_docs(topdir, stoplist):
    for fn in os.listdir(topdir):
        fin = open(os.path.join(topdir, fn), 'rb')
        text = fin.read()
        fin.close()
        yield (x for x in 
            gensim.utils.tokenize(text, lowercase=True, deacc=True, 
                                  errors="ignore")
            if x not in stoplist)

class MyCorpus(object):

    def __init__(self, topdir, stoplist):
        self.topdir = topdir
        self.stoplist = stoplist
        self.dictionary = gensim.corpora.Dictionary(iter_docs(topdir, stoplist))
        
    def __iter__(self):
        for tokens in iter_docs(self.topdir, self.stoplist):
            yield self.dictionary.doc2bow(tokens)


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

TEXTS_DIR = "/path/to/texts/dir"
MODELS_DIR = "/path/to/models/dir"

stoplist = set(nltk.corpus.stopwords.words("english"))
corpus = MyCorpus(TEXTS_DIR, stoplist)

corpus.dictionary.save(os.path.join(MODELS_DIR, "mtsamples.dict"))
gensim.corpora.MmCorpus.serialize(os.path.join(MODELS_DIR, "mtsamples.mm"), 
                                  corpus)

I didn't know (and didn't have an opinion about) how many topics this corpus should yield so I decided to compute this by reducing the features to two dimensions, then clustering the points for different values of K (number of clusters) to find an optimum value. Gensim offers various transforms that allow us to project the vectors in a corpus to a different coordinate space. One such transform is the Latent Semantic Indexing (LSI) transform, which we use to project the original data to 2D.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Source: lsi_model.py
import logging
import os
import gensim

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

MODELS_DIR = "/path/to/models/dir"

dictionary = gensim.corpora.Dictionary.load(os.path.join(MODELS_DIR, 
                                            "mtsamples.dict"))
corpus = gensim.corpora.MmCorpus(os.path.join(MODELS_DIR, "mtsamples.mm"))

tfidf = gensim.models.TfidfModel(corpus, normalize=True)
corpus_tfidf = tfidf[corpus]

# project to 2 dimensions for visualization
lsi = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)

# write out coordinates to file
fcoords = open(os.path.join(MODELS_DIR, "coords.csv"), 'wb')
for vector in lsi[corpus]:
    if len(vector) != 2:
        continue
    fcoords.write("%6.4f\t%6.4f\n" % (vector[0][1], vector[1][1]))
fcoords.close()

Next I clustered the points in the reduced 2D LSI space using KMeans, varying the number of clusters (K) from 1 to 10. The objective function used is the Inertia of the cluster, defined as the sum of squared differences of each point to its cluster centroid. This value is provided directly from the Scikit-Learn KMeans algorithm. Other popular functions include Distortion (Inertia divided by the number of points) or the Percentage of Variance Explained, as described on this StackOverflow page.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Source: num_topics.py
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

MODELS_DIR = "/path/to/models/dir"
MAX_K = 10

X = np.loadtxt(os.path.join(MODELS_DIR, "coords.csv"), delimiter="\t")
ks = range(1, MAX_K + 1)

inertias = np.zeros(MAX_K)
diff = np.zeros(MAX_K)
diff2 = np.zeros(MAX_K)
diff3 = np.zeros(MAX_K)
for k in ks:
    kmeans = KMeans(k).fit(X)
    inertias[k - 1] = kmeans.inertia_
    # first difference    
    if k > 1:
        diff[k - 1] = inertias[k - 1] - inertias[k - 2]
    # second difference
    if k > 2:
        diff2[k - 1] = diff[k - 1] - diff[k - 2]
    # third difference
    if k > 3:
        diff3[k - 1] = diff2[k - 1] - diff2[k - 2]

elbow = np.argmin(diff3[3:]) + 3

plt.plot(ks, inertias, "b*-")
plt.plot(ks[elbow], inertias[elbow], marker='o', markersize=12,
         markeredgewidth=2, markeredgecolor='r', markerfacecolor=None)
plt.ylabel("Inertia")
plt.xlabel("K")
plt.show()

I plotted the Inertias for different values of K, then used Vincent Granville's approach of calculating the third differential to find an elbow point. The elbow point happens here for K=5 and is marked with a red dot in the graph below.

I then re-ran the KMeans algorithm with K=5 and generated the clusters.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Source: viz_topics_scatter.py
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

MODELS_DIR = "/path/to/models/dir"
NUM_TOPICS = 5

X = np.loadtxt(os.path.join(MODELS_DIR, "coords.csv"), delimiter="\t")
kmeans = KMeans(NUM_TOPICS).fit(X)
y = kmeans.labels_

colors = ["b", "g", "r", "m", "c"]
for i in range(X.shape[0]):
    plt.scatter(X[i][0], X[i][1], c=colors[y[i]], s=10)    
plt.show()

which gives me clusters that look like this.


I then ran the full LDA transform against the BoW corpus, with the number of topics set to 5. As in LSI, I load up the corpus and dictionary from files, then apply the transform to project the documents into the LDA Topic space. Notice that LDA and LSI are conceptually similar in gensim - both are transforms that map one vector space to another.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Source: lda_model.py
import logging
import os
import gensim

MODELS_DIR = "/path/to/models/dir"
NUM_TOPICS = 5

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

dictionary = gensim.corpora.Dictionary.load(os.path.join(MODELS_DIR, 
                                            "mtsamples.dict"))
corpus = gensim.corpora.MmCorpus(os.path.join(MODELS_DIR, "mtsamples.mm"))

# Project to LDA space
lda = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=NUM_TOPICS)
lda.print_topics(NUM_TOPICS)

The results are as follows. As you can see, each topic is made up of a mixture of terms. The top 10 terms from each topic is shown in the output below and covers between 69-80% of the probability space for each topic.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
2014-08-27 20:59:04,850 : INFO : topic #0 (0.200): 0.015*patient + \
      0.014*right + 0.009*left + 0.009*procedure + 0.006*well + \
      0.005*placed + 0.005*history + 0.004*skin + 0.004*normal + \
      0.004*anesthesia
2014-08-27 20:59:04,852 : INFO : topic #1 (0.200): 0.015*patient + \
      0.009*placed + 0.008*right + 0.007*normal + 0.007*left + \
      0.006*procedure + 0.005*using + 0.004*anterior + 0.004*history + \
      0.004*skin
2014-08-27 20:59:04,854 : INFO : topic #2 (0.200): 0.012*left + \
      0.010*patient + 0.010*right + 0.007*history + 0.005*well + \
      0.005*normal + 0.004*pain + 0.004*also + 0.003*without + \
      0.003*disease
2014-08-27 20:59:04,856 : INFO : topic #3 (0.200): 0.019*patient + \
      0.011*left + 0.011*normal + 0.010*right + 0.006*well + \
      0.006*procedure + 0.005*pain + 0.004*placed + 0.004*history + \
      0.004*using
2014-08-27 20:59:04,858 : INFO : topic #4 (0.200): 0.022*patient + \
      0.013*history + 0.007*mg + 0.006*normal + 0.005*pain + \
      0.005*also + 0.005*left + 0.004*right + 0.004*year + 0.004*well

and here is the same information shown graphically using wordclouds (generated using the code below).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import os
import wordcloud

MODELS_DIR = "models"

final_topics = open(os.path.join(MODELS_DIR, "final_topics.txt"), 'rb')
curr_topic = 0
for line in final_topics:
    line = line.strip()[line.rindex(":") + 2:]
    scores = [float(x.split("*")[0]) for x in line.split(" + ")]
    words = [x.split("*")[1] for x in line.split(" + ")]
    freqs = []
    for word, score in zip(words, scores):
        freqs.append((word, score))
    elements = wordcloud.fit_words(freqs, width=120, height=120)
    wordcloud.draw(elements, "gs_topic_%d.png" % (curr_topic),
                   width=120, height=120)
    curr_topic += 1
final_topics.close()












As I mentioned earlier, the only insight I get from the above is that the data seems to be predominantly about patients, history and procedures. But we already knew that :-).

Some ideas for improvement for the future. One could be to really get aggressive about removing noise words, much more than just the NLTK English stopwords, that way we could remove terms such as "without" and "year" from the input, possibly resulting in a cleaner set of topics. Another could be to check for higher values of K, perhaps against a higher dimensional space, since we are computing the elbow analytically anyway.

34 comments (moderated to prevent spam):

Brendon said...

Very useful tutorial and discussion! I'm an information systems PhD student trying to do a literature review for a paper, and starting to feel quite lost. I was planning to do this with gensim just to check if I'm missing something obvious, but I'm not much of a coder so it really helps to see the steps laid out clearly.

Sujit Pal said...

Thanks Brendon, glad it helped!

Anonymous said...

Thanks for this! No question that the ntlk stopword defaults are far too minimal.

Sujit Pal said...

You're welcome!

angelo337 said...

Hi there
I see this as one of my favorites post about the subject, I am implementing it with my own corpus however no matter the size of the corpus 1.000, 10.000, 90.000 docs all time i get same number of topics(3), could you please give some idea regarding why that could happen?

thanks a lot
angelo

Sujit Pal said...

Thanks Angelo. I am guessing you are referring to the method I described to automatically find the optimal number of clusters? I can't think of a reason why it should always return 3. Did you try applying the method to the some of the other measures and see if it changes?

Patrick said...

Very informative overview Sujit. I'm fairly new to Python but do understand the statistics behind it. How would you code this when you only have a single .txt file as the corpus? Each line in the text represents a document. I do not have JSON files.

Sujit Pal said...

Thanks Patrick. The easiest way to do what you want would be to just modify the gensim_preprocess.py code to read a text file and dump out each line of the file into a separate text file, and use that corpus going forward.

Patrick said...

Hi Sujit Pal,

I have split my large text file into multiple text file. So each line in my original text file is now a seperate text document. This is to adhere to your project as much as possible. My setup is the following:

All text files and Python (.py)files are in:
F:\SNA\Project\Corpus

so there is: text1.txt, text2.txt, text3.txt etc. In total 300 text files.

Now I run the "bow_model.py" file and I get no errors but it seems the dicitionary has not built. See log message below:

2015-02-16 22:43:11,214 : INFO : built Dictionary(0 unique tokens: []) from 0 documents (total 0 corpus positions)
2015-02-16 22:43:11,217 : INFO : saving Dictionary object under F:/SNA/Project/gensim/models/mtsamples.dict, separately None
2015-02-16 22:43:11,219 : INFO : storing corpus in Matrix Market format to F:/SNA/Project/gensim/models/mtsamples.mm
2015-02-16 22:43:11,220 : INFO : saving sparse matrix to F:/SNA/Project/gensim/models/mtsamples.mm
2015-02-16 22:43:11,220 : INFO : saving MmCorpus index to F:/SNA/Project/gensim/models/mtsamples.mm.index

Did the actual script "bow_model.py" iterate over the text files? What am I doing wrong? Below is the code I have used:

-----------------------

# Source: bow_model.py
import logging
import os
import nltk
import gensim

def iter_docs(topdir, stoplist):
for fn in os.listdir(topdir):
fin = open(os.path.join(topdir, fn), 'rb')
text = fin.read()
fin.close()
yield (x for x in
gensim.utils.tokenize(text, lowercase=True, deacc=True, errors="ignore")
if x not in stoplist)

class MyCorpus(object):

def __init__(self, topdir, stoplist):
self.topdir = topdir
self.stoplist = stoplist
self.dictionary = gensim.corpora.Dictionary(iter_docs(topdir, stoplist))

def __iter__(self):
for tokens in iter_docs(self.topdir, self.stoplist):
yield self.dictionary.doc2bow(tokens)


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s',level=logging.INFO)

TEXTS_DIR = "F:/SNA/Project/gensim/text/"
MODELS_DIR = "F:/SNA/Project/gensim/models/"

stoplist = set(nltk.corpus.stopwords.words("english"))
corpus = MyCorpus(TEXTS_DIR, stoplist)

corpus.dictionary.save(os.path.join(MODELS_DIR, "mtsamples.dict"))
gensim.corpora.MmCorpus.serialize(os.path.join(MODELS_DIR, "mtsamples.mm"),corpus)

-------------------

Patrick said...

Succeeded Thanks! The solution was to put the corpus text files in
project/text/ folder.

Ran the analysis and now looking for a way to be more aggressive (adding more stop words) with the English stop words nltk library. There is a lot of noise in my data set. These are words that will return in almost every document. Do you happen to know how I can import these additional stop words?

Sujit Pal said...

Cool, congratulations! One way to find additional candidates for stop words could be to find the ones which occur in large number of documents (hence their predictive power is low). This is typically done by computing each word's IDF (Inverse Document Frequency) = log(number of documents / (number of documents containing word + 1)), and cut off low values of IDF (maybe using a chart to detect a knee).

Patrick said...

Hi Sujit,

"This is typically done by computing each word's IDF (Inverse Document Frequency)"

Is there a specific Python script for this so I can work through my corpus of .txt documents?

Based on IDF want to select those words that occur very frequent. Also how would you add those words to the stop word list?

Sujit Pal said...

Sorry about the delay, missed answering earlier. You can build up a data structure of terms to a list of documents as you scan through the documents, and then sort the resulting structure, something like this...

term_docs = dict()
ndocs = 0
for fn in os.listdir(DATADIR):
..f = open(os.path.join(DATADIR,fn),'rb')
..text = f.read()
..f.close()
..for sent in nltk.sent_tokenize(text):
....for term in nltk.word_tokenize(sent):
......if term_docs.has_key(term):
........term_docs[term].add(ndoc)
......else:
........term_docs[term].add([ndoc])
..ndocs += 1
idfs = []
for term in term_docs:
..idf = math.log(ndocs/len(term_docs[term]+1)
..idfs.append((term,idf))
print sorted(idfs, key=itemgetter(1), reverse=True)

- said...

Thank Sujit for your clear explanation and the code ..

I wonder how i can print the final topic/document mapping .. i.e. the topic ID for each Doc ID ..

Thanks
Walid

Sujit Pal said...

You are welcome. Regarding your question about list of topics for a given document, I haven't tried this, but according to this page, you can do:

>>> doc_lda = lda[doc_bow]

where doc_bow is the vector for each individual document in the corpus.

라일 said...

Thanks for the interesting work. Could you please give an explanation of projecting 2D semantic space by LSI to find a number of topics and later applying the number to LDA modeling? What is the reason for using LSI first instead of LDA?

Sujit Pal said...

Hi 라일, you are welcome. I used LSI to estimate the number of topics to pass into LDA. I guess I could have tried to estimate the number of topics by cross-validation with LDA. But I couldn't think of an objective measure to evaluate LDA to figure out which number of topics is "best".

Pavithra Srinath said...

Hello Sujit:

I am sorry if this question has already been asked and if you have answered it, but I just wanted to get your soundbyte on gensim vs pylucene. I need to decide which one to use for a project of mine and I would be grateful to hear your opinion about which you like better. I need only basic IR implementations, i.e. TDF-IDF, consine score, etc. However, my dataset is extremely huge, so I need efficient implementations. Thanks in advance.

-Pavithra

Sujit Pal said...

Hi Pavithra, I often use Lucene (although less using pyLucene and more through Solr's and now ES's HTTP interface) as a persistence mechanism for large datasets, but usually when I can get away with comparing one document to another. Since you are looking for absolute numbers (TF-IDF, cosine similarity) Lucene may not be the best choice - reason being that the default Lucene similarity implementation TFIDFSimilarity approximates cosine similarity but has some differences introduced for performance (but this similarity is still very good for relative scoring, which is what it is used for anyway). OTOH, gensim does implement TF-IDF and cosine accurately and is built to be used in streaming mode, so as long as you can accomodate one document in your dataset in memory at a time, you should be good.

Rhymenoceros said...

Hi Sujit,

Thanks for this post! It was a great help and very insightful (especially on using LSI to determine the number of topics for LDA).

One question -- what is in the file final_topics.txt in the wordcloud step? I didn't see a previous step to create/export out this file before it's called in the wordcloud part.

Thanks!

-Ryan

Sujit Pal said...

Hi Ryan, you are welcome, glad you found it helpful. The credit for the insight around using LSI really belongs to Dr Vincent Granville, who suggested it in one of his posts on LinkedIn. The final_topics.txt file is just the data that is written out to STDOUT by lda.print_topics, I captured it into a file and parsed it. Using LdaModel.show_topics() was probably a better way to get a structured set of words and probabilities instead of parsing the output of LdaModel.print_topics(), but at the time I didn't know about it.

Rhymenoceros said...

Got it working -- show_topics() definitely made it a bit easier. Thanks again for your help and look forward to more posts in the future! (and will make sure to cite both you and Dr. Granville in my project)

Vision said...

Hi Sujit,
Thank you very much for this great informative post on GenSim, I am glad I got your post as google search result for GenSim.
I am a software engineer in web and server-side general programming, but have little knowledge on Machine Learning . If you can provide your thoughts on my problem described below , that will be great help .

I am planning to develop a Webservice which will do classification of Elementary Mathematics Word Problems for grades 1 to 5 . The following url Link gives in Idea on what type of classification I am trying to achieve .
http://www.math.niu.edu/courses/math402/packet/packet-2.pdf

I am looking at these 4 software solutions. 1) Gensim 2) MALLOT 3/ Stanford Topic Modeling Toolkit 4/ Princeton Topic modelling s/w ( URL are given below for these software systems )

http://mallet.cs.umass.edu/topics.php
http://nlp.stanford.edu/software/tmt/tmt-0.4/
https://www.cs.princeton.edu/~blei/topicmodeling.html

It would be great help if you can through some light WHICH of the above FOUR may suite well for the kind of problem I am trying to solve .

Here is some reference from a Research Paper on similar kind of Mathematics Topic Modelling problem they are trying to solve .
https://t.co/qeKDcvFUbT


Thank you very much
Reddy

Sujit Pal said...

Hi Reddy, all of these packages are good, although I am only familiar with Gensim and MALLET. You will have to try them out against your test data - in general they might have slightly different default values, but otherwise tuned results should be similar. However, from the PDF it looks like you are trying to solve a classification problem. Topic Models will return you a set of (unnamed) topics and a set of (word, probability) pairs for each, where the probability is the probability of the word appearing in documents for that topic, whereas what you probably want is just a single number indicating what type of problem it is. If so, you can still do the topic modeling to reduce the input data to a small set of weighted features (where the features are top n terms from each topic and the probabilities are the weights) and then run it through a classifier. Alternatively, you can start with terms directly without the topic modeling step.

Vision said...

Thank you very much Sujit for taking time to understand my lengthy question and providing your views . your answer clears the fog in my head about the difference between Classification and Topic Modeling .

Sujit Pal said...

You are welcome, glad it helped.

OLE LEO said...

Very great tutorial! Thanks for this, i am following above tutorial but i am not able to generate any coordinates on the csv file, kindly advise, thanks again

Sujit Pal said...

Thanks Ole. Are you getting an error message when trying to generate coords.csv? Can you post your error stack trace if so? Maybe the gensim API changed from when I used it for this stuff.

soham lawar said...

Awesome tutorial. Thanks a lot!

Sujit Pal said...

Thanks Soham, and you are welcome!

Ólavur said...

The topics you are getting seem very typical when using few topics. They tend to be uninformative and have a great deal of overlap. I would suggest experimenting with the number of topics.

Sujit Pal said...

Thanks Ólavur, good suggestion, maybe the the reason words are bleeding into each other is because there are too few topics. I don't think I have the data anymore, but will keep this in mind for future experiments.

Cameron said...

You can try Hierachical Dirichlet instead of LDA. It will automatically generate the number of topics. Gensim has a method. Usually it suggest more topics than a typical person would intuit.

Sujit Pal said...

Thank you, I didn't know about this, will check it out.