Wednesday, October 02, 2013

Topic Modeling with Mahout on Amazon EMR


Introduction


The motivation for this work was a desire to understand the structure of a corpus in a manner different from what I am used to. Central to all our applications is a knowledge graph derived from our medical taxonomy. So any document corpus can easily be defined as a small set (50-100) of high level concepts, merely by rolling up document concepts into their parents until an adequate degree of granularity is achieved. I wanted to see if standard topic modeling techniques would yield comparable results. If so, perhaps the output of such a process could be used as feedback for concept creation.

This post describes Topic Modeling a smallish corpus (2,285 documents) from our document collection, using Apache Mahout's Latent Dirichlet Allocation (LDA) algorithm, and running it on Amazon Elastic Map Reduce (EMR) platform. Mahout provides the LDA implementation, as well as utilities for IO. The code I wrote work at the two ends of the pipeline, first to download and parse data for Mahout to consume, and then to produce a report of top terms in each topic category.

Even though Mahout (I used version 0.8) provided most of the functionality for this work, the experience was hardly straightforward. The official documentation is outdated, and I had to repeatedly refer to discussions on the Mahout Mailing lists to find solutions for problems I faced along the way. I found only one blog post based on Mahout version 0.5 that I used as a starting point. Of course, all's well that ends well, and I was ultimately able to get the top terms for each topic and the topic composition of each document.

Theory


The math behind LDA is quite formidable as you can see from its Wikipedia page, but here is a somewhat high-level view, selectively gleaned from this paper by Steyvers and Griffiths.

Topic Models are based upon the idea that documents are mixtures of topics, where a topic is a probability distribution over words. To make a new document, one chooses a distribution over topics. Then for each word in the document, one chooses a topic at random and draws a word from the topic.

In order to answer the (IMO more interesting) question of what topics make up a collection of documents, you invert this process. Each Topic Modeling algorithm does it differently. LDA provides an approximate iterative method to sample values sequentially, proceeding until sample values converge to the target distribution.

Preparing the data


The documents for this work come from our Content Management System, and this section describes the extraction code. Its included for completeness. Your setup is likely very different, so it may be of limited use to you. In any case, our CMS is loosely coupled to our web front end via a publisher, which serializes documents in JSON format onto a network filesystem. Content can be pulled off a REST API off this filesystem most efficiently if you know the "file ID". I use Solr to get a list of these file IDs, and download it to my local filesystem for further processing.

Processing consists of parsing out the text content of the files (each content type can define its own JSON format), then using NLTK to remove HTML tags, stopwords, numeric tokens and punctuation. The text versions of the JSON files are written out to another directory for feeding into the Mahout pipeline.

Code is in Python, its shown below. Hostnames and such have been changed to protect the innocent.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import json
import nltk
import os
import os.path
import string
import urllib
import urllib2

SOLR_SERVER = "http://solrserver.mycompany.com:8983/solr/select"
CONTENT_SERVER = "http://contentserver.mycompany.com/view"
JSON_OUTPUTDIR = "/path/to/data/hlcms_jsons"
TEXT_OUTPUTDIR = "/path/to/data/hlcms_text"
FILENAMES_FILE = "/tmp/hlcms_filenames.txt"

STOPWORDS = nltk.corpus.stopwords.words("english")
PUNCTUATIONS = {c:"" for c in string.punctuation}

def textify(s):
  text = nltk.clean_html(s)
  sentences = nltk.sent_tokenize(text)
  words = []
  for sentence in sentences:
    sent = sentence.encode("utf-8", 'ascii')
    sent = "".join([PUNCTUATIONS[c] if PUNCTUATIONS.has_key(c) else c 
                                    for c in sent])
    ws = nltk.word_tokenize(sent)
    for w in ws:
      if w in STOPWORDS: continue
      if w.replace(",", "").replace(".", "").isdigit(): continue
      words.append(w.lower())
  return " ".join(words)

# build list of all file parameter values from solr
params = urllib.urlencode({
  "q" : "sourcename:hlcms",
  "start" : "0",
  "rows" : "0",
  "fl" : "contenttype,cmsurl",
  "wt" : "json"
})
conn = urllib.urlopen(SOLR_SERVER, params)
rsp = json.load(conn)
numfound = rsp["response"]["numFound"]
print "# of CMS articles to download: ", numfound
filenames = open(FILENAMES_FILE, 'wb')
npages = int(numfound/10) + 1
for pg in range(0, npages):
  if pg % 100 == 0:
    print "Downloading HLCMS page #: %d" % (pg)
  params = urllib.urlencode({
    "q" : "sourcename:hlcms",
    "start" : str(pg * 10),
    "rows" : "10",
    "fl" : "contenttype,cmsurl",
    "wt" : "json"
  })
  conn = urllib.urlopen(SOLR_SERVER, params)
  rsp = json.load(conn)
  for doc in rsp["response"]["docs"]:
    try:
      contenttype = doc["contenttype"]
      cmsurl = doc["cmsurl"]
      filenames.write("%s-%s\n" % (contenttype, cmsurl))
    except KeyError:
      continue
filenames.close()

# for each file parameter, build URL and extract data into local dir
filenames2 = open(FILENAMES_FILE, 'rb')
for filename in filenames2:
  fn = filename.strip()
  ofn = os.path.join(JSON_OUTPUTDIR, fn + ".json")
  print "Downloading file: ", fn
  try:
    output = open(ofn, 'wb')
    response = urllib2.urlopen(CONTENT_SERVER + "?file=" + fn + "&raw=true")
    output.write(response.read())
    output.close()
  except IOError:
    continue
filenames2.close()
print "All files downloaded"

# build parser for each content type to extract title and body
for file in os.listdir(JSON_OUTPUTDIR):
  print "Parsing file: %s" % (file)
  fin = open(os.path.join(JSON_OUTPUTDIR, file), 'rb')
  ofn = os.path.join(TEXT_OUTPUTDIR, 
    os.path.basename(file[0:file.rindex(".json")]) + ".txt")
  fout = open(ofn, 'wb')
  try:
    doc_json = json.load(fin)
    # parsing out title and body based on content type
    # since different content types can have own format
    if file.startswith("ctype1-"):
      for fval in ["title", "bm_intro", "bm_seo_body"]:
        fout.write("%s\n" % (textify(doc_json[fval])))
    elif file.startswith("ctype2-"):
      for fval in ["body"]:
        fout.write("%s\n" % (textify(doc_json[fval])))
    elif file.startswith("ctype3-"):
      for fval in ["title", "body"]:
        fout.write("%s\n" % (textify(doc_json[fval])))
    elif file.startswith("ctype4-"):
      fout.write("%s\n" % (textify(doc_json["recipeDeck"])))
      fout.write("%s\n" % (textify(". ".join([x.values()[0] 
                           for x in doc_json["directions"]]))))
    elif file.startswith("ctype5-"):
      for fval in ["title", "body"]:
        fout.write("%s\n" % (textify(doc_json[fval])))
    else:
      continue
  except ValueError as e:
    print "ERROR!", e
    continue
  fout.close()
  fin.close()

# filter out files with 0 bytes and remove them from text output directory
for file in os.listdir(TEXT_OUTPUTDIR):
  fname = os.path.join(TEXT_OUTPUTDIR, file)
  size = os.path.getsize(fname)
  if size == 0:
    print "Deleting zero byte file:", os.path.basename(fname)
    os.remove(fname)

Converting Text Files to Sequence File


The end product of the step above is a directory of text files. Punctuations, stopwords and number tokens have been stripped (because they are of limited value as topic terms) and all characters have been lowercased (not strictly necessary, because the vectorization step takes care of that). So each file is essentially now a bag of words.

Our pipeline is Hadoop based, and Hadoop likes small number of large files, so this step converts the directory of 2,258 text files into a single large sequence file, where each row represents a single file. I run the mahout seqdirectory subcommand locally to do this, then copy the output to Amazon EMR using s3cmd (available on Ubuntu via apt-get and on Mac OS via macports).

1
2
3
4
5
6
sujit@localhost:data$ $MAHOUT_HOME/bin/mahout seqdirectory \
    --input /path/to/data/hlcms_text \
    --output /path/to/data/hlcms_seq \
    -c UTF-8
sujit@localhost:data$ s3cmd put /path/to/data/hlcms_seq \
    s3://mybucket/cmstopics/

Vectorizing the Input


The next step is to create a term-document matrix out of the sequence files. Once again, we can do this locally with the Mahout seq2sparse subcommand. I choose to do this on Amazon EMR - the only change is to specify the name of the class that corresponds to the seq2sparse subcommand (you can find this information in $MAHOUT_HOME/conf/driver.classes.default.props). You also need to copy over the Mahout job JAR to S3.

JAR location: s3n://mybucket/cmstopics/mahout-core-0.8-job.jar
JAR arguments:
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles \
-i s3n://mybucket/cmstopics/hlcms_seq \
-o s3n://mybucket/cmstopics/hlcms_vec \
-wt tf

With Amazon's Hadoop Distribution (ie choosing Amazon Distribution for the Hadoop Version prompt in the AWS EMR console) results in this error.

1
Error running child : java.lang.NoSuchFieldError: LUCENE_43

This is very likely caused by the Amazon distribution gratitously including old Lucene JARS (older than the Lucene 4.3 the Mahout 0.8 job JAR includes) within it. At runtime Lucene classes from the Amazon JARs are being loaded, which don't know anything about LUCENE_43 because they do not (yet) exist for it. My solution was to try the MapR M7 distribution (at least partly based on the reason that Ted Dunning works for MapR and he is a committer for Mahout :-)). However, MapR (all distributions) require m1.large instances at minimum, so its a bit more expensive.

This step creates an output directory hlcms_vec that looks like this. Of these, the only ones of interest to this pipeline are the tf-vectors folder and the dictionary.file-0 file.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
hlcms_vec/
+-- df-count
|   +-- _SUCCESS
|   +-- part-r-00000
+-- dictionary.file-0
+-- frequency.file-0
+-- tf-vectors
|   +-- _SUCCESS
|   +-- part-r-00000
+-- tokenized-documents
|   +-- _SUCCESS
|   +-- part-m-00000
+-- wordcount
    +-- _SUCCESS
    +-- part-r-00000

Converting Keys to IntWritables


This step is not documented in the official documentation. The blog post does not mention it either, but thats probably because Mahout 0.5's lda subcommand was deprecated in favor of the cvb subcommand. The tf-vectors file contains (Text, VectorWritable) tuples, but cvb expects to read (IntWritable, VectorWritable). The rowid subcommand does this conversion. Interestingly, the rowid job is contained in mahout-examples-0.8-job.jar and not in the main job JAR. Attempting to run it on Amazon EMR on either Amazon or MapR distributions produces errors to the effect that it can only be run locally.

1
2
3
4
5
6
7
8
9
# running under MapR distribution
java.io.IOException: \
Could not resolve any CLDB hostnames for cluster: mybucket:7222
# running under Amazon distribution
java.lang.IllegalArgumentException: \
This file system object (hdfs://10.255.35.8:9000) does not support \
access to the request path 's3n://mybucket/cmstopics/cvb-vectors/docIndex'\
You possibly called FileSystem.get(conf) when you should have called \
FileSystem.get(uri, conf) to obtain a file system supporting your path.

So I ended up pulling down tf-vectors locally, converting to tf-vectors-cvb and then uploading back to S3.

1
2
3
4
5
6
7
8
sujit@localhost:data$ s3cmd get --recursive \
  s3://mybucket/cmstopics/hlcms_vec/tf-vectors/ \
  hlcms_vec/tf-vectors
sujit@localhost:data$ $MAHOUT_HOME/bin/mahout rowid \
  -i /path/to/data/hlcms_vec/tf-vectors \
  -o /path/to/data/hlcms_vec/tf-vectors-cvb
sujit@localhost:data$ s3cmd put tf-vectors-cvb \
  s3://mybucket/cmstopics/hlcms_vec/

After this subcommand is run, there is an additional folder tf-vectors-cvb in the hlcms_vec folder. The tf-vectors-cvb folder contains 2 files, matrix and docindex. Our pipeline only cares about the data in the matrix file.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
hlcms_vec
+-- df-count
|   +-- _SUCCESS
|   +-- part-r-00000
+-- dictionary.file-0
+-- frequency.file-0
+-- tf-vectors
|   +-- _SUCCESS
|   +-- part-r-00000
+-- tf-vectors-cvb
|   +-- docindex
|   +-- matrix
+-- tokenized-documents
|   +-- _SUCCESS
|   +-- part-m-00000
+-- wordcount
    +-- _SUCCESS
    +-- part-r-00000

Run LDA on Modified term-vector input


Finally, we are ready to run LDA on our corpus. The Mahout lda subcommand has been deprecated and replaced with the cvb subcommand, which uses the Collapsed Variational Bayes (CVB) algorithm to do LDA. We run LDA with 50 topics (-k) for 30 iterations (-x) on Amazon EMR using a MapR distribution, with the following parameters.

JAR location: s3n://mybucket/cmstopics/mahout-core-0.8-job.jar
JAR arguments:
org.apache.mahout.clustering.lda.cvb.CVB0Driver \
-i s3n://mybucket/cmstopics/hlcms_vec/tf-vectors-cvb/matrix \
-dict s3n://mybucket/cmstopics/hlcms_vec/dictionary.file-0 \
-o s3n://mybucket/cmstopics/hlcms_lda/topicterm \
-dt s3n://mybucket/cmstopics/hlcms_lda/doctopic \
-k 50 \
-ow \
-x 30 \
-a 1 \
-e 1

Number of things to keep in mind here. For one, -nt (number of terms) should not be specified if -dict is specified, since it can be inferred from -dict (or your job may fail). Also don't specify -mt (model directory) since otherwise the job will fail if it can't find one.

The output of the job is two folders, doctopic and topicterm. Both contain sequence files with (IntWritable,VectorWritable) tuples. Each row of doctopic represents a document and the VectorWritable is a list of p(topic|doc) for a topic. Each row of topicterm represents a topic and the VectorWritable is a list of p(term|topic) values for each term.

1
2
3
4
5
6
7
8
9
hlcms_lda
+-- doctopic
|   +-- _SUCCESS
|   +-- part-m-00000
+-- topicterm
    +-- _SUCCESS
    +-- part-m-00001
    +-- ...
    +-- part-m-00009

Dump results into CSV


The official documentation says to use Mahout's ldatopics subcommand, but according to StackOverflow page, ldatopics is deprecated and you should use the vectordump subcommand instead.

The vectordump subcommand merges the information from the dictionary file and one of doctopic or topicterm and it writes out a CSV file representing a matrix of p(topic|doc) or p(term|topic) respectively. I wasn't sure how to dump out into a local filesystem on Amazon EMR, so I just copied the files locally using s3cmd and ran vectordump on them.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
sujit@localhost:data$ s3cmd get --recursive \
  s3://mybucket/cmstopics/hlcms_lda hlcms_lda
sujit@localhost:data$ $MAHOUT_HOME/bin/mahout vectordump \
  -i /path/to/data/hlcms_lda/topicterm \
  -d /path/to/data/hlcms_vec/dictionary.file-0 \
  -dt sequencefile \
  -c csv \
  -p true \
  -o ./p_term_topic.txt
  -sort /path/to/data/hlcms_lda/topicterm \
  -vs 10
sujit@localhost:data$ $MAHOUT_HOME/bin/mahout vectordump \
  -i /path/to/data/hlcms_lda/doctopic \
  -d /path/to/data/hlcms_vec/dictionary.file-0 \
  -dt sequencefile \
  -c csv \
  -p true \
  -o ./p_topic_doc.txt
  -sort /path/to/data/hlcms_lda/doctopic \
  -vs 10 

The p_term_topic.txt contains the p(term|topic) for each of the 50 topics, one topic per row. The p_topic_doc.txt contains the p(topic|doc) values for each document, one document per row.

Create Reports


We can create some interesting reports out of the data computed above. One such would be to find the top 10 words for each topic cluster. Here is the code for this report:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import operator
import string

terms = {}

f = open("/path/to/data/p_term_topic.txt", 'rb')
ln = 0
for line in f:
  if len(line.strip()) == 0: continue
  if ln == 0:
    # make {id,term} dictionary for use later
    tn = 0
    for term in line.strip().split(","):
      terms[tn] = term
      tn += 1
  else:
    # parse out topic and probability, then build map of term to score
    # finally sort by score and print top 10 terms for each topic.
    topic, probs = line.strip().split("\t")
    termProbs = {}
    pn = 0
    for prob in probs.split(","):
      termProbs[terms[pn]] = float(prob)
      pn += 1
    toptermProbs = sorted(termProbs.iteritems(),
      key=operator.itemgetter(1), reverse=True)
    print "Topic: %s" % (topic)
    print "\n".join([(" "*3 + x[0]) for x in toptermProbs[0:10]])
  ln += 1
f.close()

And the results are shown (after some editing to make them easier to read) below:

Topic: 0 Topic: 1 Topic: 2 Topic: 3 Topic: 4
droids applaud technique explosions sufferers born delight succeed compliant warming responds stools technique explosions applaud proposal stern centers warming succeed responds applaud droids explosions proposal born delight sexually upsidedown hemophilia elisa responds sufferers born delight sexually fully hemophilia fury upsidedown technique sufferers stools droids explosions knees amount stabilized centers stern
Topic: 5 Topic: 6 Topic: 7 Topic: 8 Topic: 9
group's technique stools applaud born amount stern vascular vectors knees technique droids stools authored interchangeably stern households vectors bleed muchneeded sufferers technique responds explosions applaud born compliant stabilized recording punch droids explosions responds technique born upsidedown hypogastric compliant flinn bleed group's responds applaud explosions technique born vectors delight punch fully
Topic: 10 Topic: 11 Topic: 12 Topic: 13 Topic: 14
group's responds sufferers explosions droids authored proposal centers thick flinn applaud droids sufferers technique responds stools born vectors delight succeed explosions applaud stools stern born upsidedown delight fury recording hypogastric sufferers applaud interchangeably muchneeded households stabilized sexually ninety succeed flinn technique stools responds droids interchangeably centers muchneeded thick upsidedown punch
Topic: 15 Topic: 16 Topic: 17 Topic: 18 Topic: 19
group's responds sufferers technique stools explosions flinn hemophilia delight centers responds applaud technique vectors knees stern stabilized vascular sexually recording responds stools sufferers vectors centers ninety warming households muchneeded interchangeably technique sufferers explosions proposal born hemophilia centers delight fury compliant group's sufferers applaud droids stools born centers punch compliant delight
Topic: 20 Topic: 21 Topic: 22 Topic: 23 Topic: 24
technique responds sufferers applaud droids stools interchangeably amount born ninety responds applaud sufferers droids born delight sexually flinn vascular thick applaud explosions droids born delight upsidedown interchangeably amount compliant punch technique explosions vectors fury stern vascular households untreatable hemophilia stabilized technique droids applaud sufferers stools stern amount interchangeably households centers
Topic: 25 Topic: 26 Topic: 27 Topic: 28 Topic: 29
stools sufferers responds born knees amount vectors flinn untreatable upsidedown stools explosions proposal authored droids vectors knees fury amount succeed stools proposal responds applaud born knees amount vascular untreatable hypogastric applaud technique explosions sufferers droids responds stabilized centers punch muchneeded responds stools droids explosions interchangeably stern households ninety upsidedown amount
Topic: 30 Topic: 31 Topic: 32 Topic: 33 Topic: 34
responds explosions applaud sufferers stools droids centers compliant vectors thick stools explosions droids technique vectors centers muchneeded thick flinn stabilized responds technique droids stools explosions born interchangeably households fury hypogastric applaud explosions droids technique compliant punch centers warming hemophilia fully droids technique vectors stern interchangeably fury households muchneeded amount knees
Topic: 35 Topic: 36 Topic: 37 Topic: 38 Topic: 39
sufferers technique responds authored centers vectors interchangeably punch fully warming technique stools responds droids authored stern fury ninety bleed compliant elisa sufferers group's technique droids interchangeably centers vectors punch thick stools proposal technique sexually upsidedown stabilized thick punch muchneeded compliant interchangeably stabilized vectors centers punch compliant ninety delight hemophilia droids
Topic: 40 Topic: 41 Topic: 42 Topic: 43 Topic: 44
stools applaud responds sufferers authored born flinn interchangeably hypogastric fury group's responds sufferers applaud authored centers fury bleed hypogastric stern responds stools technique sufferers applaud vectors amount knees untreatable upsidedown elisa technique explosions responds stools proposal stern succeed born warming stools applaud authored interchangeably stern born ninety muchneeded households warming
Topic: 45 Topic: 46 Topic: 47 Topic: 48 Topic: 49
responds droids sufferers interchangeably fury vectors households ninety muchneeded stern group's droids stools explosions applaud authored proposal sufferers interchangeably stabilized sufferers group's explosions applaud responds droids technique stools interchangeably fury amount stern knees flinn compliant sexually thick bleed upsidedown punch technique droids applaud sufferers explosions born amount knees centers succeed

Another interesting report would be to see the composition of topics within the corpus. We calculate the "topic" of a document as the topic with the highest p(topic|doc) value for that document. We then display the number of documents across various topics as a histogram. Here is the code:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import numpy as np
import pylab as pl

f = open("/path/to/data/p_topic_doc.txt", 'rb')
xvals = range(0, 50)
tcounts = np.zeros((50))
for line in f:
  line = line.strip()
  if len(line) == 0 or line.startswith("#"): continue
  docid, probs = line.split("\t")
  plist = [float(p) for p in probs.split(",")]
  topic = plist.index(max(plist))
  tcounts[topic] += 1
f.close()
yvals = list(tcounts)
print xvals
print yvals
fig = pl.figure()
ax = pl.subplot(111)
ax.bar(xvals, yvals)
pl.ylabel("#-Documents")
pl.xlabel("Topics")
pl.show()

and here is the resulting histogram. As you can see, the distribution appears fairly uniform with a few popular topics. We could try to correlate these topics with the popular words in the topic to figure out what our corpus is all about.

Yet another application could be to think of LDA as a feature reduction strategy, converting the problem down to only 50 features (the number of topics) represented by the p(topic|doc) values..

Conclusion


Topic Modeling can be a powerful tool and provides interesting insights into your data. Mahout is one of the few packages that can do Topic Modeling at scale. However, using it was daunting because of poor/outdated documentation. Mahout hasn't yet reached the 1.0 release milestone, and there is already some work being done within the Mahout community to improve documentation, so hopefully it will all be ironed out by that time.

36 comments (moderated to prevent spam):

Ram Awasthi said...

Thanks Sujit for this post this helped me on my project.

Regards
Ram

Sujit Pal said...

Hi Ram, you're very welcome, glad I could help.

Anonymous said...

Hi Sujit, I have read about LDA but still confused. Does it require that you have a set of seed topics or is it totally unsupervised? Does it detect phrases or only single words ?

Thanks,
Ravi Kiran Bhaskar

Sujit Pal said...

Hi Ravi, I've tried to capture the gist of my understanding of LDA in the blockquote (under Theory) in the post. Its completely unsupervised, it uses an EM like algorithm to converge to the optimal topic distribution. In this particular setup the topics are single words - but if you feed it a stream of bigrams instead, then its topics could be bigrams. One other application could be the feature reduction possibilities - given a large volume of text, you could do a pass on it to extract topics then represent each document with a set of topic features instead of (noisier) word features.

Nitin said...

Thanks Sujit. Just to point out your top 10 words might not be correct as on my data your python code is not considering decimals. For example it is considering 0.9999E-6 > 0.9666E-4 and so giving wrong output.

Sujit Pal said...

Thanks very much for pointing it out Nitin. I suspect problem is on line 23 of the report creation code, I should wrap it in double() or something, currently its doing a string comparison in sorted(). I will fix and regenerate my reports.

nitin agrawal said...

Yes, I used float(prob) and got the correct results.

Regards,
Nitin

Sujit Pal said...

Thanks again, I have updated the code in the post as well.

Ambarish Hazarnis said...

Hi Sujit,
Thanks for the post. Do you know if there is any method to infer a new document using the LDA model obtained?

Thanks,
Ambarish

Sujit Pal said...

Hi Ambarish, I don't think there is anything in Mahout to do this, but I could be wrong. But maybe you could probably use the p(term|topic) to calculate this?

Jeff said...

Sujit,
Do you know how to map the doc_id back to your original corpus? Assuming that some documents were removed due to sparse2seq, I want to find out which original document corresponds to the each docID

Sujit Pal said...

Hi Jeff, sorry about the delay in responding. I don't think seq2sparse actually drops documents, each row in the sequence file corresponds to a block of text from a file in the input directory, and it converts to a sparse vector of terms for that file. Seqdirectory reads the input directory in the same order as "ls" so you could use that to map the doc_id back to the document.

Unknown said...

Thanks for the great article and information.

On the doc_id map, I am guessing the vectordump on matrix/docIndex may serve the mapping purpose.

Also, I have to use the '-c cvs' to retrieve the correct doc/topic information on 0.9

Sam

Sujit Pal said...

Thanks Sam.

Anonymous said...

Thanks Sujit for this post this helped alot.
Can you please show me how to print each topic words in a file ?
and how to get each docuement to which exact topic it belongs to.
Thank in advance

Sujit Pal said...

I am guessing you are asking about the mechanics of writing the topic words into a file rather than to STDOUT right? In that case, in the code block just above the table containing topic words, you would replace lines 27-28 with:
fout.write() instead of print(), and open fout somewhere round line 7 and close it somewhere round line 30. For finding documents in the topic, you use the other file p_topic_doc.txt. Something like this:

fout = open("/path/to/output.txt", 'wb')
...
toptermProbs = sorted(termProbs.iteritems(), key=operator.itemgetter(1), reverse=True)
fout.write("%s\t%s\n" % (topic, "\t".join([x[0] for x in toptermProbs])))
...
fout.close()

Anonymous said...

-"I am guessing you are asking about the mechanics of writing the topic words into a file rather than to STDOUT right?" yes I meant this, I tried this before posting the question ,but I kept read from different file and it always cause "Cannot convert String to Float error" and I'm sorry for my naive question it is my fault, Thank you very much , it helps me alot.
-What do you suggest to visual the topics ,docs,and associated words?

My respect.

Sujit Pal said...

Thank you, and really no need to apologize. Regarding visualization, how about tag clouds for each topic where each word is sized based on the p(topic|term)? For the p(doc|topic) I can't think of anything fancier than a simple histogram (similar to the figure in the post).

Anonymous said...

"how about tag clouds for each topic where each word is sized based on the p(topic|term)?"
Sounds good
-Can you please suggest a library to implement this in python ?
Almost I am naive in python.
- How I can subscribe to this blog the RSS button did not work.
many thanks.

Sujit Pal said...

I used a nice library called wordcloud only last week (code and output here), maybe this can help as an example.

Thanks for pointing out the broken RSS/Atom feed links. I have replaced the widget with Blogspot's follow by mail widget - you should see "Subscribe" followed by a text prompt to enter your email.

adismart said...

Hi,
I am using the cvb algo of Mahout(0.9) for topic modelling.
Following is the code used: http://stackoverflow.com/questions/26340247/interpreting-mahout0-9-cvb-output

But the output for document to topic mapping is not coming in proper format. Please guide for the correct way

Sujit Pal said...

Hi adismart, going by the output in the SO page, it looks like perhaps Mahout 0.9 may have a cosmetic bug in printing the reports where lines are not terminated (or maybe it was there in 0.8 and I just fixed it myself without too much thought). You can easily fix it with the following Unix command:

sed -e 's/} /}\n/g' doc_topic.txt > fixed_doc_topic.txt

The resulting document looks something like this:

0 {2d:0.0199,3d:0.0199,...}
1 {2d:0.0200,3d:0.0199,...}
...
4 {2d:0.0200,3d:0.0199,...}

It means that for doc#0, the topic "2d" has a probability of 0.0199, "3d" has a probability of 0.0199, etc.

adismart said...

Thanks sujit for prompt response.
I was not able to verify if the output is correct. Because topic name like: 2d , 3d .etc where coming in output.
Actually i tried to remove the stop words. And even did stemming. But still the output(doc topic) remains the same.
I even tried working with bi gram and tri-gram. Still the output for document-topic mapping dint show much improvement.
Am I missing some step which is causing the problem.
Please guide

Thanks & Regards,
Aditya

adismart said...

The other problem which i feel in the output id that, the terms starting with a are only getting printed in document topic mapping.

Seems i missed something in the code?

Sujit Pal said...

Yes, didn't notice initially that the topics all start with either numbers or "a", very likely because of this...

String[] rowIDArgs = {"--input",inputVectorDir + "/tf-vectors/part-r-00000", "--output", rowIDOutFile};

ie, you are using a subset of the tf-vectors (these are sorted so it probably stops at "a").

adismart said...

Thanks Sujit. I just looked at the director for topic vectors. It contains two files alone : part-r-00000 and .part-r-00000.crc

Thanks,
Aditya

Sujit Pal said...

Bummer :-). Well, I can't think of anything that would cause this sort of behavior...in my case things worked mostly out of the box. I noticed you posted your problem to the Mahout ML, I think maybe you will get a better response there since people there are more in touch with Mahout.

adismart said...

Even I am surprised to see this behavior. I checked most of the forums and places. Everywhere they are using similar code to get the desired output.
While me struggling to find a solution to this problem.
Anyways thanks for your suggestions.

Thanks ,
Aditya

Vinay said...

Thanks for the nice article Sujit. Under what conditions will you use a LIng pipe vs Apache Mahout ?
My use case is that I have crawled data using nutch and would like to create a taxonomy / categorization of the data . I am not very clear on which one is the right choice to create the taxonomy

Sujit Pal said...

Thanks Vinay. I think the choice would primarily be based on data volume, Mahout for large volumes and LingPipe (although afaik, LingPipe doesn't have anything for topic modeling, there is another toolkit called Mallet which does). In your case, you may want to try Mallet against your data first, if it fails try Mahout.

adismart said...

Hi Sujit,
Is there a way for giving meaning full name to the topics created ? I tried naming the topic with the top 5 terms with highest probability , but the name doesn't correctly represents the group.
Regards Aditya

Sujit Pal said...

Hi Aditya, I don't think its possible to do automatically. Usually you look at the terms in the topic and come up with some representative topic name.

Unknown said...

hi Sujit,

I am trying to run LDA algorithm in mahout. but i am keep getting the following error:

Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected


I thought it is something to do with jar files and compatibility so I added the jars in trunk directory. But still it throws the same error. I dont know in which path should I place the jar files. kindly help me boss.

Sujit Pal said...

Hi Nikitha, I found Stack Overflow page by copy-pasting the error message into google. It says that this is because of a change in API between Hadoop 1.x and Hadoop 2.x and also lists some things you should try, which should fix it.

Unknown said...

Thank you so much for replying :) I am trying to do lda on Reuters dataset. But when i try to bring the output in the graph format using matplotlib. I am getting error in the split line as below,
hduser@ubuntu:/usr/local/Mahoutold$ python ldaplot.py
Traceback (most recent call last):
File "ldaplot.py", line 10, in
docid, probs = line.split("\t")
ValueError: need more than 1 value to unpack

can you suggest me something please.

Sujit Pal said...

This is probably because it doesn't have two values separated by tab in the line it is seeing. You can try to see why by printing the line immediately before the split. Most likely it might be a comment line or something that the previous if conditions are not catching.