Introduction
The motivation for this work was a desire to understand the structure of a corpus in a manner different from what I am used to. Central to all our applications is a knowledge graph derived from our medical taxonomy. So any document corpus can easily be defined as a small set (50-100) of high level concepts, merely by rolling up document concepts into their parents until an adequate degree of granularity is achieved. I wanted to see if standard topic modeling techniques would yield comparable results. If so, perhaps the output of such a process could be used as feedback for concept creation.
This post describes Topic Modeling a smallish corpus (2,285 documents) from our document collection, using Apache Mahout's Latent Dirichlet Allocation (LDA) algorithm, and running it on Amazon Elastic Map Reduce (EMR) platform. Mahout provides the LDA implementation, as well as utilities for IO. The code I wrote work at the two ends of the pipeline, first to download and parse data for Mahout to consume, and then to produce a report of top terms in each topic category.
Even though Mahout (I used version 0.8) provided most of the functionality for this work, the experience was hardly straightforward. The official documentation is outdated, and I had to repeatedly refer to discussions on the Mahout Mailing lists to find solutions for problems I faced along the way. I found only one blog post based on Mahout version 0.5 that I used as a starting point. Of course, all's well that ends well, and I was ultimately able to get the top terms for each topic and the topic composition of each document.
Theory
The math behind LDA is quite formidable as you can see from its Wikipedia page, but here is a somewhat high-level view, selectively gleaned from this paper by Steyvers and Griffiths.
Topic Models are based upon the idea that documents are mixtures of topics, where a topic is a probability distribution over words. To make a new document, one chooses a distribution over topics. Then for each word in the document, one chooses a topic at random and draws a word from the topic.
In order to answer the (IMO more interesting) question of what topics make up a collection of documents, you invert this process. Each Topic Modeling algorithm does it differently. LDA provides an approximate iterative method to sample values sequentially, proceeding until sample values converge to the target distribution.
Preparing the data
The documents for this work come from our Content Management System, and this section describes the extraction code. Its included for completeness. Your setup is likely very different, so it may be of limited use to you. In any case, our CMS is loosely coupled to our web front end via a publisher, which serializes documents in JSON format onto a network filesystem. Content can be pulled off a REST API off this filesystem most efficiently if you know the "file ID". I use Solr to get a list of these file IDs, and download it to my local filesystem for further processing.
Processing consists of parsing out the text content of the files (each content type can define its own JSON format), then using NLTK to remove HTML tags, stopwords, numeric tokens and punctuation. The text versions of the JSON files are written out to another directory for feeding into the Mahout pipeline.
Code is in Python, its shown below. Hostnames and such have been changed to protect the innocent.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 | import json
import nltk
import os
import os.path
import string
import urllib
import urllib2
SOLR_SERVER = "http://solrserver.mycompany.com:8983/solr/select"
CONTENT_SERVER = "http://contentserver.mycompany.com/view"
JSON_OUTPUTDIR = "/path/to/data/hlcms_jsons"
TEXT_OUTPUTDIR = "/path/to/data/hlcms_text"
FILENAMES_FILE = "/tmp/hlcms_filenames.txt"
STOPWORDS = nltk.corpus.stopwords.words("english")
PUNCTUATIONS = {c:"" for c in string.punctuation}
def textify(s):
text = nltk.clean_html(s)
sentences = nltk.sent_tokenize(text)
words = []
for sentence in sentences:
sent = sentence.encode("utf-8", 'ascii')
sent = "".join([PUNCTUATIONS[c] if PUNCTUATIONS.has_key(c) else c
for c in sent])
ws = nltk.word_tokenize(sent)
for w in ws:
if w in STOPWORDS: continue
if w.replace(",", "").replace(".", "").isdigit(): continue
words.append(w.lower())
return " ".join(words)
# build list of all file parameter values from solr
params = urllib.urlencode({
"q" : "sourcename:hlcms",
"start" : "0",
"rows" : "0",
"fl" : "contenttype,cmsurl",
"wt" : "json"
})
conn = urllib.urlopen(SOLR_SERVER, params)
rsp = json.load(conn)
numfound = rsp["response"]["numFound"]
print "# of CMS articles to download: ", numfound
filenames = open(FILENAMES_FILE, 'wb')
npages = int(numfound/10) + 1
for pg in range(0, npages):
if pg % 100 == 0:
print "Downloading HLCMS page #: %d" % (pg)
params = urllib.urlencode({
"q" : "sourcename:hlcms",
"start" : str(pg * 10),
"rows" : "10",
"fl" : "contenttype,cmsurl",
"wt" : "json"
})
conn = urllib.urlopen(SOLR_SERVER, params)
rsp = json.load(conn)
for doc in rsp["response"]["docs"]:
try:
contenttype = doc["contenttype"]
cmsurl = doc["cmsurl"]
filenames.write("%s-%s\n" % (contenttype, cmsurl))
except KeyError:
continue
filenames.close()
# for each file parameter, build URL and extract data into local dir
filenames2 = open(FILENAMES_FILE, 'rb')
for filename in filenames2:
fn = filename.strip()
ofn = os.path.join(JSON_OUTPUTDIR, fn + ".json")
print "Downloading file: ", fn
try:
output = open(ofn, 'wb')
response = urllib2.urlopen(CONTENT_SERVER + "?file=" + fn + "&raw=true")
output.write(response.read())
output.close()
except IOError:
continue
filenames2.close()
print "All files downloaded"
# build parser for each content type to extract title and body
for file in os.listdir(JSON_OUTPUTDIR):
print "Parsing file: %s" % (file)
fin = open(os.path.join(JSON_OUTPUTDIR, file), 'rb')
ofn = os.path.join(TEXT_OUTPUTDIR,
os.path.basename(file[0:file.rindex(".json")]) + ".txt")
fout = open(ofn, 'wb')
try:
doc_json = json.load(fin)
# parsing out title and body based on content type
# since different content types can have own format
if file.startswith("ctype1-"):
for fval in ["title", "bm_intro", "bm_seo_body"]:
fout.write("%s\n" % (textify(doc_json[fval])))
elif file.startswith("ctype2-"):
for fval in ["body"]:
fout.write("%s\n" % (textify(doc_json[fval])))
elif file.startswith("ctype3-"):
for fval in ["title", "body"]:
fout.write("%s\n" % (textify(doc_json[fval])))
elif file.startswith("ctype4-"):
fout.write("%s\n" % (textify(doc_json["recipeDeck"])))
fout.write("%s\n" % (textify(". ".join([x.values()[0]
for x in doc_json["directions"]]))))
elif file.startswith("ctype5-"):
for fval in ["title", "body"]:
fout.write("%s\n" % (textify(doc_json[fval])))
else:
continue
except ValueError as e:
print "ERROR!", e
continue
fout.close()
fin.close()
# filter out files with 0 bytes and remove them from text output directory
for file in os.listdir(TEXT_OUTPUTDIR):
fname = os.path.join(TEXT_OUTPUTDIR, file)
size = os.path.getsize(fname)
if size == 0:
print "Deleting zero byte file:", os.path.basename(fname)
os.remove(fname)
|
Converting Text Files to Sequence File
The end product of the step above is a directory of text files. Punctuations, stopwords and number tokens have been stripped (because they are of limited value as topic terms) and all characters have been lowercased (not strictly necessary, because the vectorization step takes care of that). So each file is essentially now a bag of words.
Our pipeline is Hadoop based, and Hadoop likes small number of large files, so this step converts the directory of 2,258 text files into a single large sequence file, where each row represents a single file. I run the mahout seqdirectory subcommand locally to do this, then copy the output to Amazon EMR using s3cmd (available on Ubuntu via apt-get and on Mac OS via macports).
1 2 3 4 5 6 | sujit@localhost:data$ $MAHOUT_HOME/bin/mahout seqdirectory \
--input /path/to/data/hlcms_text \
--output /path/to/data/hlcms_seq \
-c UTF-8
sujit@localhost:data$ s3cmd put /path/to/data/hlcms_seq \
s3://mybucket/cmstopics/
|
Vectorizing the Input
The next step is to create a term-document matrix out of the sequence files. Once again, we can do this locally with the Mahout seq2sparse subcommand. I choose to do this on Amazon EMR - the only change is to specify the name of the class that corresponds to the seq2sparse subcommand (you can find this information in $MAHOUT_HOME/conf/driver.classes.default.props). You also need to copy over the Mahout job JAR to S3.
JAR location: s3n://mybucket/cmstopics/mahout-core-0.8-job.jar
JAR arguments:
org.apache.mahout.vectorizer.SparseVectorsFromSequenceFiles \
-i s3n://mybucket/cmstopics/hlcms_seq \
-o s3n://mybucket/cmstopics/hlcms_vec \
-wt tf
With Amazon's Hadoop Distribution (ie choosing Amazon Distribution for the Hadoop Version prompt in the AWS EMR console) results in this error.
1 | Error running child : java.lang.NoSuchFieldError: LUCENE_43
|
This is very likely caused by the Amazon distribution gratitously including old Lucene JARS (older than the Lucene 4.3 the Mahout 0.8 job JAR includes) within it. At runtime Lucene classes from the Amazon JARs are being loaded, which don't know anything about LUCENE_43 because they do not (yet) exist for it. My solution was to try the MapR M7 distribution (at least partly based on the reason that Ted Dunning works for MapR and he is a committer for Mahout :-)). However, MapR (all distributions) require m1.large instances at minimum, so its a bit more expensive.
This step creates an output directory hlcms_vec that looks like this. Of these, the only ones of interest to this pipeline are the tf-vectors folder and the dictionary.file-0 file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | hlcms_vec/
+-- df-count
| +-- _SUCCESS
| +-- part-r-00000
+-- dictionary.file-0
+-- frequency.file-0
+-- tf-vectors
| +-- _SUCCESS
| +-- part-r-00000
+-- tokenized-documents
| +-- _SUCCESS
| +-- part-m-00000
+-- wordcount
+-- _SUCCESS
+-- part-r-00000
|
Converting Keys to IntWritables
This step is not documented in the official documentation. The blog post does not mention it either, but thats probably because Mahout 0.5's lda subcommand was deprecated in favor of the cvb subcommand. The tf-vectors file contains (Text, VectorWritable) tuples, but cvb expects to read (IntWritable, VectorWritable). The rowid subcommand does this conversion. Interestingly, the rowid job is contained in mahout-examples-0.8-job.jar and not in the main job JAR. Attempting to run it on Amazon EMR on either Amazon or MapR distributions produces errors to the effect that it can only be run locally.
1 2 3 4 5 6 7 8 9 | # running under MapR distribution
java.io.IOException: \
Could not resolve any CLDB hostnames for cluster: mybucket:7222
# running under Amazon distribution
java.lang.IllegalArgumentException: \
This file system object (hdfs://10.255.35.8:9000) does not support \
access to the request path 's3n://mybucket/cmstopics/cvb-vectors/docIndex'\
You possibly called FileSystem.get(conf) when you should have called \
FileSystem.get(uri, conf) to obtain a file system supporting your path.
|
So I ended up pulling down tf-vectors locally, converting to tf-vectors-cvb and then uploading back to S3.
1 2 3 4 5 6 7 8 | sujit@localhost:data$ s3cmd get --recursive \
s3://mybucket/cmstopics/hlcms_vec/tf-vectors/ \
hlcms_vec/tf-vectors
sujit@localhost:data$ $MAHOUT_HOME/bin/mahout rowid \
-i /path/to/data/hlcms_vec/tf-vectors \
-o /path/to/data/hlcms_vec/tf-vectors-cvb
sujit@localhost:data$ s3cmd put tf-vectors-cvb \
s3://mybucket/cmstopics/hlcms_vec/
|
After this subcommand is run, there is an additional folder tf-vectors-cvb in the hlcms_vec folder. The tf-vectors-cvb folder contains 2 files, matrix and docindex. Our pipeline only cares about the data in the matrix file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | hlcms_vec
+-- df-count
| +-- _SUCCESS
| +-- part-r-00000
+-- dictionary.file-0
+-- frequency.file-0
+-- tf-vectors
| +-- _SUCCESS
| +-- part-r-00000
+-- tf-vectors-cvb
| +-- docindex
| +-- matrix
+-- tokenized-documents
| +-- _SUCCESS
| +-- part-m-00000
+-- wordcount
+-- _SUCCESS
+-- part-r-00000
|
Run LDA on Modified term-vector input
Finally, we are ready to run LDA on our corpus. The Mahout lda subcommand has been deprecated and replaced with the cvb subcommand, which uses the Collapsed Variational Bayes (CVB) algorithm to do LDA. We run LDA with 50 topics (-k) for 30 iterations (-x) on Amazon EMR using a MapR distribution, with the following parameters.
JAR location: s3n://mybucket/cmstopics/mahout-core-0.8-job.jar
JAR arguments:
org.apache.mahout.clustering.lda.cvb.CVB0Driver \
-i s3n://mybucket/cmstopics/hlcms_vec/tf-vectors-cvb/matrix \
-dict s3n://mybucket/cmstopics/hlcms_vec/dictionary.file-0 \
-o s3n://mybucket/cmstopics/hlcms_lda/topicterm \
-dt s3n://mybucket/cmstopics/hlcms_lda/doctopic \
-k 50 \
-ow \
-x 30 \
-a 1 \
-e 1
Number of things to keep in mind here. For one, -nt (number of terms) should not be specified if -dict is specified, since it can be inferred from -dict (or your job may fail). Also don't specify -mt (model directory) since otherwise the job will fail if it can't find one.
The output of the job is two folders, doctopic and topicterm. Both contain sequence files with (IntWritable,VectorWritable) tuples. Each row of doctopic represents a document and the VectorWritable is a list of p(topic|doc) for a topic. Each row of topicterm represents a topic and the VectorWritable is a list of p(term|topic) values for each term.
1 2 3 4 5 6 7 8 9 | hlcms_lda
+-- doctopic
| +-- _SUCCESS
| +-- part-m-00000
+-- topicterm
+-- _SUCCESS
+-- part-m-00001
+-- ...
+-- part-m-00009
|
Dump results into CSV
The official documentation says to use Mahout's ldatopics subcommand, but according to StackOverflow page, ldatopics is deprecated and you should use the vectordump subcommand instead.
The vectordump subcommand merges the information from the dictionary file and one of doctopic or topicterm and it writes out a CSV file representing a matrix of p(topic|doc) or p(term|topic) respectively. I wasn't sure how to dump out into a local filesystem on Amazon EMR, so I just copied the files locally using s3cmd and ran vectordump on them.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | sujit@localhost:data$ s3cmd get --recursive \
s3://mybucket/cmstopics/hlcms_lda hlcms_lda
sujit@localhost:data$ $MAHOUT_HOME/bin/mahout vectordump \
-i /path/to/data/hlcms_lda/topicterm \
-d /path/to/data/hlcms_vec/dictionary.file-0 \
-dt sequencefile \
-c csv \
-p true \
-o ./p_term_topic.txt
-sort /path/to/data/hlcms_lda/topicterm \
-vs 10
sujit@localhost:data$ $MAHOUT_HOME/bin/mahout vectordump \
-i /path/to/data/hlcms_lda/doctopic \
-d /path/to/data/hlcms_vec/dictionary.file-0 \
-dt sequencefile \
-c csv \
-p true \
-o ./p_topic_doc.txt
-sort /path/to/data/hlcms_lda/doctopic \
-vs 10
|
The p_term_topic.txt contains the p(term|topic) for each of the 50 topics, one topic per row. The p_topic_doc.txt contains the p(topic|doc) values for each document, one document per row.
Create Reports
We can create some interesting reports out of the data computed above. One such would be to find the top 10 words for each topic cluster. Here is the code for this report:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | import operator
import string
terms = {}
f = open("/path/to/data/p_term_topic.txt", 'rb')
ln = 0
for line in f:
if len(line.strip()) == 0: continue
if ln == 0:
# make {id,term} dictionary for use later
tn = 0
for term in line.strip().split(","):
terms[tn] = term
tn += 1
else:
# parse out topic and probability, then build map of term to score
# finally sort by score and print top 10 terms for each topic.
topic, probs = line.strip().split("\t")
termProbs = {}
pn = 0
for prob in probs.split(","):
termProbs[terms[pn]] = float(prob)
pn += 1
toptermProbs = sorted(termProbs.iteritems(),
key=operator.itemgetter(1), reverse=True)
print "Topic: %s" % (topic)
print "\n".join([(" "*3 + x[0]) for x in toptermProbs[0:10]])
ln += 1
f.close()
|
And the results are shown (after some editing to make them easier to read) below:
Topic: 0 | Topic: 1 | Topic: 2 | Topic: 3 | Topic: 4 |
droids applaud technique explosions sufferers born delight succeed compliant warming | responds stools technique explosions applaud proposal stern centers warming succeed | responds applaud droids explosions proposal born delight sexually upsidedown hemophilia | elisa responds sufferers born delight sexually fully hemophilia fury upsidedown | technique sufferers stools droids explosions knees amount stabilized centers stern |
Topic: 5 | Topic: 6 | Topic: 7 | Topic: 8 | Topic: 9 |
group's technique stools applaud born amount stern vascular vectors knees | technique droids stools authored interchangeably stern households vectors bleed muchneeded | sufferers technique responds explosions applaud born compliant stabilized recording punch | droids explosions responds technique born upsidedown hypogastric compliant flinn bleed | group's responds applaud explosions technique born vectors delight punch fully |
Topic: 10 | Topic: 11 | Topic: 12 | Topic: 13 | Topic: 14 |
group's responds sufferers explosions droids authored proposal centers thick flinn | applaud droids sufferers technique responds stools born vectors delight succeed | explosions applaud stools stern born upsidedown delight fury recording hypogastric | sufferers applaud interchangeably muchneeded households stabilized sexually ninety succeed flinn | technique stools responds droids interchangeably centers muchneeded thick upsidedown punch |
Topic: 15 | Topic: 16 | Topic: 17 | Topic: 18 | Topic: 19 |
group's responds sufferers technique stools explosions flinn hemophilia delight centers | responds applaud technique vectors knees stern stabilized vascular sexually recording | responds stools sufferers vectors centers ninety warming households muchneeded interchangeably | technique sufferers explosions proposal born hemophilia centers delight fury compliant | group's sufferers applaud droids stools born centers punch compliant delight |
Topic: 20 | Topic: 21 | Topic: 22 | Topic: 23 | Topic: 24 |
technique responds sufferers applaud droids stools interchangeably amount born ninety | responds applaud sufferers droids born delight sexually flinn vascular thick | applaud explosions droids born delight upsidedown interchangeably amount compliant punch | technique explosions vectors fury stern vascular households untreatable hemophilia stabilized | technique droids applaud sufferers stools stern amount interchangeably households centers |
Topic: 25 | Topic: 26 | Topic: 27 | Topic: 28 | Topic: 29 |
stools sufferers responds born knees amount vectors flinn untreatable upsidedown | stools explosions proposal authored droids vectors knees fury amount succeed | stools proposal responds applaud born knees amount vascular untreatable hypogastric | applaud technique explosions sufferers droids responds stabilized centers punch muchneeded | responds stools droids explosions interchangeably stern households ninety upsidedown amount |
Topic: 30 | Topic: 31 | Topic: 32 | Topic: 33 | Topic: 34 |
responds explosions applaud sufferers stools droids centers compliant vectors thick | stools explosions droids technique vectors centers muchneeded thick flinn stabilized | responds technique droids stools explosions born interchangeably households fury hypogastric | applaud explosions droids technique compliant punch centers warming hemophilia fully | droids technique vectors stern interchangeably fury households muchneeded amount knees |
Topic: 35 | Topic: 36 | Topic: 37 | Topic: 38 | Topic: 39 |
sufferers technique responds authored centers vectors interchangeably punch fully warming | technique stools responds droids authored stern fury ninety bleed compliant | elisa sufferers group's technique droids interchangeably centers vectors punch thick | stools proposal technique sexually upsidedown stabilized thick punch muchneeded compliant | interchangeably stabilized vectors centers punch compliant ninety delight hemophilia droids |
Topic: 40 | Topic: 41 | Topic: 42 | Topic: 43 | Topic: 44 |
stools applaud responds sufferers authored born flinn interchangeably hypogastric fury | group's responds sufferers applaud authored centers fury bleed hypogastric stern | responds stools technique sufferers applaud vectors amount knees untreatable upsidedown | elisa technique explosions responds stools proposal stern succeed born warming | stools applaud authored interchangeably stern born ninety muchneeded households warming |
Topic: 45 | Topic: 46 | Topic: 47 | Topic: 48 | Topic: 49 |
responds droids sufferers interchangeably fury vectors households ninety muchneeded stern | group's droids stools explosions applaud authored proposal sufferers interchangeably stabilized | sufferers group's explosions applaud responds droids technique stools interchangeably fury | amount stern knees flinn compliant sexually thick bleed upsidedown punch | technique droids applaud sufferers explosions born amount knees centers succeed |
Another interesting report would be to see the composition of topics within the corpus. We calculate the "topic" of a document as the topic with the highest p(topic|doc) value for that document. We then display the number of documents across various topics as a histogram. Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | import numpy as np
import pylab as pl
f = open("/path/to/data/p_topic_doc.txt", 'rb')
xvals = range(0, 50)
tcounts = np.zeros((50))
for line in f:
line = line.strip()
if len(line) == 0 or line.startswith("#"): continue
docid, probs = line.split("\t")
plist = [float(p) for p in probs.split(",")]
topic = plist.index(max(plist))
tcounts[topic] += 1
f.close()
yvals = list(tcounts)
print xvals
print yvals
fig = pl.figure()
ax = pl.subplot(111)
ax.bar(xvals, yvals)
pl.ylabel("#-Documents")
pl.xlabel("Topics")
pl.show()
|
and here is the resulting histogram. As you can see, the distribution appears fairly uniform with a few popular topics. We could try to correlate these topics with the popular words in the topic to figure out what our corpus is all about.
Yet another application could be to think of LDA as a feature reduction strategy, converting the problem down to only 50 features (the number of topics) represented by the p(topic|doc) values..
Conclusion
Topic Modeling can be a powerful tool and provides interesting insights into your data. Mahout is one of the few packages that can do Topic Modeling at scale. However, using it was daunting because of poor/outdated documentation. Mahout hasn't yet reached the 1.0 release milestone, and there is already some work being done within the Mahout community to improve documentation, so hopefully it will all be ironed out by that time.
Thanks Sujit for this post this helped me on my project.
ReplyDeleteRegards
Ram
Hi Ram, you're very welcome, glad I could help.
ReplyDeleteHi Sujit, I have read about LDA but still confused. Does it require that you have a set of seed topics or is it totally unsupervised? Does it detect phrases or only single words ?
ReplyDeleteThanks,
Ravi Kiran Bhaskar
Hi Ravi, I've tried to capture the gist of my understanding of LDA in the blockquote (under Theory) in the post. Its completely unsupervised, it uses an EM like algorithm to converge to the optimal topic distribution. In this particular setup the topics are single words - but if you feed it a stream of bigrams instead, then its topics could be bigrams. One other application could be the feature reduction possibilities - given a large volume of text, you could do a pass on it to extract topics then represent each document with a set of topic features instead of (noisier) word features.
ReplyDeleteThanks Sujit. Just to point out your top 10 words might not be correct as on my data your python code is not considering decimals. For example it is considering 0.9999E-6 > 0.9666E-4 and so giving wrong output.
ReplyDeleteThanks very much for pointing it out Nitin. I suspect problem is on line 23 of the report creation code, I should wrap it in double() or something, currently its doing a string comparison in sorted(). I will fix and regenerate my reports.
ReplyDeleteYes, I used float(prob) and got the correct results.
ReplyDeleteRegards,
Nitin
Thanks again, I have updated the code in the post as well.
ReplyDeleteHi Sujit,
ReplyDeleteThanks for the post. Do you know if there is any method to infer a new document using the LDA model obtained?
Thanks,
Ambarish
Hi Ambarish, I don't think there is anything in Mahout to do this, but I could be wrong. But maybe you could probably use the p(term|topic) to calculate this?
ReplyDeleteSujit,
ReplyDeleteDo you know how to map the doc_id back to your original corpus? Assuming that some documents were removed due to sparse2seq, I want to find out which original document corresponds to the each docID
Hi Jeff, sorry about the delay in responding. I don't think seq2sparse actually drops documents, each row in the sequence file corresponds to a block of text from a file in the input directory, and it converts to a sparse vector of terms for that file. Seqdirectory reads the input directory in the same order as "ls" so you could use that to map the doc_id back to the document.
ReplyDeleteThanks for the great article and information.
ReplyDeleteOn the doc_id map, I am guessing the vectordump on matrix/docIndex may serve the mapping purpose.
Also, I have to use the '-c cvs' to retrieve the correct doc/topic information on 0.9
Sam
Thanks Sam.
ReplyDeleteThanks Sujit for this post this helped alot.
ReplyDeleteCan you please show me how to print each topic words in a file ?
and how to get each docuement to which exact topic it belongs to.
Thank in advance
I am guessing you are asking about the mechanics of writing the topic words into a file rather than to STDOUT right? In that case, in the code block just above the table containing topic words, you would replace lines 27-28 with:
ReplyDeletefout.write() instead of print(), and open fout somewhere round line 7 and close it somewhere round line 30. For finding documents in the topic, you use the other file p_topic_doc.txt. Something like this:
fout = open("/path/to/output.txt", 'wb')
...
toptermProbs = sorted(termProbs.iteritems(), key=operator.itemgetter(1), reverse=True)
fout.write("%s\t%s\n" % (topic, "\t".join([x[0] for x in toptermProbs])))
...
fout.close()
-"I am guessing you are asking about the mechanics of writing the topic words into a file rather than to STDOUT right?" yes I meant this, I tried this before posting the question ,but I kept read from different file and it always cause "Cannot convert String to Float error" and I'm sorry for my naive question it is my fault, Thank you very much , it helps me alot.
ReplyDelete-What do you suggest to visual the topics ,docs,and associated words?
My respect.
Thank you, and really no need to apologize. Regarding visualization, how about tag clouds for each topic where each word is sized based on the p(topic|term)? For the p(doc|topic) I can't think of anything fancier than a simple histogram (similar to the figure in the post).
ReplyDelete"how about tag clouds for each topic where each word is sized based on the p(topic|term)?"
ReplyDeleteSounds good
-Can you please suggest a library to implement this in python ?
Almost I am naive in python.
- How I can subscribe to this blog the RSS button did not work.
many thanks.
I used a nice library called wordcloud only last week (code and output here), maybe this can help as an example.
ReplyDeleteThanks for pointing out the broken RSS/Atom feed links. I have replaced the widget with Blogspot's follow by mail widget - you should see "Subscribe" followed by a text prompt to enter your email.
Hi,
ReplyDeleteI am using the cvb algo of Mahout(0.9) for topic modelling.
Following is the code used: http://stackoverflow.com/questions/26340247/interpreting-mahout0-9-cvb-output
But the output for document to topic mapping is not coming in proper format. Please guide for the correct way
Hi adismart, going by the output in the SO page, it looks like perhaps Mahout 0.9 may have a cosmetic bug in printing the reports where lines are not terminated (or maybe it was there in 0.8 and I just fixed it myself without too much thought). You can easily fix it with the following Unix command:
ReplyDeletesed -e 's/} /}\n/g' doc_topic.txt > fixed_doc_topic.txt
The resulting document looks something like this:
0 {2d:0.0199,3d:0.0199,...}
1 {2d:0.0200,3d:0.0199,...}
...
4 {2d:0.0200,3d:0.0199,...}
It means that for doc#0, the topic "2d" has a probability of 0.0199, "3d" has a probability of 0.0199, etc.
Thanks sujit for prompt response.
ReplyDeleteI was not able to verify if the output is correct. Because topic name like: 2d , 3d .etc where coming in output.
Actually i tried to remove the stop words. And even did stemming. But still the output(doc topic) remains the same.
I even tried working with bi gram and tri-gram. Still the output for document-topic mapping dint show much improvement.
Am I missing some step which is causing the problem.
Please guide
Thanks & Regards,
Aditya
The other problem which i feel in the output id that, the terms starting with a are only getting printed in document topic mapping.
ReplyDeleteSeems i missed something in the code?
Yes, didn't notice initially that the topics all start with either numbers or "a", very likely because of this...
ReplyDeleteString[] rowIDArgs = {"--input",inputVectorDir + "/tf-vectors/part-r-00000", "--output", rowIDOutFile};
ie, you are using a subset of the tf-vectors (these are sorted so it probably stops at "a").
Thanks Sujit. I just looked at the director for topic vectors. It contains two files alone : part-r-00000 and .part-r-00000.crc
ReplyDeleteThanks,
Aditya
Bummer :-). Well, I can't think of anything that would cause this sort of behavior...in my case things worked mostly out of the box. I noticed you posted your problem to the Mahout ML, I think maybe you will get a better response there since people there are more in touch with Mahout.
ReplyDeleteEven I am surprised to see this behavior. I checked most of the forums and places. Everywhere they are using similar code to get the desired output.
ReplyDeleteWhile me struggling to find a solution to this problem.
Anyways thanks for your suggestions.
Thanks ,
Aditya
Thanks for the nice article Sujit. Under what conditions will you use a LIng pipe vs Apache Mahout ?
ReplyDeleteMy use case is that I have crawled data using nutch and would like to create a taxonomy / categorization of the data . I am not very clear on which one is the right choice to create the taxonomy
Thanks Vinay. I think the choice would primarily be based on data volume, Mahout for large volumes and LingPipe (although afaik, LingPipe doesn't have anything for topic modeling, there is another toolkit called Mallet which does). In your case, you may want to try Mallet against your data first, if it fails try Mahout.
ReplyDeleteHi Sujit,
ReplyDeleteIs there a way for giving meaning full name to the topics created ? I tried naming the topic with the top 5 terms with highest probability , but the name doesn't correctly represents the group.
Regards Aditya
Hi Aditya, I don't think its possible to do automatically. Usually you look at the terms in the topic and come up with some representative topic name.
ReplyDeletehi Sujit,
ReplyDeleteI am trying to run LDA algorithm in mahout. but i am keep getting the following error:
Exception in thread "main" java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
I thought it is something to do with jar files and compatibility so I added the jars in trunk directory. But still it throws the same error. I dont know in which path should I place the jar files. kindly help me boss.
Hi Nikitha, I found Stack Overflow page by copy-pasting the error message into google. It says that this is because of a change in API between Hadoop 1.x and Hadoop 2.x and also lists some things you should try, which should fix it.
ReplyDeleteThank you so much for replying :) I am trying to do lda on Reuters dataset. But when i try to bring the output in the graph format using matplotlib. I am getting error in the split line as below,
ReplyDeletehduser@ubuntu:/usr/local/Mahoutold$ python ldaplot.py
Traceback (most recent call last):
File "ldaplot.py", line 10, in
docid, probs = line.split("\t")
ValueError: need more than 1 value to unpack
can you suggest me something please.
This is probably because it doesn't have two values separated by tab in the line it is seeing. You can try to see why by printing the line immediately before the split. Most likely it might be a comment line or something that the previous if conditions are not catching.
ReplyDelete