I spent most of last week trying out various approaches to extract "interesting" phrases from a collection of articles. The objective was to identify candidate concepts that could be added to our taxonomy. There are various approaches, ranging from simple NGram frequencies, to algorithms such as RAKE (Rapid Automatic Keyword Extraction), to rescoring NGrams using Log Likelihood or Chi-squared measures. In this post, I describe how I used Mahout's CollocDriver (which uses the Log Likelihood measure) to find interesting phrases from a small corpus of about 200 articles.
The articles were in various formats (PDF, DOC, HTML), and I used Apache Tika to parse them into text (yes, I finally found the opportunity to learn Tika :-)). Tika provides parsers for many common formats, so all we have to do was to hook them up to produce text from the various file formats. Here is my code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | // Source: src/main/scala/com/mycompany/mia/preprocess/TextExtractor.scala
package com.mycompany.mia.preprocess
import java.io.{File, FileInputStream, FileWriter, InputStream, PrintWriter}
import scala.collection.JavaConversions.asScalaIterator
import org.apache.commons.io.{FileUtils, FilenameUtils, IOUtils}
import org.apache.commons.io.filefilter.{DirectoryFileFilter, WildcardFileFilter}
import org.apache.tika.exception.TikaException
import org.apache.tika.metadata.Metadata
import org.apache.tika.parser.{AutoDetectParser, ParseContext, Parser}
import org.apache.tika.parser.audio.AudioParser
import org.apache.tika.parser.html.HtmlParser
import org.apache.tika.parser.image.ImageParser
import org.apache.tika.parser.microsoft.OfficeParser
import org.apache.tika.parser.opendocument.OpenOfficeParser
import org.apache.tika.parser.pdf.PDFParser
import org.apache.tika.parser.rtf.RTFParser
import org.apache.tika.parser.txt.TXTParser
import org.apache.tika.parser.xml.XMLParser
import org.apache.tika.sax.WriteOutContentHandler
object TextExtractor extends App {
val extractor = new TextExtractor()
val idir = new File("/path/to/raw/files")
val odir = new File("/path/to/text/files")
extractor.extractDirToFiles(idir, odir, null)
}
class TextExtractor {
object FileType extends Enumeration {
type FileType = Value
val Text, Html, Xml, Pdf, Rtf, OOText,
MsExcel, MsWord, MsPowerpoint, MsOutlook, Visio,
Png, Jpeg, Mp3, Undef = Value
}
object DocPart extends Enumeration {
type DocPart = Value
val Title, Author, Body, Error = Value
}
val parsers = Map[FileType.Value,Parser](
(FileType.Text, new TXTParser()),
(FileType.Html, new HtmlParser()),
(FileType.Xml, new XMLParser()),
(FileType.Pdf, new PDFParser()),
(FileType.Rtf, new RTFParser()),
(FileType.OOText, new OpenOfficeParser()),
(FileType.MsExcel, new OfficeParser()),
(FileType.MsWord, new OfficeParser()),
(FileType.MsPowerpoint, new OfficeParser()),
(FileType.MsOutlook, new OfficeParser()),
(FileType.Visio, new OfficeParser()),
(FileType.Png, new ImageParser()),
(FileType.Jpeg, new ImageParser()),
(FileType.Mp3, new AudioParser()),
(FileType.Undef, new AutoDetectParser())
)
/** Extract single file into map of name value pairs */
def extract(file: File): Map[DocPart.Value,String] = {
var istream: InputStream = null
try {
istream = new FileInputStream(file)
val handler = new WriteOutContentHandler(-1)
val metadata = new Metadata()
val parser = parsers(detectFileType(file))
val ctx = new ParseContext()
parser.parse(istream, handler, metadata, ctx)
Map[DocPart.Value,String](
(DocPart.Author, metadata.get(Metadata.CREATOR)),
(DocPart.Title, metadata.get(Metadata.TITLE)),
(DocPart.Body, handler.toString))
} catch {
case e: TikaException => Map[DocPart.Value,String](
(DocPart.Error, e.getMessage()))
} finally {
IOUtils.closeQuietly(istream)
}
}
/** Detect FileType based on file name suffix */
def detectFileType(file: File): FileType.Value = {
val suffix = FilenameUtils.getExtension(file.getName()).
toLowerCase()
suffix match {
case "text" | "txt" => FileType.Text
case "html" | "htm" => FileType.Html
case "xml" => FileType.Xml
case "pdf" => FileType.Pdf
case "rtf" => FileType.Rtf
case "odt" => FileType.OOText
case "xls" | "xlsx" => FileType.MsExcel
case "doc" | "docx" => FileType.MsWord
case "ppt" | "pptx" => FileType.MsPowerpoint
case "pst" => FileType.MsOutlook
case "vsd" => FileType.Visio
case "png" => FileType.Png
case "jpg" | "jpeg" => FileType.Jpeg
case "mp3" => FileType.Mp3
case _ => FileType.Undef
}
}
/** Extract all files in directory with specified file name
pattern. Accepts a renderer function to convert name-
value pairs into an output file (or files).*/
def extract(dir: File, pattern: String, odir: File,
renderer: (File, File, Map[DocPart.Value,String]) => Unit):
Unit = {
val filefilter = pattern match {
case null => new WildcardFileFilter("*.*")
case _ => new WildcardFileFilter(pattern)
}
FileUtils.iterateFiles(dir, filefilter,
DirectoryFileFilter.DIRECTORY).foreach(file => {
Console.println("Parsing file: " + file.getName())
val data = extract(file)
renderer(file, odir, data)
})
}
/** Convenience method to write out text extracted from a file
into the specified directory as filename.txt */
def extractDirToFiles(dir: File, odir: File, pattern: String): Unit = {
def renderDirToFiles(file: File, odir: File,
data: Map[DocPart.Value,String]): Unit = {
val ofname = file.getName() + ".txt"
val writer = new PrintWriter(new FileWriter(new File(odir, ofname)), true)
writer.println(data(DocPart.Title))
writer.println(data(DocPart.Author))
writer.println(data(DocPart.Body))
writer.flush()
writer.close()
}
extract(dir, pattern, odir, renderDirToFiles)
}
}
|
We then use Mahout subcommands to convert the directory of text files to a set of sequence files. I found lots of useful information in this thread from the Mahout Mailing list. Since I only had around 200 documents to work with, I just ran Mahout on my laptop with Hadoop in local mode.
1 2 3 4 5 | sujit@localhost:mahout-distribution-0.8$ bin/mahout seqdirectory \
-i /path/to/colloc/text \
-o /path/to/colloc/text-seqdir \
-c UTF8 \
-chunk 5
|
We then run another Mahout subcommand seq2sparse, which vectorizes the sequence file created by seqdirectory.
1 2 3 | sujit@localhost:mahout-distribution-0.8$ bin/mahout seq2sparse \
-i /path/to/colloc/text-seqdir \
-o /path/to/colloc/text-seqdir-sparse
|
This produces output as a directory structure as follows. Of the data produced, we use only the files in the tokenized-documents subdirectory for the following steps.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | text-seqdir-sparse/
+-- df-count
| +-- _SUCCESS
| +-- part-r-00000
+-- dictionary.file-0
+-- frequency.file-0
+-- tf-vectors
| +-- _SUCCESS
| +-- part-r-00000
+-- tfidf-vectors
| +-- _SUCCESS
| +-- part-r-00000
+-- tokenized-documents
| +-- _SUCCESS
| +-- part-m-00000
+-- wordcount
+-- _SUCCESS
+-- part-r-00000
|
We are now ready to run the CollocDriver. The CollocDriver uses Lucene's ShingleAnalyzer to produce n-grams, and a specified (Lucene) analyzer to tokenize the text in tokenized-documents. We specify the DefaultAnalyzer below, which is a Mahout subclass of Lucene's StandardAnalyzer but with a no-arg constructor (hence suitable for specifying on the command line). We only compute bigrams (n=2) and trigrams (n=3).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | sujit@localhost:mahout-distribution-0.8$ bin/mahout \
org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
-i /path/to/colloc/text-seqdir-sparse/tokenized-documents \
-o /path/to/colloc/text-bigrams \
-ng 2 \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
-u
sujit@localhost:mahout-distribution-0.8$ bin/mahout \
org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
-i /path/to/colloc/text-seqdir-sparse/tokenized-documents \
-o /path/to/colloc/text-trigrams \
-ng 3 \
-a org.apache.mahout.vectorizer.DefaultAnalyzer \
-u
|
This creates two subdirectories, ngrams and subgrams, under each output directory. We will only use the ngrams subdirectory for our next steps.
1 2 3 4 5 6 7 | text-bigrams
+-- ngrams
| +-- _SUCCESS
| +-- part-r-00000
+-- subgrams
+-- _SUCCESS
+-- part-r-00000
|
To get the data in human readable form, we convert the sequence files in the ngrams subdirectory using Mahout's seqdumper subcommand. The key for these files is the bigram or trigram (space separated) and the value is the Log Likelihood ratio. There are good results in here, but there is also lots of noise. The process yields about 90K significant bigrams and 213K significant trigrams.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | sujit@localhost:mahout-distribution-0.8$ bin/mahout seqdumper \
-i /path/to/colloc/text-bigrams/ngrams
Running on hadoop, using /opt/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout-distribution-0.8/mahout-examples-0.8-job.jar
Input Path: file:/path/to/colloc/text-bigrams/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text \
Value Class: class org.apache.hadoop.io.DoubleWritable
Key: zelboraf combo: Value: 48.62653920799494
...
sujit@localhost:mahout-distribution-0.8$ bin/mahout seqdumper \
-i /path/to/colloc/text-trigrams/ngrams
Running on hadoop, using /opt/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout-distribution-0.8/mahout-examples-0.8-job.jar
Input Path: file:/path/to/colloc/text-trigrams/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text \
Value Class: class org.apache.hadoop.io.DoubleWritable
Key: zelboraf lebrikizumab dc: Value: 33.451520167291164
...
|
In an attempt to get better phrases, I used OpenNLP to parse out only noun phrase chunks from the input sentences. The code below parses the directory of text files and writes out the noun phrases, one per line and delimited by period (to signal "sentence boundaries" to the Mahout tokenizer.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 | // Source: src/main/scala/com/mycompany/mia/preprocess/Tokenizer.scala
package com.mycompany.mia.preprocess
import java.io.{File, FileInputStream, FileWriter, InputStream, PrintWriter}
import scala.Array.canBuildFrom
import scala.io.Source
import org.apache.commons.io.IOUtils
import opennlp.tools.chunker.{ChunkerME, ChunkerModel}
import opennlp.tools.postag.{POSModel, POSTaggerME}
import opennlp.tools.sentdetect.{SentenceDetectorME, SentenceModel}
import opennlp.tools.tokenize.{TokenizerME, TokenizerModel}
object Tokenizer extends App {
val tokenizer = new Tokenizer()
val idir = new File("/path/to/colloc/text")
val writer = new PrintWriter(new FileWriter(
new File("/path/to/colloc/text-nphrases/phrases.txt")), true)
idir.listFiles().foreach(file => {
Source.fromFile(file).getLines().foreach(line => {
val sentences = tokenizer.sentTokenize(line)
sentences.foreach(sentence => {
tokenizer.phraseChunk(sentence)
.filter(phrase => "NP".equals(phrase._2))
.foreach(phrase => writer.println(phrase._1 + "."))
})
})
})
writer.flush()
writer.close()
}
class Tokenizer {
val ModelDir = "/path/to/opennlp/models"
val sentenceDetectorFn = (model: SentenceModel) =>
new SentenceDetectorME(model)
val sentenceDetector = sentenceDetectorFn({
var smis: InputStream = null
try {
smis = new FileInputStream(new File(ModelDir, "en_sent.bin"))
val model = new SentenceModel(smis)
model
} finally {
IOUtils.closeQuietly(smis)
}
})
val tokenizerFn = (model: TokenizerModel) =>
new TokenizerME(model)
val tokenizer = tokenizerFn({
var tmis: InputStream = null
try {
tmis = new FileInputStream(new File(ModelDir, "en_token.bin"))
val model = new TokenizerModel(tmis)
model
} finally {
IOUtils.closeQuietly(tmis)
}
})
val posTaggerFn = (model: POSModel) =>
new POSTaggerME(model)
val posTagger = posTaggerFn({
var pmis: InputStream = null
try {
pmis = new FileInputStream(new File(ModelDir, "en_pos_maxent.bin"))
val model = new POSModel(pmis)
model
} finally {
IOUtils.closeQuietly(pmis)
}
})
val chunkerFn = (model: ChunkerModel) =>
new ChunkerME(model)
val chunker = chunkerFn({
var cmis: InputStream = null
try {
cmis = new FileInputStream(new File(ModelDir, "en_chunker.bin"))
val model = new ChunkerModel(cmis)
model
} finally {
IOUtils.closeQuietly(cmis)
}
})
def sentTokenize(para: String): List[String] = {
sentenceDetector.sentDetect(para).toList
}
def wordTokenize(sentence: String): List[String] = {
return tokenizer.tokenize(sentence).toList
}
def posTag(sentence: String): List[(String,String)] = {
val tokenSpans = tokenizer.tokenizePos(sentence)
val tokens = tokenSpans.map(span =>
span.getCoveredText(sentence).toString())
val tags = posTagger.tag(tokens)
tokens.zip(tags).toList
}
def phraseChunk(sentence: String): List[(String,String)] = {
val tokenSpans = tokenizer.tokenizePos(sentence)
val tokens = tokenSpans.map(span =>
span.getCoveredText(sentence).toString())
val tags = posTagger.tag(tokens)
return chunker.chunkAsSpans(tokens, tags).map(chunk => {
val start = tokenSpans(chunk.getStart()).getStart()
val end = tokenSpans(chunk.getEnd() - 1).getEnd()
(sentence.substring(start, end), chunk.getType())
}).toList
}
}
|
Running the pipeline against the /path/to/colloc/text-nphrases directory instead of /path/to/colloc/text directory cuts down the number of significant bigrams and trigrams to about 76K and 184K respectively. The resulting quality is better, but still not good enough to be used without manual filtering. However, it is still useful, because going through the file of keywords and deleting the ones that don't make sense is easier than reading all the documents and trying to find phrases.
Hello Sujit,
ReplyDeleteI tried several algorithms (RAKE, PMI, N-Grams, Maximum Entropy etc) for concept/Theme extraction from document texts and found this decent paper from stanford which gave reasonably good results although the algorithm itself is pretty basic.
http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=85057D4ADAAD516A5F763D7EC94F5B66?doi=10.1.1.173.5881&rep=rep1&type=pdf
Let me know your thoughts.
Thanks
Ravi Kiran Bhaskar
Thanks Ravi this looks very interesting. Definitely something worth trying out.
ReplyDeleteLet me know your thoughts as everybody understands an algo differently. I just want to corroborate my understanding ;-)
ReplyDeleteHi Ravi, it looks interesting, similar to RAKE with the rules but uses ngrams (which almost every other approach uses). I was going to try implementing and benchmarking it against some job description data I had (from the Adzuna challenge on Kaggle) and see how it compares with some of the other approaches.
ReplyDeleteI used it to normalize concepts extracted via pos tagging since I did not want extraneous words. For example if a doc contains several variations of a concept like - computer security, computer vulnerability, hacking, computer system vulnerability etc - I use this this to normalize and find base concept
ReplyDeleteMy goal is more modest, its to help human taxonomy editors discover new concepts in a corpus of text. Do you normalize synonyms automatically? If you do, would appreciate some pointers.
ReplyDeleteNot really Synonym normalization. This is how I did it. For example if I take the document http://www.washingtonpost.com/blogs/the-fix/wp/2013/10/24/john-boehners-next-big-test-immigration-reform/
ReplyDeleteJust taking the (NN || NNS) and (NNP || NNPS) would give the following candidates (also note that there are some editorial mistakes in them as well :-) which will get discarded)
[remarks, morning, call, immigration reform, President Obama, Thursday, Congress, sentiment, year, year.But, weeks, Washington, necessity, immigration.House, House Speaker John A. Boehner, Speaker John Boehner House Speaker John Boehner, defeat, budget, debt ceiling showdown, party, policy concessions, immigration legislation, time, calendar, Obama, White House, Fresh, GOP, Democrats, Boehner, Wednesday, subject, question, speaker, hopefulness, matter, cast-iron, members, reforms, outlook, Months, immigration, bill, possibility, bet, House Republicans, Senate, times, position, standoff, brand, way, things, ways, damage, government shutdown showdown, part, need, risk, voters, skepticism, independents, Republicans, imperative, policy imperative, vote, reform, type, majority, minority, week, deal, government shutdown, debt ceiling, House Democrats, path, support, back, wall, pressure, Senate Democrats, example, plan, conservatives, thing, flank, conference, fight, hard-liners, bit, list, administration, insurance, marketplaces, penalties, state Democrats, delays, deadline, allegation, meeting, Americans, Obamacare, Sen. Dick Durbin, GOP House Leader, comment, Durbin, end, Steve Daines, senator, mega-donor, weapon, desert, message, governor, run, office, Cory Booker, Oct. 31.GOP, Sheldon Adelson, Iran.Former Arkansas, Mike Huckabee, Sen. Mike Enzi, Liz Cheney .Will Eliot Spitzer]
Now if you take the freq of most occored words within the candidates and then "percentage threshold" the phrases (I use 90%) to get the top results as follows
{theme=[GOP House Leader, Boehner, House Speaker John A. Boehner, House Republicans, White House, House Democrats, Speaker John Boehner House Speaker John Boehner]}
However you still see that Boehner is repeated multiple times which can cause trouble so we normalize using the stanford algorithm I was mentioning, which ends up giving
{theme=[GOP House Leader, House Republicans, House Speaker John A. Boehner, White House, House Democrats, Speaker John Boehner House Speaker John Boehner]}
Which is way saner and looks more appropriate for the doc.
hope i made sense :-)
I see, thank you for the detailed explanation. We do something similar (but cruder version of what you are doing) during our concept mapping process - we restrict analysis to noun phrases (although that part is now disabled because of concerns about throwing the baby out of the bathwater) and we post-process the concepts so that the longest match is considered. Thanks to your explanation, I think I now understand the Stanford/Kosmix algo better - there was a mention of using POS tagging but the example was with query logs, so I missed that part on my first shot.
ReplyDelete