Tuesday, October 15, 2013

Entity Discovery using Mahout CollocDriver


I spent most of last week trying out various approaches to extract "interesting" phrases from a collection of articles. The objective was to identify candidate concepts that could be added to our taxonomy. There are various approaches, ranging from simple NGram frequencies, to algorithms such as RAKE (Rapid Automatic Keyword Extraction), to rescoring NGrams using Log Likelihood or Chi-squared measures. In this post, I describe how I used Mahout's CollocDriver (which uses the Log Likelihood measure) to find interesting phrases from a small corpus of about 200 articles.

The articles were in various formats (PDF, DOC, HTML), and I used Apache Tika to parse them into text (yes, I finally found the opportunity to learn Tika :-)). Tika provides parsers for many common formats, so all we have to do was to hook them up to produce text from the various file formats. Here is my code:

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
// Source: src/main/scala/com/mycompany/mia/preprocess/TextExtractor.scala
package com.mycompany.mia.preprocess

import java.io.{File, FileInputStream, FileWriter, InputStream, PrintWriter}

import scala.collection.JavaConversions.asScalaIterator

import org.apache.commons.io.{FileUtils, FilenameUtils, IOUtils}
import org.apache.commons.io.filefilter.{DirectoryFileFilter, WildcardFileFilter}
import org.apache.tika.exception.TikaException
import org.apache.tika.metadata.Metadata
import org.apache.tika.parser.{AutoDetectParser, ParseContext, Parser}
import org.apache.tika.parser.audio.AudioParser
import org.apache.tika.parser.html.HtmlParser
import org.apache.tika.parser.image.ImageParser
import org.apache.tika.parser.microsoft.OfficeParser
import org.apache.tika.parser.opendocument.OpenOfficeParser
import org.apache.tika.parser.pdf.PDFParser
import org.apache.tika.parser.rtf.RTFParser
import org.apache.tika.parser.txt.TXTParser
import org.apache.tika.parser.xml.XMLParser
import org.apache.tika.sax.WriteOutContentHandler

object TextExtractor extends App {
  val extractor = new TextExtractor()
  val idir = new File("/path/to/raw/files")
  val odir = new File("/path/to/text/files")
  extractor.extractDirToFiles(idir, odir, null)
}

class TextExtractor {

  object FileType extends Enumeration {
    type FileType = Value
    val Text, Html, Xml, Pdf, Rtf, OOText,
    MsExcel, MsWord, MsPowerpoint, MsOutlook, Visio,
    Png, Jpeg, Mp3, Undef = Value
  }
  object DocPart extends Enumeration {
    type DocPart = Value
    val Title, Author, Body, Error = Value
  }
  
  val parsers = Map[FileType.Value,Parser](
    (FileType.Text, new TXTParser()),
    (FileType.Html, new HtmlParser()),
    (FileType.Xml, new XMLParser()),
    (FileType.Pdf, new PDFParser()),
    (FileType.Rtf, new RTFParser()),
    (FileType.OOText, new OpenOfficeParser()),
    (FileType.MsExcel, new OfficeParser()),
    (FileType.MsWord, new OfficeParser()),
    (FileType.MsPowerpoint, new OfficeParser()),
    (FileType.MsOutlook, new OfficeParser()),
    (FileType.Visio, new OfficeParser()),
    (FileType.Png, new ImageParser()),
    (FileType.Jpeg, new ImageParser()),
    (FileType.Mp3, new AudioParser()),
    (FileType.Undef, new AutoDetectParser())
  )

  /** Extract single file into map of name value pairs */
  def extract(file: File): Map[DocPart.Value,String] = {
    var istream: InputStream = null
    try {
      istream = new FileInputStream(file)
      val handler = new WriteOutContentHandler(-1)
      val metadata = new Metadata()
      val parser = parsers(detectFileType(file))
      val ctx = new ParseContext()
      parser.parse(istream, handler, metadata, ctx)
      Map[DocPart.Value,String](
        (DocPart.Author, metadata.get(Metadata.CREATOR)),
        (DocPart.Title, metadata.get(Metadata.TITLE)),
        (DocPart.Body, handler.toString))
    } catch {
      case e: TikaException => Map[DocPart.Value,String](
        (DocPart.Error, e.getMessage()))      
    } finally {
      IOUtils.closeQuietly(istream)
    }
  }
  
  /** Detect FileType based on file name suffix */
  def detectFileType(file: File): FileType.Value = {
    val suffix = FilenameUtils.getExtension(file.getName()).
      toLowerCase()
    suffix match {
      case "text" | "txt" => FileType.Text
      case "html" | "htm" => FileType.Html
      case "xml"          => FileType.Xml
      case "pdf"          => FileType.Pdf
      case "rtf"          => FileType.Rtf
      case "odt"          => FileType.OOText
      case "xls" | "xlsx" => FileType.MsExcel
      case "doc" | "docx" => FileType.MsWord
      case "ppt" | "pptx" => FileType.MsPowerpoint
      case "pst"          => FileType.MsOutlook
      case "vsd"          => FileType.Visio
      case "png"          => FileType.Png
      case "jpg" | "jpeg" => FileType.Jpeg
      case "mp3"          => FileType.Mp3
      case _              => FileType.Undef
    }
  }
  
  /** Extract all files in directory with specified file name
      pattern. Accepts a renderer function to convert name-
      value pairs into an output file (or files).*/
  def extract(dir: File, pattern: String, odir: File,
      renderer: (File, File, Map[DocPart.Value,String]) => Unit): 
      Unit = {
    val filefilter = pattern match {
      case null => new WildcardFileFilter("*.*")
      case _ => new WildcardFileFilter(pattern)
    }
    FileUtils.iterateFiles(dir, filefilter, 
        DirectoryFileFilter.DIRECTORY).foreach(file => {
      Console.println("Parsing file: " + file.getName())
      val data = extract(file)
      renderer(file, odir, data)
    })
  }

  /** Convenience method to write out text extracted from a file
      into the specified directory as filename.txt */
  def extractDirToFiles(dir: File, odir: File, pattern: String): Unit = {
    def renderDirToFiles(file: File, odir: File, 
        data: Map[DocPart.Value,String]): Unit = {
      val ofname = file.getName() + ".txt"
      val writer = new PrintWriter(new FileWriter(new File(odir, ofname)), true)
      writer.println(data(DocPart.Title))
      writer.println(data(DocPart.Author))
      writer.println(data(DocPart.Body))
      writer.flush()
      writer.close()
    } 
    extract(dir, pattern, odir, renderDirToFiles)
  }
}

We then use Mahout subcommands to convert the directory of text files to a set of sequence files. I found lots of useful information in this thread from the Mahout Mailing list. Since I only had around 200 documents to work with, I just ran Mahout on my laptop with Hadoop in local mode.

1
2
3
4
5
sujit@localhost:mahout-distribution-0.8$ bin/mahout seqdirectory \
    -i /path/to/colloc/text \
    -o /path/to/colloc/text-seqdir \
    -c UTF8 \
    -chunk 5

We then run another Mahout subcommand seq2sparse, which vectorizes the sequence file created by seqdirectory.

1
2
3
sujit@localhost:mahout-distribution-0.8$ bin/mahout seq2sparse \
    -i /path/to/colloc/text-seqdir \
    -o /path/to/colloc/text-seqdir-sparse

This produces output as a directory structure as follows. Of the data produced, we use only the files in the tokenized-documents subdirectory for the following steps.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
text-seqdir-sparse/
+-- df-count
|   +-- _SUCCESS
|   +-- part-r-00000
+-- dictionary.file-0
+-- frequency.file-0
+-- tf-vectors
|   +-- _SUCCESS
|   +-- part-r-00000
+-- tfidf-vectors
|   +-- _SUCCESS
|   +-- part-r-00000
+-- tokenized-documents
|   +-- _SUCCESS
|   +-- part-m-00000
+-- wordcount
    +-- _SUCCESS
    +-- part-r-00000

We are now ready to run the CollocDriver. The CollocDriver uses Lucene's ShingleAnalyzer to produce n-grams, and a specified (Lucene) analyzer to tokenize the text in tokenized-documents. We specify the DefaultAnalyzer below, which is a Mahout subclass of Lucene's StandardAnalyzer but with a no-arg constructor (hence suitable for specifying on the command line). We only compute bigrams (n=2) and trigrams (n=3).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
sujit@localhost:mahout-distribution-0.8$ bin/mahout \
    org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
    -i /path/to/colloc/text-seqdir-sparse/tokenized-documents \
    -o /path/to/colloc/text-bigrams \
    -ng 2 \
    -a org.apache.mahout.vectorizer.DefaultAnalyzer \
    -u
sujit@localhost:mahout-distribution-0.8$ bin/mahout \
    org.apache.mahout.vectorizer.collocations.llr.CollocDriver \
    -i /path/to/colloc/text-seqdir-sparse/tokenized-documents \
    -o /path/to/colloc/text-trigrams \
    -ng 3 \
    -a org.apache.mahout.vectorizer.DefaultAnalyzer \
    -u

This creates two subdirectories, ngrams and subgrams, under each output directory. We will only use the ngrams subdirectory for our next steps.

1
2
3
4
5
6
7
text-bigrams
+-- ngrams
|   +-- _SUCCESS
|   +-- part-r-00000
+-- subgrams
    +-- _SUCCESS
    +-- part-r-00000

To get the data in human readable form, we convert the sequence files in the ngrams subdirectory using Mahout's seqdumper subcommand. The key for these files is the bigram or trigram (space separated) and the value is the Log Likelihood ratio. There are good results in here, but there is also lots of noise. The process yields about 90K significant bigrams and 213K significant trigrams.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
sujit@localhost:mahout-distribution-0.8$ bin/mahout seqdumper \
  -i /path/to/colloc/text-bigrams/ngrams
Running on hadoop, using /opt/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout-distribution-0.8/mahout-examples-0.8-job.jar
Input Path: file:/path/to/colloc/text-bigrams/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text \
    Value Class: class org.apache.hadoop.io.DoubleWritable
Key: zelboraf combo: Value: 48.62653920799494
...

sujit@localhost:mahout-distribution-0.8$ bin/mahout seqdumper \
  -i /path/to/colloc/text-trigrams/ngrams
Running on hadoop, using /opt/hadoop/bin/hadoop and HADOOP_CONF_DIR=
MAHOUT-JOB: /opt/mahout-distribution-0.8/mahout-examples-0.8-job.jar
Input Path: file:/path/to/colloc/text-trigrams/ngrams/part-r-00000
Key class: class org.apache.hadoop.io.Text \
    Value Class: class org.apache.hadoop.io.DoubleWritable
Key: zelboraf lebrikizumab dc: Value: 33.451520167291164
...

In an attempt to get better phrases, I used OpenNLP to parse out only noun phrase chunks from the input sentences. The code below parses the directory of text files and writes out the noun phrases, one per line and delimited by period (to signal "sentence boundaries" to the Mahout tokenizer.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
// Source: src/main/scala/com/mycompany/mia/preprocess/Tokenizer.scala
package com.mycompany.mia.preprocess

import java.io.{File, FileInputStream, FileWriter, InputStream, PrintWriter}

import scala.Array.canBuildFrom
import scala.io.Source

import org.apache.commons.io.IOUtils

import opennlp.tools.chunker.{ChunkerME, ChunkerModel}
import opennlp.tools.postag.{POSModel, POSTaggerME}
import opennlp.tools.sentdetect.{SentenceDetectorME, SentenceModel}
import opennlp.tools.tokenize.{TokenizerME, TokenizerModel}

object Tokenizer extends App {
  val tokenizer = new Tokenizer()
  val idir = new File("/path/to/colloc/text")
  val writer = new PrintWriter(new FileWriter(
    new File("/path/to/colloc/text-nphrases/phrases.txt")), true)
  idir.listFiles().foreach(file => {
    Source.fromFile(file).getLines().foreach(line => {
      val sentences = tokenizer.sentTokenize(line)
      sentences.foreach(sentence => {
        tokenizer.phraseChunk(sentence)
          .filter(phrase => "NP".equals(phrase._2))
          .foreach(phrase => writer.println(phrase._1 + "."))
      })
    })
  })
  writer.flush()
  writer.close()
}

class Tokenizer {

  val ModelDir = "/path/to/opennlp/models"

  val sentenceDetectorFn = (model: SentenceModel) =>
    new SentenceDetectorME(model)    
  val sentenceDetector = sentenceDetectorFn({
    var smis: InputStream = null
    try {
      smis = new FileInputStream(new File(ModelDir, "en_sent.bin"))
      val model = new SentenceModel(smis)
      model
    } finally {
      IOUtils.closeQuietly(smis)
    }   
  })
  val tokenizerFn = (model: TokenizerModel) => 
    new TokenizerME(model)
  val tokenizer = tokenizerFn({
    var tmis: InputStream = null
    try {
      tmis = new FileInputStream(new File(ModelDir, "en_token.bin"))
      val model = new TokenizerModel(tmis)
      model
    } finally {
      IOUtils.closeQuietly(tmis)
    }
  })
  val posTaggerFn = (model: POSModel) => 
    new POSTaggerME(model)
  val posTagger = posTaggerFn({
    var pmis: InputStream = null
    try {
      pmis = new FileInputStream(new File(ModelDir, "en_pos_maxent.bin"))
      val model = new POSModel(pmis)
      model
    } finally {
      IOUtils.closeQuietly(pmis)
    }
  })
  val chunkerFn = (model: ChunkerModel) => 
    new ChunkerME(model)
  val chunker = chunkerFn({
    var cmis: InputStream = null
    try {
      cmis = new FileInputStream(new File(ModelDir, "en_chunker.bin"))
      val model = new ChunkerModel(cmis)
      model
    } finally {
      IOUtils.closeQuietly(cmis)
    }
  })

  def sentTokenize(para: String): List[String] = {
    sentenceDetector.sentDetect(para).toList
  }
  
  def wordTokenize(sentence: String): List[String] = {
    return tokenizer.tokenize(sentence).toList
  }
  
  def posTag(sentence: String): List[(String,String)] = {
    val tokenSpans = tokenizer.tokenizePos(sentence)
    val tokens = tokenSpans.map(span => 
      span.getCoveredText(sentence).toString())
    val tags = posTagger.tag(tokens)
    tokens.zip(tags).toList
  }
  
  def phraseChunk(sentence: String): List[(String,String)] = {
    val tokenSpans = tokenizer.tokenizePos(sentence)
    val tokens = tokenSpans.map(span => 
      span.getCoveredText(sentence).toString())
    val tags = posTagger.tag(tokens)
    return chunker.chunkAsSpans(tokens, tags).map(chunk => {
      val start = tokenSpans(chunk.getStart()).getStart()
      val end = tokenSpans(chunk.getEnd() - 1).getEnd()
      (sentence.substring(start, end), chunk.getType())
    }).toList
  }
}

Running the pipeline against the /path/to/colloc/text-nphrases directory instead of /path/to/colloc/text directory cuts down the number of significant bigrams and trigrams to about 76K and 184K respectively. The resulting quality is better, but still not good enough to be used without manual filtering. However, it is still useful, because going through the file of keywords and deleting the ones that don't make sense is easier than reading all the documents and trying to find phrases.

8 comments (moderated to prevent spam):

Anonymous said...

Hello Sujit,
I tried several algorithms (RAKE, PMI, N-Grams, Maximum Entropy etc) for concept/Theme extraction from document texts and found this decent paper from stanford which gave reasonably good results although the algorithm itself is pretty basic.

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=85057D4ADAAD516A5F763D7EC94F5B66?doi=10.1.1.173.5881&rep=rep1&type=pdf

Let me know your thoughts.

Thanks
Ravi Kiran Bhaskar

Sujit Pal said...

Thanks Ravi this looks very interesting. Definitely something worth trying out.

Anonymous said...

Let me know your thoughts as everybody understands an algo differently. I just want to corroborate my understanding ;-)

Sujit Pal said...

Hi Ravi, it looks interesting, similar to RAKE with the rules but uses ngrams (which almost every other approach uses). I was going to try implementing and benchmarking it against some job description data I had (from the Adzuna challenge on Kaggle) and see how it compares with some of the other approaches.

Anonymous said...

I used it to normalize concepts extracted via pos tagging since I did not want extraneous words. For example if a doc contains several variations of a concept like - computer security, computer vulnerability, hacking, computer system vulnerability etc - I use this this to normalize and find base concept

Sujit Pal said...

My goal is more modest, its to help human taxonomy editors discover new concepts in a corpus of text. Do you normalize synonyms automatically? If you do, would appreciate some pointers.

Anonymous said...

Not really Synonym normalization. This is how I did it. For example if I take the document http://www.washingtonpost.com/blogs/the-fix/wp/2013/10/24/john-boehners-next-big-test-immigration-reform/

Just taking the (NN || NNS) and (NNP || NNPS) would give the following candidates (also note that there are some editorial mistakes in them as well :-) which will get discarded)

[remarks, morning, call, immigration reform, President Obama, Thursday, Congress, sentiment, year, year.But, weeks, Washington, necessity, immigration.House, House Speaker John A. Boehner, Speaker John Boehner House Speaker John Boehner, defeat, budget, debt ceiling showdown, party, policy concessions, immigration legislation, time, calendar, Obama, White House, Fresh, GOP, Democrats, Boehner, Wednesday, subject, question, speaker, hopefulness, matter, cast-iron, members, reforms, outlook, Months, immigration, bill, possibility, bet, House Republicans, Senate, times, position, standoff, brand, way, things, ways, damage, government shutdown showdown, part, need, risk, voters, skepticism, independents, Republicans, imperative, policy imperative, vote, reform, type, majority, minority, week, deal, government shutdown, debt ceiling, House Democrats, path, support, back, wall, pressure, Senate Democrats, example, plan, conservatives, thing, flank, conference, fight, hard-liners, bit, list, administration, insurance, marketplaces, penalties, state Democrats, delays, deadline, allegation, meeting, Americans, Obamacare, Sen. Dick Durbin, GOP House Leader, comment, Durbin, end, Steve Daines, senator, mega-donor, weapon, desert, message, governor, run, office, Cory Booker, Oct. 31.GOP, Sheldon Adelson, Iran.Former Arkansas, Mike Huckabee, Sen. Mike Enzi, Liz Cheney .Will Eliot Spitzer]


Now if you take the freq of most occored words within the candidates and then "percentage threshold" the phrases (I use 90%) to get the top results as follows

{theme=[GOP House Leader, Boehner, House Speaker John A. Boehner, House Republicans, White House, House Democrats, Speaker John Boehner House Speaker John Boehner]}

However you still see that Boehner is repeated multiple times which can cause trouble so we normalize using the stanford algorithm I was mentioning, which ends up giving

{theme=[GOP House Leader, House Republicans, House Speaker John A. Boehner, White House, House Democrats, Speaker John Boehner House Speaker John Boehner]}

Which is way saner and looks more appropriate for the doc.

hope i made sense :-)

Sujit Pal said...

I see, thank you for the detailed explanation. We do something similar (but cruder version of what you are doing) during our concept mapping process - we restrict analysis to noun phrases (although that part is now disabled because of concerns about throwing the baby out of the bathwater) and we post-process the concepts so that the longest match is considered. Thanks to your explanation, I think I now understand the Stanford/Kosmix algo better - there was a mention of using POS tagging but the example was with query logs, so I missed that part on my first shot.