Saturday, October 24, 2009

Finding Phrases with Binomial Distribution

Last week, I wrote about the Binomial Distribution, and worked through the "New York" example from Dr Manu Konchady's "Building Search Applications" book. This week, I enhance my old Phrase Extractor code to filter the word grams generated using the ratio of actual probability to the estimated probability (according to the Binomial Distribution) as a guide.

Overview of Approach

Since I am just exploring the approach, I limit myself to 1 and 2 word grams. However, it can easily be extended to 3, 4 and 5 word grams, by considering the phrases found in the previous stage. Our input is a preprocessed sequence file of sentences, one per line, which is generated from combining "Alice in Wonderland", "Moby Dick" and "The adventures of Sherlock Holmes" ebooks from the Gutenberg project. The processing is done using a sequence of two map-reduce jobs. The first job writes out the word 1-grams and 2-grams and their occurrences into a database table. The second job reads through the 2-grams, and looks up occurrence values of corresponding 1-grams to compute the estimated and actual probabilities and do the filtering.

Using DBInput/Output

One thing new to me is the use of a relational database as a Hadoop input and output format. Since I am using the new Hadoop API introduced in Hadoop 0.20, and because the corresponding DBInput/OutputFormats were not available in this version, I had to create local copies of these artifacts from the (as yet unreleased) Hadoop 0.22 distribution. I have checked these temporarily into JTMT's svn repository so everything compiles.

I needed to use a database because I wanted to be able to pull individual occurrences out for a given phrase. I suppose I could have done this with HBase, but I know next to nothing about it, and using an RDBMS with Hadoop seemed to be a good thing to know about. So in any case, I am using a local MySQL instance as my database.

Apart from copying over all the DB* classes from org.apache.hadoop.mapreduce.lib.db, I had to create a custom DbWritable subclass representing a record in the MySQL table. It is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/PhraseCounterDbWritable.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;

import java.io.DataInput;
import java.io.DataOutput;
import java.io.IOException;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

import net.sf.jtmt.concurrent.hadoop.phraseextractor.db.DBWritable;

import org.apache.commons.lang.builder.ReflectionToStringBuilder;
import org.apache.commons.lang.builder.ToStringStyle;
import org.apache.hadoop.io.Writable;

/**
 * A writable that can be used to read and write data from the table
 * tmdb:my_phrase_counts:
 * +-----------+--------------+------+-----+---------+-------+
 * | Field     | Type         | Null | Key | Default | Extra |
 * +-----------+--------------+------+-----+---------+-------+
 * | phrase    | varchar(255) | NO   | MUL | NULL    |       | 
 * | gram_size | int(11)      | NO   | MUL | NULL    |       | 
 * | occurs    | mediumtext   | NO   |     | NULL    |       | 
 * +-----------+--------------+------+-----+---------+-------+
 */
public class PhraseCounterDbWritable implements Writable, DBWritable {

  private String phrase;
  private int gramSize;
  private long occurs;
  
  public String getPhrase() {
    return phrase;
  }

  public void setPhrase(String phrase) {
    this.phrase = phrase;
  }

  public int getGramSize() {
    return gramSize;
  }

  public void setGramSize(int gramSize) {
    this.gramSize = gramSize;
  }

  public long getOccurs() {
    return occurs;
  }

  public void setOccurs(long occurs) {
    this.occurs = occurs;
  }

  @Override
  public void readFields(DataInput in) throws IOException {
    phrase = in.readUTF();
    gramSize = in.readInt();
    occurs = in.readLong();
  }

  @Override
  public void write(DataOutput out) throws IOException {
    out.writeUTF(phrase);
    out.writeInt(gramSize);
    out.writeLong(occurs);
  }

  @Override
  public void readFields(ResultSet resultSet) throws SQLException {
    phrase = resultSet.getString(1);
    gramSize = resultSet.getInt(2);
    occurs = resultSet.getLong(3);
  }

  @Override
  public void write(PreparedStatement statement) throws SQLException {
    statement.setString(1, phrase);
    statement.setInt(2, gramSize);
    statement.setLong(3, occurs);
  }
  
  @Override
  public String toString() {
    return ReflectionToStringBuilder.reflectionToString(
      this, ToStringStyle.DEFAULT_STYLE);
  }
}

The code

Here is the main driver code. It is a simple pipeline of two map-reduce jobs, with some straightforward JDBC code thrown in. The first step is to drop and create the my_phrases_count table. The second step reads the sequence file containing the sentences and writes out the 1 and 2 word grams to the table. The third step puts indexes on the table - we do this for performance - inserts are faster without indexes, and we definitely need indexes to lookup by phrase and gram size in the next step. The fourth step filters phrases based on the estimated and actual probabilities and writes them out into a TextOutput.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/PhraseExtractor.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;

import java.io.IOException;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.Iterator;
import java.util.List;

// TODO: remove these when moving to Hadoop 0.22+
import net.sf.jtmt.concurrent.hadoop.phraseextractor.db.DBConfiguration;
import net.sf.jtmt.concurrent.hadoop.phraseextractor.db.DBInputFormat;
import net.sf.jtmt.concurrent.hadoop.phraseextractor.db.DBOutputFormat;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.math.util.MathUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

/**
 * Breaks up input text into sentences, then generates 3-5 grams of
 * the input text.
 */
public class PhraseExtractor {

  public static final int MIN_GRAM_SIZE = 1;
  public static final int MAX_GRAM_SIZE = 3;
  public static final int MIN_OCCURRENCES_THRESHOLD = 5;
  public static final int MIN_LOG_LIKELIHOOD_RATIO = 100;
  
  public static final String MIN_GRAM_SIZE_KEY =
    "phrase_extractor.min.gram.size";
  public static final String MAX_GRAM_SIZE_KEY =
    "phrase_extractor.max.gram.size";
  public static final String MIN_OCCURRENCES_KEY =
    "phrase_extractor.min.occurs";
  public static final String MIN_LOG_LIKELIHOOD_RATIO_KEY =
    "phrase_extractor.min.llr";
  public static final String NUM_WORDS_IN_CORPUS_KEY =
    "phrase_extractor.num.words";
  
  private static void recreateTable(Configuration conf) throws Exception {
    Connection conn = null;
    PreparedStatement ps = null;
    try {
      DBConfiguration dbconf = new DBConfiguration(conf);
      conn = dbconf.getConnection();
      ps = conn.prepareStatement("drop table my_phrase_counts");
      ps.executeUpdate();
      ps = conn.prepareStatement("create table my_phrase_counts (" +
        "phrase varchar(255) not null, " +
        "gram_size int not null," +
        "occurs long not null" +
        ") type=ISAM");
      ps.executeUpdate();
    } catch (SQLException e) {
      throw new Exception(e);
    } finally {
      if (ps != null) {
        try { ps.close(); } catch (SQLException e) {}
        try { conn.close(); } catch (SQLException e) {}
      }
    }
  }

  private static void writeNGrams(Configuration conf, Path inputPath) 
      throws Exception {
    DBConfiguration dbconf = new DBConfiguration(conf);
    dbconf.setOutputTableName("my_phrase_counts");
    dbconf.setOutputFieldNames("phrase", "gram_size", "occurs");
    Job job = new Job(conf, "write-ngrams");
    job.getConfiguration().setInt(MIN_GRAM_SIZE_KEY, MIN_GRAM_SIZE);
    job.getConfiguration().setInt(MAX_GRAM_SIZE_KEY, MAX_GRAM_SIZE);
    job.getConfiguration().setInt(MIN_OCCURRENCES_KEY, 
      MIN_OCCURRENCES_THRESHOLD);
    job.setJarByClass(PhraseExtractor.class);
    FileInputFormat.addInputPath(job, inputPath);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setMapperClass(WordNGramMapper.class);
    job.setReducerClass(WordNGramReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(LongWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(PhraseCounterDbWritable.class);
    job.setOutputFormatClass(DBOutputFormat.class);
    job.setNumReduceTasks(2);
    boolean status = job.waitForCompletion(true);
    if (! status) {
      throw new Exception("Job " + job.getJobName() + " failed!");
    }
  }

  private static void createIndex(Configuration conf) throws Exception {
    Connection conn = null;
    PreparedStatement ps = null;
    try {
      DBConfiguration dbconf = new DBConfiguration(conf);
      conn = dbconf.getConnection();
      ps = conn.prepareStatement("create index my_phrase_counts_ix1 " +
        "on my_phrase_counts(gram_size)");
      ps.executeUpdate();
      ps = conn.prepareStatement("create index my_phrase_counts_ix2 " +
        "on my_phrase_counts(phrase)");
      ps.executeUpdate();
    } catch (SQLException e) {
      throw(e);
    } finally {
      if (ps != null) {
        try { ps.close(); } catch (SQLException e) {}
        try { conn.close(); } catch (SQLException e) {}
      }
    }
  }

  private static long getNumWordsInCorpus(DBConfiguration dbconf) 
      throws IOException {
    Connection conn = null;
    PreparedStatement ps = null;
    ResultSet rs = null;
    try {
      conn = dbconf.getConnection();
      ps = conn.prepareStatement("select sum(occurs) " +
        "from my_phrase_counts where gram_size=1");
      rs = ps.executeQuery();
      rs.next();
      return rs.getLong(1);
    } catch (SQLException e) {
      throw new IOException(e);
    } catch (ClassNotFoundException e) {
      throw new IOException(e);
    } finally {
      if (rs != null) {
        try { rs.close(); } catch (SQLException e) {}
        try { ps.close(); } catch (SQLException e) {}
        try { conn.close(); } catch (SQLException e) {}
      }
    }
  }

  private static void findLikelyPhrases(Configuration conf, int gramSize,
      Path outputPath) throws Exception {
    DBConfiguration dbconf = new DBConfiguration(conf);
    dbconf.setInputTableName("my_phrase_counts");
    dbconf.setInputFieldNames("phrase", "gram_size", "occurs");
    dbconf.setInputConditions("gram_size=" + gramSize);
    dbconf.setInputClass(PhraseCounterDbWritable.class);
    Job job = new Job(conf, "find-likely-phrases");
    job.getConfiguration().setLong(NUM_WORDS_IN_CORPUS_KEY, 
      getNumWordsInCorpus(dbconf));
    job.getConfiguration().setInt(MIN_LOG_LIKELIHOOD_RATIO_KEY, 
      MIN_LOG_LIKELIHOOD_RATIO);
    job.setJarByClass(PhraseExtractor.class);
    job.setInputFormatClass(DBInputFormat.class);
    job.setMapperClass(LikelyPhraseMapper.class);
    job.setReducerClass(LikelyPhraseReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(DoubleWritable.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(DoubleWritable.class);
    job.setOutputFormatClass(TextOutputFormat.class);
    FileOutputFormat.setOutputPath(job, outputPath);
    job.setNumReduceTasks(2);
    boolean status = job.waitForCompletion(true);
    if (! status) {
      throw new Exception("Job " + job.getJobName() + " failed!");
    }
  }

  public static void main(String[] argv) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = 
      new GenericOptionsParser(conf, argv).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: calc input_path output_path");
      System.exit(-1);
    }
    // set up database properties
    DBConfiguration.configureDB(conf, 
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", 
      "root", "5ec4e1");
    recreateTable(conf);
    writeNGrams(conf, new Path(otherArgs[0]));
    createIndex(conf);
    findLikelyPhrases(conf, 2, new Path(otherArgs[1]));
  }
}

The WordNGramMapper is not very different from the previous version. All it does is find the N-grams given a sentence and writes them out into the context with a count of 1. Details about the WordNGramGenerator itself can be found in my previous post.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/WordNGramMapper.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;

import java.io.IOException;
import java.util.List;

import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * Reads a sequence file and converts it to {ngram, count}.
 */
public class WordNGramMapper 
    extends Mapper<LongWritable,Text,Text,LongWritable> {

  private static final LongWritable ONE = new LongWritable(1L);

  private static int minGramSize;
  private static int maxGramSize;

  @Override
  public void setup(Context context) {
    minGramSize = context.getConfiguration().getInt(
      PhraseExtractor.MIN_GRAM_SIZE_KEY, PhraseExtractor.MIN_GRAM_SIZE);
    maxGramSize = context.getConfiguration().getInt(
      PhraseExtractor.MAX_GRAM_SIZE_KEY, PhraseExtractor.MAX_GRAM_SIZE);
  }

  @Override
  public void map(LongWritable key, Text value, Context context) 
  throws IOException, InterruptedException {
    String sentence = value.toString();
    WordNGramGenerator ngramGenerator = new WordNGramGenerator();
    List<String> grams = ngramGenerator.generate(
      sentence, minGramSize, maxGramSize);
    for (String gram : grams) {
      context.write(new Text(gram), ONE);
    }
  }
}

The WordNGramReducer aggregates the n-gram counts and writes it out to the database table. DBInputFormat requires the data to be populated into the key, so thats what I have done here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/WordNGramReducer.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;

import java.io.IOException;
import java.util.Iterator;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * Writes out n-gram counts, filtering out the n-grams whose occurrence
 * counts falls below a predefined threshold.
 */
public class WordNGramReducer extends  
    Reducer<Text,LongWritable,PhraseCounterDbWritable,NullWritable> {

  private static int minOccurrences;

  @Override
  public void setup(Context context) {
    minOccurrences = 
      context.getConfiguration().getInt(PhraseExtractor.MIN_OCCURRENCES_KEY, 
      PhraseExtractor.MIN_OCCURRENCES_THRESHOLD);
  }

  @Override
  public void reduce(Text key, Iterable<LongWritable> values, 
    Context context) throws IOException, InterruptedException {
    long sum = 0L;
    for (Iterator<LongWritable> it = values.iterator(); it.hasNext();) {
      it.next();
      sum++;
    }
    if (sum > minOccurrences) {
      int gramSize = StringUtils.split(key.toString(), " ").length;
      PhraseCounterDbWritable okey = new PhraseCounterDbWritable();
      okey.setPhrase(key.toString());
      okey.setGramSize(gramSize);
      okey.setOccurs(sum);
      context.write(okey, NullWritable.get());
    }
  }
}

The LikelyPhraseMapper reads in bigrams from the database table, then for each bigram phrase, looks up the occurrence value of the corresponding trailing word, computes the actual and estimated probabilities for the phrase, and for all qualifying phrases writes the information into the context.

There is a bug in the WordNGramGenerator where some of the trailing words are being considered in the bigram but not by itself. I discovered this when trying to run the mapper - the Binomial Coefficient nCr was hitting a case where r > n, which is like saying that in a corpus "New York" occurred more times than "York", which obviously cannot happen. I haven't had time to fix this bug, but I will (its marked by a TODO in the code below).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/LikelyPhraseMapper.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;

import java.io.IOException;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;

import net.sf.jtmt.concurrent.hadoop.phraseextractor.db.DBConfiguration;

import org.apache.commons.math.util.MathUtils;
import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
 * Reads in bigrams from the database and computes the likelihood of the
 * phrase using Binomial distribution. Writes out phrases which satisfy
 * the test.
 */
public class LikelyPhraseMapper extends
    Mapper<LongWritable,PhraseCounterDbWritable,Text,DoubleWritable> {

  private static DBConfiguration dbconf;
  private static long nWords;
  private static double logLikelihoodThreshold;

  @Override
  public void setup(Context context) 
      throws IOException, InterruptedException {
    dbconf = new DBConfiguration(context.getConfiguration());
    nWords = context.getConfiguration().getLong(
      PhraseExtractor.NUM_WORDS_IN_CORPUS_KEY, -1L);
    logLikelihoodThreshold = context.getConfiguration().getInt(
      PhraseExtractor.MIN_LOG_LIKELIHOOD_RATIO_KEY, 
      PhraseExtractor.MIN_LOG_LIKELIHOOD_RATIO);
  }

  @Override
  public void map(LongWritable key, PhraseCounterDbWritable value, 
    Context context) throws IOException, InterruptedException {
    String phrase = value.getPhrase();
    long phraseOccurs = value.getOccurs();
    String trailingWord = phrase.substring(phrase.lastIndexOf(' ') + 1);
    long trailingWordOccurs = getOccurrence(trailingWord);
    double pTrailingWord = (double) trailingWordOccurs / (double) nWords;
    if (phraseOccurs > trailingWordOccurs) {
      // TODO: fix this bug, this is impossible, and points to a bug in
      // the NGram generator code.
    } else {
      double estPhraseLogProbability = getEstimatedLogProbability(
        trailingWordOccurs, pTrailingWord, phraseOccurs);
      double actPhraseLogProbability = 
        Math.log(phraseOccurs) - Math.log(nWords);
      // if the actual occurrence probability is 100 times as likely as the 
      // estimated probability given by the binomial distribution, then we 
      // consider this to be a likely phrase.
      double diff = actPhraseLogProbability - estPhraseLogProbability;
      if (diff > logLikelihoodThreshold) {
        context.write(new Text(phrase), new DoubleWritable(diff));
      }
    }
  }

  private long getOccurrence(String trailingWord) throws IOException {
    Connection conn = null;
    PreparedStatement ps = null;
    ResultSet rs = null;
    try {
      conn = dbconf.getConnection();
      ps = conn.prepareStatement("select occurs " +
      "from my_phrase_counts where phrase = ?");
      ps.setString(1, trailingWord);
      rs = ps.executeQuery();
      rs.next();
      return rs.getLong(1);
    } catch (SQLException e) {
      throw new IOException(e);
    } catch (ClassNotFoundException e) {
      throw new IOException(e);
    } finally {
      if (rs != null) {
        try { rs.close(); } catch (SQLException e) {}
        try { ps.close(); } catch (SQLException e) {}
        try { conn.close(); } catch (SQLException e) {}
      }
    }
  }

  private double getEstimatedLogProbability(long n, double p, long x) {
    return MathUtils.binomialCoefficientLog((int) n, (int) x) + 
    (x * Math.log(p)) + ((n - x) * Math.log(1 - p));
  }
}

The LikelyPhraseReducer is just a simple identity reducer. The mapper will write out zero or one row for each word bigram, and the reducer just writes the data out into a TextOutput.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Source: src/main/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/LikelyPhraseReducer.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;

import java.io.IOException;

import org.apache.hadoop.io.DoubleWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

/**
 * Simple identity reducer, writes out qualifying phrase and its log
 * likelihood ratio to a text output.
 */
public class LikelyPhraseReducer
    extends Reducer<Text,DoubleWritable,Text,DoubleWritable> {

  @Override
  public void reduce(Text key, Iterable<DoubleWritable> values,
      Context context) throws IOException, InterruptedException {
    for (DoubleWritable value : values) {
      // will only have 1 value in the iterable
      context.write(key, value);
    }
  }
}

Results

The results are definitely better than with the previous implementation, but still contain a lot of junk. Here are some nice phrases it caught (one each from each book above).

1
2
3
4
sherlock holmes 348.64636504741196
sperm whale     327.2002806998291
your majesty    213.70131001002
...

Out of 5447 two word grams that were considered, 1358 were considered likely to be phrases with a log likelihood ratio threshold of 100 (ie the actual probability is 100 times the estimated probability). Perhaps I should increase it - as you can see, the "nice" results I see above all have a ratio of 200 or more.

Another thing I noticed is that a lot of the "likely" phrases have stopwords in them - perhaps I should filter out phrases that are considered likely but have stopwords in them. I will post updates as I find out more.

Update 2009-10-28: Putting in a simple filter to remove phrases which contain stopwords (based on my stopword.txt file) resulted in 27 very nice phrases, which are shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
aye aye           141.51576072034712
baker street      129.34515160313717
captain ahab      168.36076624969655
captain peleg     165.97023802793268
good deal         138.35536009692996
hosmer angel      116.01526169240735
lord st           150.75121049953745
march hare        246.8881927527065
miss hunter       118.52058386372536
miss stoner       110.05045443880385
moby dick         563.8792649078067
mock turtle       381.22084668006727
mr holmes         253.46200914806514
mr starbuck       105.5923146454597
mr windibank      105.90936635039294
project gutenberg 326.82789982895486
sherlock holmes   350.16072531114384
small print       139.89748896049025
sperm whale       329.133237299884
sperm whale's     140.82282422992682
st clair          171.47400592362465
st simon          298.5832451161461
thou art          130.54357132297693
white rabbit      144.84973106411437
white whale       198.02535045342663
years ago         190.3057543422783
young lady        124.60954738818677

On reading a bit more about the Binomial Distribution, I learned that in certain cases, one can consider it to approximate a Gaussian Distribution, and therefore use a cutoff that reflects the confidence level (ie as a multiple of its z-score). So I changed the code in LikelyPhraseMapper to compute the mean and standard deviation of the Binomial Distribution for each phrase, and then its z-score, and then the distance of the observed probability from the mean as a multiple of the z-score. If the distance exceeds 0.1 of z, we consider the independence hypothesis (ie the two words in the phrase are independent) refuted with 10% confidence.

With this approach, I get a larger number of phrases, and these phrases look quite good too. Most of the original 27 phrases are included in this larger set. Right now, the code is in a pretty hacked up state, so I need to do a bit of cleanup before I can post to the svn repository, but just to put in code what I explained in the previous paragraph, here is the snippet that you will need to stick into the LikelyPhraseMapper.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
  @Override
  public void map(LongWritable key, PhraseCounterDbWritable value, Context context) 
      throws IOException, InterruptedException {
    ...
    String phrase = value.getPhrase();
    String[] wordsInPhrase = StringUtils.split(phrase, " ");
    long phraseOccurs = value.getOccurs();
    String trailingWord = phrase.substring(phrase.lastIndexOf(' ') + 1);
    long trailingWordOccurs = getOccurrence(trailingWord);
    double pTrailingWord = (double) trailingWordOccurs / (double) nWords;
    BinomialDistribution dist = 
      new BinomialDistributionImpl((int) trailingWordOccurs, pTrailingWord);
    double estPhraseProbability = dist.probability(phraseOccurs);
    double distMean = trailingWordOccurs * pTrailingWord;
    double distStdDev = Math.sqrt(distMean * (1 - pTrailingWord));
    double actPhraseProbability = trailingWordOccurs / nWords;
    double zScore = (estPhraseProbability - distMean) / distStdDev;
    double distance = Math.abs((actPhraseProbability - distMean) / zScore);
    if (distance >= 0.1257D) { // 10% confidence level
      // if the condition holds, we can refute the independence hypothesis
      // with 10% confidence.
      context.write(new Text(phrase), new DoubleWritable(distance));
    }
  }

I plan to work some more on some of the other phrase extraction ideas outlined in Dr Konchady's book, so I will do the necessary refactoring so that either algorithm is pluggable.

10 comments (moderated to prevent spam):

webdesing said...

nice post mate.....


web design

Sujit Pal said...

Thank you, glad you liked it.

Anonymous said...

can you identify the differences from the traditional Xtract ? or the Dragon toolkit phrase indexer ?

Sujit Pal said...

Hi, thanks for the pointers... I am aware of Dragon (but haven't used it), but was not aware of Xtract. From the abstract, the approach I describe is similar to the one used in Xtract, although it is possible that the filtering criteria may be different.

Anonymous said...

Well sorry for commenting this post again, but i have few things to say here. Xtract is not a Dragon toolkit algorithm, it is much older than Dragon toolkit, and have much solid filtering scheme. However, if you try it on a small news corpus (Such like 20 Newsgroup or Reuters) it will give a set of meaningless bi-grams with high ranks. Further, considering POS tagging with a simple regular expression for extracting Proper Nouns or Noun Phrases would give more meaningful phrases, and no matter how good their rank is, they will be much more expressive.

If you please share why did you bring this up, maybe we can understand the idea better :)

Sujit Pal said...

Hi Anonymous, to answer your last question first, the objective of the exercise is to find candidate concepts for a taxonomy from large bodies of text. At this point I am just trying to perfect the algorithm, hence the inputs are ebooks from Gutenberg.

I like your idea of using a regular expression (or NLP) to filter phrases which are noun phrases. That was one of the things I considered doing when filtering them out - ultimately I decided to just filter the ones with stopwords in them, that gives me quite good results, but I do believe that filtering on noun phrases will be even better, so thank you for that.

As for your comment on Xtract vs Dragon, I re-read my reply to you, and I don't think I implied that Xtract is a part of Dragon. The only thing I saw about Xtract was a paper from Microsoft, so I assumed that it was something from them. Reading through the abstract led me to believe that our approaches are similar, but that is about all I could figure out from the abstract.

Anonymous said...

As you are working on taxonomies, i would suggest you (if you didn't do that earlier) to start looking for clues in text. That was of great help for me once :)

Sujit Pal said...

Thanks Anonymous, both for this tip and your previous one. I did think of this, but was putting it off, since its a lot of work...first tag, then look for patterns...but its definitely worth trying.

Anonymous said...

Nice article ,Can you guide or give some pointers on how we can combine KEA with UIMA.

Sujit Pal said...

Thanks Anonymous, but my preference would be to keep this processing offline... the reason is you are effectively mining your entire corpus for phrases. Once you have those phrases however, you could (probably) use UIMA to annotate them as such. Or even better, just mark them as keywords so they are not tokenized by your analyzer chain. Depending on how many phrases you are looking at, even this may be too time-consuming to do inline.