Sometime back, I wrote about a Hadoop Phrase Extractor that went through a bunch of electronic books from the Gutenberg project and extracted possible phrases by parsing out 3-5 word grams. The extractor works fairly well for phrases of 3 words or more with a simple count threshold. However, for 2 word phrases, a count threshold (i.e. only consider word bigrams which occur more than a specific cutoff value) did not work so well for me.
In his book, Building Search Applications - Lucene, LingPipe and Gate, Dr Manu Konchady touches on using Binomial Distributions to compute the likelihood of a word bigram being a phrase based on its number of occurrences and the number of occurrences of its component words in a document corpus. I decided to explore that a bit further, and I try to explain my understanding of the approach in this post.
Binomial Distribution - theory
Although I learned about Binomial Distributions in high school, I never really understood it beyond being able to solve simple toy problems, and in any case, its been a while, so I decided to brush up on the theory. The AP* Statistics Tutorial: Binomial Distribution page from StatTrek explains it in great detail, so you may also find this page helpful if you are in a similar situation.
Essentially, for a word bigram, we are trying to estimate the likelihood of observing the number of bigram occurrences of the second word, preceded by the first word, out of the number of occurrences of the second word alone, with the observed probability of the second word in the corpus. This definition reduces our phrase detection problem into a Binomial Distribution problem.
Binomial Distribution Implementation
The commons-math library does not have a implementation for this distribution (this is not true, as I found out later, please see the update below for details), but it provides a Distribution Framework and guidelines for creating your own custom distributions, which I used to create my own subclass. The code is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 | // Source: src/main/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/BinomialDistribution.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;
import org.apache.commons.math.MathException;
import org.apache.commons.math.distribution.AbstractIntegerDistribution;
import org.apache.commons.math.distribution.DiscreteDistribution;
import org.apache.commons.math.util.MathUtils;
/**
* Provides various probability density functions for a Binomial
* Distribution. Definitions and formulae for Binomial Distribution
* can be found here:
* {@link http://stattrek.com/Lesson2/Binomial.aspx?Tutorial=AP}
*/
public class BinomialDistribution extends AbstractIntegerDistribution
implements DiscreteDistribution {
private static final long serialVersionUID = -1858690105951636184L;
private int n; // number of trials
private double p; // probability of success on an individual trial
/**
* Construct a Binomial Distribution experiment instance.
* @param n the number of trials.
* @param p the probability of success on an individual trial.
*/
public BinomialDistribution(int n, double p) {
super();
this.n = n;
this.p = p;
}
/**
* Using logs is useful for very large values of n.
* @param x the number of successes.
* @return the log of the probability.
*/
public double logProbability(int x) {
return MathUtils.binomialCoefficientLog(n, x) +
(x * Math.log(p)) + ((n - x) * Math.log(1 - p));
}
/**
* Computes the probability that the experiment results in
* exactly x successes. The computation is done with logarithms
* internally and converted back to the probability value to
* prevent overflow.
* @param x the number of successes, should be an integer.
* @return probability the probability of exactly x successes.
*/
@Override
public double probability(int x) {
return Math.exp(MathUtils.binomialCoefficientLog(n, x) +
(x * Math.log(p)) + ((n - x) * Math.log(1 - p)));
}
/**
* Computes the probability that the experiment results in at
* least x (0-x) successes.
* @param x the number of successes, should be an integer.
* @return the probability of at least x successes.
*/
@Override
public double cumulativeProbability(int x) throws MathException {
double cumulativeProbability = 0.0D;
for (int i = 0; i <= x; i++) {
cumulativeProbability += probability(i);
}
return cumulativeProbability;
}
@Override
protected int getDomainLowerBound(double p) {
return 0;
}
@Override
protected int getDomainUpperBound(double p) {
return 1;
}
}
|
To test the component out, I used the problems and their solutions described on the StatTrek page. Here is my JUnit test. The class is not that hard to use, but this also shows the usage of the class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 | // Source: src/test/java/net/sf/jtmt/concurrent/hadoop/phraseextractor/BinomialDistributionTest.java
package net.sf.jtmt.concurrent.hadoop.phraseextractor;
import java.math.RoundingMode;
import org.apache.commons.math.util.MathUtils;
import org.junit.Assert;
import org.junit.Test;
/**
* Tests for Binomial Distribution. The tests solves some problems
* found in the StatTrek AP Statistics page and compares the results
* with that provided in the page. Link for the StatTrek page is:
* {@link http://stattrek.com/Lesson2/Binomial.aspx?Tutorial=AP}
*/
public class BinomialDistributionTest {
/**
* Problem: Suppose a die is tossed 5 times. What is the probability
* of getting exactly 2 fours?
* @throws Exception if thrown.
*/
@Test
public void test1() throws Exception {
double p = 1.0D / 6.0D;
BinomialDistribution bd = new BinomialDistribution(5, p);
double probability = bd.probability(2);
System.out.println(">> b(x = 2; 5, " + p + ") = " + probability);
Assert.assertEquals(0.161, MathUtils.round(probability, 3,
RoundingMode.HALF_UP.ordinal()));
}
/**
* Problem: What is the probability of obtaining 45 or fewer heads in
* 100 tosses of a coin.
* @throws Exception if thrown.
*/
@Test
public void test2() throws Exception {
BinomialDistribution bd = new BinomialDistribution(100, 0.5D);
double cumulativeProbability = bd.cumulativeProbability(45);
System.out.println(">> b(x <= 45; 100, 0.5) = " +
cumulativeProbability);
Assert.assertEquals(0.184, MathUtils.round(cumulativeProbability, 3,
RoundingMode.HALF_UP.ordinal()));
}
/**
* Problem: The probability that a student is accepted into a prestigeous
* college is 0.3. If 5 students from the same school apply, what is the
* probability that at most 2 are accepted?
* @throws Exception
*/
@Test
public void test3() throws Exception {
BinomialDistribution bd = new BinomialDistribution(5, 0.3D);
double cumulativeProbability = bd.cumulativeProbability(2);
System.out.println(">> b(x <= 2; 5, 0.3) = " +
cumulativeProbability);
Assert.assertEquals(0.8369, MathUtils.round(cumulativeProbability, 4,
RoundingMode.HALF_UP.ordinal()));
}
}
|
Application in Phrase Identification
I want to use this in the context of a Hadoop job that will extract the word unigrams and bigrams and store their counts. For that, I wanted to first write some code that will, given some parameters I know I can get from a previous stage of my job, compute the likelihood value for the phrase using the problem definition above. Here is a unit test working from the problem definition.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | @Test
public void test4() throws Exception {
// known parameters
int nNew = 1044; // number of occurrences of first word
int nYork = 597; // number of occurrences of second word
int nNewYork = 588; // number of bigram occurrences
int nWords = 1900000; // total words in document corpus
// observed probability of second word
double pYork = (double) nYork / (double) nWords;
BinomialDistribution bd = new BinomialDistribution(nYork, pYork);
// our probabilities in data mining situations are usually too small
// so we should use the log probabilities. The logEstimatedProbability
// is the log likelihood of "new" preceding the word "york" in the
// case where "new" and "york" are assumed to have no relation.
double logEstimatedProbability = bd.logProbability(nNewYork);
// compare with the log of the observed probability of the bigram
double logActualProbability =
Math.log((double) nNewYork / (double) nWords);
// determine whether this is a phrase by a simple test
boolean likely = (logActualProbability - logEstimatedProbability > 0);
System.out.println("likely? " + likely);
Assert.assertTrue(likely);
}
|
The method outlined in the book is significantly more complex and uses a null independence hypothesis and a corresponding dependence hypothesis. Likelihood ratios are computed for the cases where New precedes York and where New does not precede York for both hypothesis. I traced through the method but could not figure out why the algorithm needs to be so complex, so decided to do it the way I described above. If you find flaws in this approach, please let me know.
Update 2009-10-23: commons-math provides a Binomial Distribution implementation (called BinomialDistributionImpl), and has been at least since version 1.0, not sure how I missed it, but unfortunately I did. The DistributionFactory class referenced here seems to have disappeared from commons-math 2.0 (which I am using), so I guess you can instantiate the implementation with a BinomialDistribution bd = new BinomialDistribution(n, p) call. However, the current implementation does not use the BinomialCoefficient.binomialCoefficientLog() call, which makes it unsuitable for my usage (see the logProbability() method in my implementation). So one possibility is to subclass the BinomialDistributionImpl and add it in there. Another way is to simply create a method that takes the n, p and x parameters and just returns the logProbability, which is probably the path I will be taking.
Thanks for your explanation.
ReplyDeleteIt worths to read.
Thanks Burçak, glad you liked it.
ReplyDelete