Friday, November 06, 2009

Finding Phrases - Two Statistical Approaches

In my previous post, I described how I used the Binomial Distribution to find the expected probability of a word n-gram, and compared the actual probability for a phrase to figure out if the n-gram is a phrase. I first learned about using a Binomial Distribution for Phrase Detection from Dr Konchady's Building Search Applications (BSA) book. The book describes two approaches, but the explanation is a bit sketchy, and I could not understand the methods fully at first, so I developed my own.

Over the past couple of weeks, I have been trying to figure out these two methods, and I think I now understand them well enough to be comfortable using them. For the benefit of others who are as perplexed as I was by the mental leaps in the book's explanation, I have included my notes in this post. Hopefully it will help in understanding the code.

As I mentioned in my update to last week's post, I now have a pluggable interface for filtering phrases, called IPhraseFilter. There are currently four implementations available in the repository, two from last week and two from this week. The LikelyPhraseMapper calls the desired implementation. To keep the post down to a reasonable length, I'll just describe the Phrase filter implementations - if you want to see how it fits in, you can refer to my previous post for the Mappers and Reducer classes, or download the whole thing from the jtmt repository.

One thing I liked in both the approaches described below, is that there is no choosing some arbitary cutoff number to ensure results are good. In both cases the cutoff is 0, ie if the determinant is negative, then it is dropped and if it is positive, it is retained.

Likelihood Ratio Approach

This approach makes two hypothesis, and compares the likelihood of the two hypothesis. The hypothesis are as follows:

For any word bigram {w1,w2} in the document corpus:

H0 = w2 and w1 are independent
H1 = w2 is dependent on w1

We first calculate the actual probabilities for the independent and dependent cases from the occurrence counts of the words in the corpus, like so:


   | P(w2|w1)       | P(w2|¬w1)
---+----------------+----------------------------
H0 | p00 = n2 / N   | p01 = n2 / N
H1 | p10 = n12 / n1 | p11 = (N - n12) / (N - n1)

where:


n1  = number of occurrences of w1
n2  = number of occurrences of w2
n12 = number of occurrences of {w1,w2}
N   = number of words

The derivation for the pij values in the table above are shown below.

Under the null hypothesis H0:

p00  = P(w2|w1)

        P(w2 ∩ w1)
     = ------------           ... by definition of conditional
          P(w1)                   probability

        P(w2) * P(w1)
     = --------------         ... since w1 and w2 are independent
          P(w1)

     = n2 / N

             
p01  = P(w2|¬w1) 
     = P(w2|w1)               ... since probability of w2 being preceded 
                                  by w1 is same as w2 being preceded by ¬w1
     = n2 / N

Under the alternative hypothesis H1:

p10  = P(w2|w1)  

        P(w2 ∩ w1)
     = -------------          ... by definition of conditional
           P(w1)                  probability

         n12 / N
     = -------------
          n1 / N

     = n12 / n1

p11  = P(w2|¬w1)

        P(w2 ∩ ¬w1)
     = -------------          ... by definition of conditional
           P(¬w1)                 probability

         (n2 - n12) / N
     = ------------------
         (N - n1) / N

     = (n2 - n12) / (N - n1)

We then use the Binomial distribution to calculate the expected probabilities from the observed probabilities:

   | E(w2|w1)            | E(w2|¬w1)
---+---------------------+-------------------------
H0 | b(n12; n2, p00)     | b((n12-n2); (N-n1), p01)
H1 | b(n12; n2, p10)     | b((n12-n2); (N-n1), p11)

We are now able to calculate the likelihood ratio for hypothesis H0 and H1. The likelihood ratio is defined as the ratio of the probability of a true positive and the probability of a false positive.

L(H0)    = E(w2|w1) / E(w2|¬w1)    ... by definition of likelihood

             b(n12; n2, p00)
         = ------------------------ 
           b((n12-n2); (N-n1), p00)

                n2Cn12 * p00n12 * (1 - p00)(n2-n12)
         = --------------------------------------------------------
           (N-n1)C(n12-n2) * p01(n12-n2) * (1 - p01)(N-n1-n12+n2)


L(H1)    = E(w2|w1) / E(w2|¬w1)    ... by definition of likelihood

              b(n12; n2, p10)
         = --------------------------
            b((n12-n2); (N-n1), p11)

                n2Cn12 * p10n12 * (1 - p10)(n2 - n12)
         = ----------------------------------------------------------
           (N-n1)C(n12-n2) * p11(n12-n2) * (1 - p11)(N-n1-n2+n2)

The ratio of the likelihood tells us how much more likely the dependence assumption is compared to the independence assumption. We can cancel out the Binomial coefficients in this case.

           p10n12 * (1-p10)(n2-n12) * p01(n12-n2) * (1-p01)(N-n1-n12+n2)
R(H1/H0) = ------------------------------------------------------------------
           p00n12 * (1-p00)(n2-n12) * p11(n12-n2) * (1-p11)(N-n1-n2+n2)

LikelihoodRatioPhraseFilter

The code for the LikelihoodRatioPhraseFilter is shown below. The calculations are based on the formula for R(H1/H0) developed in the last section. Rather than return a yes/no answer from the filters, the filter returns the logarithm of R. So the client should check to see if the logarithm has a value greater than 0, ie, the likelihood of the dependence H1 (that the words are dependent on each other) is higher than H0 (that the words are independent).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
// Source: src/main/java/net/sf/jtmt/phrase/filters/LikelihoodRatioPhraseFilter.java
package net.sf.jtmt.phrase.filters;

/**
 * Uses the ratio of the likelihood of an dependence hypothesis to the
 * likelihood of the independence hypothesis for phrase bigrams to 
 * filter out unlikely phrases. The log of the ratio is returned as 
 * the determinant.
 */
public class LikelihoodRatioPhraseFilter implements IPhraseFilter {

  private long n;    // number of words in the corpus
  private long n1;   // number of occurrences of leading word in corpus
  private long n2;   // number of occurrences of trailing word in corpus
  private long n12;  // number of occurrences of phrase in corpus
  
  public LikelihoodRatioPhraseFilter(long n, long n1, long n2, long n12) {
    this.n = n;
    this.n1 = n1;
    this.n2 = n2;
    this.n12 = n12;
  }
  
  /**
   * Returns the log of the likelihood ratio between the dependence hypothesis
   * and the independence hypothesis. If the value is positive, then the 
   * dependence hypothesis is more likely than the independence hypothesis.
   */
  @Override
  public double getPhraseDeterminant() {
    double p00 = (double) n2 / (double) n;
    double p01 = p00;
    double p10 = (double) n12 / (double) n1;
    double p11 = (double) (n2 - n12) / (double) (n - n1);
    double llr = n12 * (Math.log(p10) - Math.log(p00)) + 
      (n2 - n12) * (Math.log(1 - p10) - Math.log(1 - p00)) + 
      (n12 - n2) * (Math.log(p01) - Math.log(p11)) + 
      (n - n1 - n12 + n2) * (Math.log(1 - p01) - Math.log(1 - p11));
    return llr;
  }
}

Results are similar to the previous two approaches, I get 91 phrases with my corpus of three electronic books, and a visual inspection shows that the phrases are quite nice. The top 10 examples are shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
sperm whale     409.39943002166274
white whale     205.1643105456127
march hare      200.58942539551933
mr holmes       191.6443128051443
baker street    151.41014908579413
captain ahab    124.85466211360614
lord st         108.17580790802768
aye aye         91.86846403214781
cape horn       74.58676644925315
ivory leg       71.68738725289398
...

Chi-Square Approach

With the Chi-Square Approach, we assume the independence hypothesis H0, and then attempt to prove or disprove it. This is apparently a popular technique applicable in other fields as well. A high level description of this approach is available in the BSA book, and here is a more detailed explanation. Observed frequencies are first recorded in a contingency table.

      | w2       | ¬w2           | Total
------+----------+---------------+---------------------
w1    | n12      | n1 - n12      | n1
¬w1   | n2 - n12 | N - n12       | N + n2 - 2n12
------+----------+---------------+----------------------
Total | n2       | N + n1 - 2n12 | N + n1 + n2 - 2n12

where:
n1  = number of occurrences of w1
n2  = number of occurrences of w2
n12 = number of occurrences of phrase {w1w2}
N   = number of 2 word phrases in our corpus

We then compute a table of expected frequencies from the marginal value of the observed frequencies.

      | w2       | ¬w2          
------+-------------------------
w1    | E00      | E01
¬w1   | E10      | E11

Values of Eij are calculated thus:

Eij = N * Pm(i) * Pm(j)            ... by Multiplicative law of probability

          Nm(i)    Nm(j)
    = N * ----- * -------
            N        N

    = Nm(i) * Nm(j) / N

where Pm(i) = marginal probability of event i
            = P(wi|wj) + P(wi|¬wj)
      Nm(i) = frequency corresponding to Pm(i).

We then calculate the Chi-Square statistic &Chi2 and the degrees of freedom ν from our sample. In our code we will just use the ChiSquare test statistic and pass in two arrays of observed and expected frequencies, but the formula below would be used if we were to do this by hand.

              (Oij - Eij)2
Χ2 = Σi Σj -----------------
                  Eij

ν = (r - 1) * (c - 1) = 1           ... where r = number of rows in table (2)
                                              c = number of columns in table (2)

For given α = 0.05, we find the critical value ν of a hypothetical "fair" distribution. We use the inverseCumulativeProbability() method for this, and it returns the point where P(x < ν) = α. Thus, 95% of the hypothetical distribution lies behind the critical value. If our computed value of Χ2 exceeds the critical value ν from the table, then we conclude that the (hypothetical) independence hypothesis is incorrect, and that the words {w1,w2} are indeed related as a phrase.

The ChiSquare approach is used by LingPipe for phrase extraction, according to the BSA book.

ChiSquarePhraseFilter

The code for the ChiSquarePhraseFilter is shown below. The theory leading to the code is explained above. Essentially, we compute the Χ2 statistic from the observed and expected probabilities (based on marginal probabilities of the observations), and compare it with the critical value at a 0.05 level of significance. If the result is positive, the observed value lies outside the 95% range of expected values based on a "fair" distribution, ie one where the words are independent and equally likely, and hence can be considered to be a phrase.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
// Source: src/main/java/net/sf/jtmt/phrase/filters/ChiSquarePhraseFilter.java
package net.sf.jtmt.phrase.filters;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.commons.math.MathException;
import org.apache.commons.math.distribution.ChiSquaredDistribution;
import org.apache.commons.math.distribution.ChiSquaredDistributionImpl;
import org.apache.commons.math.linear.Array2DRowRealMatrix;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.commons.math.linear.RealVector;
import org.apache.commons.math.stat.inference.ChiSquareTest;
import org.apache.commons.math.stat.inference.ChiSquareTestImpl;

/**
 * Calculates the Chi-Squared statistic for the expected vs observed 
 * frequencies, and finds if the statistic is higher than the critical 
 * value computed at a 0.05 significance level (ie 95% coverage).
 */
public class ChiSquarePhraseFilter implements IPhraseFilter {

  private final Log log = LogFactory.getLog(getClass());
  
  private static final double ALPHA = 0.05D;
  
  private long n;    // number of 2 word phrases in corpus
  private long n1;   // number of occurrences of leading word in corpus
  private long n2;   // number of occurrences of trailing word in corpus
  private long n12;  // number of occurrences of phrase in corpus

  public ChiSquarePhraseFilter(long n, long n1, long n2, long n12) {
    this.n = n;
    this.n1 = n1;
    this.n2 = n2;
    this.n12 = n12;
  }
  
  @Override
  public double getPhraseDeterminant() {
    // set up contingency table of observed frequencies
    RealMatrix obs = new Array2DRowRealMatrix(2, 2);
    obs.setEntry(0, 0, n12);
    obs.setEntry(0, 1, (n1 - n12));
    obs.setEntry(1, 0, (n2 - n12));
    obs.setEntry(1, 1, (n - n12));
    // compute marginal frequencies
    RealVector rowTotals = obs.getRowVector(0).add(obs.getRowVector(1));
    RealVector colTotals = obs.getColumnVector(0).add(obs.getColumnVector(1));
    double total = colTotals.getL1Norm();
    // flatten contingency table of observed frequencies
    long[] observed = new long[4];
    int k = 0;
    for (int i = 0; i < obs.getRowDimension(); i++) {
      for (int j = 0; j < obs.getColumnDimension(); j++) {
        observed[k++] = (long) obs.getEntry(i, j);
      }
    }
    // compute expected frequencies based on marginal frequencies
    double[] expected = new double[4];
    k = 0;
    for (int i = 0; i < obs.getRowDimension(); i++) {
      for (int j = 0; j < obs.getColumnDimension(); j++) {
        expected[k++] = 
          (double) colTotals.getEntry(i) * rowTotals.getEntry(j) / total;
      }
    }
    // find the test statistic
    ChiSquareTest test = new ChiSquareTestImpl();
    double chiSquare = test.chiSquare(expected, observed);
    // find the critical value
    ChiSquaredDistribution dist = new ChiSquaredDistributionImpl(6.0D);
    double criticalValue = 0.0D;
    try {
      criticalValue = dist.inverseCumulativeProbability(ALPHA);
    } catch (MathException e) {
      log.warn(e);
    }
    // return the difference between the test statistic and critical value
    return (chiSquare - criticalValue);
  }
}

As before, the results look quite believable. 103 phrases are recognized from this algorithm, and the top 10 are shown below. In the full output, there is considerable overlap, even though it does not show up here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
irene adler       1811.5261599783912
march hare        1530.794875612239
um um             1194.2290051976697
project gutenberg 1136.9018182378147
copper beeches    862.8478133287713
don sebastian     789.5537133726627
st simon          788.6790165231548
jabez wilson      727.0858527238903
boscombe valley   708.5084636022881
...

Resources

While figuring out this stuff, I came across couple of very good online resources, which I would like to share here.

  • An Intuitive Explanation of Bayes' Theorem by Eliezer S. Yudkowsky - As the title suggests, this is a very easy to understand explanation of Bayes' theorem based on commonsense explanations. Read the rest of the articles on his site if you have some time, I found them quite interesting, maybe you will too.

  • AP* Statistics Tutorial - The last time I used any statistics other than mean and standard deviation was in college, and that was geared to solving toy problems. This time round, I read through the entire tutorial (takes about 2-3 hours) with a view to applying it to real-life (data mining) situations. If you are in a similar situation, I highly recommend reading through it. If nothing else, it will help you understand the more statistically oriented data mining algorithms available out there.

10 comments (moderated to prevent spam):

Yuval said...

Here is a survey of several methods for finding collocations from Manning and Schutze's "Foundations of Statistical Natural Language Processing". It includes the likelihood ratio and Chi-square methods, and adds some others:

http://nlp.stanford.edu/fsnlp/promo/colloc.pdf

Ashwin Jayaprakash said...

Hi Sujit, I follow your posts with great interest. Thanks for making all this info available to everyone.

Cheers!
Ashwin.

Sujit Pal said...

@Yuval: thanks very much for this info, and also for the info about the Lucene and Hadoop meetups at ApacheCon. Attended both and learned quite a lot from both.

@Ashwin: thanks, and you are welcome.

Unknown said...

You rock man, you are an inspiration....I was always afraid of converting mathematical formulas into code..your detailed postings helped me a lot...I even bookmarked your blog and read it regularly!!!!

Have you ever dabbled with unsupervised text/document classification with LDA or pLSI ?

Ravi

Sujit Pal said...

Thanks Ravi, I am just learning this stuff myself, I write it down so I can refer back to them later...something like my own internet-enabled cheat sheet :-).

I have played with LSI but I found it to be too memory intensive for any real work...but I do plan to revisit it via some data mining libraries in the future.

Anonymous said...

Well, to save you time, LSI and LDA sucks when it comes to phrase extraction, especially LDA (As you should look for a sampling method to approximate it, which is BTW memory expensive).

In this context i would like to ask if it possible to use more than 1.5 GB RAM on JVM ? I have 4 GB ram, but unfortunatly because JVM i couldn't extend the maximum memory limit to more than 1.5 GB RAM. IF anyone have an idea please post ;-)

Unknown said...

Hi Salmon.
I can't understand usage of chi square test, can you explain few moments ?
1. In all chi square descriptions: chi square test reject H0 hypothesys if result X^2 value more then critical, in other words this mean that difference between observed result and expected distribution is very great. But in you description you check X^2 > critial, why?
2. You receive very great numbers in test and these values is very depends on count of phrases in corpus: sample n1=5, n2=5, n12=1: for N=1000000 give X^2 ~ 39998, for N = 1000 give ~ 38, how interpret this results? I think that dependency on N isn't correct.

Sujit Pal said...

Answers to your questions, to the best of my (perhaps limited) understanding...

1. In our case H0 = words (w1,w2) are independent. So to determine if (w1,w2) are likely to be together in a phrase, we want to know if they are dependent, ie, we have to disprove H0. Hence the rejection test.
2. I think the dependency on N (total number of words in text being analyzed) is correct. So the larger the value of N, the higher must be the value of n12 for H0 to be rejected.

The algorithm is not mine (in the post I have a link to a page which shows its use in an engineering context), so chances of it being wrong is low :-).

Srijith said...

Hi..Sujit..Great piece of code..
I have developed a program to create Bigrams using the Likelihood ratio.How can i specify a particular threshold or how can i determine the most appropriate bigrams for my process??

Sujit Pal said...

Thanks Srijith. I don't know of any way other than to eyeball the results and select a cutoff. While this involves a manual step, its less work than finding the bigrams manually. There may be other better ways though.