Sunday, November 30, 2008

IR Math in Java : Citation based Ranking

If you are a regular reader, you know that I have been working my way through Dr Manu Konchady's TMAP book in an effort to teach myself some Information Retrieval theory. This week, I talk about my experience implementing Google's PageRank algorithm in Java, as described in Chapter 6 of this book and the PageRank Wikipedia page. In the process, I also ended up developing a Sparse Matrix implementation in order to compute PageRank for real data collections, which I contributed back to the commons-math project.

The PageRank algorithm was originally proposed by Google's founders, and while it does form part of the core of what SEO types refer to as The Google Algorithm, the Algorithm is significantly more comprehensive and complex. My intent is not to reverse engineer this stuff, nor to hack it. I think the algorithm is interesting, and thought it would be worth figuring out how to code this up in Java.

The PageRank algorithm is based on the citation model (hence the title of this post), ie, if a scholarly paper is considered to be of interest, other scholarly papers cite it as a reference. Similarly, a page with good information is linked to by other pages on the web. The PageRank of a page is the sum of normalized PageRanks of pages that point to it. If a page links out to more than one page, its contribution to the target page's PageRank is its PageRank divided by the number of pages it links out to. Obviously, this is kind of a chicken and egg problem, so it needs to be solved in a recursive way.

In addition, there is a damping factor d to simulate a random surfer, who clicks on links but eventually gets bored and does a new search and starts over. To compensate for the damping factor, a constant factor c is added to the PageRank formula. The formula is thus:

  rj = c + (d * Σ ri / ni)
  where:
    rj = PageRank for page j
    d = damping factor, usually 0.85
    c = (1 - d) / N
    ri = PageRank for page i which points to page j
    ni = Number of outbound links from page i
    N = number of documents in the collection

This would translate to a set of linear equations, and could thus be re-written as a recursive matrix equation. As much as I would like to say that I arrived at this epiphany all by myself, I really just worked backwards from the formula on the Wikipedia page.

  R = C + d * A * R0
  where:
    R  = a column vector of size N, containing the ranks of pages in the collection.
    C  = a constant column vector containing [ci]
    d  = scalar damping factor
    A  = a NxN square matrix containing the initial probabilities 1/N for each (i,j)
         where page(i) links to page(j), and 0 for all other (i,j).
    R0 = the initial guess for the page ranks, all set to 1/N.

We populate the matrices on the right hand side, then compute R. At each stage we check for convergence (if it is close enough to the previous value of R). If not, we set R0 from R and recompute. Here is the code to do this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
// Source: src/main/java/com/mycompany/myapp/ranking/PageRanker.java
package com.mycompany.myapp.ranking;

import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.commons.math.linear.SparseRealMatrixImpl;

public class PageRanker {

  private Map<String,Boolean> linkMap;
  private double d;
  private double threshold;
  private List<String> docIds;
  private int numDocs;
  
  public void setLinkMap(Map<String,Boolean> linkMap) {
    this.linkMap = linkMap;
  }
  
  public void setDocIds(List<String> docIds) {
    this.docIds = docIds;
    this.numDocs = docIds.size();
  }
  
  public void setDampingFactor(double dampingFactor) {
    this.d = dampingFactor;
  }
  
  public void setConvergenceThreshold(double threshold) {
    this.threshold = threshold;
  }
  
  public RealMatrix rank() throws Exception {
    // create and initialize the probability matrix, start with all
    // equal probability p(i,j) of 0 or 1/n depending on if there is 
    // a link or not from page i to j.
    RealMatrix a = new SparseRealMatrixImpl(numDocs, numDocs);
    for (int i = 0; i < numDocs; i++) {
      for (int j = 0; j < numDocs; j++) {
        String key = StringUtils.join(new String[] {
          docIds.get(i), docIds.get(j)
        }, ",");
        if (linkMap.containsKey(key)) {
          a.setEntry(i, j, 1.0D / numDocs);
        }
      }
    }
    // create and initialize the constant matrix
    RealMatrix c = new SparseRealMatrixImpl(numDocs, 1);
    for (int i = 0; i < numDocs; i++) {
      c.setEntry(i, 0, ((1.0D - d) / numDocs));
    }
    // create and initialize the rank matrix
    RealMatrix r0 = new SparseRealMatrixImpl(numDocs, 1);
    for (int i = 0; i < numDocs; i++) {
      r0.setEntry(i, 0, (1.0D / numDocs));
    }
    // solve for the pagerank matrix r
    RealMatrix r;
    int i = 0;
    for(;;) {
      r = c.add(a.scalarMultiply(d).multiply(r0));
      // check for convergence
      if (r.subtract(r0).getNorm() < threshold) {
        break;
      }
      r0 = r.copy();
      i++;
    }
    return r;
  }
}

Here is the JUnit code to call the class. We set up the damping factor and the convergence threshold. We use the picture of the graph on the Wikipedia PageRank article as our initial dataset. The dataset is represented as a comma-delimited pairs of linked page ids.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
  @Test
  public void testRankWithToyData() throws Exception {
    Map<String,Boolean> linkMap = getLinkMapFromDatafile(
      "src/test/resources/pagerank_links.txt");
    PageRanker ranker = new PageRanker();
    ranker.setLinkMap(linkMap);
    ranker.setDocIds(Arrays.asList(new String[] {
      "1", "2", "3", "4", "5", "6", "7"
    }));
    ranker.setDampingFactor(0.85D);
    ranker.setConvergenceThreshold(0.001D);
    RealMatrix pageranks = ranker.rank();
    log.debug("pageRank=" + pageranks.toString());
  }

  private Map<String,Boolean> getLinkMapFromDatafile(String filename) 
      throws Exception {
    Map<String,Boolean> linkMap = new HashMap<String,Boolean>();
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    String line;
    while ((line = reader.readLine()) != null) {
      if (StringUtils.isEmpty(line) || line.startsWith("#")) {
        continue;
      }
      String[] pairs = StringUtils.split(line, "\t");
      linkMap.put(pairs[0], Boolean.TRUE);
    }
    return linkMap;
  }

You may have noticed that I am using calls to SparseRealMatrixImpl, which does not exist in the commons-math codebase at the time of this writing. The reason I implemented the SparseRealMatrixImpl was because when I try to run the algorithm against a real interlinked data collection of about 6000+ documents, I would consistently get an Out Of Memory Exception with the code that used a RealMatrixImpl (which uses a two dimensional double array as its backing store).

The SparseRealMatrixImpl subclasses RealMatrixImpl, but uses a Map<Point,Double> as its backing store. The Point class is a simple struct type data holder private class that encapsulates the row and column number for the data element. Only non-zero matrix elements are actually stored in the Map. This works out because the largest matrix (A) contains mostly zeros, ie comparatively few pages are actually linked. The patch is available here.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

6 comments (moderated to prevent spam):

shilpa said...

hi sujit,
could you please tell me if there is any lucene class that implements document ranking? if so will i be able to get the source code? i need it urgently..

Sujit Pal said...

Hi Shilpa, I take it you want to implement a custom document ranking? If so, you may want to look at Similarity and extend it. Lucene is an open source project, so you should be able to download the source for it from the lucene site.

Ved said...

Hello Sir,
I was working on page rank from so many days , and i was about to give up . Suddenly i found you on web and Your contribution helped me a lot . You saved my grade and my future !!!
Thank you very very much .

Sujit Pal said...

Hi Ved, thanks for the kind words, my contribution here is really the java code based on Dr Konchady's TMAP book.

Anonymous said...

Hello sir..I need application that will read bodies of text and scan for things like:
Verb counts
Noun counts
Nouns that are food related
Food slang or popular foreign dish names
Adverb counts
Adjective counts
Certain punctuation marks
Number of words
Average word length
Sentiment – Positive words vs. negative words
Person – first person or second person or third person words
Tense – present past or future tense word usage
Numbers of bodies of text
#Did u have any idea relating on the above mention tasks. I will be very thankful, if u share me some idea about these.

Sujit Pal said...

Hi Puru, you may want to check out some NLP libraries like Lingpipe (free for non-commercial use) or OpenNLP or NLTK (if you like Python) - I believe all of them have tools that detect noun, verb, adjectives, noun phrases, etc. For food dishes, not sure but you can probably check out foodnetwork.com, I believe their pages are very richly annotated, so you can mine that for information. For sentiment analysis, you probably want to really understand the domain in which you are going to work in - positive/negative terms vary greatly across domains.