If you are a regular reader, you know that I have been working my way through Dr Manu Konchady's TMAP book in an effort to teach myself some Information Retrieval theory. This week, I talk about my experience implementing Google's PageRank algorithm in Java, as described in Chapter 6 of this book and the PageRank Wikipedia page. In the process, I also ended up developing a Sparse Matrix implementation in order to compute PageRank for real data collections, which I contributed back to the commons-math project.

The PageRank algorithm was originally proposed by Google's founders, and while it does form part of the core of what SEO types refer to as The Google Algorithm, the Algorithm is significantly more comprehensive and complex. My intent is not to reverse engineer this stuff, nor to hack it. I think the algorithm is interesting, and thought it would be worth figuring out how to code this up in Java.

The PageRank algorithm is based on the citation model (hence the title of this post), ie, if a scholarly paper is considered to be of interest, other scholarly papers cite it as a reference. Similarly, a page with good information is linked to by other pages on the web. The PageRank of a page is the sum of normalized PageRanks of pages that point to it. If a page links out to more than one page, its contribution to the target page's PageRank is its PageRank divided by the number of pages it links out to. Obviously, this is kind of a chicken and egg problem, so it needs to be solved in a recursive way.

In addition, there is a damping factor d to simulate a random surfer, who clicks on links but eventually gets bored and does a new search and starts over. To compensate for the damping factor, a constant factor c is added to the PageRank formula. The formula is thus:

r_{j}= c + (d * Σ r_{i}/ n_{i}) where: r_{j}= PageRank for page j d = damping factor, usually 0.85 c = (1 - d) / N r_{i}= PageRank for page i which points to page j n_{i}= Number of outbound links from page i N = number of documents in the collection

This would translate to a set of linear equations, and could thus be re-written as a recursive matrix equation. As much as I would like to say that I arrived at this epiphany all by myself, I really just worked backwards from the formula on the Wikipedia page.

R = C + d * A * R_{0}where: R = a column vector of size N, containing the ranks of pages in the collection. C = a constant column vector containing [c_{i}] d = scalar damping factor A = a NxN square matrix containing the initial probabilities 1/N for each (i,j) where page(i) links to page(j), and 0 for all other (i,j). R_{0}= the initial guess for the page ranks, all set to 1/N.

We populate the matrices on the right hand side, then compute R. At each stage we check for convergence (if it is close enough to the previous value of R). If not, we set R_{0} from R and recompute. Here is the code to do this:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | ```
// Source: src/main/java/com/mycompany/myapp/ranking/PageRanker.java
package com.mycompany.myapp.ranking;
import java.util.List;
import java.util.Map;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.math.linear.RealMatrix;
import org.apache.commons.math.linear.SparseRealMatrixImpl;
public class PageRanker {
private Map<String,Boolean> linkMap;
private double d;
private double threshold;
private List<String> docIds;
private int numDocs;
public void setLinkMap(Map<String,Boolean> linkMap) {
this.linkMap = linkMap;
}
public void setDocIds(List<String> docIds) {
this.docIds = docIds;
this.numDocs = docIds.size();
}
public void setDampingFactor(double dampingFactor) {
this.d = dampingFactor;
}
public void setConvergenceThreshold(double threshold) {
this.threshold = threshold;
}
public RealMatrix rank() throws Exception {
// create and initialize the probability matrix, start with all
// equal probability p(i,j) of 0 or 1/n depending on if there is
// a link or not from page i to j.
RealMatrix a = new SparseRealMatrixImpl(numDocs, numDocs);
for (int i = 0; i < numDocs; i++) {
for (int j = 0; j < numDocs; j++) {
String key = StringUtils.join(new String[] {
docIds.get(i), docIds.get(j)
}, ",");
if (linkMap.containsKey(key)) {
a.setEntry(i, j, 1.0D / numDocs);
}
}
}
// create and initialize the constant matrix
RealMatrix c = new SparseRealMatrixImpl(numDocs, 1);
for (int i = 0; i < numDocs; i++) {
c.setEntry(i, 0, ((1.0D - d) / numDocs));
}
// create and initialize the rank matrix
RealMatrix r0 = new SparseRealMatrixImpl(numDocs, 1);
for (int i = 0; i < numDocs; i++) {
r0.setEntry(i, 0, (1.0D / numDocs));
}
// solve for the pagerank matrix r
RealMatrix r;
int i = 0;
for(;;) {
r = c.add(a.scalarMultiply(d).multiply(r0));
// check for convergence
if (r.subtract(r0).getNorm() < threshold) {
break;
}
r0 = r.copy();
i++;
}
return r;
}
}
``` |

Here is the JUnit code to call the class. We set up the damping factor and the convergence threshold. We use the picture of the graph on the Wikipedia PageRank article as our initial dataset. The dataset is represented as a comma-delimited pairs of linked page ids.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | ```
@Test
public void testRankWithToyData() throws Exception {
Map<String,Boolean> linkMap = getLinkMapFromDatafile(
"src/test/resources/pagerank_links.txt");
PageRanker ranker = new PageRanker();
ranker.setLinkMap(linkMap);
ranker.setDocIds(Arrays.asList(new String[] {
"1", "2", "3", "4", "5", "6", "7"
}));
ranker.setDampingFactor(0.85D);
ranker.setConvergenceThreshold(0.001D);
RealMatrix pageranks = ranker.rank();
log.debug("pageRank=" + pageranks.toString());
}
private Map<String,Boolean> getLinkMapFromDatafile(String filename)
throws Exception {
Map<String,Boolean> linkMap = new HashMap<String,Boolean>();
BufferedReader reader = new BufferedReader(new FileReader(filename));
String line;
while ((line = reader.readLine()) != null) {
if (StringUtils.isEmpty(line) || line.startsWith("#")) {
continue;
}
String[] pairs = StringUtils.split(line, "\t");
linkMap.put(pairs[0], Boolean.TRUE);
}
return linkMap;
}
``` |

You may have noticed that I am using calls to SparseRealMatrixImpl, which does not exist in the commons-math codebase at the time of this writing. The reason I implemented the SparseRealMatrixImpl was because when I try to run the algorithm against a real interlinked data collection of about 6000+ documents, I would consistently get an Out Of Memory Exception with the code that used a RealMatrixImpl (which uses a two dimensional double array as its backing store).

The SparseRealMatrixImpl subclasses RealMatrixImpl, but uses a Map<Point,Double> as its backing store. The Point class is a simple struct type data holder private class that encapsulates the row and column number for the data element. Only non-zero matrix elements are actually stored in the Map. This works out because the largest matrix (A) contains mostly zeros, ie comparatively few pages are actually linked. The patch is available here.

**Update 2009-04-26:** In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.