Saturday, June 14, 2008

Web Page Summarizer using Jericho

Recently, I needed to build a component, that given a URL, would try to extract the summary of the page from it. One of the first things I do when I need to build something about which I don't know much is to check out if there are other people who have built similar things, and have been kind enough to post their code or created a project that I can reuse. While lot of people may consider this mildly unethical, I think it is a good practice because you get to know what's available, and go down paths that have the highest probability of success (based on the theory that if something didn't work out, people won't post articles and blogs about it). I also credit my sources in my code, as well as post code (admittedly of dubious value) myself in the hope it may help someone else.

One of the first results for 'Web Page Summarizer' from Google is Aaron Weiss's article from the Web Developer's Virtual Library (WDVL). Its written in Perl and depends on modules developed earlier in the series.

I work mostly in Java, and I needed to use Java to do the same thing. I have been looking at using the Jericho HTML Parser, and this appeared to be quite a good use case. In this post, I replicate the functionality of Aaron Weiss's Web Page Summarizer in Java. Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
// WebPageSummarizer.java
package com.mycompany.myapp.utils;

import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.httpclient.NameValuePair;

import au.id.jericho.lib.html.CharacterReference;
import au.id.jericho.lib.html.Element;
import au.id.jericho.lib.html.HTMLElementName;
import au.id.jericho.lib.html.Source;
import au.id.jericho.lib.html.StartTag;

public class WebPageSummarizer {

  /**
   * Return a Map of extracted attributes from the web page identified by url.
   * @param url the url of the web page to summarize.
   * @return a Map of extracted attributes and their values.
   */
  public Map<String,Object> summarize(String url) throws Exception {
    Map<String,Object> summary = new HashMap<String,Object>();
    Source source = new Source(new URL(url));
    source.fullSequentialParse();
    summary.put("title", getTitle(source));
    summary.put("description", getMetaValue(source, "description"));
    summary.put("keywords", getMetaValue(source, "keywords"));
    summary.put("images", getElementText(source, HTMLElementName.IMG, "src", "alt"));
    summary.put("links", getElementText(source, HTMLElementName.A, "href"));
    return summary;
  }

  public String getTitle(Source source) {
    Element titleElement=source.findNextElement(0, HTMLElementName.TITLE);
    if (titleElement == null) {
      return null;
    }
    // TITLE element never contains other tags so just decode it collapsing whitespace:
    return CharacterReference.decodeCollapseWhiteSpace(titleElement.getContent());
  }
  
  private String getMetaValue(Source source, String key) {
    for (int pos = 0; pos < source.length(); ) {
      StartTag startTag = source.findNextStartTag(pos, "name", key, false);
      if (startTag == null) {
        return null;
      }
      if (startTag.getName() == HTMLElementName.META) {
        String metaValue = startTag.getAttributeValue("content");
        if (metaValue != null) {
          metaValue = LcppStringUtils.removeLineBreaks(metaValue);
        }
        return metaValue;
      }
      pos = startTag.getEnd();
    }
    return null;
  }

  private List<NameValuePair> getElementText(Source source, String tagName, 
      String urlAttribute) {
    return getElementText(source, tagName, urlAttribute, null);
  }

  @SuppressWarnings("unchecked")
  private List<NameValuePair> getElementText(Source source, String tagName, 
      String urlAttribute, String srcAttribute) {
    List<NameValuePair> pairs = new ArrayList<NameValuePair>();
    List<Element> elements = source.findAllElements(tagName);
    for (Element element : elements) {
      String url = element.getAttributeValue(urlAttribute);
      if (url == null) {
        continue;
      }
      // A element can contain other tags so need to extract the text from it:
      String label = element.getContent().getTextExtractor().toString();
      if (label == null) {
        // if text content is not available, get info from the srcAttribute
        label = element.getAttributeValue(srcAttribute);
      }
      // if still null, replace label with the url
      if (label == null) {
        label = url;
      }
      pairs.add(new NameValuePair(label, url));
    }
    return pairs;
  }
}

As you can see, its all quite simple. Probably too simple, since there is a lot more I need to do to make this robust enough for general use. If you look at the Jericho site, you will find examples that describe all that I have done above, as well as some other things. However, using Jericho makes it easy to extend the code to cover corner cases, as I no longer have to rely on messy regular expression matching strategies (which I had used till now to parse HTML).

Here is the test case that hits the page that is the inspiration for this summarizer.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// WebPageSummarizerTest.java
package com.mycompany.myapp.utils;

import java.util.List;
import java.util.Map;

import org.apache.commons.httpclient.NameValuePair;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.Test;

public class WebPageSummarizerTest {

  private final Log log = LogFactory.getLog(getClass());
  
  private static final String[] TEST_URLS = {
    "http://www.wdvl.com/Authoring/Languages/Perl/PerlfortheWeb/summarizer.html",
  };
  
  @Test
  public void testGetSummary() throws Exception {
    WebPageSummarizer summarizer = new WebPageSummarizer();
    for (String testUrl : TEST_URLS) {
      System.out.println("==\nSummary for url:" + testUrl);
      Map<String,Object> summaryMap = summarizer.summarize(testUrl);
      for (String tag : summaryMap.keySet()) {
        Object value = summaryMap.get(tag);
        if (value == null) {
          continue;
        }
        if (value instanceof String) {
          System.out.println(tag + " => " + summaryMap.get(tag));
        } else if (value instanceof List) {
          List<NameValuePair> pairs = (List<NameValuePair>) value;
          System.out.println("#-" + tag + " => " + pairs.size());
        } else {
          log.warn("Unknown value of class:" + value.getClass().getName());
          continue;
        }
      }
    }
  }
}

and the output:

1
2
3
4
5
6
7
8
Summary for url: http://www.wdvl.com/Authoring/Languages/Perl/PerlfortheWeb/summarizer.html
#-images => 58
title => WDVL: The Proof is in the Parsing: A Web Page Summarizer
keywords => Perl, PERL, programming, scripting, CGI, LWP, TokeParser
description => The Web Developer's Virtual Library is a resource for web development, including
a JavaScript tutorial, html tag info, JavaScript events, html special characters, paint shop pro, 
database normalization, PHP and more.
#-links => 246

On a slightly different note, I notice that I have been indulging in a bad practice - that of running my large batch programs using a JUnit test. I recently had a new developer start some of these tests inadvertently and start to wonder what the tests were doing after they cleaned up quite a few database tables :-). In the Ant world, I would use the java target to run my classes using a main() method, but its only recently I found out about the Maven2 exec plugin, so I will look at using that instead for my batch programs.

10 comments (moderated to prevent spam):

Anonymous said...

I'm sorry if the question i'm asking here is inappropriate in the context of what you have posted.but i dunno where else to ask this and i would be really greatful if you could help me wid this.
Is there a way to save the links clicked in the results returned by a search engine to file/DB(maybe using javascript/jquery)?i need to maintain search history profile of a user ie the number of pages he has browsed for a particular query.

Sujit Pal said...

Hi Anonymous, your question /is/ OT for this post, but apology accepted :-), here is how I would do it (with my rather limited knowledge of Javascript)...

Have an onclick javascript handler on the result link that would basically send back the URL and the userId (from the session) or the referer (if sessions are disabled) in an AJAX request to a server side controller. This controller would persist the URL against the userId into the database, then proxy over to the URL and send back the contents.

Anonymous said...

thank you.i'm working on it.thanks a bunch for accepting the post and answering it.I appreciate it a lot.

Anonymous said...

Hi im currently working on making an app in android mobile phones wherein you can output a summarized page of your bookmarks. i tried using your code but i got an error with "metaValue = LcppStringUtils.removeLineBreaks(metaValue);" and "pairs.add(new NameValuePair(label,url));" saying namevaluepair cant instantiate the type NameValuePair. please help me out many thanks!

Sujit Pal said...

Hi, the LcppStringUtils.removeLineBreaks() just removes "\r\n" from the string, you can replace with something like: StringUtils.replace(s, "\n|\r|\r\n", " "). Not sure about why it will not allow you to not instantiate NameValuePair(String,String), according to the 3.x NameValuePair Javadocs, it supports such a constructor call - maybe check if you have an older/newer version of httpclient.jar, perhaps in that version its either an abstract class or interface, you will need to call the constructor of the appropriate implementation/subclass if so.

Erwin said...

Hi Sir this works perfectly given the test url.im using eclipse for an android project. i get an access to the internet using the android device emulator. Do you have any suggestions on how to use this having the user input an online url. thanks a lot!

Erwin said...

Good day Sir!, i found out that the summarizer doesnt work for all kinds of websites. Can you give me some advice or revisions to make it work specifically with wikipedia. Thank you very much! I appreciate your replies from my older posts.

Sujit Pal said...

>> Do you have any suggestions on how to use this having the user input an online url.
The summarizer is relatively lightweight, all it does is fetch the page and parse out the various meta fields, so you can probably do it online from the phone itself - the cost is only slightly higher than making a browser call.

>>...i found out that the summarizer doesnt work for all kinds of websites. Can you give me some advice or revisions to make it work specifically with wikipedia.
Yes, this summarizer depends on the meta tags being specified correctly, and this keeps it lightweight. However, some pages don't have the tags defined (or incorrectly defined). For these cases, an alternative is to just parse the body and get the first 200-500 characters for summary. For titles and such, you may have to develop custom rules for some sites by looking at their HTMLs (for example, some pages I have seen have an incorrect meta title but the "correct" title is in the first <H1> tag).

Anonymous said...

I got an error at line when i tried html summarization \
metaValue = LcppStringUtils.removeLineBreaks(metaValue);


can give solution to this

Sujit Pal said...

Depends on the error I guess... I am guessing its mostly likely a null pointer exception? If so, add a null check to the meta value before attempting to strip line breaks off it.