Saturday, March 29, 2008

Crawling web pages with WebSPHINX

Some proof-of-concept type projects I have worked on in the past involves crawling pages off public web sites. Since these are one-time crawls involving a few hundred pages at most, its generally impractical to set up a full blown crawl using your massively distributed parallel computing crawling juggernaut of choice. I recently discovered WebSPHINX, a really cool little GUI tool and Java class library, that allow you do this fairly easily and quickly. This blog post describes a few of these use-cases (with code) that show how easy it is to use the class library. In other words, it describes a programmer's hack to crawl a few hundred pages off the web once in a while for one time use.

The GUI tool is nice, and allows you to experiment with some basic crawls. However, to actually do something with the pages you crawled inline with the crawl, you need to get the Javascript library that comes bundled with Netscape. Not surprising, since the last release of WebSPHINX was in 2002, and Netscape was the browser technology leader at the time. I use Firefox now, however, and I didn't want to download Netscape just to get at its embedded JAR file. In any case, the class library is simple enough to use, so I just went with that.

At its most basic, all you have to do to build your own crawler is to subclass the WebSPHINX crawler and override the doVisit(Page) method to specify what your crawler should do with the Page it visits. Here you could probably have parsing or persistence logic for the Page in question. There are some other methods you can subclass as well, such as the shouldVisit(Link) method, which allows you to weed out URLs you don't want to crawl even before they are crawled, so you incur less overhead.

I created a base crawler class MyCrawler which all my other crawlers extend. The MyCrawler class contains a few things that make the subclasses a little more well-behaved, such as obeying the robots.txt exclusion file, waiting 1s between page visits, and the inclusion of a UserAgent string that tells the webmaster of my target site who I am and how I should be contacted if necessary. Here is the code for this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// MyCrawler.java
package com.mycompany.mycrawler;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import websphinx.Crawler;
import websphinx.DownloadParameters;
import websphinx.Page;

public abstract class MyCrawler extends Crawler {

  private static final long serialVersionUID = 2383514014091378008L;

  protected final Log log = LogFactory.getLog(getClass());

  public MyCrawler() {
    super();
    DownloadParameters dp = new DownloadParameters();
    dp.changeObeyRobotExclusion(true);
    dp.changeUserAgent("MyCrawler Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.4) " + 
      "WebSPHINX 0.5 contact me_at_mycompany_dot_com");
    setDownloadParameters(dp);
    setDomain(Crawler.SUBTREE);
    setLinkType(Crawler.HYPERLINKS);
  }
  
  @Override
  public void visit(Page page) {
    doVisit(page);
    try {
      Thread.sleep(1000L);
    } catch (InterruptedException e) {;}
  }
  
  /**
   * Extend me, not visit(Page)!
   */
  protected abstract void doVisit(Page page);
}

Downloading a small site

This class hits a small public site and downloads it to your local disk. The URL structure determines the file and directory names on the local disk. You may need to tweak the logic for figuring out the mapping from the URL path to the local file path, this worked for my test, but may not work for any arbitary site. The init() method contains application logic and is called from the main() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
// SiteDownloadingCrawler.java
package com.mycompany.mycrawler;

import java.io.File;
import java.net.URL;

import org.apache.commons.io.FileUtils;
import org.apache.commons.io.FilenameUtils;
import org.apache.commons.lang.StringUtils;

import websphinx.Link;
import websphinx.Page;

public class SiteDownloadingCrawler extends MyCrawler {

  private static final long serialVersionUID = 64989986095789110L;

  private String targetDir;
  
  public void setTargetDir(String targetDir) {
    this.targetDir = targetDir;
  }
  
  private void init() throws Exception {
    File targetDirFile = new File(targetDir);
    if (targetDirFile.exists()) {
      FileUtils.forceDelete(targetDirFile);
    }
    FileUtils.forceMkdir(targetDirFile);
  }
  
  @Override
  protected void doVisit(Page page) {
    URL url = page.getURL();
    try {
      String path = url.getPath().replaceFirst("/", "");
      if (StringUtils.isNotEmpty(path)) {
        String targetPathName = FilenameUtils.concat(targetDir, path);
        File targetFile = new File(targetPathName);
        File targetPath = new File(FilenameUtils.getPath(targetPathName));
        if (! targetPath.exists()) {
          FileUtils.forceMkdir(targetPath);
        }
        FileUtils.writeByteArrayToFile(targetFile, page.getContentBytes());
      }
    } catch (Exception e) {
      log.error("Could not download url:" + url.toString(), e);
    }
  }
  
  /**
   * This is how we are called.
   * @param argv command line args.
   */
  public static void main(String[] argv) {
    SiteDownloadingCrawler crawler = new SiteDownloadingCrawler();
    try {
      crawler.setTargetDir("/tmp/some-public-site");
      crawler.init();
      crawler.setRoot(new Link(new URL("http://www.some-public-site.com")));
      crawler.run();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Downloading from a site protected by basic authentication

Sometimes you may need to download files off a remote, but non-public site. The site is protected by Basic HTTP Authentication, so the contents are only available to trusted clients (of which you are one). I used Javaworld's Java Tip 46 to figure out how to do this. The actual persistence work is identical to the SiteDownloadingCrawler, so we just use the superclass's doVisit(Page) method here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// RemoteFileHarvester.java
package com.mycompany.mycrawler;

import java.net.Authenticator;
import java.net.PasswordAuthentication;
import java.net.URL;

import websphinx.Link;

public class RemoteFileHarvester extends SiteDownloadingCrawler {

  private static final long serialVersionUID = 3466884716433043917L;
  
  /**
   * This is how we are called.
   * @param argv command line args.
   */
  public static void main(String[] argv) {
    RemoteFileHarvester crawler = new RemoteFileHarvester();
    try {
      crawler.setTargetDir("/tmp/private-remote-site");
      URL rootUrl = new URL("http://private.site.com/protected/");
      Authenticator.setDefault(new Authenticator() {
        protected PasswordAuthentication getPasswordAuthentication() {
          return new PasswordAuthentication("myuser", "mypassword".toCharArray());
        }
      });
      crawler.setRoot(new Link(rootUrl));
      crawler.run();
    } catch (Exception e) {
      e.printStackTrace();
    }
  }
}

Harvesting URLs from a public site

Sometimes, all you want are a bunch of URLs of pages from a public site. The URLs could be used as input to another process. In most cases, the URLs you are interested in have a distinct structure which you can exploit to reduce the I/O your crawler is doing, and also reducing the load on the public site. We extend the shouldVisit(Link) method here to tell the crawler to not even bother to visit pages whose URLs don't match the pattern. Additionally, we have application level init() and destroy() methods that opens and closes the handle to the output file.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// UrlHarvestingCrawler.java
package com.mycompany.mycrawler;

import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.net.URL;

import websphinx.Crawler;
import websphinx.Link;
import websphinx.Page;

public class UrlHarvestingCrawler extends MyCrawler {

  private static final long serialVersionUID = 9015164947202781853L;
  private static final String URL_PATTERN = "some_pattern";

  private PrintWriter output;
  
  protected void init() throws Exception {
    output = new PrintWriter(new OutputStreamWriter(
      new FileOutputStream("/tmp/urls-from-public-site.txt")), true);
  }
  
  protected void destroy() {
    output.flush();
    output.close();
  }
  
  @Override
  protected void doVisit(Page page) {
    URL url = page.getURL();
    output.println(url.toString());
  }
  
  @Override
  public boolean shouldVisit(Link link) {
    URL linkUrl = link.getURL();
    return (linkUrl.toString().contains(URL_PATTERN));
  }
  
  /**
   * This is how we are called.
   * @param argv command line args.
   */
  public static void main(String[] argv) {
    UrlHarvestingCrawler crawler = new UrlHarvestingCrawler();
    try {
      crawler.init();
      crawler.setRoot(new Link(new URL("http://www.public-site.com/page")));
      crawler.setDomain(Crawler.SERVER); // reset this since we are interested in siblings
      crawler.setMaxDepth(2); // only crawl 2 levels deep, default 5
      crawler.run();
    } catch (Exception e) {
      e.printStackTrace();
    } finally {
      crawler.destroy();
    }
  }
}

Obviously, there can be many more interesting use cases WebSPHINX can be applied to. Also, you can probably do a lot of this stuff using a well crafted wget call or by wrapping a well-crafted wget call (or calls) inside a shell script. However, WebSPHINX offers a neat way to do this with Java, which at least in my opinion, is more maintainable.

A few cautions. If you are going to run a crawler frequently (possibly from your home account), be aware that your ISP may come down hard on you for excessive bandwidth use. But this probably doesn't apply if you are going to use WebSPHINX for the type of one-off situations I have described above.

A few other cautions to keep in mind when actually running this (or any) crawler. Crawling other people's web sites is an intrusion to them. Sure, they put up their sites to be viewed, so in a sense its public property, but they put it up to be viewed by humans, and they usually have advertising and other stuff on the page from which they hope to make some money. Not only is your crawler not helping them do this but it is consuming bandwidth they reserve for their human visitors, so be as nice and non-intrusive as you can. Don't hit them too hard, don't crawl parts of their site they tell you not to, and let them know who you are, so they can get back to you with a cease-and-desist order (and obey the cease-and-desist if they hand you one).

As you have seen from the article, the MyCrawler subclass of the WebSPHINX Crawler applies these checks so my crawls are as well behaved as I know how to make them. Obviously, I am not a crawling guru, so if you have suggestions that will make them better behaved, by all means let me know, and I will be happy to add in your suggestions.

30 comments (moderated to prevent spam):

Sujit Pal said...

This message is posted on behalf of Ashwin Ittoo, who is trying to use WebSPHINX. Here is his email:
--
I was wondering if you were aware of any websphinx forums. Currently,
I'm developing by reading the entire API at first, to know the different
classes etc, and then only i can implement, which is time consuming.
So, if you know of any forums or communities or other resources, please
let me know. There are a few individual users pages, like yours - but
that's about it

I've emailed Rob miller, but he told me he's not longer involved and
does not know any community using Web Sphinx.
--
If you are using WebSPHINX and know about any user forums, please comment. Thanks.

Unknown said...

Hi I try to extend websphinx since a week as I would like to perform search and save some output to a database but as I am not a java programmer
I have no idea how to manipulate the class to do so.

can you give me some ideas

Thanks

R.

Sujit Pal said...

Hi remi, you can extend websphinx.Crawler directly and override the visit() method to parse the web page and dump the results into a database. Typically (and I only say this because you mentioned you were not a java programmer), I have a method that will parse the page and populate a bean, then another method that will write the bean into the database, that way you can test both your methods in isolation. If you want the well-behavedness stuff in MyCrawler, you could extend that instead, in that case, you should extend the doVisit() method. Here is a pseudo-code example:

public class RemisCrawler extends MyCrawler {
..private Connection dbconn;
..private void init() {
dbconn = getDbConn();
..}
..public void doVisit(Page page) {
....URL url = page.getUrl();
....Bean bean = parse(url);
....save(bean);
..}
..private Bean parse(URL url) {..}
..private void save(Bean bean) {..}
}

Unknown said...

Many thanks, I am progressing, like a turtle but progressing

I am looking for more infos about the patterns as well, I want to match a specific word in an url then extract the whole url (this is quite the same as the exemple, but I would like to go deeper.)
Which kind of pattern are used, I have found so many different regular expression infos and tutorials,
Do you know any one to go straight to results.

Great to find somebody active on this tool.
Sorry for annoying with newbbies question

R.

Sujit Pal said...

Hi Remi, perhaps override shouldVisit(Link) method to return true if your URL matches your pattern and false otherwise? Something like this:

private static Pattern foo =
..Pattern.compile("foo");

public boolean shouldVisit(Link link) {
..URL url = link.getUrl();
..Matcher matcher = foo.matcher(url.toString());
..return matcher.find();
}

You should take a look at the Javadocs for Pattern and Matcher.

troy said...

master guruji.... i m arun....i juz now learning data mining paper for my project....i have to work on WEBSPHINX ....but i don't know how to work on it...i downloaded websphinx but don't where to start with.....so please provide materials and steps to work on websphinx......

troy said...

hi tis is raja.i had a trouble in executing source code ofwebsphinx.i would like u to help me with required explanations and steps to be known.kindly reply to my mail.

Sujit Pal said...

Hi Troy, I am just another user of WebSPHINX. Its small and the API is quite easy to understand - my post details some of my understanding of it. Did you have specific questions about stuff you don't understand?

Unknown said...

I have a big problem!!!
I set root http://vnexpress.net/GL/Home/ (with domain vnexpress.net)
but I only get urls with domain mvnexpress.net???
--------------
Please try using root http://vnexpress.net/GL/Home/
and give me ideal to sholve this problem.
Thanks!

Sujit Pal said...

Hi Le Van, I tried going to the /GL/Home url you sent (with my browser), then moused over few of the links. It looks like the links are pointing to /GL/*, ie, /GL/Home is a sibling of all the links in it. If you are only interested in the /GL/Home links, then you should filter the URL in the code.

Chirag said...

Hi Sujit,

Thanks for your wonderful posts. They are very informative and are leading me down the right path!

I am currently using the WebSPHINX java program to crawl a list of web pages and extract the data contained between specific tags.

At the moment, the crawler outputs the data to an HTML which creates an ordered list with the extracted values. The values generated are in seemingly random order and have links to the respective URL.

My issue is two fold. I need the output to not have an ordered list and I need the output to stay in the same order as the URLs.

So if I put the URLS as:

URL1
URL2
URL3

I need the output to look like this:

Value from URL1
Value from URL2
Value from URL3

Can you offer any advice Sujit? I'm very new to all this!

Sujit Pal said...

Thanks Chirag. To answer your question, I don't think you can control the order unless you force websphinx to run in a single thread...

If the order is required because you want to present the URL and the HTML together, then it may make sense to write your HTML and URL at the same time, and use that file as your input.

Alternatively, you could consider storing the mapping in some kind of in-memory hash or database as they are being generated.

Usman said...

Hy Sujit

I want to use websphinx in my project , but im still not sure how to use it. i copy/paste mycrawler.java & sitedownloadingcrawler.java in a package.

wat i want to do is .that ik takes a link from user and returns all the sub-sequent links in a file.

I dont want to download huge pages,bcoz they will eat up all memory very quickly.

It will be great , if you can help me . :)

Thanks
Usman Raza.

Please also contact me on hotmail/msn

my id is usman000000@hotmail.com

Sujit Pal said...

Hi Usman, as far as I understood it, Websphinx will internally call visit(Page page) for each page it crawls - so once you set a root page and start the crawl, it will automatically follow the links on the page. If you make your visit(Page page) just capture the URL of the visited page, you should have what you want.

mike said...

Hi Sujit, I would like to know how to connect event listener to monitor the crawler. Instead of setting the root to be crawled, I prefer to input my own set of URLs during runtime using the Workbench.

So, the questions are:
How to connect event lister to monitor the crawler?
How to get the input from the Workbench?

Thanks for your help.

--
Mike

Sujit Pal said...

Hi Mike, I have never used the WebSphinx workbench tool, so not very sure, sorry. My usage has mainly been to quickly use it to build ad-hoc crawlers for a single site. Using that approach, you could add multiple seed URLs using multiple Crawler.addRoot().

teckk said...

Hi Sujit,

Very nice posts. I was just wondering if you knew how to specify in shouldVisit() how to make it only download links that are in a certain HTML element. For example I only want to go to links that are in a certain list, do you have any idea how one would do this? Thanks.

Sujit Pal said...

Thanks teckk. Not sure if there is direct support for what you want to do, but you could probably do this like so: (1) instantiate a Set<String> in your crawler. (2) if the Set is empty or contains the URL, shouldVisit(URL) should return true. (3) in visit(Page) you should do page.getContent() and parse out the links that you care about and put them in the Set.

Alberto said...

Hi Sujit,
how do you compile some of the java files you have created with javac and the websphinx classes.

I'm new with java and when I try some errors occurs like:
C:\Users\Alberto\Desktop\websphinx>javac mycrawler\UrlHarvestingCrawler.java
mycrawler\UrlHarvestingCrawler.java:13: cannot find symbol
symbol: class MyCrawler
public class UrlHarvestingCrawler extends MyCrawler {
^
mycrawler\UrlHarvestingCrawler.java:30: method does not override or implement a
method from a supertype
@Override
^
mycrawler\UrlHarvestingCrawler.java:36: method does not override or implement a
method from a supertype
@Override
^
mycrawler\UrlHarvestingCrawler.java:50: cannot find symbol
symbol : method setRoot(websphinx.Link)
location: class mycrawler.UrlHarvestingCrawler
crawler.setRoot(new Link(new URL("http://www.public-site.com/page")));
^
mycrawler\UrlHarvestingCrawler.java:51: cannot find symbol
symbol : method setDomain(java.lang.String[])
location: class mycrawler.UrlHarvestingCrawler
crawler.setDomain(Crawler.SERVER); // reset this since we are interested i
n siblings
^
mycrawler\UrlHarvestingCrawler.java:52: cannot find symbol
symbol : method setMaxDepth(int)
location: class mycrawler.UrlHarvestingCrawler
crawler.setMaxDepth(2); // only crawl 2 levels deep, default 5
^
mycrawler\UrlHarvestingCrawler.java:53: cannot find symbol
symbol : method run()
location: class mycrawler.UrlHarvestingCrawler
crawler.run();
^
7 errors

Thank you

Sujit Pal said...

I think you are missing the MyCrawler class from your classpath. See the first error:
...UrlHarvestingCrawler.java:13: cannot find symbol
symbol: class MyCrawler

From there its complaining about methods defined in MyCrawler that UrlHarvestingCrawler overrides. You can cut out the middleman by extending Crawler directly and extending the methods like I did in MyCrawler.

somewherepeace said...

Dear Sujit Pal !
thanks for your arctice. It's useful for me. I have a question, how do i get title of a tags of previous Page
example: previous Page: < a href="http://link1.com">title 1< a/>.
When i crawl http://link1.com page, i get string "title 1"

Sorry about my English

Sujit Pal said...

Thanks, Nguyễn Anh Đức. Not sure I follow your question completely, but generally the "Previous page" would be enclosed in the A tag, something like: <a href="www.link1.com">Previous</a>, so you could parse the first page looking for the A tags. There are various HTML parsers available, but my personal (and probably biased, since I use it exclusively for HTML parsing) recommendation is Jericho.

propovip said...

Ihave been asked to crawl the links my application has. Is there a way to generate a report with hierarchy.

For Ex :- If i crawl a link "X" whose parent is "A" and "A"s parent is "B". i should see the report as
Page B URL --> Page A URL--> Page X URL
How do i achieve this?

Sujit Pal said...

Hi propovip, if you want to do this with WebSphinx, you can extend visit(Page) to extract the links in the page (page.getLinks() and store it in say a database table as (parent, child). Then you could write something that reads this table and produces a report, starting from the root of your crawl and proceeding recursively down to each child, until the current child has no children of its own.

francesco said...

a tip: if you use DownloadParameters.changeXXX() methods there will be no results indeed; DPs are like Java.Strings in this implementation, and you have to use them like this:

dp = dp.changeXXX()

anyway, thanks a lot for your samples, they introduced me into websphinx!

cheers, francesco

Sujit Pal said...

Thanks for your kind words, Francesco, and thank you for the information - I did not know about DownloadParameters but it looks like a simple and useful way of filtering web pages.

Francesco said...

I'm referring to your class MyCrawler; if you write like this

...
dp.changeObeyRobotExclusion(true);
...

nothing happens;

I want you to note that the correct sintax (in order to make something happen :) ) is

dp = dp.changeObeyRobotExclusion(true);

as you are trying to modify a Java.String.

cheers again,

Francesco

Sujit Pal said...

Ah, got it, thanks! :-).

Unknown said...

Hi,can i download javascript links also using this crawler?
Thanks in advance

Sujit Pal said...

I guess you could if you had full links inside your script sections, but since Javascript is a programming language and you can represent URLs as other than plain string fields, its not guaranteed.