Some proof-of-concept type projects I have worked on in the past involves crawling pages off public web sites. Since these are one-time crawls involving a few hundred pages at most, its generally impractical to set up a full blown crawl using your massively distributed parallel computing crawling juggernaut of choice. I recently discovered WebSPHINX, a really cool little GUI tool and Java class library, that allow you do this fairly easily and quickly. This blog post describes a few of these use-cases (with code) that show how easy it is to use the class library. In other words, it describes a programmer's hack to crawl a few hundred pages off the web once in a while for one time use.
The GUI tool is nice, and allows you to experiment with some basic crawls. However, to actually do something with the pages you crawled inline with the crawl, you need to get the Javascript library that comes bundled with Netscape. Not surprising, since the last release of WebSPHINX was in 2002, and Netscape was the browser technology leader at the time. I use Firefox now, however, and I didn't want to download Netscape just to get at its embedded JAR file. In any case, the class library is simple enough to use, so I just went with that.
At its most basic, all you have to do to build your own crawler is to subclass the WebSPHINX crawler and override the doVisit(Page) method to specify what your crawler should do with the Page it visits. Here you could probably have parsing or persistence logic for the Page in question. There are some other methods you can subclass as well, such as the shouldVisit(Link) method, which allows you to weed out URLs you don't want to crawl even before they are crawled, so you incur less overhead.
I created a base crawler class MyCrawler which all my other crawlers extend. The MyCrawler class contains a few things that make the subclasses a little more well-behaved, such as obeying the robots.txt exclusion file, waiting 1s between page visits, and the inclusion of a UserAgent string that tells the webmaster of my target site who I am and how I should be contacted if necessary. Here is the code for this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | // MyCrawler.java
package com.mycompany.mycrawler;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import websphinx.Crawler;
import websphinx.DownloadParameters;
import websphinx.Page;
public abstract class MyCrawler extends Crawler {
private static final long serialVersionUID = 2383514014091378008L;
protected final Log log = LogFactory.getLog(getClass());
public MyCrawler() {
super();
DownloadParameters dp = new DownloadParameters();
dp.changeObeyRobotExclusion(true);
dp.changeUserAgent("MyCrawler Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.4) " +
"WebSPHINX 0.5 contact me_at_mycompany_dot_com");
setDownloadParameters(dp);
setDomain(Crawler.SUBTREE);
setLinkType(Crawler.HYPERLINKS);
}
@Override
public void visit(Page page) {
doVisit(page);
try {
Thread.sleep(1000L);
} catch (InterruptedException e) {;}
}
/**
* Extend me, not visit(Page)!
*/
protected abstract void doVisit(Page page);
}
|
Downloading a small site
This class hits a small public site and downloads it to your local disk. The URL structure determines the file and directory names on the local disk. You may need to tweak the logic for figuring out the mapping from the URL path to the local file path, this worked for my test, but may not work for any arbitary site. The init() method contains application logic and is called from the main() method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 | // SiteDownloadingCrawler.java
package com.mycompany.mycrawler;
import java.io.File;
import java.net.URL;
import org.apache.commons.io.FileUtils;
import org.apache.commons.io.FilenameUtils;
import org.apache.commons.lang.StringUtils;
import websphinx.Link;
import websphinx.Page;
public class SiteDownloadingCrawler extends MyCrawler {
private static final long serialVersionUID = 64989986095789110L;
private String targetDir;
public void setTargetDir(String targetDir) {
this.targetDir = targetDir;
}
private void init() throws Exception {
File targetDirFile = new File(targetDir);
if (targetDirFile.exists()) {
FileUtils.forceDelete(targetDirFile);
}
FileUtils.forceMkdir(targetDirFile);
}
@Override
protected void doVisit(Page page) {
URL url = page.getURL();
try {
String path = url.getPath().replaceFirst("/", "");
if (StringUtils.isNotEmpty(path)) {
String targetPathName = FilenameUtils.concat(targetDir, path);
File targetFile = new File(targetPathName);
File targetPath = new File(FilenameUtils.getPath(targetPathName));
if (! targetPath.exists()) {
FileUtils.forceMkdir(targetPath);
}
FileUtils.writeByteArrayToFile(targetFile, page.getContentBytes());
}
} catch (Exception e) {
log.error("Could not download url:" + url.toString(), e);
}
}
/**
* This is how we are called.
* @param argv command line args.
*/
public static void main(String[] argv) {
SiteDownloadingCrawler crawler = new SiteDownloadingCrawler();
try {
crawler.setTargetDir("/tmp/some-public-site");
crawler.init();
crawler.setRoot(new Link(new URL("http://www.some-public-site.com")));
crawler.run();
} catch (Exception e) {
e.printStackTrace();
}
}
}
|
Downloading from a site protected by basic authentication
Sometimes you may need to download files off a remote, but non-public site. The site is protected by Basic HTTP Authentication, so the contents are only available to trusted clients (of which you are one). I used Javaworld's Java Tip 46 to figure out how to do this. The actual persistence work is identical to the SiteDownloadingCrawler, so we just use the superclass's doVisit(Page) method here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | // RemoteFileHarvester.java
package com.mycompany.mycrawler;
import java.net.Authenticator;
import java.net.PasswordAuthentication;
import java.net.URL;
import websphinx.Link;
public class RemoteFileHarvester extends SiteDownloadingCrawler {
private static final long serialVersionUID = 3466884716433043917L;
/**
* This is how we are called.
* @param argv command line args.
*/
public static void main(String[] argv) {
RemoteFileHarvester crawler = new RemoteFileHarvester();
try {
crawler.setTargetDir("/tmp/private-remote-site");
URL rootUrl = new URL("http://private.site.com/protected/");
Authenticator.setDefault(new Authenticator() {
protected PasswordAuthentication getPasswordAuthentication() {
return new PasswordAuthentication("myuser", "mypassword".toCharArray());
}
});
crawler.setRoot(new Link(rootUrl));
crawler.run();
} catch (Exception e) {
e.printStackTrace();
}
}
}
|
Harvesting URLs from a public site
Sometimes, all you want are a bunch of URLs of pages from a public site. The URLs could be used as input to another process. In most cases, the URLs you are interested in have a distinct structure which you can exploit to reduce the I/O your crawler is doing, and also reducing the load on the public site. We extend the shouldVisit(Link) method here to tell the crawler to not even bother to visit pages whose URLs don't match the pattern. Additionally, we have application level init() and destroy() methods that opens and closes the handle to the output file.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | // UrlHarvestingCrawler.java
package com.mycompany.mycrawler;
import java.io.FileOutputStream;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import java.net.URL;
import websphinx.Crawler;
import websphinx.Link;
import websphinx.Page;
public class UrlHarvestingCrawler extends MyCrawler {
private static final long serialVersionUID = 9015164947202781853L;
private static final String URL_PATTERN = "some_pattern";
private PrintWriter output;
protected void init() throws Exception {
output = new PrintWriter(new OutputStreamWriter(
new FileOutputStream("/tmp/urls-from-public-site.txt")), true);
}
protected void destroy() {
output.flush();
output.close();
}
@Override
protected void doVisit(Page page) {
URL url = page.getURL();
output.println(url.toString());
}
@Override
public boolean shouldVisit(Link link) {
URL linkUrl = link.getURL();
return (linkUrl.toString().contains(URL_PATTERN));
}
/**
* This is how we are called.
* @param argv command line args.
*/
public static void main(String[] argv) {
UrlHarvestingCrawler crawler = new UrlHarvestingCrawler();
try {
crawler.init();
crawler.setRoot(new Link(new URL("http://www.public-site.com/page")));
crawler.setDomain(Crawler.SERVER); // reset this since we are interested in siblings
crawler.setMaxDepth(2); // only crawl 2 levels deep, default 5
crawler.run();
} catch (Exception e) {
e.printStackTrace();
} finally {
crawler.destroy();
}
}
}
|
Obviously, there can be many more interesting use cases WebSPHINX can be applied to. Also, you can probably do a lot of this stuff using a well crafted wget call or by wrapping a well-crafted wget call (or calls) inside a shell script. However, WebSPHINX offers a neat way to do this with Java, which at least in my opinion, is more maintainable.
A few cautions. If you are going to run a crawler frequently (possibly from your home account), be aware that your ISP may come down hard on you for excessive bandwidth use. But this probably doesn't apply if you are going to use WebSPHINX for the type of one-off situations I have described above.
A few other cautions to keep in mind when actually running this (or any) crawler. Crawling other people's web sites is an intrusion to them. Sure, they put up their sites to be viewed, so in a sense its public property, but they put it up to be viewed by humans, and they usually have advertising and other stuff on the page from which they hope to make some money. Not only is your crawler not helping them do this but it is consuming bandwidth they reserve for their human visitors, so be as nice and non-intrusive as you can. Don't hit them too hard, don't crawl parts of their site they tell you not to, and let them know who you are, so they can get back to you with a cease-and-desist order (and obey the cease-and-desist if they hand you one).
As you have seen from the article, the MyCrawler subclass of the WebSPHINX Crawler applies these checks so my crawls are as well behaved as I know how to make them. Obviously, I am not a crawling guru, so if you have suggestions that will make them better behaved, by all means let me know, and I will be happy to add in your suggestions.