The genesis of this "feature" was due to my misunderstanding of how Nutch works. So far I've been running fairly small batches as I built my plugin and other application specific custom code. But I was now at the point where I could try ingesting all the XML files provided by this provider, so I did. There are about 6000 XML files in this collection, but Nutch fetched exactly 318. Every time (I tried it couple of times to make sure I was doing it right).
I initially thought that perhaps because the seed list for the provider was in HTML, Nutch's default HTML parser was doing some magic "above the fold" scoring that discounted items further down the page, so I hit upon the idea of using a sitemap XML file. I figured that since Nutch didn't provide sitemap support, I'd have to write my own parser (which wouldn't have any magic scoring). Since my XML parser plugin already allowed for multiple parsers, this just involves writing a sitemap XML processor and calling it through my plugin.
Of course, this did not fix the problem, the fetcher just stopped after a different number of files. Turns out that by default, Nutch only reads the first 64KB of the file and drops the rest. A quick peek at my webpage["f"]["cnt"] in the database confirmed this. So the fix for my original problem was really just adding this block to my nutch-site.xml file:
1 2 3 4 5 6 7 8 9 | <property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
|
But I had already written the sitemap parser, and using a seed file in sitemap format would allow me to also support vertical crawls for partner sites (who typically provide us a URL to their sitemap) with the same infrastructure that I am building for ingesting provider XML files. So I decided to go with it.
The sitemap format is quite simple, but the only information thats usable during the inject stage is the urlset/url/loc value. The provider_index method in the updated CherryPy server code below generates a dynamic sitemap XML from the contents of the filesystem.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | #!/usr/bin/python
import cherrypy
import os
import os.path
import urllib
from cherrypy.lib.static import serve_file
SITES_DIR = "/path/to/your/sites/directory"
SERVER_HOST = "localhost"
SERVER_PORT = 8080
def _accumulate_files(files, dirname, fnames):
"""
This function gets called for every file (and directory) that is walked
by os.path.walk. It accumulates the file names found into a flat array.
The file names accumulated are relative to the providers directory.
"""
for fname in fnames:
abspath = os.path.join(dirname, fname)
if os.path.isfile(abspath):
abspath = abspath.replace(os.path.join(SITES_DIR, "providers"), "")[1:]
files.append(abspath)
class Root:
@cherrypy.expose
def test(self, name):
"""
Expose the mock site for testing.
"""
return serve_file(os.path.join(SITES_DIR, "test", "%s.html" % (name)), \
content_type="text/html")
@cherrypy.expose
def provider_index(self, name):
"""
Builds an index page of links to all the files for the specified
provider. The files are stored under sites/providers/$name. The
function will recursively walk the filesystem under this directory
and dynamically generate a flat list of links. Path separators in
the filename are converted to "__" in the URL. The index page can
be used as the seed URL for this content.
"""
files = []
name = name.replace("-sitemap.xml", "")
os.path.walk(os.path.join(SITES_DIR, "providers", name), \
_accumulate_files, files)
index = """<?xml version=\"1.0\" encoding=\"UTF-8\"?>
<urlset xmlns=\"http://www.sitemaps.org/schemas/sitemap/0.9\">
"""
for file in files:
url = "http://%s:%s/provider/%s" % (SERVER_HOST, SERVER_PORT, \
urllib.quote_plus(file.replace(os.path.sep, "__")))
index += """
<url><loc>%s</loc></url>
""" % (url)
index += """
</urlset>
"""
cherrypy.response.headers["Content-Type"] = "application/xml"
return [index]
@cherrypy.expose
def provider(self, name):
"""
Returns the contents of the XML file stored at the location
corresponding to the URL provided. The "__" in the URL are converted
back to file path separators.
"""
ct = None
if name.endswith(".xml"):
ct = "application/xml"
elif name.endswith(".json"):
ct = "application/json"
if ct is None:
return serve_file(os.path.join(SITES_DIR, "providers", \
"%s" % name.replace("__", os.path.sep)), \
content_type = "text/html")
else:
return serve_file(os.path.join(SITES_DIR, "providers", \
"%s" % (urllib.unquote_plus(name).replace("__", os.path.sep))), \
content_type = ct)
if __name__ == '__main__':
current_dir = os.path.dirname(os.path.abspath(__file__))
# Set up site-wide config first so we get a log if errors occur.
cherrypy.config.update({'environment': 'production',
'log.access_file': 'site.log',
'log.screen': True,
"server.socket_host" : SERVER_HOST,
"server.socket_port" : SERVER_PORT})
cherrypy.quickstart(Root(), '/')
|
The ProviderXmlProcessorFactory (described in a previous post) was modified slightly to check for the basename of the URL. If the basename contains the string "sitemap", then it will delegate to the Sitemap Processing component first, and only then look at the value of the u_idx metadata field. The code for this is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | // Source: src/plugin/mycompany/src/java/com/mycompany/nutch/parse/xml/sitemap/SitemapXmlProcessor.java
package com.mycompany.nutch.parse.xml.sitemap;
import java.io.ByteArrayInputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.codehaus.jackson.map.ObjectMapper;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.Namespace;
import org.jdom.input.SAXBuilder;
import org.xml.sax.InputSource;
import com.mycompany.nutch.parse.xml.IProviderXmlProcessor;
import com.mycompany.nutch.parse.xml.ProviderXmlFields;
public class SitemapXmlProcessor implements IProviderXmlProcessor {
public static final String OUTLINKS_KEY = "_outlinks";
private ObjectMapper mapper;
public SitemapXmlProcessor() {
mapper = new ObjectMapper();
}
@SuppressWarnings("unchecked")
@Override
public Map<String,String> parse(String content) throws Exception {
Map<String,String> parsedFields = new HashMap<String,String>();
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(new InputSource(
new ByteArrayInputStream(content.getBytes())));
Element root = doc.getRootElement();
Namespace ns = root.getNamespace();
if ("urlset".equals(root.getName())) {
List<String> urls = new ArrayList<String>();
List<Element> eUrls = root.getChildren("url", ns);
for (Element eUrl : eUrls) {
urls.add(eUrl.getChildTextTrim("loc", ns));
// sitemap 0.9 also specifies optional elements lastmod,
// changefreq and priority. The first two could be handled
// if we change the OutLink object to hold these values
// as metadata, which is used to update the modified
// and fetchInterval values in the outlink once its put
// into the fetchlist. But we ignore these currently.
}
parsedFields.put(OUTLINKS_KEY, mapper.writeValueAsString(urls));
// set some fields to prevent nutch from choking on NPEs
parsedFields.put(ProviderXmlFields.title.name(), "sitemap");
parsedFields.put(ProviderXmlFields.content.name(), "sitemap");
}
return parsedFields;
}
}
|
As you can see, it parses out the URLs and accumulates them into a List, whcih is then written out to the parsedFields map as a JSON string keyed by a magic key "_outlinks". The ProviderXmlParser plugin does a bit of special processing for this key, specifically it converts the JSON string back to a List and writes the list elements out to the webpage["f"]["ol"] column. Here is the modified ProviderXmlParser class.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | // Source: src/plugin/mycompany/src/java/com/mycompany/nutch/parse/xml/ProviderXmlParser.java
package com.mycompany.nutch.parse.xml;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.apache.avro.util.Utf8;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.parse.Outlink;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseStatusCodes;
import org.apache.nutch.parse.Parser;
import org.apache.nutch.storage.ParseStatus;
import org.apache.nutch.storage.WebPage;
import org.apache.nutch.storage.WebPage.Field;
import org.apache.nutch.util.Bytes;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.type.TypeReference;
import com.mycompany.nutch.parse.xml.sitemap.SitemapXmlProcessor;
public class ProviderXmlParser implements Parser {
private static final Log LOG = LogFactory.getLog(ProviderXmlParser.class);
private static final Set<WebPage.Field> FIELDS = new HashSet<WebPage.Field>();
private static final Utf8 IDX_KEY = new Utf8("u_idx");
static {
FIELDS.add(WebPage.Field.METADATA);
FIELDS.add(WebPage.Field.OUTLINKS);
}
private Configuration conf;
private ObjectMapper mapper;
private TypeReference<List<String>> outlinksTypeRef;
public ProviderXmlParser() {
this.mapper = new ObjectMapper();
this.outlinksTypeRef = new TypeReference<List<String>>() {};
}
@Override
public Parse getParse(String url, WebPage page) {
Parse parse = new Parse();
parse.setParseStatus(new ParseStatus());
parse.setOutlinks(new Outlink[0]);
Map<Utf8,ByteBuffer> metadata = page.getMetadata();
if (metadata.containsKey(IDX_KEY)) {
String idx = Bytes.toString(Bytes.toBytes(metadata.get(IDX_KEY)));
IProviderXmlProcessor processor = ProviderXmlProcessorFactory.getProcessor(url, idx);
if (processor != null) {
try {
LOG.info("Parsing URL:[" + url + "] with " +
processor.getClass().getSimpleName());
Map<String,String> parsedFields = processor.parse(
Bytes.toString(Bytes.toBytes(page.getContent())));
parse.setText(parsedFields.get(ProviderXmlFields.content.name()));
parse.setTitle(parsedFields.get(ProviderXmlFields.title.name()));
// set the rest of the metadata back into the page
for (String key : parsedFields.keySet()) {
if (ProviderXmlFields.content.name().equals(key) ||
ProviderXmlFields.title.name().equals(key) ||
SitemapXmlProcessor.OUTLINKS_KEY.equals(key)) {
continue;
}
page.putToMetadata(new Utf8(key),
ByteBuffer.wrap(parsedFields.get(key).getBytes()));
}
if (parsedFields.containsKey(
SitemapXmlProcessor.OUTLINKS_KEY)) {
// if we have OUTLINKS data, then populate it as well
List<String> outlinkUrls = mapper.readValue(
parsedFields.get(SitemapXmlProcessor.OUTLINKS_KEY),
outlinksTypeRef);
Outlink[] outlinks = new Outlink[outlinkUrls.size()];
for (int i = 0; i < outlinks.length; i++) {
String outlinkUrl = outlinkUrls.get(i);
outlinks[i] = new Outlink(outlinkUrl, outlinkUrl);
}
parse.setOutlinks(outlinks);
}
parse.getParseStatus().setMajorCode(ParseStatusCodes.SUCCESS);
} catch (Exception e) {
LOG.warn("Parse of URL: " + url + " failed", e);
parse.getParseStatus().setMajorCode(ParseStatusCodes.FAILED);
}
}
}
return parse;
}
@Override
public Collection<Field> getFields() {
return FIELDS;
}
@Override
public Configuration getConf() {
return conf;
}
@Override
public void setConf(Configuration conf) {
this.conf = conf;
}
}
|
And thats pretty much it. We now inject the following seed URL for this provider, and run through two iterations (depth 2) of the generate, fetch, parse and updatedb, and as expected, all the provider XML files are ingested without problems.
1 | http://localhost:8080/provider_index/prov1-sitemap.xml u_idx=prov1
|
The approach I describe is probably not what you would normally think of when you hear "nutch" and "sitemap enabled" in the same sentence. After all, we are throwing away the optional metadata that is being provided to us, such as crawl frequency and last modified time. Unfortunately with the approach I have chosen - using the sitemap XML file as the seed URL - the only way I know of to ingest this information in a single pass is to change the Outlink data structure. However, there is nothing preventing you from making a second pass over the sitemap after the fetch and then resetting the fetch interval for a page based on its sitemap properties - sort of like my Delta Indexer on auto-pilot.
Hi Sujit,
ReplyDeleteI know you have moved on from this, however I wonder if you ever saw the Sitemap support in Crawler Commons [0]?
The intention is to drag this aspect of the code into Nutch (e.g. NUTCH-1465 [1]) and I really like your idea of "The ProviderXmlProcessorFactory (described in a previous post) was modified slightly to check for the basename of the URL. If the basename contains the string "sitemap", then it will delegate to the Sitemap Processing component first, and only then look at the value of the u_idx metadata field." as this makes perfect sense for a parse-sitemap plugin for Nutch.
Thanks for the post and inspiration
Lewis
[0] http://code.google.com/p/crawler-commons/
[1] https://issues.apache.org/jira/browse/NUTCH-1465
Hi Lewis, as a matter of fact, we are implementing something at work along these lines with Nutch 2.1 as our workhorse. I did not know the crawler-commons code, thanks for the pointer, and I am glad my comment served as inspiration for a new feature on Nutch :-).
ReplyDeleteThe approach we are choosing, however, is to have a component preprocess the sitemap (or set of sitemaps as the case may be) and come up with a seed URL list (with metadata as name-value pairs) that we feed to inject. This is because we want to use the lastmod and changefreq data as well.