Thursday, July 23, 2009

Nutch: Custom Plugin to parse and add a field

Last week, I described my initial explorations with Nutch, and the code for a really simple plugin. This week, I describe a pair of plugin components that parse out the blog tags (the "Labels:" towards the bottom of this page) and add them to the index. The plugins are fairly useless in a general case, because you cannot depend on a particular format of a page unless, like me, you are looking at a very small subset of the web. I wrote them in order to understand Nutch's plugin architecture, and to see what was involved in using it as a crawler-indexer combo.

Most of the code in here is based on the information I found in the Nutch Writing Plugin Example wiki page, which is based on Nutch 0.9. There are some API changes between Nutch 0.9 and Nutch 1.0 (which I use), which I had to look at the contributed plugin source code to figure out, but other than that, my example is quite vanilla.

I am adding to the same myplugins plugin that I described in my previous post. My new plugin pair consists of a Parsing filter to parse out the tags from the HTML page, and an Indexing filter to put the tags into the Lucene index. My new plugin.xml file looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<?xml version="1.0" encoding="UTF-8"?>
<plugin id="myplugins" name="My test plugins for Nutch"
     version="0.0.1" provider-name="mycompany.com">

   <runtime>
      <library name="myplugins.jar">
         <export name="*"/>
      </library>
   </runtime>

   <extension id="com.mycompany.nutch.indexing.InvalidUrlIndexFilter"
       name="Invalid URL Index Filter"
       point="org.apache.nutch.indexer.IndexingFilter">
     <implementation id="MyPluginsInvalidUrlFilter"
         class="com.mycompany.nutch.indexing.InvalidUrlIndexFilter"/>
   </extension>
   
   <extension id="com.mycompany.nutch.parsing.TagExtractorParseFilter"
       name="Tag Extractor Parse Filter"
       point="org.apache.nutch.parse.HtmlParseFilter">
     <implementation id="MyPluginsTagExtractorParseFilter"
         class="com.mycompany.nutch.parsing.TagExtractorParseFilter"/>
   </extension>
   
   <extension id="com.mycompany.nutch.parsing.TagExtractorIndexFilter"
       name="Tag Extractor Index Filter"
       point="org.apache.nutch.indexer.IndexingFilter">
     <implementation id="MyPluginsTagExtractorIndexFilter"
         class="com.mycompany.nutch.indexing.TagExtractorIndexFilter"/>
   </extension>
</plugin>

The code for the TagExtractorParseFilter is shown below. It reads the content byte array line by line and applies regular expressions to a particular portion of the page to extract tags, then stuff them into a named slot in the parse MetaData map for retrieval and use by the corresponding indexing filter. The class implements the HtmlParseFilter interface, and will be called as one of the configured HtmlParseFilters when the Nutch parse subcommand is run (described below).

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
// Source: src/plugin/myplugins/src/java/com/mycompany/nutch/parsing/TagExtractorParseFilter.java
package com.mycompany.nutch.parsing;

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.log4j.Logger;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.ParseResult;
import org.apache.nutch.parse.ParseText;
import org.apache.nutch.protocol.Content;
import org.w3c.dom.DocumentFragment;

/**
 * The parse portion of the Tag Extractor module. Parses out blog tags 
 * from the body of the document and sets it into the ParseResult object.
 */
public class TagExtractorParseFilter implements HtmlParseFilter {

  public static final String TAG_KEY = "labels";
  
  private static final Logger LOG = 
    Logger.getLogger(TagExtractorParseFilter.class);
  
  private static final Pattern tagPattern = 
    Pattern.compile(">(\\w+)<");
  
  private Configuration conf;

  /**
   * We use regular expressions to parse out the Labels section from
   * the section snippet shown below:
   * <pre>
   * Labels:
   * <a href='http://sujitpal.blogspot.com/search/label/ror' rel='tag'>ror</a>,
   * ...
   * </span>
   * </pre>
   * Accumulate the tag values into a List, then stuff the list into the
   * parseResult with a well-known key (exposed as a public static variable
   * here, so the indexing filter can pick it up from here).
   */
  public ParseResult filter(Content content, ParseResult parseResult,
      HTMLMetaTags metaTags, DocumentFragment doc) {
    LOG.debug("Parsing URL: " + content.getUrl());
    BufferedReader reader = new BufferedReader(
      new InputStreamReader(new ByteArrayInputStream(
      content.getContent())));
    String line;
    boolean inTagSection = false;
    List<String> tags = new ArrayList<String>();
    try {
      while ((line = reader.readLine()) != null) {
        if (line == null) {
          continue;
        }
        if (line.contains("Labels:")) {
          inTagSection = true;
          continue;
        }
        if (inTagSection && line.contains("</span>")) {
          inTagSection = false;
          break;
        }
        if (inTagSection) {
          Matcher m = tagPattern.matcher(line);
          if (m.find()) {
            LOG.debug("Adding tag=" + m.group(1));
            tags.add(m.group(1));
          }
        }
      }
      reader.close();
    } catch (IOException e) {
      LOG.warn("IOException encountered parsing file:", e);
    }
    Parse parse = parseResult.get(content.getUrl());
    Metadata metadata = parse.getData().getParseMeta();
    for (String tag : tags) {
      metadata.add(TAG_KEY, tag);
    }
    return parseResult;
  }

  public Configuration getConf() {
    return conf;
  }

  public void setConf(Configuration conf) {
    this.conf = conf;
  }
}

The TagExtractorIndexFilter is the other part of this pair. This retrieves the value of the labels from the Parse object and sticks it into the Lucene index. The code is shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
// Source: src/plugin/myplugins/src/java/com/mycompany/nutch/indexing/TagExtractorIndexFilter.java
package com.mycompany.nutch.indexing;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.log4j.Logger;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.crawl.Inlinks;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.lucene.LuceneWriter;
import org.apache.nutch.indexer.lucene.LuceneWriter.INDEX;
import org.apache.nutch.indexer.lucene.LuceneWriter.STORE;
import org.apache.nutch.parse.Parse;

import com.mycompany.nutch.parsing.TagExtractorParseFilter;

/**
 * The indexing portion of the TagExtractor module. Retrieves the
 * tag information stuffed into the ParseResult object by the parse
 * portion of this module.
 */
public class TagExtractorIndexFilter implements IndexingFilter {

  private static final Logger LOGGER = 
    Logger.getLogger(TagExtractorIndexFilter.class);
  
  private Configuration conf;
  
  public void addIndexBackendOptions(Configuration conf) {
    LuceneWriter.addFieldOptions(
      TagExtractorParseFilter.TAG_KEY, STORE.YES, INDEX.UNTOKENIZED, conf);
  }

  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
      CrawlDatum datum, Inlinks inlinks) throws IndexingException {
    String[] tags = 
      parse.getData().getParseMeta().getValues(
      TagExtractorParseFilter.TAG_KEY);
    if (tags == null || tags.length == 0) {
      return doc;
    }
    // add to the nutch document, the properties of the field are set in
    // the addIndexBackendOptions method.
    for (String tag : tags) {
      LOGGER.debug("Adding tag: [" + tag + "] for URL: " + url.toString());
      doc.add(TagExtractorParseFilter.TAG_KEY, tag);
    }
    return doc;
  }

  public Configuration getConf() {
    return this.conf;
  }

  public void setConf(Configuration conf) {
    this.conf = conf;
  }
}

We already have the myplugins plugin registered with Nutch, so to exercise the parser, we generate the set of urls to fetch from crawldb, fetch the pages without parsing, then parse. Once that is done, we run updatedb to update the crawldb, then index, dedup and merge. The entire sequence of commands is listed below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
sujit@sirocco:/opt/nutch-1.0$ CRAWL_DIR=/home/sujit/tmp
sujit@sirocco:/opt/nutch-1.0$ bin/nutch generate \
  $CRAWL_DIR/data/crawldb $CRAWL_DIR/data/segments

# this will create a segments subdirectory which is used in the
# following commands (we set it to SEGMENTS_DIR below)
sujit@sirocco:/opt/nutch-1.0$ SEGMENTS_DIR=20090720105503
sujit@sirocco:/opt/nutch-1.0$ bin/nutch fetch \
  $CRAWL_DIR/data/segments/$SEGMENTS_DIR -noParsing

# The parse command is where our custom parsing happens.
# To run this (for testing) multiple times, remove the 
# crawl_parse, parse_date and parse_text under the
# segments subdirectory after a failed run.
sujit@sirocco:/opt/nutch-1.0$ bin/nutch parse \
  $CRAWL_DIR/data/segments/$SEGMENTS_DIR
sujit@sirocco:/opt/nutch-1.0$ bin/nutch updatedb \
  $CRAWL_DIR/data/crawldb -dir $CRAWL_DIR/data/segments/*

# The index command is where our custom indexing happens
# you should remove $CRAWL_DIR/data/index and 
# $CRAWL_DIR/data/indexes from previous run before running
# these commands.
sujit@sirocco:/opt/nutch-1.0$ bin/nutch index \
  $CRAWL_DIR/data/indexes $CRAWL_DIR/data/crawldb \
  $CRAWL_DIR/data/linkdb $CRAWL_DIR/data/segments/*
sujit@sirocco:/opt/nutch-1.0$ bin/nutch dedup \
  $CRAWL_DIR/data/indexes
sujit@sirocco:/opt/nutch-1.0$ bin/nutch merge \
  -workingdir $CRAWL_DIR/data/work $CRAWL_DIR/data/index \
  $CRAWL_DIR/data/indexes

I tried setting the fetcher.parse value to false in my conf/nutch-site.xml but it did not seem to have any effect - I had to set the -noParsing flag in my nutch fetch command. But here is the snippet, just in case.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  ...
  <property>
    <name>fetcher.parse</name>
    <value>false</value>
  </property>
  ...
</configuration>

I noticed this error message when I was running the parse.

1
2
3
4
5
6
7
Error parsing: http://sujitpal.blogspot.com/feeds/6781582861651651982
/comments/default: org.apache.nutch.parse.ParseException: parser not found
for contentType=application/atom+xml url=http://sujitpal.blogspot.com
/feeds/6781582861651651982/comments/default
 at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:74)
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:766)
 at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:552)

I noticed that application/atom+xml was not explicitly mapped to a plugin, so I copied the application/rss+xml setting to it in conf/parse-plugins.xml, and adding parse-rss to plugin.includes in conf/nutch-site.xml. The error message disappeared, but was replaced with a warning about the parse-rss not being able to parse the atom+xml content properly. I did not investigate further, because I was throwing away these pages anyway (InvalidUrlIndexFilter). Here is the snippet from parse-plugins.xml.

1
2
3
4
        <mimeType name="application/atom+xml">
            <plugin id="parse-rss" />
            <plugin id="feed" />
        </mimeType>

You can test (apart from verifying the log traces in hadoop.log) that the TagExtractor combo worked by looking at the Lucene index generated. Here is a screenshot of the top terms for the "label" field.

I probably should have gone all the way and built a QueryFilter to look for this field in the search queries, but I really have no intention of ever using Nutch's search service (except perhaps as an online debugging tool, similar to Luke, which is a lot of work at this point) so I decided not to.

You may have noticed that the code above has nothing to do with Map-Reduce, these are simply (well, almost) plain Java hooks that plugin in to Nutch's published extension points. However, the Nutch core, which calls these plugins during its life-cycle, uses Hadoop Map-Reduce pretty heavily, as these Map-Reduce in Nutch slides by Doug Cutting shows. I plan on looking at this stuff more over the coming week, and possibly write about it if I come up with anything interesting.

64 comments:

  1. Hi Sujit,
    The information given by u is very useful. I have started working on nutch for customized crawling. Will need ur guidance in future. thanks in advance.

    ReplyDelete
  2. Thanks, Jagdeep, but the information here is already on the Nutch website, the only way the post probably helps you is as a simplistic case study. Feel free to ask questions, and I will try my best to help, but be aware that I am no expert on Nutch, and you can probably get (better and more authoritative) answers on the Nutch mailing list.

    ReplyDelete
  3. Thanks a lot sir for your warm response.
    I want to crawl blog sites to get the meaning ful content . For this purpose i am planning to configure nutch to get text from my defined list of HTML tags only. After getting success in this approach i will try to use NLP to extarct meaningful text while crawling itself. Please guide me how far this approach is feaseable and is it the right approach to crawl relevant text from blog sites. Thanks

    ReplyDelete
  4. Hi Jagdeep, your approach should work if you are crawling a (known) subset of pages, or if your patterns are broad enough. You may want to be more aggressive and use an HTML parser (I like Jericho) to first remove all content from tags you consider "useless", such as script and style tags, perhaps the head tag, then strip out all XML markup from the rest of it - that gives you clean text to start with, then your NLP code has less cruft to deal with. As for feasibility of doing NLP in the crawl, I guess that depends on what you are going to do and how fast or algorithms are, or how much time you want your crawl to take.

    ReplyDelete
  5. Hi, Well Boss I can just say thanks a lot. It seems that you have a great knowledge in this field. Basically what you have mentioned is very true as I am looking for a parser which can get me text from specified tags and to filter HTML also.
    I will try the tool that you have suggested. Will respond you soon with my approach
    Thanks

    ReplyDelete
  6. Hi sir, I was working on webharvest tool to get text from described tags. I was able to get the relevant text from the set to given HTML tags. But to crawl blog sites for which its not easy to describe the set of tags. With this tool i can remove HTML tags like you have suggested but I am still not sure that how far NLP approach will be helpful. I even have very short time line to properly figure out and implement NLP approach so, can you please suggest me something new.
    Thanks a lot

    ReplyDelete
  7. Hi Jagdeep, sorry about the delay in responding. Another approach you can try (I have tried it for some experimental stuff, got quite good results, though not perfect) is explained here. Seems to be more maintainable than parsing known subset of tags, and less noisy than removing markup and parsing the rest.

    As for using NLP, typically you would have some kind of goal in mind...the only place I can think of when NLP would be useful in this scenario is if you were going to do some form of sentiment analysis. Usually, if you are just going to index it, just parsing out the useful text is sufficient, there are various (standard) filters and normalizations that happen during indexing that will ensure that noise is removed.

    ReplyDelete
  8. Thanks a lot for your important suggestion. You got it right that sentiment analysis is my ultimate aim. First i have to extract relevant text then i will do the sentiment analysis.
    I trying to go through the approach suggested by you its just taking bit time to understand python code. Will try to simplfy the algorithm a bit and will write java code.
    Will get back to you soon.... in meanwhile if u come across any good link or code .... pease do me a favor and post it.
    Thanks a lot

    ReplyDelete
  9. very good post, I would suggest you make a tutorial of how to integrate tika0.4 (a tool parse apache) to nutch. Because we have almost no content on the Internet about it. This would be of great help, hug!

    ReplyDelete
  10. @juliano:
    Thanks, and yes, I heard about tika in the Lucene meetup at the ApacheCon last week. Thanks for the suggestion, and I will take a look at Apache tika and see if I can use it.

    @jagdeep:
    You may find my current post interesting, depending on our conversation.

    ReplyDelete
  11. hats Off to u boss.... just saw your post... it looks really great....i have just tried it and in first run it looks good...will get back to you with its detail study....
    thanks a lot

    ReplyDelete
  12. Hi Everyone,

    I have followed this link to add custom fields to my Index.

    while doing a search I do see my fields and their values when I click the explain link.

    but my urls have this appended to them. It is something like this http://www.ontla.on.ca/library/repository/ser/140213/2003//2003v14no04.pdf sha1:LQARYIDT5UHWATV3LPTKARIBUTPVQ2FB

    How do I get rid of the part sha1:LQARYIDT5UHWATV3LPTKARIBUTPVQ2FB

    Is there any configuration I am missing.

    Please help

    Thanks

    ReplyDelete
  13. Hi Pramila, the problem you are noticing is probably not linked to the filters you got from here. To verify, remove them from the chain and rerun the job. I am guessing that this is probably some parsing error - not sure where, but to fix it, you could write an Indexing filter that will extract the url from the NutchDocument, strip off the trailing portion, and reset the stripped URL back into the NutchDocument. See my previous post for an example.

    ReplyDelete
  14. Many institutions limit access to their online information. Making this information available will be an asset to all.

    ReplyDelete
  15. Hi Custom, thanks, but I am not an institution...I am just a programmer who enjoys working with open source stuff to solve some (real and imagined) problems, which I write about in the hope it is useful to someone (or even me six months later) :-). Most of the stuff I describe here are little proofs of concept solutions to "real" problems - some end up never getting used, and some do.

    ReplyDelete
  16. Hi Sujit,

    I am new to Nutch and found that your post is very useful! I would like to say big thanks to you.

    Now i am working on plugin too, wish to extract all the text within certain html tag.

    Actually I have some doubts regarding to the crawl comment in nutch, when I tried to run the crawl comment for second time in different day ("/crawl" folder already exist for the first time running), it seems the segments generated doesn't include the updated information, may I know why will this happen? When I delete the "/crawl" folder and run the crawl command, I can get the updated information.

    Besides that, I would also like to ask that the url list that we put is it must be the root page of the web site, (eg: http://myweb.com/), can we put something like --> (http://myweb.com/comment) ? When I tried (http://myweb.com/comment) and crawl the web, it seems cannot fetch all the links inside this page. Many are missing. May I know why this will happen? I had tried to edit the crawl-urlfilter.txt but the results also is the same.

    Thank you so much for your time..

    Really appreciate it.

    Thanks.

    ReplyDelete
  17. Thanks, theng. I am not much of an expert on actual nutch usage, so it may be better if you asked this question on the nutch mailing list.

    ReplyDelete
  18. Hi Sujit,

    Your articles are really a good starting point for using nutch.

    I have a requirement of storing some extra information for links while parsing the links of a given page.Basically I have requirement of creating video search ,so I want to fetch thumbnail along with the ulr.Please help me how this can be done in nutch

    ReplyDelete
  19. Thanks Shashwat. I am guessing you are crawling and indexing pages containing videos, given that you are trying to do video search. In that case, you would need to build a parser and indexer extension similar to the (rather lame) example in my post. I like Jericho for HTML parsing, but there are other libraries too. Basically once you know the various locations (you would have to analyze the pages you are crawling) where the thumbnail is available on the video page, you would extract it out into a ParseResult during parsing, and then in your indexer extension, you would put that information into your index.

    ReplyDelete
  20. Hi, I read your blog ..I installed nutch it and everything works great ... but I have a big doubt.

    When I run the crawler, for example in the url directory I have a *. txt in the interior contains:
    http://www.opentechlearning.com/

    And inside the folder 'conf' there are a file 'crawl-urlfilter' must have:

    + ^ Http:// ([a-z0-9] * \.) * Opentechlearning.com /

    My question is how (I put the pages in the crawl-urlfilter file) to the following pages:

    http://cnx.org/lenses/ccotp/endorsements/atom

    http://ocw.nd.edu/courselist/rss

    http://openlearn.open.ac.uk/file.php/1/learningspace.xml

    .... and not starting with www. and that causes me problems

    I put for example:

    http://cnx.org/lenses/ccotp/endorsements/atom
    and
    + ^ Http:// ([a-z0-9] * \.) *cnx.org/lenses/ccotp/endorsements/atom

    but when i do the search....nothing appears

    I want to know if I need a plugin for rss xml atom or because of those pages I presents the results when searching, or if there are plugins for that

    ReplyDelete
  21. Hi Israel, since you are trying to do a crawl from multiple domains, you probably want to inject the seeds first (nutch inject seeds), then replace the regex in the crawl-filter with a custom one. I believe there is also a regex filter that you may want to use. For RSS/Atom pages, you may want to build a custom indexing filter that parses out the content for indexing, if such a thing doesn't already exist.

    ReplyDelete
  22. Hi Sujit,
    Thansk for your information.It was very knowlegable. I am trying to acquire web content and to do that we want to crawl a link and then convert the html content of the link into a text file. We know that nutch parses the url link and creates files like the crawldb,linkdb. I understand the actual content of the html is stored under segments/timestamped_folder/parse_text. My question is how dow we built our cown content aquisition engine and convert the web page into text.
    Once we have the etxt may be we can parse the text like html parser and then we can index it using lucene.
    Please Advice
    Thanks
    Anshu

    ReplyDelete
  23. Thanks Anonymous. To answer your question, you could write a map-reduce job (model it on the class that backs the parse command, in my copy of nutch-1.0 it is ParseSegment, in nutch-2.0 it is ParserJob) - in your mapper define the conversion from HTML to plaintext.

    ReplyDelete
  24. Hi, I followed your guide but I get an error when trying to index my own field: org.apache.solr.comoon.SolrException: ERROR: unknown field 'contingut'.

    How do you do to add an own field? The only solution I've found is defined above in the schema.xml. Can not create a field from a java file?

    Thanks!

    ReplyDelete
  25. Yes, the only way to add your new field to the Solr index, as you've found out, is to define the index in the Solr schema.xml file. Currently, the Solr schema definition in Nutch (for 1.1 at least) explicitly defines each field name. If you are going to add many fields, you may want to consider using Solr's dynamic field feature - it will still require you to define the field name patterns once, so its not truly dynamic.

    ReplyDelete
  26. its a great post , since very less details are avail about nutch --

    Ive gone through intranet crawling , i have merged segments and used readsegs to dump URLs in to one file called dump ,
    these URLS consist of image urls also, ex: http://localhost/abc.jpg
    my question is? can i get these pictures downloaded using hadoop or solr or nutch any tech if yes pls reply

    ReplyDelete
  27. Hi Arun, never did this myself, but googled for "nutch download images" and this post on Allen Day's Blog came up on top. I think this may be what you are looking for?

    ReplyDelete
  28. hello sujit i want to crawl and i have to fetch data within these tag i tried to to fetch using plugin but i only successful to fetch tag not data within it any tutorial related to it thanks in advance

    ReplyDelete
  29. Hi, you are probably missing some logic in your ParseFilter.filter() implementation. If you know (and can depend on) the structure of your input page HTML, you can probably get away with code like I have written, otherwise you may need to write a more robust custom parser based on Nutch's html-parse plugin or if you don't mind adding an extra JAR dependency, using some third-party HTML parsing library such as Jericho.

    ReplyDelete
  30. Hi,

    I try the tutorial above but i only made the htmlparserfilter part because i need to do more pre-processing tasks in data. When i activate my plugin in nutch-site.xml, the parse content disappear in hadoop files. I check the configuration files and i think they are fine. I change the pattern to match everything but nothing change. Any suggestion? PS: I am using Nutch 1.5.1 and Hadoop 1.0.3

    ReplyDelete
  31. Hi Antonio, hard to tell what may be wrong, but maybe check for an exception inside the parse filter that may be sending empty parse results?

    ReplyDelete
  32. I want to parse specific part of html page..for example div element with class = "details" . I know that I have to write a separate plugin for this purpose. Also, if want to parse multiple parts of every html page crawled by nutch, what should I do?For example, div element with class=details or uploadedBy or so on.

    ReplyDelete
  33. Hi Abhijeet, you can get the HTML body from content, then use standard tools like XPath to extract content from various div elements, then stick them back into the metadata as a set of name-value pairs. You should be able to do this within a ParseFilter similar to my example.

    ReplyDelete
  34. Sujit...thanks a lot for your response..I am using parse-html plugin which removes css, javascript and html tags and leaves only text. Because of this , html element is not available. Should I remove this plugin from nutch-default.xml?

    Also, I want to parse html pages on
    multiple elements with specific values of attributes.

    Your example talks about only one tag whereas I want to parse against multiple tags(like div, p, etc). Also, if possible, provide an example for xpath for extracting content from different elements.

    Thanks a lot

    ReplyDelete
  35. Hi Abhijeet, yes, I think you may have to rethink your flow. Perhaps you can remove the html-parse plugin and do the HTML parsing later along with extracting the other tags. I don't have an example for parsing out multiple tags from an XML/HTML handy, but there are plently of examples on the web such as this one.

    ReplyDelete
  36. Sujit..thanks for your help..It is really very helpful...I have parsed the webpage using htmlcleaner(for extracting elements with specific tags and attributes). I am crawling a website where every webpage displays a list of items. I am keeping every item in a json format and, therefore, for every webpage, I have a list of json objects. In other words, for every Nutch Document object, I have a list of json objects. Now, instead of whole page, I want to implement indexing on every item(from list) for each Nutch Docuement. Is it possible and if yes, please let me know the way.

    ReplyDelete
  37. I can think of two ways of doing this. One approach is something similar to crawling RSS feeds with Nutch (link to my post about this). Here add a pre-processing step to split up the items at the fetch stage itself. Another approach could be, since you are storing a list of JSON objects for each document, you could very well store a single JSON object which is a list of JSON objects per document. Since your objective is to treat each of these items as a separately searchable object, you can then write a custom Solr publisher that writes out a document for each element of the list.

    ReplyDelete
  38. Thanks again.. Inspired by your answers, I am following approach,

    In the plugin's parsing filter class, I am adding key-value pair to the metatag to the metadata of the NutchDocument object where value is an array of json object.

    I am creating a Solr publisher class and I have gone through one of your blog http://sujitpal.blogspot.in/2012/02/nutchgora-indexing-sections-and.html and found that it is something I am looking for. But it is written for 2.x version of Nutch and I am working on 1.6 verison. It will be great if you can provide guidance about how to write solr publisher for 1.6 version.

    ReplyDelete
  39. You are welcome, glad I could help. However, I don't know much about Nutch 1.6 except that the input is a sequence file instead of a NoSQL database. The nice thing about Nutch 2.x is that the entire WebPage object is made available in the GoraMapper and GoraReducer so its easier to work with. But I think the Parse metadata which you are populating should be available in the input file to the Solr publisher. You should find example in Nutch documentation about the structure that you should expect. If you don't, ask on the Nutch Mailing lists, they should be able to help.

    ReplyDelete
  40. Sujit,

    Thanks for your help and useful advice, I am on the verge of completing the assignment. You helped me in enhancing my knowledge base. Thanks a lot!!

    ReplyDelete
  41. Cool! You are welcome, glad I could help.

    ReplyDelete
  42. Thanks for your blog,it's very help ful to me , right now i am using nutch 2.x and i want to create custom plug-in for image Store and search but. i have no idea about that so how can i create... and implement in nutch. please help me

    ReplyDelete
  43. Hi Vicky, you are welcome, glad it helped you. When you say store images, I am guessing you are planning on extracting features and metadata from incoming crawled images and making it searchable using these features? If so, you could store your image as a byte array (or a Base64 encoded text block if the API only allows for string). In your parse, you could use something like Apache Tika to pull out any location or tag metadata added by the photographer or image publisher. The code in this post does this for tags, so you can probably adapt this to your requirements.

    ReplyDelete
  44. Hi Sujith,

    Your blog is very informative really, thanks for sharing your knowledge. Here am dealing with a problem to crawl the docs,pdfs, ppts,xls,xlsx file from web through Nutch 1.4 and Solr 4.3.1, i have configured things in regex_urlfilter.text file from Nutch but still not getting properly what happening and where can i found those docs if i could crawl the Docs and etc from web? Answer could be very appreciated.

    Thank you,
    Boyer.

    ReplyDelete
  45. Thanks Santosh, glad you found it informative. I can't tell what the problem is from your comment, can you please elaborate (unless the rest of my comment gives you the answer)? If the question is where you will find the crawled documents at the end of the run, it would be in Solr. You will need to run a sequence of commands to do this - check out the Nutch Tutorial for more info. Also given the document types you want to crawl and make searchable, you may want to consider using the Tika parser plugin.

    ReplyDelete
  46. Thank you Sujit Pal! I found this post very informative. I am new to this topic and I am trying to write a web crawler that extracts the news content from the news websites. Is it possible for you to tell me whether the development for nutch 2.x is the same as the one for nutch 1.x so that I can follow your blog and write my own plugin :)

    ReplyDelete
  47. Hi Xiahao, I haven't worked much with Nutch 1.x so can't say for sure, but plugin development should be the same between 1.x and 2.x.

    ReplyDelete
  48. Thank you for your quick response!
    I am new to nutch and I am trying to have a plugin that allows me to extract certain content under certain html tag. I noticed that in the previous comment you mentioned about some htmlparser that can be used. Is it possible for you to recommend me the steps to take in order to write this plugin. Based on what I have found out so far, there are quite a number of files to be changed and also the nutch has to compile again for the plugin to work. Is that correct? Thank you for your help!

    ReplyDelete
  49. You're welcome. If you need XML like parsing for your HTML, you may want to check out Jericho, you feed it the entire content and then query it with a simple API. There /are/ quite a few files to modify - the post lists all the ones I needed to (I think this is a complete list). I don't think you need to recompile Nutch, if you can package up your custom code into a JAR and drop it into the $NUTCH_HOME/lib directory, it should be sufficient.

    ReplyDelete
  50. I don't know what I am going to do without you! You are such a helper! Thank you. I have configured and build nutch on my PC. I just need to write the plugin and put it there is it?

    ReplyDelete
  51. Pretty much, and update the configuration files.

    ReplyDelete
  52. hi sujid, iam new in nutch.. i want to ask about creating plugin.. i had create one.. but it always said myplugin not present or inactive.. can you help me please

    ReplyDelete
  53. Hi Indah, sorry about the delay. The only thing I can think of is that maybe you haven't registered it into plugin.includes in the nutch-site.xml? More details on this Nutch wiki page.

    ReplyDelete
  54. This comment has been removed by the author.

    ReplyDelete
  55. Hi Ruchi, its been a while since I used Nutch 2, so you would probably get better answers from the folks on the Nutch mailing list. IIRC the communication between Nutch2 and Solr is via the HTTP interface so if you have everything else running this should be quite simple. Not sure if you saw this wiki page on github, but looks fairly complete from what I remember, this came up as the first result from a google search "nutch2 solr".

    ReplyDelete
  56. Hi dear, I am looking for a way to do focus crawling with Nutch so that it finds out and download some specific data from the web, e.g. find out specific images like all images available for a specific brand etc. How can we do that?

    ReplyDelete
  57. Hi Asmat, you probably want a parse plugin that looks for and parses out what you want from the web pages, either using regular expressions or using a ML model that detects the specific type of images you are after.

    ReplyDelete
  58. Can you please state all this in a little detail, or refer me to some blog post addressing this use case?

    ReplyDelete
  59. The current blog post addresses the basic mechanics, although it is probably a much more trivial use case than the one you are looking for, but you can adapt the ideas. I am guessing you would limit your crawl to pages served by a particular manufacturer or group of manufacturers, so you would need a seed list. Then you would need some way to detect images in the page, maybe just looking for img tags. Once you detect an image, you either need some kind of rule based mechanism (which may vary per manufacturer) or train a ML model either on the pixels themselves or the neighboring caption, that would identify the image as one you want vs not. For the pixel based idea, here is a link to a talk I gave last year and the Github repo. For caption based classification take a look at the scikit-learn text classification tutorial on their website. Both examples are in Python, these are just to give you ideas, you probably want to build the components in Java - some Java libraries suggestions for this task maybe using DeepLearning4j and Weka, but there might be others. Another option could be wrap your trained model in a REST wrapper and call it that way.

    ReplyDelete
  60. Thanks for the detailed reply. I am basically trying to build a system that will look for images and will make decision based on the file name and the surrounding text whether or not this is the required image. I am working in Java...

    ReplyDelete
  61. Hi Sujit Pal,
    I have some issue regarding developing a nutch plugin. I want a bit of help via personal email, Could you help any?

    ReplyDelete
  62. Hi Asmat, I don't want to publish my email as a comment where it can be scraped. If you send me a comment with your email on it, I will delete that comment and reply directly to you. Not sure if I can help, but happy to try.

    ReplyDelete
  63. Hi Sujit,
    I am confused in writing a plugin i didnt find a good documentation any where on how to extract only certain tags from HTMl page(example div class=body) but when i saw you post i thought you might be the good person to conatact is there any way that you can help me or any documents that i can follow.


    Thanks in advance!

    ReplyDelete
  64. Hi, it has been some time since I did this, hopefully this should still work. I based the work on a Nutch documentation example whose link I have in the post, if the instructions on my post are hard to understand or don't work for you, I would advise looking at that example, that is more likely to be updated as versions change.

    ReplyDelete

Comments are moderated to prevent spam.