According to the Nutch2Roadmap Wiki Page, one of the features of (as yet unreleased, but available in SVN) Nutch 2.0 is Storage Abstraction. Instead of segment files, it can use a MySQL or HBase (support for Cassandra is also planned) as its backend datastore.
Support for multiple backends is achieved using GORA, an ORM framework (originally written for Nutch) that works against Column databases. So changing backends would (probably, haven't looked at the GORA code yet) mean adding the appropriate GORA implementation JAR into Nutch's classpath.
Currently, even though the code is pre-release, there is a working HBase backend, and adequate documentation on how to set it up. Since we use Cassandra as part of our crawl/indexing infrastructure, I figured it would be worth checking out, so once Nutch 2.0 is out, maybe we could use it with the Cassandra backend.
So this post is basically an attempt to figure out what Nutch does to the HBase datastore as each of its subcommands are run. You can find the list of subcommands here.
The first step is to download Nutch 2.0 and GORA sources, and build them. This page has detailed instructions, which I followed almost to the letter. The only things to remember is to set the GORA backend in conf/nutch-site.xml after generating the nutch runtime.
Two other changes are to set the http.agent.name and http.robots.agents in nutch-default.xml (so nutch actually does the crawl), and the hbase.rootdir in hbase-default.xml to something other than /tmp (to prevent data loss across system restarts).
I just ran a subset of Nutch commands (we use Nutch for crawling, not its indexing and search functionality), and looked at what happened in the HBase datastore as a result. The attempt was to understand what each Nutch command does and correlate it to the code, so I can write similar code to hook into various phases of the Nutch lifecycle.
First, we have to start up HBase so Nutch can write to it. Part of the Nutch/GORA integration instructions was to install HBase, so now we can start up a local instance, and then login to the HBase shell.
1 2 3 4 5 6 7 8 9 10 11 | sujit@cyclone:~$ cd /opt/hbase-0.20.6
sujit@cyclone:hbase-0.20.6$ bin/start-hbase.sh
localhost: starting zookeeper, logging to /opt/hbase-0.20.6/bin/../logs/hbase-sujit-zookeeper-cyclone.hl.local.out
starting master, logging to /opt/hbase-0.20.6/bin/../logs/hbase-sujit-master-cyclone.hl.local.out
localhost: starting regionserver, logging to /opt/hbase-0.20.6/bin/../logs/hbase-sujit-regionserver-cyclone.hl.local.out
sujit@cyclone:hbase-0.20.6$ bin/hbase shell
HBase Shell; enter 'help<RETURN>' for list of supported commands.
Version: 0.20.6, r965666, Mon Jul 19 15:48:07 PDT 2010
hbase(main):001:0> list
0 row(s) in 0.1090 seconds
hbase(main):002:0>
|
We use a single URL (this blog) as the seed URL. So we create a one-line file as shown below:
1 | http://sujitpal.blogspot.com/
|
and then inject this URL into HBase:
1 | sujit@cyclone:local$ bin/nutch inject /tmp/seed.txt
|
This results in a single table called "webpage" being created in HBase, with the following structure. I used list to list the tables, and scan to list the contents of the table. For ease of understanding, I reformatted the output manually into a JSON structure. Each leaf level column (cell in HBase-speak) consists of a (key, timestamp, value) triplet, so we could have written the first leaf more compactly as {f1 : "\x00'\x80\x00"}.
It might help to refer to the conf/gora-hbase-mapping.xml file in your Nutch runtime as you read this. If you haven't set up Nutch 2.0 locally, then this information is also available in the GORA_HBase wiki page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 | webpage : {
key : "com.blogspot.sujitpal:http/",
f : {
fi : {
timestamp : 1293676557658,
value : "\x00'\x8D\x00"
},
ts : {
timestamp : 1293676557658,
value : "\x00\x00\x01-5!\x9D\xE5"
}
},
mk : {
_injmrk_ : {
timestamp : 1293676557658,
value : "y"
}
},
mtdt : {
_csh_ : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
},
s : {
s : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
}
|
I then run the generate command, which generates the fetchlist:
1 2 3 4 5 6 | sujit@cyclone:local$ bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1293732622-2092819984
|
This creates an additional column "mk:_gnmrk_" containing the batch id, in the webpage table for the record keyed by the seed URL.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | webpage : {
key : "com.blogspot.sujitpal:http/",
f : {
fi : {
timestamp : 1293676557658,
value : "\x00'\x8D\x00"
},
ts : {
timestamp : 1293676557658,
value : "\x00\x00\x01-5!\x9D\xE5"
}
},
mk : {
_injmrk_ : {
timestamp : 1293676557658,
value : "y"
},
_gnmrk_ : {
timestamp=1293732629430,
value : "1293732622-2092819984"
}
},
mtdt : {
_csh_ : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
},
s : {
s : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
}
}
|
Next I ran a fetch with the batch id returned by the generate command:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | sujit@cyclone:local$ bin/nutch fetch 1293732622-2092819984
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1293732622-2092819984
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://sujitpal.blogspot.com/
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
-activeThreads=0
FetcherJob: done
|
This creates some more columns as shown below. As you can see, it creates additional columns under the "f" column family, most notably the raw page content in the "f:cnt" column and a new "h" column family with page header information. It also creates a batch id marker in the "mk" column family.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 | webpage : {
key : "com.blogspot.sujitpal:http/",
f : {
bas : {
timestamp : 1293732801833,
value : "http://sujitpal.blogspot.com/"
},
cnt : {
timestamp : 1293732801833,
value : "DOCTYPE html PUBLIC "-//W3C//DTD X...rest of page content"
},
fi : {
timestamp : 1293676557658,
value : "\x00'\x8D\x00"
},
prot : {
timestamp : 1293732801833,
value : "x02\x00\x00"
},
st : {
timestamp : 1293732801833,
value : "x00\x00\x00\x02"
},
ts : {
timestamp : 1293676557658,
value : "\x00\x00\x01-5!\x9D\xE5"
}
typ : {
timestamp : 1293732801833,
value : "application/xhtml+xml"
}
},
h : {
Cache-Control : {
timestamp : 1293732801833,
value : "private"
},
Content-Type : {
timestamp : 1293732801833,
value : "text/html; charset=UTF-8"
},
Date : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 18:13:21 GMT"
},
ETag : {
timestamp : 1293732801833,
value : 40bdf8b9-8c0a-477e-9ee4-b19995601dde"
},
Expires : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 18:13:21 GMT"
},
Last-Modified : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 15:01:20 GMT"
},
Server : {
timestamp : 1293732801833,
value : "GSE"
},
Set-Cookie : {
timestamp : 1293732801833,
value : "blogger_TID=130c0c57a66d0704;HttpOnly"
},
X-Content-Type-Options : {
timestamp : 1293732801833,
value : "nosniff"
},
X-XSS-Protection : {
timestamp : 1293732801833,
value : "1; mode=block"
}
},
mk : {
_injmrk_ : {
timestamp : 1293676557658,
value : "y"
},
_gnmrk_ : {
timestamp=1293732629430,
value : "1293732622-2092819984"
},
_ftcmrk_ : {
timestamp : 1293732801833,
value : "1293732622-2092819984"
}
},
mtdt : {
_csh_ : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
},
s : {
s : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
}
}
|
Finally we parse the fetched content. This extracts the links and parses the text content out of the HTML.
1 2 3 4 5 6 | sujit@cyclone:local$ bin/nutch parse 1293732622-2092819984
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1293732622-2092819984
ParserJob: success
|
This results in more columns written out to the webpage table. At this point it parses out the links from the page and stores them in the "ol" (outlinks) column family, and the "p" column family, which contains the parsed content for the page.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 | webpage : {
key : "com.blogspot.sujitpal:http/",
f : {
bas : {
timestamp : 1293732801833,
value : "http://sujitpal.blogspot.com/"
},
cnt : {
timestamp : 1293732801833,
value : "DOCTYPE html PUBLIC "-//W3C//DTD X...rest of page content"
},
fi : {
timestamp : 1293676557658,
value : "\x00'\x8D\x00"
},
prot : {
timestamp : 1293732801833,
value : "x02\x00\x00"
},
st : {
timestamp : 1293732801833,
value : "x00\x00\x00\x02"
ts : {
timestamp : 1293676557658,
value : "\x00\x00\x01-5!\x9D\xE5"
}
typ : {
timestamp : 1293732801833,
value : "application/xhtml+xml"
}
},
h : {
Cache-Control : {
timestamp : 1293732801833,
value : "private"
},
Content-Type : {
timestamp : 1293732801833,
value : "text/html; charset=UTF-8"
},
Date : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 18:13:21 GMT"
},
ETag : {
timestamp : 1293732801833,
value : 40bdf8b9-8c0a-477e-9ee4-b19995601dde"
},
Expires : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 18:13:21 GMT"
},
Last-Modified : {
timestamp : 1293732801833,
value : "Thu, 30 Dec 2010 15:01:20 GMT"
},
Server : {
timestamp : 1293732801833,
value : "GSE"
},
Set-Cookie : {
timestamp : 1293732801833,
value : "blogger_TID=130c0c57a66d0704;HttpOnly"
},
X-Content-Type-Options : {
timestamp : 1293732801833,
value : "nosniff"
},
X-XSS-Protection : {
timestamp : 1293732801833,
value : "1; mode=block"
}
},
mk : {
_injmrk_ : {
timestamp : 1293676557658,
value : "y"
},
_gnmrk_ : {
timestamp=1293732629430,
value : "1293732622-2092819984"
},
_ftcmrk_ : {
timestamp : 1293732801833,
value : "1293732622-2092819984"
},
__prsmrk__ : {
timestamp : 1293732957501,
value : "1293732622-2092819984"
}
},
mtdt : {
_csh_ : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
},
s : {
s : {
timestamp : 1293676557658,
value : "x80\x00\x00"
}
}
ol : {
http://pagead2.googlesyndication.com/pagead/show_ads.js : {
timestamp : 1293732957501,
value : ""
},
http://sujitpal.blogspot.com/ : {
timestamp : 1293732957501,
value : "Home"
},
http/ column=ol:http://sujitpal.blogspot.com/2005_03_01_archive.html : {
timestamp : 1293732957501,
value : "March"
},
// ... (more outlinks below) ...
},
p : {
c : {
timestamp : 1293732957501,
value : "Salmon Run skip to main ... (rest of parsed content)"
},
sig : {
timestamp : 1293732957501,
value="cW\xA5\xB7\xDD\xD3\xBF`\x80oYR8\x1F\ x80\x16"
},
st : {
timestamp : 1293732957501,
value : "\x02\x00\x00"
},
t : {
timestamp : 1293732957501,
value : "Salmon Run"
},
s : {
timestamp : 1293732629430,
value : "?\x80\x00\x00"
}
}
}
|
We then run the updatedb command to add the outlinks discovered during the parse to the list of URLs to be fetched.
1 2 3 | sujit@cyclone:local$ bin/nutch updatedb
DbUpdaterJob: starting
DbUpdaterJob: done
|
This results in 152 rows in the HBase table. Each of the additional rows correspond to the outlinks discovered during the parse stage above.
1 2 3 4 | hbase(main):010:0> scan "webpage"
...
152 row(s) in 1.0400 seconds
hbase(main):011:0>
|
We can then go back to doing fetch, generate, parse and update until we are done crawling to the desired depth.
Thats all for today. Happy New Year and hope you all had fun during the holidays. As I have mentioned above, this exercise was for me to understand what Nutch does to the HBase datastore when each command is invoked. In coming weeks, I plan on using this information to write some plugins that would drop "user" data into the database, and use it in later steps.
How can i execute query string to search data with nutch 2.0 using hbase?. In nutch 1.0,1.1, i can show the result throught webpage ui
ReplyDeleteHi Phạm, if you are looking for a similar interface to Nutch 1.x, you could use the SolrIndexerJob to generate the index out to a Solr instance, and use its UI to view it. But since the data is in HBase, I think a better approach would be to just query the table.
ReplyDeleteDear Sujit,
ReplyDeleteI can not get to use HBase as Gora backend for Nutch, when running the updatedb command after the generate/fetch/parse ones.
See http://techvineyard.blogspot.com/2011/01/trying-nutch-20-hbase-storage.html for the details.
Any idea about tweaking HBase to make it run properly with limited memory?
Hi Alexis, no sorry, don't know much about HBase tuning. Thanks for your initial writeup on how to set up Nutch and Eclipse though (did not realize it was you at first :-)). I see you are planning to take a look at Cassandra - do publish your findings - we are a Cassandra shop so we would prefer to run Nutch with Cassandra - I figured that since Nutch uses GORA, I could just do development with HBase and switch to Cassandra when documentation becomes available.
ReplyDeleteI was interested in the Cassandra backend because I could not get to use Nutch with HBase in a bigger scale than just testing.
ReplyDeleteSo here are my findings. This a very alpha version of the gora-cassandra module. Don't hesitate to run some tests.
The patch:
https://issues.apache.org/jira/browse/GORA-22
The "user" guide:
http://techvineyard.blogspot.com/2010/12/build-nutch-20.html#Cassandra
More info:
http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#Cassandra_in_Gora
Thank you Alexis, much appreciated. I will take your patch and user guide and try it out and let you know what I find. Many thanks for all your hard work on this.
ReplyDeletePlease give it shot again. A new version of the code is now into trunk.
ReplyDeleteThanks Alexis, sorry for not getting back to you earlier.
ReplyDeleteThanks for the kind words Himanshu, glad it helped.
ReplyDeleteDear sujit, if I only want outlines to be stored by hbase in at the time of parsing and not the parsed text. Where should I make the change in SRC code and recompile?
ReplyDeleteHi Suyash, its been a while since I used Nutch, my guess would be the HtmlParser. However, my preference would to write a parse plugin that extracts the outline in addition to the content as I describe in this blog post. My understanding (not 100% sure of this) is that this runs after Nutch's own parse has completed and so you have access to the content which you post-process to extract whatever you want. But if you are looking to save space this may not be what you want. Also take a look at this Stackoverflow question and the reply.
ReplyDelete