Saturday, January 07, 2012

Exploring Nutch-GORA with Cassandra

Background

I started looking at this early last year, with a view to building an incremental indexing pipeline to build our Lucene indexes. For most of last year, a group of us have been building an integrated content ingestion and search service around our proprietary concept search algorithms using Cassandra and Solr. During the initial design phase, I proposed a design for the indexing pipeline that was kind of where I wanted to go with Nutch/GORA. However, my main focus for the project was Solr and search, and for various reasons, the backend did not turn out quite the way I had hoped.

While I worked on Solr quite heavily and learned a lot over the past year, I did not get time to work on (and learn how to use) Cassandra. Its biting me a bit at the moment, now that the project is winding down and we are uncovering integration issues - our index is now quite large, and many times, the issues can be fixed long-term by fixing the parsing code, and (additionally) short-term by updating the data in Cassandra and republishing to Solr. Being able to do the latter would have significantly decreased turnaround time in quite a few cases.

So I decided to resurrect this project (using Nutch/GORA with Cassandra to build a pipeline that would work with content from different sources - web crawls, feeds, content management systems and semi-structured content (XML/JSON) from external providers, and incrementally index them into a single large Solr index). Mainly because I believe this would be a nice generic solution, but a nice side effect would be gaining familiarity with Cassandra, as well as Nutch, GORA and Hadoop. So, hopefully (unless I get sidetracked again by something shinier :-)), the next few posts would be about this project.

In this post, I follow a path similar to my previous experiment with Nutch/GORA (previously Nutch 2.0 trunk, now in its own branch) and HBase, basically running the various nutch sub-commands to crawl a small external site and finally building the index using Nutch's solrindex sub-command. At each stage, I run some Python/Pycassa scripts to see what went into the Cassandra database, mainly to understand what each subcommand does, so I can modify/extend the behavior for my purposes if required.

Setup

For setup, I followed this excellent Tech Vineyard post almost to the letter. Broadly here are the steps:

  • Download and install Cassandra - I downloaded Cassandra version 1.0.6 from here. Installation is simply untarring into some location on disk.
  • Download Nutch from the NutchGora branch and run the default ant target, this creates the runtime/local subdirectory which can be used to run Nutch.
  • Download GORA from the GORA trunk. To compile this, I couldn't get the ant/ivy setup to work, but I could compile the project and build the GORA JAR using mvn.
  • Follow the instructions in this TechVineyard post to set up Nutch with Cassandra. Of particular importance is setting the storage engine in nutch-site.xml and making sure that the gora-cassandra-mapping.xml file is in your runtime/local/conf directory.
  • Set the http.agent.name (your crawler name) and the http.robot.agents in nutch-default.xml in runtime/local/conf, otherwise nutch will complain when you try to run it.
  • Download Hector, a Cassandra Java client API which seems to be used by GORA. We need it for some of the library dependecies. I could have probably got these libraries from GORA if I could get ivy to work, but I couldn't, so I had to get these libraries from Hector.
  • In the runtime/local/lib directory, replace cassandra-thrift-0.8.jar with cassandra-thrift-1.0.1.jar from the Cassandra distribution. From the Hector distribution, copy over the hector-core-1.0-1.jar and guava-r09.jar. Also copy over the gora-cassandra-0.2-SNAPSHOT.jar (and perhaps the other gora JARs as well, but so far I haven't had to).
  • Download Solr from the Solr download page. I got Solr 3.5.0. Copy over the runtime/local/conf/schema.xml to Solr's examples/solr/conf directory. This replaces Solr's default schema with Nutch's.
  • Download Pycassa from the Pycassa download page - I needed this to write little scripts to peek inside Cassandra as I executed the various nutch subcommands. Installation instructions can be found here.
  • Start Cassandra with $CASSANDRA_HOME/bin/cassandra -f. This sets up a Cassandra server on port 9160. Cassandra can be terminated with Ctrl-C.
  • Start Solr with java -jar start.jar from the solr/example directory. This starts up a Jetty server running Solr on port 8983.

At this point we are ready to run the nutch subcommands and observe the effects of running these commands in the Cassandra database.

Nutch Inject

The nutch inject subcommand allows you to add a list of seed URLs for your crawl. For my test, I used a small site I was familiar with from earlier crawler testing. Nutch expects the seed URLs to be supplied as a directory of flat files. Each line in a flat file consists of the seed URL, optionally followed by name-value metadata elements. Here is my seed list file.

1
2
# Source: web_seeds/somesite.com
http://www.somesite.com u_idx=web

I want all pages crawled from this seed to have an extra metadata attribute u_idx set to "web". This is so I can differentiate between crawled content and other sources in my Solr index.

To inject this URL we run the following command. This command takes the URLs from the seed list(s) and puts them on Nutch's list of URLs to fetch.

1
2
3
4
sujit@cyclone:local$ bin/nutch inject ../../../web_seeds
InjectorJob: starting
InjectorJob: urlDir: ../../../web_seeds
InjectorJob: finished

At this point Cassandra has created a keyspace "webpage", and its "f" column family (fetched content) contains a single record as shown below (this is the output of the display_webpage.py script, which is provided further down). Notice that our user supplied metadata "u_idx" is available in the webpage["sc"]["mtdt"]["u_idx"] column.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
webpage: {
.. key: "com.somesite.www:http/" ,
.. f: {
.... fi : "2592000" ,
.... s : "1.0" ,
.... ts : "1325623018630" ,
.. },
.. p: {
.. },
.. sc: {
.... mk : {
...... _injmrk_ : "y" ,
.... }
.... mtdt : {
...... _csh_ : "?" ,
...... u_idx : "web" ,
.... }
.. }
}

Nutch Generate

The nutch generate command will take the list of outlinks generated from a previous cycle and promote them to the fetch list and return a batch ID for this cycle. You will need this batch ID for subsequent calls in this cycle. Here's the command:

1
2
3
4
5
6
sujit@cyclone:local$ bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1325623123-1500197250

Once more we look at what's in Cassandra at the end of this run. Notice that not much has changed, except it created a webpage["sc"]["mk"]["_gnmrk_"] record with the batch ID it spat out.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
webpage: {
.. key: "com.somesite.www:http/" ,
.. f: {
.... fi : "2592000" ,
.... s : "1.0" ,
.... ts : "1325623018630" ,
.. },
.. p: {
.. },
.. sc: {
.... mk : {
...... _gnmrk_ : "1325623123-1500197250" ,
...... _injmrk_ : "y" ,
.... }
.... mtdt : {
...... _csh_ : "?" ,
...... u_idx : "web" ,
.... }
.. }
}

Nutch Fetch

Next we execute a fetch. The Nutch Fetch command will crawl the pages listed in the "f" column family and write out the contents into new columns in "f". We need to pass in the batch ID from the previous step. The command looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
sujit@cyclone:local$ bin/nutch fetch 1325623123-1500197250
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1325623123-1500197250
Using queue mode : byHost
Fetcher: threads: 10
QueueFeeder finished: total 1 records. Hit by time limit :0
fetching http://www.somesite.com/
-finishing thread FetcherThread1, activeThreads=1
-finishing thread FetcherThread2, activeThreads=1
-finishing thread FetcherThread3, activeThreads=1
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread5, activeThreads=1
-finishing thread FetcherThread6, activeThreads=1
-finishing thread FetcherThread7, activeThreads=1
-finishing thread FetcherThread8, activeThreads=1
-finishing thread FetcherThread9, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
-activeThreads=0
FetcherJob: done

As you can see, the fetch command wrote out new columns in the "f" and "sc" column families. These new columns are additional data and page metadata found during the fetch. But since we started with only one injected URL in "f", there is still only one record in "f".

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
webpage: {
.. key: "com.somesite.www:http/" ,
.. f: {
.... bas : "http://www.somesite.com/" ,
.... cnt : "<html>content omitted here</html>",
.... fi : "2592000" ,
.... s : "1.0" ,
.... st : "2" ,
.... ts : "1325623263386" ,
.... typ : "text/html" ,
.. },
.. p: {
.. },
.. sc: {
.... h : {
...... Accept-Ranges : "bytes" ,
...... Connection : "close" ,
...... Content-Length : "64557" ,
...... Content-Location : "http://www.somesite.com/index.htm" ,
...... Content-Type : "text/html" ,
...... Date : "Tue, 03 Jan 2012 20:41:02 GMT" ,
...... ETag : None ,
...... Last-Modified : "Fri, 30 Dec 2011 20:20:45 GMT" ,
...... MicrosoftOfficeWebServer : "5.0_Pub" ,
...... Server : "Microsoft-IIS/6.0" ,
...... X-Powered-By : "ASP.NET" ,
.... }
.... mk : {
...... _ftcmrk_ : "1325623123-1500197250" ,
...... _gnmrk_ : "1325623123-1500197250" ,
...... _injmrk_ : "y" ,
.... }
.... mtdt : {
...... _csh_ : "?" ,
...... u_idx : "web" ,
.... }
.... prs : {
...... code : "1" ,
...... lastModified : "0" ,
.... }
.. }
}

Nutch Parse

The nutch parse subcommand will loop through all the pages in the "f" column family (one in our case), analyze the page content to find outgoing links, and write them out in the "p" column family. Here is the command:

1
2
3
4
5
6
7
sujit@cyclone:local$ bin/nutch parse 1325623124-1500197250
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1325623123-1500197250
Parsing http://www.somesite.com/
ParserJob: success

And the data inside Cassandra now looks like this:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
webpage: {
.. key: "com.somesite.www:http/" ,
.. f: {
.... bas : "http://www.somesite.com/" ,
.... cnt : "<html>HTML Content ommitted</html>",
.... fi : "2592000" ,
.... s : "1.0" ,
.... st : "2" ,
.... ts : "1325623263386" ,
.... typ : "text/html" ,
.. },
.. p: {
.... c : "a very long plaintext string representing parsed content",
.... sig : "gm" ,
.... t : "Title of document extracted from HTML/HEAD/TITLE tag",
.. },
.. sc: {
.... h : {
...... Accept-Ranges : "bytes" ,
...... Connection : "close" ,
...... Content-Length : "64557" ,
...... Content-Location : "http://www.somesite.com/index.htm" ,
...... Content-Type : "text/html" ,
...... Date : "Tue, 03 Jan 2012 20:41:02 GMT" ,
...... ETag : None ,
...... Last-Modified : "Fri, 30 Dec 2011 20:20:45 GMT" ,
...... MicrosoftOfficeWebServer : "5.0_Pub" ,
...... Server : "Microsoft-IIS/6.0" ,
...... X-Powered-By : "ASP.NET" ,
.... }
.... mk : {
...... __prsmrk__ : "1325623123-1500197250" ,
...... _ftcmrk_ : "1325623123-1500197250" ,
...... _gnmrk_ : "1325623123-1500197250" ,
...... _injmrk_ : "y" ,
.... }
.... mtdt : {
...... _csh_ : "?" ,
...... u_idx : "web" ,
.... }
.... ol : {
...... http://www.somesite.com/ : "" ,
...... http://www.somesite.com/10_somesite_rules.htm : "10SomediseaseRules" ,
...... http://www.somesite.com/Dr.Norman.htm : "Dr.Norman" ,
...... http://www.somesite.com/FAQ.htm : "FrequentQuestions" ,
...... http://www.somesite.com/FHH.htm : "FHH" ,
...... http://www.somesite.com/MEN-Syndrome.htm : "MEN Syndromes" ,
...... http://www.somesite.com/MIRP-Surgery.htm : "MIRPMiniSurgery" ,
...... http://www.somesite.com/MIRP-publications.htm : "Publications" ,
...... http://www.somesite.com/Somedisease-Surgeon-Map.htm : "PatientMap" ,
...... http://www.somesite.com/Somedisease-Surgeon.htm : "BecomeAPatient" ,
...... http://www.somesite.com/Re-Operation.htm : "Re-Operate" ,
...... http://www.somesite.com/Sensipar-high-calcium.htm : "Sensipar" ,
...... http://www.somesite.com/about-Somedisease.htm : "Norman Somedisease Center" ,
...... http://www.somesite.com/age.htm : "WhoGetsIt?" ,
...... http://www.somesite.com/causes.htm : "WhatCausesIt?" ,
...... http://www.somesite.com/contents.htm : "TableofContents" ,
...... http://www.somesite.com/diagnosis.htm : "Diagnosis" ,
...... http://www.somesite.com/disclaimer.htm : "Disclaimer" ,
...... http://www.somesite.com/endocrinology.htm : "WhatExpertsSay" ,
...... http://www.somesite.com/finding-somesite.htm : "FindingtheTumor" ,
...... http://www.somesite.com/high-calcium.htm : "HighBloodCalcium" ,
...... http://www.somesite.com/hypersomesiteism-diagnosis.htm : "Diagnosis-ADVANCED" ,
...... http://www.somesite.com/hypersomesiteism-videos.htm : "TeachingVideos" ,
...... http://www.somesite.com/hyposomesiteism.htm : "Hyp0somesite" ,
...... http://www.somesite.com/index.htm : "" ,
...... http://www.somesite.com/low-calcium.htm : "Surgeon-Induced HypOsomesiteism." ,
...... http://www.somesite.com/low-vitamin-d.htm : "LowVitaminD" ,
...... http://www.somesite.com/mini-surgery.htm : "Mini-Surgery" ,
...... http://www.somesite.com/osteoporosis.htm : "Osteoporosis" ,
...... http://www.somesite.com/osteoporosis2.htm : "Life insurance companies" ,
...... http://www.somesite.com/somesite-adenoma.htm : "DoIHaveJustOne?" ,
...... http://www.somesite.com/somesite-anatomy.htm : "Somedisease Anatomy" ,
...... http://www.somesite.com/somesite-cancer.htm : "SomediseaseCancer" ,
...... http://www.somesite.com/somesite-disease.htm : "Hypersomesiteism" ,
...... http://www.somesite.com/somesite-function.htm : "NormalFunction" ,
...... http://www.somesite.com/somesite-pictures.htm : "SomediseasePictures" ,
...... http://www.somesite.com/somesite-surgery.htm : "SurgeryVideo" ,
...... http://www.somesite.com/somesite-symptoms-cartoon.htm : "SymptomCartoon" ,
...... http://www.somesite.com/somesite-symptoms.htm : "Symptoms" ,
...... http://www.somesite.com/somesite.htm : "SomediseaseIntro" ,
...... http://www.somesite.com/paratiroide : "" ,
...... http://www.somesite.com/paratiroide/index.html : "Espanol/Spanish" ,
...... http://www.somesite.com/pregnancy.htm : "Somedisease disease during Pregnancy." ,
...... http://www.somesite.com/sestamibi.htm : "SestamibiScan" ,
...... http://www.somesite.com/surgery_cure_rates.htm : "SurgeryCureRates" ,
...... http://www.somesite.com/testimonials.htm : "WhatPatientsSay" ,
...... http://www.somesite.com/treatment-surgery.htm : "Treatment/Surgery" ,
...... http://www.somesite.com/who's_eligible.htm : "Who is Eligible to Have a MIRP Mini-Somedisease Operation?" ,
.... }
.... pas : {
...... majorCode : "1" ,
...... minorCode : "0" ,
.... }
.... prs : {
...... code : "1" ,
...... lastModified : "0" ,
.... }
.. }
}

The parse command parses the fetched page content in webpage["f"]["cnt"] and places the parsed output in webpage["p"]["cnt"]. In addition, it places the inline HTML links in the page into the webpage["sc"]["ol"] column.

At this point there is still only 1 record (the seed URL) in Cassandra.

Nutch UpdateDB

The nutch updatedb subcommand takes the webpage["sc"]["ol"] values from the previous stage and places it into the webpage["f"] column family, so they can be fetched in the next crawl cycle.

1
2
3
sujit@cyclone:local$ bin/nutch updatedb
DbUpdaterJob: starting
DbUpdaterJob: done

At this point, I switched to counting the number of columns in the "f" and "p" column families, since the JSON style output I was doing would be too large to consume. At the end of the updatedb, I have the following URLs in the "f" and "p" column families.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
sujit@cyclone:scripts$ ./count_pages.py f
com.somesite.www:http/endocrinology.htm
com.somesite.www:http/causes.htm
com.somesite.www:http/MEN-Syndrome.htm
com.somesite.www:http/Somedisease-Surgeon.htm
com.somesite.www:http/Sensipar-high-calcium.htm
com.somesite.www:http/who's_eligible.htm
com.somesite.www:http/somesite-cancer.htm
com.somesite.www:http/mini-surgery.htm
com.somesite.www:http/surgery_cure_rates.htm
com.somesite.www:http/
com.somesite.www:http/age.htm
com.somesite.www:http/index.htm
com.somesite.www:http/10_somesite_rules.htm
com.somesite.www:http/somesite-function.htm
com.somesite.www:http/contents.htm
com.somesite.www:http/FAQ.htm
com.somesite.www:http/paratiroide
com.somesite.www:http/MIRP-publications.htm
com.somesite.www:http/Dr.Norman.htm
com.somesite.www:http/MIRP-Surgery.htm
com.somesite.www:http/somesite-adenoma.htm
com.somesite.www:http/somesite-disease.htm
com.somesite.www:http/FHH.htm
com.somesite.www:http/treatment-surgery.htm
com.somesite.www:http/somesite-surgery.htm
com.somesite.www:http/hypersomesiteism-diagnosis.htm
com.somesite.www:http/Somedisease-Surgeon-Map.htm
com.somesite.www:http/hyposomesiteism.htm
com.somesite.www:http/somesite-symptoms.htm
com.somesite.www:http/somesite-symptoms-cartoon.htm
com.somesite.www:http/somesite-pictures.htm
com.somesite.www:http/osteoporosis.htm
com.somesite.www:http/somesite.htm
com.somesite.www:http/about-Somedisease.htm
com.somesite.www:http/Re-Operation.htm
com.somesite.www:http/diagnosis.htm
com.somesite.www:http/pregnancy.htm
com.somesite.www:http/high-calcium.htm
com.somesite.www:http/somesite-anatomy.htm
com.somesite.www:http/low-vitamin-d.htm
com.somesite.www:http/testimonials.htm
com.somesite.www:http/low-calcium.htm
com.somesite.www:http/finding-somesite.htm
com.somesite.www:http/hypersomesiteism-videos.htm
com.somesite.www:http/paratiroide/index.html
com.somesite.www:http/sestamibi.htm
com.somesite.www:http/osteoporosis2.htm
com.somesite.www:http/disclaimer.htm
#-records in f: 48

sujit@cyclone:scripts$ ./count_pages.py p
com.somesite.www:http/
#-records in p: 1

Iterating

We then iterate through the generate, fetch, parse, and updatedb subcommand two more times to crawl the site to a depth of 3. Shown below is the nutch console logs for the second iteration.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
sujit@cyclone:local$ bin/nutch generate
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: done
GeneratorJob: generated batch id: 1325709400-776802111
sujit@cyclone:local$ bin/nutch fetch 1325709400-776802111
FetcherJob: starting
FetcherJob : timelimit set for : -1
FetcherJob: threads: 10
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob: batchId: 1325709400-776802111
Using queue mode : byHost
Fetcher: threads: 10
fetching http://www.somesite.com/somesite.htm
QueueFeeder finished: total 47 records. Hit by time limit :0
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=46
fetching http://www.somesite.com/Somedisease-Surgeon.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=45
fetching http://www.somesite.com/paratiroide/index.html
fetching http://www.somesite.com/diagnosis.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=43
fetching http://www.somesite.com/somesite-adenoma.htm
fetching http://www.somesite.com/age.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=41
fetching http://www.somesite.com/FHH.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=40
fetching http://www.somesite.com/treatment-surgery.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=39
fetching http://www.somesite.com/who's_eligible.htm
fetching http://www.somesite.com/somesite-disease.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=37
fetching http://www.somesite.com/FAQ.htm
fetching http://www.somesite.com/finding-somesite.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=35
fetching http://www.somesite.com/hypersomesiteism-diagnosis.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=34
fetching http://www.somesite.com/index.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=33
fetching http://www.somesite.com/somesite-pictures.htm
fetching http://www.somesite.com/Somedisease-Surgeon-Map.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=31
fetching http://www.somesite.com/mini-surgery.htm
fetching http://www.somesite.com/about-Somedisease.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=29
fetching http://www.somesite.com/disclaimer.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=28
fetching http://www.somesite.com/somesite-function.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=27
fetching http://www.somesite.com/paratiroide
fetching http://www.somesite.com/low-vitamin-d.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=25
fetching http://www.somesite.com/somesite-symptoms-cartoon.htm
fetching http://www.somesite.com/sestamibi.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=23
fetching http://www.somesite.com/osteoporosis.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=22
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=22
fetching http://www.somesite.com/surgery_cure_rates.htm
fetching http://www.somesite.com/low-calcium.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=20
fetching http://www.somesite.com/Sensipar-high-calcium.htm
fetching http://www.somesite.com/Dr.Norman.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=18
fetching http://www.somesite.com/somesite-anatomy.htm
fetching http://www.somesite.com/somesite-surgery.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=16
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=16
fetching http://www.somesite.com/hyposomesiteism.htm
fetching http://www.somesite.com/endocrinology.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=14
fetching http://www.somesite.com/somesite-cancer.htm
fetching http://www.somesite.com/testimonials.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=12
fetching http://www.somesite.com/hypersomesiteism-videos.htm
fetching http://www.somesite.com/high-calcium.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=10
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=10
fetching http://www.somesite.com/osteoporosis2.htm
fetching http://www.somesite.com/MEN-Syndrome.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=8
fetching http://www.somesite.com/causes.htm
fetching http://www.somesite.com/MIRP-Surgery.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=6
fetching http://www.somesite.com/Re-Operation.htm
fetching http://www.somesite.com/pregnancy.htm
-activeThreads=10, spinWaiting=9, fetchQueues= 1, fetchQueues.totalSize=4
* queue: http://www.somesite.com
  maxThreads    = 1
  inProgress    = 1
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1325709723044
  now           = 1325709724390
  0. http://www.somesite.com/somesite-symptoms.htm
  1. http://www.somesite.com/10_somesite_rules.htm
  2. http://www.somesite.com/MIRP-publications.htm
  3. http://www.somesite.com/contents.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=4
* queue: http://www.somesite.com
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1325709738263
  now           = 1325709734391
  0. http://www.somesite.com/somesite-symptoms.htm
  1. http://www.somesite.com/10_somesite_rules.htm
  2. http://www.somesite.com/MIRP-publications.htm
  3. http://www.somesite.com/contents.htm
fetching http://www.somesite.com/somesite-symptoms.htm
fetching http://www.somesite.com/10_somesite_rules.htm
-activeThreads=10, spinWaiting=10, fetchQueues= 1, fetchQueues.totalSize=2
* queue: http://www.somesite.com
  maxThreads    = 1
  inProgress    = 0
  crawlDelay    = 5000
  minCrawlDelay = 0
  nextFetchTime = 1325709748664
  now           = 1325709744393
  0. http://www.somesite.com/MIRP-publications.htm
  1. http://www.somesite.com/contents.htm
fetching http://www.somesite.com/MIRP-publications.htm
fetching http://www.somesite.com/contents.htm
-finishing thread FetcherThread5, activeThreads=9
-finishing thread FetcherThread3, activeThreads=8
-finishing thread FetcherThread2, activeThreads=7
-finishing thread FetcherThread1, activeThreads=6
-finishing thread FetcherThread8, activeThreads=5
-finishing thread FetcherThread7, activeThreads=2
-finishing thread FetcherThread9, activeThreads=3
-finishing thread FetcherThread6, activeThreads=4
-finishing thread FetcherThread4, activeThreads=1
-finishing thread FetcherThread0, activeThreads=0
-activeThreads=0, spinWaiting=0, fetchQueues= 0, fetchQueues.totalSize=0
-activeThreads=0
FetcherJob: done
ujit@cyclone:local$ bin/nutch parse 1325709400-776802111
ParserJob: starting
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1325709400-776802111
Parsing http://www.somesite.com/endocrinology.htm
Parsing http://www.somesite.com/causes.htm
Parsing http://www.somesite.com/MEN-Syndrome.htm
Parsing http://www.somesite.com/Somedisease-Surgeon.htm
Parsing http://www.somesite.com/Sensipar-high-calcium.htm
Parsing http://www.somesite.com/who's_eligible.htm
Parsing http://www.somesite.com/somesite-cancer.htm
Parsing http://www.somesite.com/mini-surgery.htm
Parsing http://www.somesite.com/surgery_cure_rates.htm
Skipping http://www.somesite.com/; different batch id
Parsing http://www.somesite.com/age.htm
Parsing http://www.somesite.com/index.htm
Parsing http://www.somesite.com/10_somesite_rules.htm
Parsing http://www.somesite.com/somesite-function.htm
Parsing http://www.somesite.com/contents.htm
Parsing http://www.somesite.com/FAQ.htm
Parsing http://www.somesite.com/paratiroide
Parsing http://www.somesite.com/MIRP-publications.htm
Parsing http://www.somesite.com/Dr.Norman.htm
Parsing http://www.somesite.com/MIRP-Surgery.htm
Parsing http://www.somesite.com/somesite-adenoma.htm
Parsing http://www.somesite.com/somesite-disease.htm
Parsing http://www.somesite.com/FHH.htm
Parsing http://www.somesite.com/treatment-surgery.htm
Parsing http://www.somesite.com/somesite-surgery.htm
Parsing http://www.somesite.com/hypersomesiteism-diagnosis.htm
Parsing http://www.somesite.com/Somedisease-Surgeon-Map.htm
Parsing http://www.somesite.com/hyposomesiteism.htm
Parsing http://www.somesite.com/somesite-symptoms.htm
Skipping http://www.somesite.com/paratiroide/; different batch id
Parsing http://www.somesite.com/somesite-symptoms-cartoon.htm
Parsing http://www.somesite.com/somesite-pictures.htm
Parsing http://www.somesite.com/osteoporosis.htm
Parsing http://www.somesite.com/somesite.htm
Parsing http://www.somesite.com/about-Somedisease.htm
Parsing http://www.somesite.com/Re-Operation.htm
Parsing http://www.somesite.com/diagnosis.htm
Parsing http://www.somesite.com/pregnancy.htm
Parsing http://www.somesite.com/high-calcium.htm
Parsing http://www.somesite.com/somesite-anatomy.htm
Parsing http://www.somesite.com/low-vitamin-d.htm
Parsing http://www.somesite.com/testimonials.htm
Parsing http://www.somesite.com/low-calcium.htm
Parsing http://www.somesite.com/finding-somesite.htm
Parsing http://www.somesite.com/hypersomesiteism-videos.htm
Parsing http://www.somesite.com/paratiroide/index.html
Parsing http://www.somesite.com/sestamibi.htm
Parsing http://www.somesite.com/osteoporosis2.htm
Parsing http://www.somesite.com/disclaimer.htm
ParserJob: success
sujit@cyclone:local$ bin/nutch updatedb
DbUpdaterJob: starting
DbUpdaterJob: done
As you can see, nutch is now fetching and parsing all the URLs that have been put by the updatedb subcommand in its fetch list. I recorded the number of records in The "f" and "p" column families after each iteration, as shown below:

  • after second iteration: count(f)=149, count(p) = 47
  • after third iteration, count(f)=627, count(p) = 108

Nutch SolrIndex

It is possible (and probably recommended, for incremental indexing) to publish the parsed records to Solr by batch ID after each iteration, but I decided to do this after my entire crawl was done. I already had a Solr server listening on port 8983 with the nutch schema.xml file. Here's the nutch subcommand I used.

1
2
3
sujit@cyclone:local$ bin/nutch solrindex http://127.0.0.1:8983/solr/ -all
SolrIndexerJob: starting
SolrIndexerJob: done.

Navigating to http://localhost:8983/solr/admin and issuing a *:* (match all records) gives me back 108 records, which matches the count of records in the "p" column family after the third crawl loop iteration.

Pycassa Scripts

And here are the Python scripts for display_webpage.py and count_pages.py I promised earlier. The display_webpage.py reads all the records in the webpage keyspace and dumps them out in easy to understand JSON-like format to the console.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
#!/usr/bin/python

import pycassa
from pycassa.pool import ConnectionPool
from pycassa.util import OrderedDict
import sys

def print_map(level, dict):
  for key in dict.keys():
    value = dict[key]
    if type(value) == type(OrderedDict()):
      print indent(level), key, ": {"
      print_map(level+1, value)
      print indent(level), "}"
    else:
      print indent(level), key, ":", quote(value), ","
    
def quote(s):
  if not(s.startswith("\"") and s.endswith("\"")):
    return "".join(["\"", s, "\""])

def indent(level):
  return ("." * level * 2)

def main():
  print "webpage: {"
  level = 1
  pool = ConnectionPool("webpage")
  f = pycassa.ColumnFamily(pool, "f")
  for fk, fv in f.get_range(start="", finish=""):
    print indent(level), "key:", quote(fk), ","
    print indent(level), "f: {"
    if type(fv) == type(OrderedDict()):
      print_map(level+1, fv)
    else:
      print indent(level+1), fk, ":", quote(fv), ","
    print indent(level), "},"
    p = pycassa.ColumnFamily(pool, "p")
    print indent(level), "p: {"
    for pk, pv in p.get_range(start=fk, finish=fk):
      if type(pv) == type(OrderedDict()):
        print_map(level+1, pv)
      else:
        print indent(level+1), pk, ":", quote(pv), ","
    print indent(level), "},"
    sc = pycassa.ColumnFamily(pool, "sc")
    print indent(level), "sc: {"
    for sck, scv in sc.get_range(start=fk, finish=fk):
      if type(scv) == type(OrderedDict()):
        print_map(level+1, scv)
      else:
        print indent(level+1), sck, ":", quote(scv), ","
    print indent(level), "}"
  print "}"

if __name__ == "__main__":
  main()

The count_pages.py script lists all the records (by key) from the named column family (so you will have to pass in "f" or "p" as a parameter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#!/usr/bin/python

import pycassa
from pycassa.pool import ConnectionPool
from pycassa.util import OrderedDict
import sys

def main():
  if len(sys.argv) != 2:
    print "Usage: %s col_fam_name" % (sys.argv[0])
    sys.exit(-1)
  col_fam = sys.argv[1]
  pool = ConnectionPool("webpage")
  cf = pycassa.ColumnFamily(pool, col_fam)
  res = cf.get_range(start="", finish="")
  count = 0
  for k, v in res:
    print k
    count = count + 1
  print "#-records in %s: %d\n" % (col_fam, count)

if __name__ == "__main__":
  main()

I am still learning Pycassa and Cassandra, so there is nothing fancy here - both scripts are fairly basic usages of the Pycassa API (more details can be found on the Pycassa Tutorial if you are interested).

Problems to Fix

While I am still convinced that the Nutch/GORA with Cassandra approach would be a good fit for my pipeline application, I did find a number of things that I don't quite like. Yeah, I know, picky, picky... :-).

First, I would like to restrict the crawl to the domain provided via the seed URL. I noticed that at the end of the third iteration, a lot of the records were for pages outside the somesite.com domain I specified in the seed URL - normally this would be the cue to stop crawling further. This works fine for discovery-style crawls, but for focused single-site crawls (which would be the kind of web crawls I envision doing more frequently), it would be good to be restrict the crawl to the domain specified, so successive iterations converge. This can be done by ensuring that anything in the webpage["sc"]["ol"] column gets promoted to the fetch list only if it matches the domain of its row key.

Second, the u_idx parameter I specified in the seed URL as metadata only gets written into the webpage record for that page. My intent in providing this metadata is to have it propagate through to all its children. I believe this can be done using a combination of fetch and parse filters.

Finally, I see that my u_idx metadata parameter doesn't make it to Solr at all (not even for the single record where this was added by Nutch). Since the whole point of adding metadata along with the seed URLs is so I can search indexes for those pages using the metadata, this kind of defeats the purpose. I think that the field just needs to be added to the solrindex-mapping.xml file in Nutch's runtime/local/config directory.

My next post would probably focus on solutions to these problems...

Update 2012-02-22 - In response to some folks who were randomly copy-pasting commands and configuration from the above post and running them against the same site that I used, I have tried to anonymize the seed URL somewhat so the site doesn't become a designated target for crawling experiments for a whole bunch of people. Here is a perspective from the other side of the fence. Not that I am against self education or anything, and often one would need to go crawl a site, but we should be cognizant of the fact that crawling is essentially parasitic in nature, it takes away bandwidth small sites reserve to serve real visitors, and unlike the major search engines, we offer nothing in return. So anyway, thats why the seed URLs are now unusable... 'nuff said.

6 comments (moderated to prevent spam):

Julien Nioche said...

Good to see people using Nutch-Gora!

These issues are commonly asked on the mailing list

1. See property db.ignore.external.links

2&3 Google for the urlmeta plugin

Cheers

Julien

Sujit Pal said...

Thanks for the pointers Julien. I did not know the first property (now I know where to look for the others as well). I adapted some code based on the meta_url example in the Writing Plugins wiki page, and wrote about it here.

Anonymous said...

Thank you very much for helping me learn so many new things.

Sujit Pal said...

You are welcome, glad it helped.

rajnikant said...

Hi

I am crawling a complete website. i have done all configuration for nutch and solr.

Crawling is start but after one day it stop and shows an exception

com.mysql.jdbc.exceptions.jdbc4.MySQLDataException:

mysql buffersize outofrange.
how i can increase mysqlcachebuffer size for crawling a complete(large) website

Thanks
Rajni kant

Sujit Pal said...

Hi Rajnikant, when I last looked at GORA, the MySQL module was not very stable and there were explicit warnings to not use it. Not sure if things have changed since. My use of MySQL is fairly limited (I use it as a repository for application data, almost never CLOB/BLOB) and hasn't exposed this kind of error, so I can't say for sure, but a quick Google search for "increase mysql cache size" yielded this document from the MySQL ref manual, maybe this helps?