Sunday, February 05, 2012

Nutch/GORA - Indexing Sections and Embedded Assets

Background

Content often contains additional embedded information such as references to images and image groups (slideshows), and in these cases, its nice to be able to "search" these by matching queries against content inherited from their containers. Additionally, if we have content that is nicely sectioned, it is possible to split them and allow for "search within a document" functionality. This post discusses one possible implementation using Nutch/GORA that provides these two features.

So, here are my "business" requirements. These are basically the things that our current pipeline (at work) does for this content type.

  • Image references are embedded in XML. These references contain pointers to image files delivered separately. The reference can contain an embedded element pointing to another file, from which additional content may be extracted for the image.
  • Image references may be grouped together in a single section in the XML. In that case, it should be treated as a image-group or slideshow.
  • The first image reference in a document (can be the first standalone image, or the first image in the first slideshow) should be used to provide a thumbnail for the container document in the search results.
  • The XML document is composed of multiple sections. Section results should be available in the index to allow users to find matching sections within a document.

Our current pipeline (at work) already supports both these features by splitting a single document up front (before ingestion) into itself plus multiple subdocuments into a common XML format. But this makes delta indexing (updates, and to a lesser extent deletes) much harder (which is probably why we don't support it). The pipeline is also not as tightly integrated with Nutch as the one I am trying to build, so each class of content (web crawls, feeds, CMS, provider content, etc) can (and do) have different flows.

Implementation

The solution implemented here is in two parts.

  1. The custom XML Processor (called by the custom Provider XML Parser plugin) for the content type is extended to parse the images, slideshows and sections out of the XML and record these as (structured) JSON strings as WebPage metadata information.
  2. After SolrIndexerJob is run, an additional job is run that reads this metadata out of each WebPage, parses the JSON strings back into Collection objects, and then writes out a subpage record to Solr for each element in the Collection.

The first part requires some changes to the Prov1XmlProcessor class. Although there is quite a lot of change, its not very interesting (unless you have the same content source), since the changes are very specific to the XML we are parsing. Here is the (updated) code for the Prov1XmlProcessor.java.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
// Source: src/plugin/mycompany/src/java/com/mycompany/nutch/parse/xml/Prov1XmlProcessor.java
package com.mycompany.nutch.parse.xml;

import java.io.ByteArrayInputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.apache.commons.lang.StringUtils;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.input.SAXBuilder;
import org.xml.sax.InputSource;

public class Prov1XmlProcessor implements IProviderXmlProcessor {

  private static final String IMAGE_TEXT_URL_PREFIX =
    "http://localhost:8080/provider/prov1";
  
  private class Section {
    public int group;
    public int ordinal;
    public String title;
    public String content;
  };
  
  private class Image extends Section {
    public String name;
  };
  
  private Comparator<Section> groupOrdinalComparator = 
  new Comparator<Section>() {
    public int compare(Section sec1, Section sec2) {
      if (sec1.group == sec2.group) {
        if (sec1.ordinal == sec2.ordinal) {
          return 0;
        } else {
          return sec1.ordinal > sec2.ordinal ? 1 : -1;
        }
      } else {
        return sec1.group > sec2.group ? 1 : -1;
      }
    }
  };

  @SuppressWarnings("unchecked")
  @Override
  public Map<String,String> parse(String content) throws Exception {
    Map<String,String> parsedFields = new HashMap<String,String>();
    SAXBuilder builder = new SAXBuilder();
    Document doc = builder.build(new InputSource(
      new ByteArrayInputStream(content.getBytes())));
    Element root = doc.getRootElement();
    parsedFields.put(ProviderXmlFields.u_disp.name(), "M"); // appear on SERP
    parsedFields.put(ProviderXmlFields.u_lang.name(), 
      root.getAttributeValue("language"));
    parsedFields.put(ProviderXmlFields.title.name(), 
      root.getAttributeValue("title"));
    parsedFields.put(ProviderXmlFields.u_category.name(), 
      root.getAttributeValue("subContent"));
    parsedFields.put(ProviderXmlFields.u_contentid.name(), 
      root.getAttributeValue("genContentID"));
    Element versionInfo = root.getChild("versionInfo");
    if (versionInfo != null) {
      parsedFields.put(ProviderXmlFields.u_reviewdate.name(), 
        ProviderXmlParserUtils.convertToIso8601(
        versionInfo.getAttributeValue("reviewDate")));
      parsedFields.put(ProviderXmlFields.u_reviewers.name(), 
        versionInfo.getAttributeValue("reviewedBy"));
    }
    parsedFields.put(ProviderXmlFields.content.name(), 
      ProviderXmlParserUtils.getTextContent(root));
    // extrac sections and images
    List<Section> sections = new ArrayList<Section>();
    List<Image> images = new ArrayList<Image>();
    List<List<Image>> slideshows = new ArrayList<List<Image>>();
    List<Element> textContents = root.getChildren("textContent");
    for (Element textContent : textContents) {
      Section section = parseSection(textContent);
      if ("visHeader".equals(section.title)) {
        // this represents a slideshow, build and populate
        int group = Integer.valueOf(textContent.getAttributeValue("group"));
        slideshows.add(parseSlideshow(group, textContent));
        continue;
      }
      sections.add(section);
      images.addAll(parseImages(section, textContent));
    }
    boolean hasThumbnail = false;
    if (sections.size() > 0) {
      Collections.sort(sections, groupOrdinalComparator);
      parsedFields.put(ProviderXmlFields.u_sections.name(),
        ProviderXmlParserUtils.convertToJson(sections));
    }
    if (images.size() > 0) {
      parsedFields.put(ProviderXmlFields.u_images.name(), 
        ProviderXmlParserUtils.convertToJson(images));
      // get thumbnail from first image in document
      parsedFields.put(ProviderXmlFields.u_thumbnail.name(), 
        images.get(0).name);
      hasThumbnail = true;
    }
    if (slideshows.size() > 0) {
      parsedFields.put(ProviderXmlFields.u_slideshows.name(), 
        ProviderXmlParserUtils.convertToJson(slideshows));
      if (! hasThumbnail) {
        // if no thumbnail from images, get thumbnail from 
        // first image in slideshows
        for (List<Image> slideshow : slideshows) {
          for (Image image : slideshow) {
            if (image != null) {
              parsedFields.put(ProviderXmlFields.u_thumbnail.name(), 
                image.name);
              hasThumbnail = true;
              break;
            }
          }
        }
      }
    }
    return parsedFields;
  }

  private Section parseSection(Element textContent) {
    Section section = new Section();
    section.group = Integer.valueOf(
      textContent.getAttributeValue("group"));
    section.ordinal = Integer.valueOf(
      textContent.getAttributeValue("ordinal"));
    section.title = textContent.getAttributeValue("title");
    section.content = ProviderXmlParserUtils.getTextContent(
      textContent);
    return section;
  }

  @SuppressWarnings("unchecked")
  private List<Image> parseImages(Section section, 
      Element textContent) throws Exception {
    List<Image> images = new ArrayList<Image>();
    List<Element> visualContents = 
      textContent.getChildren("visualContent");
    for (Element visualContent : visualContents) {
      Image image = new Image();
      image.group = Integer.valueOf(
        visualContent.getAttributeValue("group"));
      image.ordinal = Integer.valueOf(
        visualContent.getAttributeValue("ordinal"));
      image.title = visualContent.getAttributeValue("alt");
      image.name = 
        visualContent.getAttributeValue("genContentID") + 
        "t." + visualContent.getAttributeValue("mediaType");
      Element visualLink = visualContent.getChild("visualLink");
      if (visualLink != null) {
        image.content = fetchDescription(
          visualLink.getAttributeValue("projectTypeID"),
          visualLink.getAttributeValue("genContentID"));
      }
      images.add(image);
    }
    Collections.sort(images, groupOrdinalComparator);
    return images;
  }

  @SuppressWarnings("unchecked")
  private List<Image> parseSlideshow(int group, 
      Element textContent) throws Exception {
    List<Image> images = new ArrayList<Image>();
    List<Element> visualContents = 
      textContent.getChildren("visualContent");
    for (Element visualContent : visualContents) {
      Image image = new Image();
      image.group = group;
      image.ordinal = Integer.valueOf(
        visualContent.getAttributeValue("ordinal"));
      image.title = visualContent.getAttributeValue("alt");
      image.name = 
        visualContent.getAttributeValue("genContentID") + 
        "t." + visualContent.getAttributeValue("mediaType");
      Element visualLink = visualContent.getChild("visualLink");
      if (visualLink != null) {
        image.content = fetchDescription(
          visualLink.getAttributeValue("projectTypeID"),
          visualLink.getAttributeValue("genContentID"));
      }
      images.add(image);
    }
    Collections.sort(images, groupOrdinalComparator);
    return images;
  }

  private String fetchDescription(String projectTypeId, 
      String imageContentId) throws Exception {
    String text = ProviderXmlParserUtils.readStringFromUrl(
      StringUtils.join(new String[] {
      IMAGE_TEXT_URL_PREFIX, projectTypeId, imageContentId
      }, "__") + ".xml");
    if (StringUtils.isEmpty(text)) {
      return text;
    }
    SAXBuilder builder = new SAXBuilder();
    Document doc = builder.build(new InputSource(
      new ByteArrayInputStream(text.getBytes())));
    Element root = doc.getRootElement();
    return ProviderXmlParserUtils.getTextContent(root);
  }
}

With the changes to the parser, we can now extract out the section, image and slideshow contents into data structures (List of Map or List of List of Map) and then write them out as JSON strings in the WebPage.metadata. So we still have the 1-document to 1-WebPage correspondence so far.

However, the search layer wants these documents explicitly split out. Nutch does not have any functionality that supports converting a single WebPage document into multiple NutchDocument (for indexing). The closest is an IndexingFilter plugin, but that will take a NutchDocument and add/delete/update fields within the same NutchDocument and return it, or return a new NutchDocument, ie you can only do 1:1.

So I decided to do this in a separate stage after (/bin/nutch) solrindex. This is modeled, as many Nutch tools are, as a Hadoop Map-Reduce job. The Mapper extends GoraMapper and reads WebPages from Cassandra. For each WebPage, the metadata is scanned for the JSON fields. If they exist, the JSON string is parsed back into the apprpriate Collection, and a new NutchDocument is written out for each element in the Collection. The reducer writes these NutchDocuments in batch to Solr.

As before, some of the code in the mapper is application specific. We look for specific metadata values for each document and use them to generate subpage fields for populating into NutchDocument. Once you have figured out what your section/image/slideshow looks like (based on search client requirements), this should be pretty generic. Here's the code for the SolrSubpageIndexerJob.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
// Source: src/java/com/mycompany/nutch/subpageindexer/SolrSubpageIndexerJob.java
package com.mycompany.nutch.subpageindexer;

import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Date;
import java.util.HashSet;
import java.util.List;
import java.util.Map;

import org.apache.avro.util.Utf8;
import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.gora.mapreduce.GoraMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.indexer.IndexerJob;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.solr.SolrConstants;
import org.apache.nutch.metadata.Nutch;
import org.apache.nutch.storage.Mark;
import org.apache.nutch.storage.StorageUtils;
import org.apache.nutch.storage.WebPage;
import org.apache.nutch.util.Bytes;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.util.NutchJob;
import org.apache.nutch.util.TableUtil;
import org.apache.nutch.util.ToolUtil;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.util.DateUtil;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.type.TypeReference;

/**
 * Map Reduce job to read the database, parse sections and 
 * images out, and write them out to SOLR as separate sub-pages.
 */
public class SolrSubpageIndexerJob extends IndexerJob {

  private static Log LOG = LogFactory.getLog(SolrSubpageIndexerJob.class);
  
  private static Utf8 U_SECTIONS = new Utf8("u_sections");
  private static Utf8 U_IMAGES = new Utf8("u_images");
  private static Utf8 U_SLIDESHOWS = new Utf8("u_slideshows");
  
  private static final Collection<WebPage.Field> FIELDS = 
    new HashSet<WebPage.Field>();
  
  static {
    FIELDS.addAll(Arrays.asList(WebPage.Field.values()));
  }
  
  public static class SolrSubpageIndexerJobMapper 
      extends GoraMapper<String,WebPage,Text,NutchDocument> {
    
    private Utf8 batchId;
    
    @Override
    public void setup(Context ctx) throws IOException {
      Configuration conf = ctx.getConfiguration();
      batchId = new Utf8(conf.get(Nutch.ARG_BATCH));
    }
    
    @Override
    public void map(String key, WebPage page, Context ctx)
        throws IOException, InterruptedException {
      // check to see if this parsed and ready for indexing
      String url = TableUtil.unreverseUrl(key);
      Utf8 mark = Mark.PARSE_MARK.checkMark(page);
      if (! NutchJob.shouldProcess(mark, batchId)) {
        LOG.info("Skipping " + 
          TableUtil.unreverseUrl(key) + "; different batch id");
        return;
      }
      Map<Utf8,ByteBuffer> metadata = page.getMetadata();
      ObjectMapper mapper = new ObjectMapper();
      if (metadata.get(U_SECTIONS) != null) {
        String sectionJson = Bytes.toString(Bytes.toBytes(
          metadata.get(U_SECTIONS)));
        List<Map<String,Object>> sections = mapper.readValue(
          sectionJson, 
          new TypeReference<List<Map<String,Object>>>() {});
        for (Map<String,Object> section : sections) {
          NutchDocument doc = new NutchDocument();
          Integer group = (Integer) section.get("group");
          Integer ordinal = (Integer) section.get("ordinal");
          String title = (String) section.get("title");
          String content = (String) section.get("content");
          String sid = StringUtils.join(new String[] {
            String.valueOf(group), String.valueOf(ordinal)
          }, ".");
          String newKey = TableUtil.reverseUrl(StringUtils.join(
            new String[] {url, sid}, "-"));
          populateCommonFields(doc, page, newKey, title, content);
          doc.add("u_idx", "section");
          doc.add("u_disp", "S"); // section - dont show on serp
          doc.add("s_parent", key);
          doc.add("s_sortorder", sid);
          doc.add("s_sid", sid);
          doc.add("title", title);
          doc.add("content", content);
          ctx.write(new Text(newKey), doc);
        }
      }
      if (metadata.get(U_IMAGES) != null) {
        String imageJson = Bytes.toString(Bytes.toBytes(
          metadata.get(U_IMAGES)));
        List<Map<String,Object>> images = mapper.readValue(
          imageJson, 
          new TypeReference<List<Map<String,Object>>>() {});
        for (Map<String,Object> image : images) {
          NutchDocument doc = new NutchDocument();
          int group = (Integer) image.get("group");
          int ordinal = (Integer) image.get("ordinal");
          String title = (String) image.get("title");
          String content = (String) image.get("content");
          String name = (String) image.get("name");
          String newKey = TableUtil.reverseUrl(StringUtils.join(
            new String[] {url, name}, "-"));
          populateCommonFields(doc, page, newKey, title, content);
          doc.add("u_idx", "image");
          doc.add("u_disp", "M"); // treated as main for search
          doc.add("s_parent", key);
          doc.add("s_sid", name);
          doc.add("s_sortorder", StringUtils.join(new String[] {
            String.valueOf(group), String.valueOf(ordinal)
          }, "."));
          doc.add("title", title);
          doc.add("s_content", content); // for search AJAX component
          doc.add("content", content);
          ctx.write(new Text(newKey), doc);
        }
      }
      if (metadata.get(U_SLIDESHOWS) != null) {
        String slideshowJson = Bytes.toString(Bytes.toBytes(
          metadata.get(U_SLIDESHOWS)));
        List<List<Map<String,Object>>> slideshows = 
          mapper.readValue(slideshowJson, 
          new TypeReference<List<List<Map<String,Object>>>>() {});
        int sortOrder = 0;
        for (List<Map<String,Object>> slideshow : slideshows) {
          if (slideshow.size() > 0) {
            // metadata is from the first image in slideshow
            // content is the JSON for the slideshow - application
            // may parse and use JSON for rendering
            Map<String,Object> image = slideshow.get(0);
            NutchDocument doc = new NutchDocument();
            String title = (String) image.get("title");
            String content = (String) image.get("content");
            String name = (String) image.get("name");
            String newKey = TableUtil.reverseUrl(StringUtils.join(
              new String[] {url, name}, "-"));
            populateCommonFields(doc, page, newKey, title, content);
            doc.add("u_idx", "slideshow");
            doc.add("u_disp", "M"); // treated as main for SERP
            doc.add("s_parent", key);
            doc.add("s_sid", name);
            doc.add("s_sortorder", String.valueOf(sortOrder));
            doc.add("title", title);
            String json = mapper.writeValueAsString(slideshow);
            doc.add("content_s", json); // for search AJAX component
            doc.add("content", json);
            ctx.write(new Text(newKey), doc);
          }
          sortOrder++;
        }
      }
    }

    private void populateCommonFields(NutchDocument doc, 
        WebPage page, String key, String title, String content) {
      if (page.isReadable(WebPage.Field.BASE_URL.getIndex())) {
        doc.add("url", TableUtil.toString(page.getBaseUrl()));
        doc.add("id", key);
        doc.add("boost", String.valueOf(0.0F));
        doc.add("digest", DigestUtils.md5Hex(title+content));
        doc.add("tstamp", DateUtil.getThreadLocalDateFormat().
          format(new Date(page.getFetchTime())));
      }
    }
  }
  
  public static class SolrSubpageIndexerJobReducer
      extends Reducer<Text,NutchDocument,Text,NutchDocument> {
   
    private int commitSize;
    private SolrServer server;
    private List<SolrInputDocument> sdocs = 
      new ArrayList<SolrInputDocument>();
    
    @Override
    public void setup(Context ctx) throws IOException {
      Configuration conf = ctx.getConfiguration();
      this.server = new CommonsHttpSolrServer(
        conf.get(Nutch.ARG_SOLR));
      this.commitSize = conf.getInt(
        SolrConstants.COMMIT_SIZE, 1000);
    }
    
    @Override
    public void reduce(Text key, Iterable<NutchDocument> values,
        Context ctx) throws IOException, InterruptedException {
      for (NutchDocument doc : values) {
        SolrInputDocument sdoc = new SolrInputDocument();
        for (String fieldname : doc.getFieldNames()) {
          sdoc.addField(fieldname, doc.getFieldValue(fieldname));
        }
        sdocs.add(sdoc);
        if (sdocs.size() >= commitSize) {
          try {
            server.add(sdocs);
          } catch (SolrServerException e) {
            throw new IOException(e);
          }
          sdocs.clear();
        }
      }
    }
    
    @Override
    public void cleanup(Context ctx) throws IOException {
      try {
        if (sdocs.size() > 0) {
          try {
            server.add(sdocs);
          } catch (SolrServerException e) {
            throw new IOException(e);
          }
          sdocs.clear();
        }
        server.commit();
      } catch (SolrServerException e) {
        throw new IOException(e);
      }
    }
  }
  
  @Override
  public Map<String,Object> run(Map<String,Object> args) throws Exception {
    String solrUrl = (String) args.get(Nutch.ARG_SOLR);
    if (StringUtils.isNotEmpty(solrUrl)) {
      getConf().set(Nutch.ARG_SOLR, solrUrl);
    }
    String batchId = (String) args.get(Nutch.ARG_BATCH);
    if (StringUtils.isNotEmpty(batchId)) {
      getConf().set(Nutch.ARG_BATCH, batchId);
    }
    currentJob = new NutchJob(getConf(), "solr-subpage-index");
    StorageUtils.initMapperJob(currentJob, FIELDS, Text.class, 
      NutchDocument.class, SolrSubpageIndexerJobMapper.class);
    currentJob.setMapOutputKeyClass(Text.class);
    currentJob.setMapOutputValueClass(NutchDocument.class);
    currentJob.setReducerClass(SolrSubpageIndexerJobReducer.class);
    currentJob.setNumReduceTasks(5);
    currentJob.waitForCompletion(true);
    ToolUtil.recordJobStatus(null, currentJob, results);
    return results;
  }

  @Override
  public int run(String[] args) throws Exception {
    if (args.length < 2) {
      System.err.println("Usage: SolrSubpageIndexerJob <solr url> " +
        "(<batch_id> | -all)");
      return -1;
    }
    LOG.info("SolrSubpageIndexerJob: starting");
    run(ToolUtil.toArgMap(
      Nutch.ARG_SOLR, args[0],
      Nutch.ARG_BATCH, args[1]));
    LOG.info("SolrSubpageIndexerJob: success");
    return 0;
  }

  public static void main(String[] args) throws Exception {
    final int res = ToolRunner.run(NutchConfiguration.create(), 
      new SolrSubpageIndexerJob(), args);
    System.exit(res);
  }
}

Configuration wise, the new fields need to be added to the Solr schema.xml file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
<schema name="nutch" version="1.4">
<!-- Source: conf/schema.xml -->
  ...
  <fields>
    <!-- user defined fields -->
    <field name="u_idx" type="string" stored="true" indexed="true"/>
    <field name="u_contentid" type="string" stored="true" indexed="true"/>
    <field name="u_category" type="string" stored="true" indexed="true"/>
    <field name="u_lang" type="string" stored="true" indexed="true"/>
    <field name="u_reviewdate" type="string" stored="true" indexed="false"/>
    <field name="u_reviewers" type="string" stored="true" indexed="false"/>
    <field name="u_thumbnail" type="string" stored="true" indexed="false"/>
    <field name="u_disp" type="string" stored="true" indexed="true"/>
    <!-- user defined subpage fields for main (ajax) -->
    <field name="u_sections" type="string" stored="true" indexed="false"/>
    <field name="u_images" type="string" stored="true" indexed="false"/>
    <field name="u_slideshows" type="string" stored="true" indexed="false"/>
    <!-- user defined subpage fields for sub -->
    <field name="s_parent" type="string" stored="true" indexed="true"/>
    <field name="s_sid" type="string" stored="true" indexed="true"/>
    <field name="s_sortorder" type="float" stored="true" indexed="true"/>
    <field name="content_s" type="string" stored="true" indexed="false"/>
  </fields>
  <uniqueKey>id</uniqueKey>
  <defaultSearchField>content</defaultSearchField>
  <solrQueryParser defaultOperator="OR"/>
</schema>

and to the solrindex-mapping.xml file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<?xml version="1.0" encoding="UTF-8"?>
<!-- Source: conf/solrindex-mapping.xml -->
<mapping>
  <fields>
    <field dest="content" source="content"/>
    <field dest="site" source="site"/>
    <field dest="title" source="title"/>
    <field dest="host" source="host"/>
    <field dest="segment" source="segment"/>
    <field dest="boost" source="boost"/>
    <field dest="digest" source="digest"/>
    <field dest="tstamp" source="tstamp"/>
    <!-- user defined -->
    <field dest="u_idx" source="u_idx"/>
    <field dest="u_contentid" source="u_contentid"/>
    <field dest="u_category" source="u_category"/>
    <field dest="u_lang" source="u_lang"/>
    <field dest="u_reviewdate" source="u_reviewdate"/>
    <field dest="u_reviewers" source="u_reviewers"/>
    <field dest="u_thumbnail" source="u_thumbnail"/>
    <field dest="u_disp" source="u_disp"/>
    <!-- user-defined subpage fields for main -->
    <field dest="u_sections" source="u_sections"/>
    <field dest="u_images" source="u_images"/>
    <field dest="u_slideshows" source="u_slideshows"/>
    <!-- user-defined subpage fields for sub -->
    <field dest="s_parent" source="s_parent"/>
    <field dest="s_sid" source="s_sid"/>
    <field dest="s_sortorder" source="s_sortorder"/>
    <field dest="s_content" source="s_content"/>
  </fields>
  <uniqueKey>id</uniqueKey>
</mapping>

Running

Running the code is just a matter of tacking on an extra step after the Nutch subcommands generate, fetch, parse, updatedb and index, like so:

1
2
3
sujit@cyclone:local$ bin/nutch \
  com.mycompany.nutch.subpageindexer.SolrSubpageIndexerJob \
  http://localhost:8983/solr/ -all

As you can see the new subcommand (called by class name since there is no alias for it within /bin/nutch) takes parameters that are identical to SolrIndexer.

Once you run it, you should see many more records in your index, and you can search for your subpages with standard Lucene filters from the Solr admin page to verify that everything worked correctly.

11 comments (moderated to prevent spam):

Ryan said...

Sujit, I cannot find the source code for org.apache.nutch.util.ToolUtil anywhere. The Nutch 2.0 download I found was from Oct. 2010 and doesn't use that class, and I can't seem to find a newer build of 2.0.
Any tips on where to find this class??

I need to create new sub-documents and modifying your code seems like a good start.

Sujit Pal said...

Hi Ryan, I found it in the NutchGora branch, from the Nutch svn repository.

Ryan said...

Got it. I just kept missing the 'workspace' directory last night and it was driving me crazy. Thanks.

Anonymous said...

Hi,

I started implementation of similar approach based on IndexerJob. I have problem with running:
java.lang.RuntimeException: java.lang.ClassNotFoundException: aa.bb.MyMapper

main in IndexerJob class starts, but Mapper class cannot be found...

Dou you have any ideas?

Thank you...

Jaroslav

Sujit Pal said...

Hi Jaroslav, sounds like a classpath problem. I try to bypass these issues by doing the following: (1) do my development inside the nutch source tree, ie, after I download, I create a directory under src/java, (2) then use "ant runtime" to compile my code and deploy to runtime/local subdirectory and (3) run my jobs from runtime/local. I am guessing you are probably not doing this and attempting to deploy manually?

Jorge Luis Betancourt González said...

Hi Sujit:

I'm trying to use a very similar approach to index each page of a PDF file as a new document in an specific solr core. This approach is based on the need that provide not only the document where a search criteria is found but also the page.

The approach you follow could be implement in the current nutch 1.6 branch? Any advice at all?

Sujit Pal said...

Hi Jorge, I think it should be possible to do what you want using this approach. The only change is that you will have to read your NutchDocument from a sequence file instead of a NoSQL database. One (space saving) suggestion may be to save an array of character offsets into each page of the text content of the PDF document during the parsing, and use this to split out the document into subsections during writing to Solr.

Jorge Luis Betancourt González said...

It would be possible, instead of getting the content of an extra database or file, to get the data from the url document stored in the segments of nutch (or even better if it could be from a higher level). The mechanism I'm thinking of if, store the each PDF page in the NutchDocument metadata (using a parser plugin) and then store each of this metadata fields as a new NutchDocument, eliminating the use of an external storage layer. What do you think of this?

Sujit Pal said...

Yes, good idea, this would work perfectly nicely too. You could split in the Mapper.

Anonymous said...

Hello Sujit! I would like to ask you if it is possible to put to arg map BATCH id which reflects real batch id collected after crawling (because I think that parameter "-all" indicates that every record from hbase will be indexed again, am I right?)

Jan

Sujit Pal said...

Hi Jan, yes, you are right about the -all behavior, and passing in the batch id instead should work to restrict by batch id similar to the other processes.