Background
Content often contains additional embedded information such as references to images and image groups (slideshows), and in these cases, its nice to be able to "search" these by matching queries against content inherited from their containers. Additionally, if we have content that is nicely sectioned, it is possible to split them and allow for "search within a document" functionality. This post discusses one possible implementation using Nutch/GORA that provides these two features.
So, here are my "business" requirements. These are basically the things that our current pipeline (at work) does for this content type.
- Image references are embedded in XML. These references contain pointers to image files delivered separately. The reference can contain an embedded element pointing to another file, from which additional content may be extracted for the image.
- Image references may be grouped together in a single section in the XML. In that case, it should be treated as a image-group or slideshow.
- The first image reference in a document (can be the first standalone image, or the first image in the first slideshow) should be used to provide a thumbnail for the container document in the search results.
- The XML document is composed of multiple sections. Section results should be available in the index to allow users to find matching sections within a document.
Our current pipeline (at work) already supports both these features by splitting a single document up front (before ingestion) into itself plus multiple subdocuments into a common XML format. But this makes delta indexing (updates, and to a lesser extent deletes) much harder (which is probably why we don't support it). The pipeline is also not as tightly integrated with Nutch as the one I am trying to build, so each class of content (web crawls, feeds, CMS, provider content, etc) can (and do) have different flows.
Implementation
The solution implemented here is in two parts.
- The custom XML Processor (called by the custom Provider XML Parser plugin) for the content type is extended to parse the images, slideshows and sections out of the XML and record these as (structured) JSON strings as WebPage metadata information.
- After SolrIndexerJob is run, an additional job is run that reads this metadata out of each WebPage, parses the JSON strings back into Collection objects, and then writes out a subpage record to Solr for each element in the Collection.
The first part requires some changes to the Prov1XmlProcessor class. Although there is quite a lot of change, its not very interesting (unless you have the same content source), since the changes are very specific to the XML we are parsing. Here is the (updated) code for the Prov1XmlProcessor.java.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 | // Source: src/plugin/mycompany/src/java/com/mycompany/nutch/parse/xml/Prov1XmlProcessor.java
package com.mycompany.nutch.parse.xml;
import java.io.ByteArrayInputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import org.apache.commons.lang.StringUtils;
import org.jdom.Document;
import org.jdom.Element;
import org.jdom.input.SAXBuilder;
import org.xml.sax.InputSource;
public class Prov1XmlProcessor implements IProviderXmlProcessor {
private static final String IMAGE_TEXT_URL_PREFIX =
"http://localhost:8080/provider/prov1";
private class Section {
public int group;
public int ordinal;
public String title;
public String content;
};
private class Image extends Section {
public String name;
};
private Comparator<Section> groupOrdinalComparator =
new Comparator<Section>() {
public int compare(Section sec1, Section sec2) {
if (sec1.group == sec2.group) {
if (sec1.ordinal == sec2.ordinal) {
return 0;
} else {
return sec1.ordinal > sec2.ordinal ? 1 : -1;
}
} else {
return sec1.group > sec2.group ? 1 : -1;
}
}
};
@SuppressWarnings("unchecked")
@Override
public Map<String,String> parse(String content) throws Exception {
Map<String,String> parsedFields = new HashMap<String,String>();
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(new InputSource(
new ByteArrayInputStream(content.getBytes())));
Element root = doc.getRootElement();
parsedFields.put(ProviderXmlFields.u_disp.name(), "M"); // appear on SERP
parsedFields.put(ProviderXmlFields.u_lang.name(),
root.getAttributeValue("language"));
parsedFields.put(ProviderXmlFields.title.name(),
root.getAttributeValue("title"));
parsedFields.put(ProviderXmlFields.u_category.name(),
root.getAttributeValue("subContent"));
parsedFields.put(ProviderXmlFields.u_contentid.name(),
root.getAttributeValue("genContentID"));
Element versionInfo = root.getChild("versionInfo");
if (versionInfo != null) {
parsedFields.put(ProviderXmlFields.u_reviewdate.name(),
ProviderXmlParserUtils.convertToIso8601(
versionInfo.getAttributeValue("reviewDate")));
parsedFields.put(ProviderXmlFields.u_reviewers.name(),
versionInfo.getAttributeValue("reviewedBy"));
}
parsedFields.put(ProviderXmlFields.content.name(),
ProviderXmlParserUtils.getTextContent(root));
// extrac sections and images
List<Section> sections = new ArrayList<Section>();
List<Image> images = new ArrayList<Image>();
List<List<Image>> slideshows = new ArrayList<List<Image>>();
List<Element> textContents = root.getChildren("textContent");
for (Element textContent : textContents) {
Section section = parseSection(textContent);
if ("visHeader".equals(section.title)) {
// this represents a slideshow, build and populate
int group = Integer.valueOf(textContent.getAttributeValue("group"));
slideshows.add(parseSlideshow(group, textContent));
continue;
}
sections.add(section);
images.addAll(parseImages(section, textContent));
}
boolean hasThumbnail = false;
if (sections.size() > 0) {
Collections.sort(sections, groupOrdinalComparator);
parsedFields.put(ProviderXmlFields.u_sections.name(),
ProviderXmlParserUtils.convertToJson(sections));
}
if (images.size() > 0) {
parsedFields.put(ProviderXmlFields.u_images.name(),
ProviderXmlParserUtils.convertToJson(images));
// get thumbnail from first image in document
parsedFields.put(ProviderXmlFields.u_thumbnail.name(),
images.get(0).name);
hasThumbnail = true;
}
if (slideshows.size() > 0) {
parsedFields.put(ProviderXmlFields.u_slideshows.name(),
ProviderXmlParserUtils.convertToJson(slideshows));
if (! hasThumbnail) {
// if no thumbnail from images, get thumbnail from
// first image in slideshows
for (List<Image> slideshow : slideshows) {
for (Image image : slideshow) {
if (image != null) {
parsedFields.put(ProviderXmlFields.u_thumbnail.name(),
image.name);
hasThumbnail = true;
break;
}
}
}
}
}
return parsedFields;
}
private Section parseSection(Element textContent) {
Section section = new Section();
section.group = Integer.valueOf(
textContent.getAttributeValue("group"));
section.ordinal = Integer.valueOf(
textContent.getAttributeValue("ordinal"));
section.title = textContent.getAttributeValue("title");
section.content = ProviderXmlParserUtils.getTextContent(
textContent);
return section;
}
@SuppressWarnings("unchecked")
private List<Image> parseImages(Section section,
Element textContent) throws Exception {
List<Image> images = new ArrayList<Image>();
List<Element> visualContents =
textContent.getChildren("visualContent");
for (Element visualContent : visualContents) {
Image image = new Image();
image.group = Integer.valueOf(
visualContent.getAttributeValue("group"));
image.ordinal = Integer.valueOf(
visualContent.getAttributeValue("ordinal"));
image.title = visualContent.getAttributeValue("alt");
image.name =
visualContent.getAttributeValue("genContentID") +
"t." + visualContent.getAttributeValue("mediaType");
Element visualLink = visualContent.getChild("visualLink");
if (visualLink != null) {
image.content = fetchDescription(
visualLink.getAttributeValue("projectTypeID"),
visualLink.getAttributeValue("genContentID"));
}
images.add(image);
}
Collections.sort(images, groupOrdinalComparator);
return images;
}
@SuppressWarnings("unchecked")
private List<Image> parseSlideshow(int group,
Element textContent) throws Exception {
List<Image> images = new ArrayList<Image>();
List<Element> visualContents =
textContent.getChildren("visualContent");
for (Element visualContent : visualContents) {
Image image = new Image();
image.group = group;
image.ordinal = Integer.valueOf(
visualContent.getAttributeValue("ordinal"));
image.title = visualContent.getAttributeValue("alt");
image.name =
visualContent.getAttributeValue("genContentID") +
"t." + visualContent.getAttributeValue("mediaType");
Element visualLink = visualContent.getChild("visualLink");
if (visualLink != null) {
image.content = fetchDescription(
visualLink.getAttributeValue("projectTypeID"),
visualLink.getAttributeValue("genContentID"));
}
images.add(image);
}
Collections.sort(images, groupOrdinalComparator);
return images;
}
private String fetchDescription(String projectTypeId,
String imageContentId) throws Exception {
String text = ProviderXmlParserUtils.readStringFromUrl(
StringUtils.join(new String[] {
IMAGE_TEXT_URL_PREFIX, projectTypeId, imageContentId
}, "__") + ".xml");
if (StringUtils.isEmpty(text)) {
return text;
}
SAXBuilder builder = new SAXBuilder();
Document doc = builder.build(new InputSource(
new ByteArrayInputStream(text.getBytes())));
Element root = doc.getRootElement();
return ProviderXmlParserUtils.getTextContent(root);
}
}
|
With the changes to the parser, we can now extract out the section, image and slideshow contents into data structures (List of Map or List of List of Map) and then write them out as JSON strings in the WebPage.metadata. So we still have the 1-document to 1-WebPage correspondence so far.
However, the search layer wants these documents explicitly split out. Nutch does not have any functionality that supports converting a single WebPage document into multiple NutchDocument (for indexing). The closest is an IndexingFilter plugin, but that will take a NutchDocument and add/delete/update fields within the same NutchDocument and return it, or return a new NutchDocument, ie you can only do 1:1.
So I decided to do this in a separate stage after (/bin/nutch) solrindex. This is modeled, as many Nutch tools are, as a Hadoop Map-Reduce job. The Mapper extends GoraMapper and reads WebPages from Cassandra. For each WebPage, the metadata is scanned for the JSON fields. If they exist, the JSON string is parsed back into the apprpriate Collection, and a new NutchDocument is written out for each element in the Collection. The reducer writes these NutchDocuments in batch to Solr.
As before, some of the code in the mapper is application specific. We look for specific metadata values for each document and use them to generate subpage fields for populating into NutchDocument. Once you have figured out what your section/image/slideshow looks like (based on search client requirements), this should be pretty generic. Here's the code for the SolrSubpageIndexerJob.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 | // Source: src/java/com/mycompany/nutch/subpageindexer/SolrSubpageIndexerJob.java
package com.mycompany.nutch.subpageindexer;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.Collection;
import java.util.Date;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import org.apache.avro.util.Utf8;
import org.apache.commons.codec.digest.DigestUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.apache.gora.mapreduce.GoraMapper;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.util.ToolRunner;
import org.apache.nutch.indexer.IndexerJob;
import org.apache.nutch.indexer.NutchDocument;
import org.apache.nutch.indexer.solr.SolrConstants;
import org.apache.nutch.metadata.Nutch;
import org.apache.nutch.storage.Mark;
import org.apache.nutch.storage.StorageUtils;
import org.apache.nutch.storage.WebPage;
import org.apache.nutch.util.Bytes;
import org.apache.nutch.util.NutchConfiguration;
import org.apache.nutch.util.NutchJob;
import org.apache.nutch.util.TableUtil;
import org.apache.nutch.util.ToolUtil;
import org.apache.solr.client.solrj.SolrServer;
import org.apache.solr.client.solrj.SolrServerException;
import org.apache.solr.client.solrj.impl.CommonsHttpSolrServer;
import org.apache.solr.common.SolrInputDocument;
import org.apache.solr.common.util.DateUtil;
import org.codehaus.jackson.map.ObjectMapper;
import org.codehaus.jackson.type.TypeReference;
/**
* Map Reduce job to read the database, parse sections and
* images out, and write them out to SOLR as separate sub-pages.
*/
public class SolrSubpageIndexerJob extends IndexerJob {
private static Log LOG = LogFactory.getLog(SolrSubpageIndexerJob.class);
private static Utf8 U_SECTIONS = new Utf8("u_sections");
private static Utf8 U_IMAGES = new Utf8("u_images");
private static Utf8 U_SLIDESHOWS = new Utf8("u_slideshows");
private static final Collection<WebPage.Field> FIELDS =
new HashSet<WebPage.Field>();
static {
FIELDS.addAll(Arrays.asList(WebPage.Field.values()));
}
public static class SolrSubpageIndexerJobMapper
extends GoraMapper<String,WebPage,Text,NutchDocument> {
private Utf8 batchId;
@Override
public void setup(Context ctx) throws IOException {
Configuration conf = ctx.getConfiguration();
batchId = new Utf8(conf.get(Nutch.ARG_BATCH));
}
@Override
public void map(String key, WebPage page, Context ctx)
throws IOException, InterruptedException {
// check to see if this parsed and ready for indexing
String url = TableUtil.unreverseUrl(key);
Utf8 mark = Mark.PARSE_MARK.checkMark(page);
if (! NutchJob.shouldProcess(mark, batchId)) {
LOG.info("Skipping " +
TableUtil.unreverseUrl(key) + "; different batch id");
return;
}
Map<Utf8,ByteBuffer> metadata = page.getMetadata();
ObjectMapper mapper = new ObjectMapper();
if (metadata.get(U_SECTIONS) != null) {
String sectionJson = Bytes.toString(Bytes.toBytes(
metadata.get(U_SECTIONS)));
List<Map<String,Object>> sections = mapper.readValue(
sectionJson,
new TypeReference<List<Map<String,Object>>>() {});
for (Map<String,Object> section : sections) {
NutchDocument doc = new NutchDocument();
Integer group = (Integer) section.get("group");
Integer ordinal = (Integer) section.get("ordinal");
String title = (String) section.get("title");
String content = (String) section.get("content");
String sid = StringUtils.join(new String[] {
String.valueOf(group), String.valueOf(ordinal)
}, ".");
String newKey = TableUtil.reverseUrl(StringUtils.join(
new String[] {url, sid}, "-"));
populateCommonFields(doc, page, newKey, title, content);
doc.add("u_idx", "section");
doc.add("u_disp", "S"); // section - dont show on serp
doc.add("s_parent", key);
doc.add("s_sortorder", sid);
doc.add("s_sid", sid);
doc.add("title", title);
doc.add("content", content);
ctx.write(new Text(newKey), doc);
}
}
if (metadata.get(U_IMAGES) != null) {
String imageJson = Bytes.toString(Bytes.toBytes(
metadata.get(U_IMAGES)));
List<Map<String,Object>> images = mapper.readValue(
imageJson,
new TypeReference<List<Map<String,Object>>>() {});
for (Map<String,Object> image : images) {
NutchDocument doc = new NutchDocument();
int group = (Integer) image.get("group");
int ordinal = (Integer) image.get("ordinal");
String title = (String) image.get("title");
String content = (String) image.get("content");
String name = (String) image.get("name");
String newKey = TableUtil.reverseUrl(StringUtils.join(
new String[] {url, name}, "-"));
populateCommonFields(doc, page, newKey, title, content);
doc.add("u_idx", "image");
doc.add("u_disp", "M"); // treated as main for search
doc.add("s_parent", key);
doc.add("s_sid", name);
doc.add("s_sortorder", StringUtils.join(new String[] {
String.valueOf(group), String.valueOf(ordinal)
}, "."));
doc.add("title", title);
doc.add("s_content", content); // for search AJAX component
doc.add("content", content);
ctx.write(new Text(newKey), doc);
}
}
if (metadata.get(U_SLIDESHOWS) != null) {
String slideshowJson = Bytes.toString(Bytes.toBytes(
metadata.get(U_SLIDESHOWS)));
List<List<Map<String,Object>>> slideshows =
mapper.readValue(slideshowJson,
new TypeReference<List<List<Map<String,Object>>>>() {});
int sortOrder = 0;
for (List<Map<String,Object>> slideshow : slideshows) {
if (slideshow.size() > 0) {
// metadata is from the first image in slideshow
// content is the JSON for the slideshow - application
// may parse and use JSON for rendering
Map<String,Object> image = slideshow.get(0);
NutchDocument doc = new NutchDocument();
String title = (String) image.get("title");
String content = (String) image.get("content");
String name = (String) image.get("name");
String newKey = TableUtil.reverseUrl(StringUtils.join(
new String[] {url, name}, "-"));
populateCommonFields(doc, page, newKey, title, content);
doc.add("u_idx", "slideshow");
doc.add("u_disp", "M"); // treated as main for SERP
doc.add("s_parent", key);
doc.add("s_sid", name);
doc.add("s_sortorder", String.valueOf(sortOrder));
doc.add("title", title);
String json = mapper.writeValueAsString(slideshow);
doc.add("content_s", json); // for search AJAX component
doc.add("content", json);
ctx.write(new Text(newKey), doc);
}
sortOrder++;
}
}
}
private void populateCommonFields(NutchDocument doc,
WebPage page, String key, String title, String content) {
if (page.isReadable(WebPage.Field.BASE_URL.getIndex())) {
doc.add("url", TableUtil.toString(page.getBaseUrl()));
doc.add("id", key);
doc.add("boost", String.valueOf(0.0F));
doc.add("digest", DigestUtils.md5Hex(title+content));
doc.add("tstamp", DateUtil.getThreadLocalDateFormat().
format(new Date(page.getFetchTime())));
}
}
}
public static class SolrSubpageIndexerJobReducer
extends Reducer<Text,NutchDocument,Text,NutchDocument> {
private int commitSize;
private SolrServer server;
private List<SolrInputDocument> sdocs =
new ArrayList<SolrInputDocument>();
@Override
public void setup(Context ctx) throws IOException {
Configuration conf = ctx.getConfiguration();
this.server = new CommonsHttpSolrServer(
conf.get(Nutch.ARG_SOLR));
this.commitSize = conf.getInt(
SolrConstants.COMMIT_SIZE, 1000);
}
@Override
public void reduce(Text key, Iterable<NutchDocument> values,
Context ctx) throws IOException, InterruptedException {
for (NutchDocument doc : values) {
SolrInputDocument sdoc = new SolrInputDocument();
for (String fieldname : doc.getFieldNames()) {
sdoc.addField(fieldname, doc.getFieldValue(fieldname));
}
sdocs.add(sdoc);
if (sdocs.size() >= commitSize) {
try {
server.add(sdocs);
} catch (SolrServerException e) {
throw new IOException(e);
}
sdocs.clear();
}
}
}
@Override
public void cleanup(Context ctx) throws IOException {
try {
if (sdocs.size() > 0) {
try {
server.add(sdocs);
} catch (SolrServerException e) {
throw new IOException(e);
}
sdocs.clear();
}
server.commit();
} catch (SolrServerException e) {
throw new IOException(e);
}
}
}
@Override
public Map<String,Object> run(Map<String,Object> args) throws Exception {
String solrUrl = (String) args.get(Nutch.ARG_SOLR);
if (StringUtils.isNotEmpty(solrUrl)) {
getConf().set(Nutch.ARG_SOLR, solrUrl);
}
String batchId = (String) args.get(Nutch.ARG_BATCH);
if (StringUtils.isNotEmpty(batchId)) {
getConf().set(Nutch.ARG_BATCH, batchId);
}
currentJob = new NutchJob(getConf(), "solr-subpage-index");
StorageUtils.initMapperJob(currentJob, FIELDS, Text.class,
NutchDocument.class, SolrSubpageIndexerJobMapper.class);
currentJob.setMapOutputKeyClass(Text.class);
currentJob.setMapOutputValueClass(NutchDocument.class);
currentJob.setReducerClass(SolrSubpageIndexerJobReducer.class);
currentJob.setNumReduceTasks(5);
currentJob.waitForCompletion(true);
ToolUtil.recordJobStatus(null, currentJob, results);
return results;
}
@Override
public int run(String[] args) throws Exception {
if (args.length < 2) {
System.err.println("Usage: SolrSubpageIndexerJob <solr url> " +
"(<batch_id> | -all)");
return -1;
}
LOG.info("SolrSubpageIndexerJob: starting");
run(ToolUtil.toArgMap(
Nutch.ARG_SOLR, args[0],
Nutch.ARG_BATCH, args[1]));
LOG.info("SolrSubpageIndexerJob: success");
return 0;
}
public static void main(String[] args) throws Exception {
final int res = ToolRunner.run(NutchConfiguration.create(),
new SolrSubpageIndexerJob(), args);
System.exit(res);
}
}
|
Configuration wise, the new fields need to be added to the Solr schema.xml file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | <schema name="nutch" version="1.4">
<!-- Source: conf/schema.xml -->
...
<fields>
<!-- user defined fields -->
<field name="u_idx" type="string" stored="true" indexed="true"/>
<field name="u_contentid" type="string" stored="true" indexed="true"/>
<field name="u_category" type="string" stored="true" indexed="true"/>
<field name="u_lang" type="string" stored="true" indexed="true"/>
<field name="u_reviewdate" type="string" stored="true" indexed="false"/>
<field name="u_reviewers" type="string" stored="true" indexed="false"/>
<field name="u_thumbnail" type="string" stored="true" indexed="false"/>
<field name="u_disp" type="string" stored="true" indexed="true"/>
<!-- user defined subpage fields for main (ajax) -->
<field name="u_sections" type="string" stored="true" indexed="false"/>
<field name="u_images" type="string" stored="true" indexed="false"/>
<field name="u_slideshows" type="string" stored="true" indexed="false"/>
<!-- user defined subpage fields for sub -->
<field name="s_parent" type="string" stored="true" indexed="true"/>
<field name="s_sid" type="string" stored="true" indexed="true"/>
<field name="s_sortorder" type="float" stored="true" indexed="true"/>
<field name="content_s" type="string" stored="true" indexed="false"/>
</fields>
<uniqueKey>id</uniqueKey>
<defaultSearchField>content</defaultSearchField>
<solrQueryParser defaultOperator="OR"/>
</schema>
|
and to the solrindex-mapping.xml file:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | <?xml version="1.0" encoding="UTF-8"?>
<!-- Source: conf/solrindex-mapping.xml -->
<mapping>
<fields>
<field dest="content" source="content"/>
<field dest="site" source="site"/>
<field dest="title" source="title"/>
<field dest="host" source="host"/>
<field dest="segment" source="segment"/>
<field dest="boost" source="boost"/>
<field dest="digest" source="digest"/>
<field dest="tstamp" source="tstamp"/>
<!-- user defined -->
<field dest="u_idx" source="u_idx"/>
<field dest="u_contentid" source="u_contentid"/>
<field dest="u_category" source="u_category"/>
<field dest="u_lang" source="u_lang"/>
<field dest="u_reviewdate" source="u_reviewdate"/>
<field dest="u_reviewers" source="u_reviewers"/>
<field dest="u_thumbnail" source="u_thumbnail"/>
<field dest="u_disp" source="u_disp"/>
<!-- user-defined subpage fields for main -->
<field dest="u_sections" source="u_sections"/>
<field dest="u_images" source="u_images"/>
<field dest="u_slideshows" source="u_slideshows"/>
<!-- user-defined subpage fields for sub -->
<field dest="s_parent" source="s_parent"/>
<field dest="s_sid" source="s_sid"/>
<field dest="s_sortorder" source="s_sortorder"/>
<field dest="s_content" source="s_content"/>
</fields>
<uniqueKey>id</uniqueKey>
</mapping>
|
Running
Running the code is just a matter of tacking on an extra step after the Nutch subcommands generate, fetch, parse, updatedb and index, like so:
1 2 3 | sujit@cyclone:local$ bin/nutch \
com.mycompany.nutch.subpageindexer.SolrSubpageIndexerJob \
http://localhost:8983/solr/ -all
|
As you can see the new subcommand (called by class name since there is no alias for it within /bin/nutch) takes parameters that are identical to SolrIndexer.
Once you run it, you should see many more records in your index, and you can search for your subpages with standard Lucene filters from the Solr admin page to verify that everything worked correctly.
Sujit, I cannot find the source code for org.apache.nutch.util.ToolUtil anywhere. The Nutch 2.0 download I found was from Oct. 2010 and doesn't use that class, and I can't seem to find a newer build of 2.0.
ReplyDeleteAny tips on where to find this class??
I need to create new sub-documents and modifying your code seems like a good start.
Hi Ryan, I found it in the NutchGora branch, from the Nutch svn repository.
ReplyDeleteGot it. I just kept missing the 'workspace' directory last night and it was driving me crazy. Thanks.
ReplyDeleteHi,
ReplyDeleteI started implementation of similar approach based on IndexerJob. I have problem with running:
java.lang.RuntimeException: java.lang.ClassNotFoundException: aa.bb.MyMapper
main in IndexerJob class starts, but Mapper class cannot be found...
Dou you have any ideas?
Thank you...
Jaroslav
Hi Jaroslav, sounds like a classpath problem. I try to bypass these issues by doing the following: (1) do my development inside the nutch source tree, ie, after I download, I create a directory under src/java, (2) then use "ant runtime" to compile my code and deploy to runtime/local subdirectory and (3) run my jobs from runtime/local. I am guessing you are probably not doing this and attempting to deploy manually?
ReplyDeleteHi Sujit:
ReplyDeleteI'm trying to use a very similar approach to index each page of a PDF file as a new document in an specific solr core. This approach is based on the need that provide not only the document where a search criteria is found but also the page.
The approach you follow could be implement in the current nutch 1.6 branch? Any advice at all?
Hi Jorge, I think it should be possible to do what you want using this approach. The only change is that you will have to read your NutchDocument from a sequence file instead of a NoSQL database. One (space saving) suggestion may be to save an array of character offsets into each page of the text content of the PDF document during the parsing, and use this to split out the document into subsections during writing to Solr.
ReplyDeleteIt would be possible, instead of getting the content of an extra database or file, to get the data from the url document stored in the segments of nutch (or even better if it could be from a higher level). The mechanism I'm thinking of if, store the each PDF page in the NutchDocument metadata (using a parser plugin) and then store each of this metadata fields as a new NutchDocument, eliminating the use of an external storage layer. What do you think of this?
ReplyDeleteYes, good idea, this would work perfectly nicely too. You could split in the Mapper.
ReplyDeleteHello Sujit! I would like to ask you if it is possible to put to arg map BATCH id which reflects real batch id collected after crawling (because I think that parameter "-all" indicates that every record from hbase will be indexed again, am I right?)
ReplyDeleteJan
Hi Jan, yes, you are right about the -all behavior, and passing in the batch id instead should work to restrict by batch id similar to the other processes.
ReplyDelete