Regular readers may imagine that I am making up for lost time with this mid-week post. Actually, its just that I have a prior engagement over the weekend which will probably prevent me from posting then, and besides, I am done with this stuff, so hopefully it helps someone that much sooner :-).
Background
The idea is that given a machine readable set of post data, perhaps an RSS/Atom feed, or a dump from some other CMS, we should be able to import the posts into a local Drupal installation. I had toyed with using the Feeds module earlier, but while it works beautifully (with the PHP memory_limit upped to 64M) with an Atom feed from my blog, making it work with a non-standard XML feed will require custom subclasses of one or more of the FeedFetcher, FeedParser or FeedProcessor classes. Check out these screencasts if you are interested in setting up your import this way. A more general solution would be to point Drupal to a Java proxy web application which converts incoming custom formats to some sort of "common" uber-format, then have custom subclasses of the Feed components on the Drupal end (via a custom module) that would parse and process incoming nodes using a set of shared conventions. However, a colleague suggested using Drupal XMLRPC service, and that turns out to be much simpler and probably just as effective, so thats what I ended up doing.
For a proof of concept, I decided to use as input the RSS feed from my blog, containing 25 of the most recent articles and see if I could import them into Drupal over XMLRPC using a Java client. Along with the title and body, I also decided to import the original URL and pubDate fields into custom CCK fields, and the category tags into a custom taxonomy. Here is what I had to do.
Create a custom type in Drupal
If you are aggregating different types of feeds into Drupal, then it is likely that each feed will have some fields that are unique to that type of feed. So for each feed, I envision having to create a custom content type. Here is the content type (blogger_story) I created in Drupal for my blogger feed.
As you can see, its basically the Story type, with two CCK fields field_origurl and field_pubdate, as well as a Taxonomy field (after Title). The Taxonomy field is called Blogger_Tags, and attached to the BloggerStory type, as shown below:
Loosen up some permissions
I allowed anonymous users to create a blogger_story content type in order to bypass the Drupal authentication. This may or may not be good for your setup. I do login from the Java client, but my tests showed that passing in the resulting session id does not seem to make a difference - it kept giving me "Access Denied" until I allowed Anonymous users to create content. Its possible that I am missing some parameter setting here though.
Create a Java bean to hold the blog data
I created a simple JavaBean representing the BloggerStory content to hold the blog data as it is parsed off the XML, and use that to populate the XMLRPC parameter. Most of it is boilerplate, but I include it for completeness below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | // Source: src/main/java/com/mycompany/myapp/blogger/DrupalBloggerStory.java
package com.mycompany.myapp.blogger;
import java.text.ParseException;
import java.text.SimpleDateFormat;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
/**
* Standard Drupal Story with two CCK fields, and a multi-select
* taxonomy field.
*/
public class DrupalBloggerStory {
public static final int BLOGGER_TAGS_VOCABULARY_ID = 1;
public static final String TAGS_ATTR_NAME = "domain";
public static final String TAGS_ATTR_VALUE =
"http://www.blogger.com/atom/ns#";
private static final SimpleDateFormat PUBDATE_INPUT_FORMAT =
new SimpleDateFormat("EEE, dd MMM yyyy HH:mm:ss Z");
private static final SimpleDateFormat PUBDATE_OUTPUT_FORMAT =
new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss");
private String title;
private List<String> tags = new ArrayList<String>();
private String body;
private String originalUrl;
private Date pubDate;
public String getType() {
return "blogger_story";
}
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public List<String> getTags() {
return tags;
}
public void setTags(List<String> tags) {
this.tags = tags;
}
public String getBody() {
return body;
}
public void setBody(String body) {
this.body = body;
}
public String getOriginalUrl() {
return originalUrl;
}
public void setOriginalUrl(String originalUrl) {
this.originalUrl = originalUrl;
}
/** Convert to Drupal's date format */
public String getPubDate() {
return PUBDATE_OUTPUT_FORMAT.format(pubDate);
}
/** Convert from RSS feed pubDate */
public void setPubDate(String pubDate) throws ParseException {
this.pubDate = PUBDATE_INPUT_FORMAT.parse(pubDate);
}
}
|
Build the Feed Client
The feed client provides a simple wrapper over an Apache XMLRPC client and provides Java methods that look similar to the corresponding Drupal XMLRPC service method. It also handles the details of populating CCK and Taxonomy (multi-select only) fields. Here is the code for the feed client.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 | // Source: src/main/java/com/mycompany/myapp/blogger/BloggerFeedClient.java
package com.mycompany.myapp.blogger;
import java.net.URL;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.Collection;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
import org.apache.xmlrpc.XmlRpcException;
import org.apache.xmlrpc.client.XmlRpcClient;
import org.apache.xmlrpc.client.XmlRpcClientConfigImpl;
import org.apache.xmlrpc.client.XmlRpcCommonsTransportFactory;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import com.mycompany.myapp.xmlrpc.CustomXmlRpcCommonsTransportFactory;
/**
* Client to connect to the Drupal XMLRPC service. Exposes
* the required services as client side Java method calls.
*/
public class BloggerFeedClient {
private final Logger logger = LoggerFactory.getLogger(getClass());
private XmlRpcClient client;
public BloggerFeedClient(String serviceUrl) throws Exception {
XmlRpcClientConfigImpl config = new XmlRpcClientConfigImpl();
config.setServerURL(new URL(serviceUrl));
this.client = new XmlRpcClient();
// client.setTransportFactory(new XmlRpcCommonsTransportFactory(client));
// logging transport - see my previous post for details
client.setTransportFactory(new CustomXmlRpcCommonsTransportFactory(client));
config.setEnabledForExceptions(true);
config.setEnabledForExtensions(true);
client.setConfig(config);
}
@SuppressWarnings("unchecked")
public String userLogin(String user, String pass) throws XmlRpcException {
Map<String,Object> result =
(Map<String,Object>) client.execute("user.login",
new String[] {user, pass});
return (result == null ? null : (String) result.get("sessid"));
}
public void taxonomySaveTerms(int vocabularyId, Collection<String> terms)
throws XmlRpcException {
for (String term : terms) {
Map<String,Object> termObj = new HashMap<String,Object>();
termObj.put("vid", vocabularyId);
termObj.put("name", term);
int status = (Integer) client.execute(
"taxonomy.saveTerm", new Object[] {termObj});
logger.info("Added term:[" + term + "] " +
(status == 0 ? "Ok" : "Failed"));
}
}
/**
* "Implementation" (in the Drupal sense) of the node.save XMLRPC
* method for DrupalBloggerStory.
*/
public String bloggerStorySave(DrupalBloggerStory story,
Map<String,Integer> termTidMap) throws XmlRpcException {
Map<String,Object> storyObj = new HashMap<String,Object>();
storyObj.put("type", story.getType()); // mandatory
storyObj.put("title", story.getTitle());
storyObj.put("body", story.getBody());
storyObj.put("field_origurl", mkCck(story.getOriginalUrl()));
storyObj.put("field_pubdate", mkCck(story.getPubDate()));
storyObj.put("uid", 1); // admin
Map<String,List<String>> tags = new HashMap<String,List<String>>();
tags.put(String.valueOf(
DrupalBloggerStory.BLOGGER_TAGS_VOCABULARY_ID), story.getTags());
storyObj.put("taxonomy", mkTaxonomy(termTidMap, tags));
String nodeId = (String) client.execute(
"node.save", new Object[] {storyObj});
return String.valueOf(nodeId);
}
/**
* CCK fields are stored as field_${field_name}[0]['value']
* in the node object, so thats what we build here for the
* XMLRPC payload. I have seen other formats too, so check in
* the page source for the node edit form.
*/
@SuppressWarnings("unchecked")
private List<Map<String,String>> mkCck(String value) {
Map<String,String> struct = new HashMap<String,String>();
struct.put("value", value);
return Arrays.asList(new Map[] {struct});
}
/**
* During editing forms, multi-select taxonomy entries are
* stored as:
* node->taxonomy["$vid"][$tid1, $tid2, ...]
* Tag fields are stored differently:
* node->taxonomy["tags"]["$vid"][$tid1, $tid2, ...]
* The entire thing is stored differently on node_load(), ie
* when loading a node from the db.
* node->taxonomy[$tid][stdClass::term_data]
* We just handle the multi-select taxonomy field case here.
*/
private Map<String,List<Integer>> mkTaxonomy(
Map<String,Integer> termTidMap,
Map<String,List<String>> tags) {
Map<String,List<Integer>> taxonomyValue =
new HashMap<String,List<Integer>>();
for (String vid : tags.keySet()) {
List<Integer> tids = new ArrayList<Integer>();
for (String tag : tags.get(vid)) {
tids.add(termTidMap.get(tag));
}
taxonomyValue.put(vid, tids);
}
return taxonomyValue;
}
}
|
First pass: import vocabulary terms
To load terms we use the Drupal XMLRPC service taxonomy.saveTerm. I was doing this inline with the post import initially, but noticed that the service does not skip inserting terms which have already been added to the term_data (and term_hierarchy) tables. So I decided to do a first pass to extract all the category tags from the posts, sort them alphabetically and remove multiple occurrences, then shove into Drupal. Here's the taxonomy import code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 | // Source: src/main/java/com/mycompany/myapp/blogger/BloggerFeedCategoryTaxonomyImporter.java
package com.mycompany.myapp.blogger;
import java.io.File;
import java.io.FileInputStream;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* Parses the XML file and extracts a non-duplicate list of
* categories. We have to separate this out into its process
* for two reasons:
* 1) the Drupal taxonomy.saveTerm does not seem to be smart
* enough to recognize duplicate terms.
* 2) the taxonomy.saveTerm does not return a tid value, instead
* it returns a 0 or 1 signifying success or failure, so
* its not possible to get the tid value for the term
* inserted. We have to do a separate database call into
* the Drupal database to get this mapping when importing
* nodes.
*/
public class BloggerFeedCategoryTaxonomyImporter {
private final Logger logger = LoggerFactory.getLogger(getClass());
private BloggerFeedClient bloggerFeedClient;
public BloggerFeedCategoryTaxonomyImporter(String serviceUrl,
String drupalUser, String drupalPass) throws Exception {
bloggerFeedClient = new BloggerFeedClient(serviceUrl);
bloggerFeedClient.userLogin(drupalUser, drupalPass);
}
public void importTerms(String inputFile) throws Exception {
Set<String> terms = parseTerms(inputFile);
List<String> termsAsList = new ArrayList<String>(terms);
Collections.sort(termsAsList); // alphabetically
logger.debug("Inserting terms: " + termsAsList);
bloggerFeedClient.taxonomySaveTerms(
DrupalBloggerStory.BLOGGER_TAGS_VOCABULARY_ID, termsAsList);
}
private Set<String> parseTerms(String inputFile) throws Exception {
Set<String> terms = new HashSet<String>();
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = factory.createXMLStreamReader(
new FileInputStream(new File(inputFile)));
boolean inItem = false;
for (;;) {
int evt = parser.next();
if (evt == XMLStreamConstants.END_DOCUMENT) {
break;
}
switch (evt) {
case XMLStreamConstants.START_ELEMENT: {
String tag = parser.getName().getLocalPart();
if ("item".equals(tag)) {
inItem = true;
}
if (inItem) {
if ("category".equals(tag)) {
int nAttrs = parser.getAttributeCount();
for (int i = 0; i < nAttrs; i++) {
String attrName = parser.getAttributeName(i).getLocalPart();
String attrValue = parser.getAttributeValue(i);
if (DrupalBloggerStory.TAGS_ATTR_NAME.equals(attrName) &&
DrupalBloggerStory.TAGS_ATTR_VALUE.equals(attrValue)) {
terms.add(parser.getElementText());
}
}
}
}
break;
}
case XMLStreamConstants.END_ELEMENT: {
String tag = parser.getName().getLocalPart();
if ("item".equals(tag) && inItem) {
inItem = false;
}
break;
}
default:
break;
}
}
parser.close();
return terms;
}
}
|
We run the importer using the JUnit snippet below. This results in 41 categories being written to the term_data and term_hierarchy tables in the Drupal database.
1 2 3 4 5 6 7 | @Test
public void testImportTaxonomyTerms() throws Exception {
BloggerFeedCategoryTaxonomyImporter importer =
new BloggerFeedCategoryTaxonomyImporter(
DRUPAL_XMLRPC_SERVICE_URL, DRUPAL_IMPORT_USER, DRUPAL_IMPORT_PASS);
importer.importTerms(IMPORT_XML_FILENAME);
}
|
Second pass: import the posts
We now create a slightly different parser to extract the various things we need to populate the BloggerStory data from the RSS XML file for my blog. Note that I could have just used the ROME FeedFetcher to parse the RSS into a SyndFeed, but the objective here is to be able to parse and load any XML feed, so I built a parser here. Here it is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 | // Source: src/main/java/com/mycompany/myapp/blogger/BloggerFeedImporter.java
package com.mycompany.myapp.blogger;
import java.io.File;
import java.io.FileInputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.sql.SQLException;
import java.util.HashMap;
import java.util.Map;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* Parses a blog feed into a list of story beans and imports them into
* Drupal.
*/
public class BloggerFeedImporter {
private final Logger logger = LoggerFactory.getLogger(getClass());
private Map<String,Integer> bloggerTagsMap;
private BloggerFeedClient bloggerFeedClient;
public BloggerFeedImporter(
String serviceUrl, String drupalUser, String drupalPass,
String dbUrl, String dbUser, String dbPass) throws Exception {
bloggerTagsMap = loadTaxonomy(
DrupalBloggerStory.BLOGGER_TAGS_VOCABULARY_ID,
dbUrl, dbUser, dbPass);
logger.debug("bloggerTagsMap=" + bloggerTagsMap);
bloggerFeedClient = new BloggerFeedClient(serviceUrl);
bloggerFeedClient.userLogin(drupalUser, drupalPass);
}
public void importBlogs(String inputFile) throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = factory.createXMLStreamReader(
new FileInputStream(new File(inputFile)));
DrupalBloggerStory story = null;
boolean inItem = false;
for (;;) {
int evt = parser.next();
if (evt == XMLStreamConstants.END_DOCUMENT) {
break;
}
switch (evt) {
case XMLStreamConstants.START_ELEMENT: {
String tag = parser.getName().getLocalPart();
if ("item".equals(tag)) {
story = new DrupalBloggerStory();
inItem = true;
}
if (inItem) {
if ("pubDate".equals(tag)) {
story.setPubDate(parser.getElementText());
} else if ("category".equals(tag)) {
int nAttrs = parser.getAttributeCount();
for (int i = 0; i < nAttrs; i++) {
String attrName = parser.getAttributeName(i).getLocalPart();
String attrValue = parser.getAttributeValue(i);
if (DrupalBloggerStory.TAGS_ATTR_NAME.equals(attrName) &&
DrupalBloggerStory.TAGS_ATTR_VALUE.equals(attrValue)) {
story.getTags().add(parser.getElementText());
}
}
} else if ("title".equals(tag)) {
story.setTitle(parser.getElementText());
} else if ("description".equals(tag)) {
story.setBody(parser.getElementText());
} else if ("link".equals(tag)) {
story.setOriginalUrl(parser.getElementText());
}
}
break;
}
case XMLStreamConstants.END_ELEMENT: {
String tag = parser.getName().getLocalPart();
if ("item".equals(tag) && inItem) {
String nodeId = bloggerFeedClient.bloggerStorySave(
story, bloggerTagsMap);
logger.info("Saving blogger_story:[" + nodeId + "]: " +
story.getTitle());
inItem = false;
}
break;
}
default:
break;
}
}
parser.close();
}
private Map<String,Integer> loadTaxonomy(int vocabularyId,
String dbUrl, String dbUser, String dbPass) throws Exception {
Class.forName("com.mysql.jdbc.Driver").newInstance();
Connection conn = DriverManager.getConnection(
dbUrl, dbUser, dbPass);
Map<String,Integer> termTidMap = new HashMap<String,Integer>();
PreparedStatement ps = null;
ResultSet rs = null;
try {
ps = conn.prepareStatement(
"select name, tid from term_data where vid = ?");
ps.setInt(1, DrupalBloggerStory.BLOGGER_TAGS_VOCABULARY_ID);
rs = ps.executeQuery();
while (rs.next()) {
termTidMap.put(rs.getString(1), rs.getInt(2));
}
return termTidMap;
} finally {
if (rs != null) { try { rs.close(); } catch (SQLException e) {}}
if (ps != null) { try { ps.close(); } catch (SQLException e) {}}
}
}
}
|
Notice that we loaded up a map of term to termId (tid) from the term_data table. This is because we only have the term values from our parsed content, but we need to populate the node with the map of vocabulary_id to the list of term_ids (not terms). Running this code using the JUnit snippet below:
1 2 3 4 5 6 7 8 | @Test
public void testImportBloggerStories() throws Exception {
BloggerFeedImporter importer =
new BloggerFeedImporter(DRUPAL_XMLRPC_SERVICE_URL,
DRUPAL_IMPORT_USER, DRUPAL_IMPORT_PASS,
DB_URL, DB_USER, DB_PASS);
importer.importBlogs(IMPORT_XML_FILENAME);
}
|
...results in the blogs in the feed imported into Drupal. Here is a screenshot of the preview page for one of the blog posts (the original is here). As you can see, all the fields and taxonomy entries came through correctly.
Using Drupal's XMLRPC interface to import data appears to be a fairly popular approach, going by the number of example clients in Python, PHP and C# available on the web to do this. I haven't seen one in Java though, so hopefully this post fills the gap. Note that this example may not be enough for you - you may need to import users, for example, or access image or node reference fields which are stored in a slightly different structure than the CCK fields - for that, you should look for hints in the edit form. But hopefully, this is a useful starting point.
10 comments (moderated to prevent spam):
Just want to let you know that I am a regular reader and love your blog posts!
Hello.
I am a student and I have a project that is implimentation Meta search engine. I use jericho library. I have a problem?
When It parses a google search results it can't extract URLs from that page while it extracts yahoo URLs very good.
could you help me?
Hello Mr Sujit
I want to use Jericho but I don't information about it.
Do you have a Jericho Tutorial? Can You help me?
thanks
Hi Vajihe, to answer your first question, if you look at the source for the "new" google results, you will find that its a script, not actual search results, and therefore Jericho does not have anything to parse.
I don't have a Jericho tutorial, but I've been using Jericho for quite some time now and I have used it in some of my posts here (search "jericho site:sujitpal.blogspot.com" for a list), so I guess you can use them as examples. Jericho also has links to sample code on their site that you are likely to find more useful.
hello
I apreciate you very mush.
Did you use HtmlUnit parser? It can parse search results from google well. Do you know why HtmlUnit can do it but Jericho can't?
Hi vajihe, if you don't mind, can you post comments relating to crawling pages in the appropriate post? People coming into this post are unlikely to find this comment interesting, and people who are looking for Jericho/HtmlUnit are unlikely to read this post and thus will not find your useful findings.
That said, I think your answer lies in the first few sentences on the HtmlUnit Home page. It mimics a "normal" browser, and has good Javascript support, which leads me to believe that it pulls the Javascript DOM from the response. Jericho does not, its a HTML parser, so it does not recognize the Javascript DOM.
I'm so sorry to ask these questions here. you can delete my later comments from here. Could I have your email address?
No, your comments are useful, just not in context. Also to get my email address, you will have to comment with yours (which I will reject for your privacy) - I don't have a way of contacting you directly.
You really helped me out of a Jam today, thank you very much.
You're welcome, glad the post helped.
Post a Comment