Saturday, July 26, 2008

A RSS Feed Client in Java

In his article, "Demystifying RESTful Data Coupling", Steve Vinoski says:
Developers who favor technologies that promote interface specialization typically raise two specific objections to the uniform-interface constraint designed into the Representational State Transfer (REST) architectural style. One is that different resources should each have specific interfaces and methods that more accurately reflect their precise functionality. The other objection to the concept of a uniform interface is that it merely shifts all coupling issues and other problems to the data exchanged between client and server.

We have faced similar concerns from clients of our RSS-2.0 based REST API. While the concerns are easier to address because our XML format is a well-known standard, and we can point them to several implementations of RSS feed parsers, such as Mark Pilgrim's Python Universal Feed Parser, the ROME Fetcher, or the Jakarta FeedParser, to name a few. In addition, because of the popularity of RSS, almost all major programming languages have built-in support or contributed modules to parse various flavors of RSS, so clients can usually find an off-the-shelf parser or toolkit that works well with their programming language of choice.

However, thinking through Steve Vinoski's comment a little more with reference to my particular context, I came up with the idea of using the ROME SyndFeed object as a Data Transfer Object (DTO). Since ROME is a popular project, its data structures are well documented, both on its own website and in various books such as Dave Johnson's "RSS and Atom in Action", client programmers can look at publicly available documentation to figure out how to convert the SyndFeed into objects that would be consumable by their application.

What makes the task easier is that ROME already has a Fetcher module, which takes care of the various nuances of parsing special headers from RSS feeds, local caching and such. While the generally available 0.9 release (at the time of this writing) does not have support for connection and read timeouts on the underlying HTTP client, the version in CVS (and probably releases following 0.9) would have this support, so I used that.

So what we would provide would be a "client library" consisting of a single class:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
// ApiClient.java
package com.healthline.feeds.client;

import java.io.IOException;
import java.io.UnsupportedEncodingException;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLEncoder;
import java.util.Map;
import java.util.UUID;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;

import com.sun.syndication.feed.synd.SyndFeed;
import com.sun.syndication.fetcher.FetcherException;
import com.sun.syndication.fetcher.impl.HashMapFeedInfoCache;
import com.sun.syndication.fetcher.impl.HttpClientFeedFetcher;
import com.sun.syndication.io.FeedException;

/**
 * Client for API. Based on the ROME FeedFetcher project.
 * Provides a single execute() method to point to the RSS based webservice.
 * The response is RSS 2.0 XML, which is converted into a SyndFeed object and 
 * returned to the caller to parse as needed.
 */
public class ApiClient {

  private final Log log = LogFactory.getLog(getClass());
  
  private URL serviceUrl;

  private HttpClientFeedFetcher fetcher = null;
  
  /**
   * Constructs a ApiClient instance.
   * @param serviceUrl the location of the service.
   * @param useLocalCache true if you want to cache responses locally.
   * @param connectTimeout the connection timeout (ms) for the network connection.
   * @param readTimeout the read timeout (ms) for the network connection.
   */
  public ApiClient(URL serviceUrl, boolean useLocalCache, int connectTimeout, 
      int readTimeout) {
    super();
    this.serviceUrl = serviceUrl;
    fetcher = new HttpClientFeedFetcher();
    fetcher.setUserAgent("MyApiClientFetcher-1.0");
    fetcher.setConnectTimeout(connectTimeout);
    fetcher.setReadTimeout(readTimeout);
    if (useLocalCache) {
      fetcher.setFeedInfoCache(HashMapFeedInfoCache.getInstance());
    }
  }
  
  /**
   * Executes a service request and returns a ROME SyndFeed object.
   *
   * @param methodName the methodName to execute.
   * @param params a Map of name value pairs.
   * @return a SyndFeed object.
   */
  public SyndFeed execute(String methodName, Map<String,String> params) {
    URL feedUrl = buildUrl(methodName, params);
    SyndFeed feed = null;
    try {
      feed = fetcher.retrieveFeed(feedUrl);
    } catch (FetcherException e) {
      throw new RuntimeException("Failed to fetch URL:[" + 
        feedUrl.toExternalForm() + "]. HTTP Response code:[" + 
        e.getResponseCode() + "]", e);
    } catch (FeedException e) {
      throw new RuntimeException("Failed to parse response for URL:[" + 
        feedUrl.toString() + "]", e);
    } catch (IOException e) {
      throw new RuntimeException("IO Error fetching URL:[" + 
        feedUrl.toString() + "]", e);
    }
    return feed;
  }

  /**
   * Convenience method to build up the request URL from the method name and
   * the Map of query parameters.
   * @param methodName the method name to execute.
   * @param params the Map of name value pairs of parameters.
   * @return
   */
  private URL buildUrl(String methodName, Map<String,String> params) {
    StringBuilder urlBuilder = new StringBuilder(serviceUrl.toString());
    urlBuilder.append("/").append(methodName);
    int numParams = 0;
    for (String paramName : params.keySet()) {
      String paramValue = params.get(paramName);
      if (StringUtils.isBlank(paramValue)) {
        continue;
      }
      try {
        paramValue = URLEncoder.encode(paramValue, "UTF-8");
      } catch (UnsupportedEncodingException e) {
        // will never happen, but just in case it does, we throw the error up
        throw new RuntimeException(e);
      }
      urlBuilder.append(numParams == 0 ? "?" : "&").
      append(paramName).
      append("=").
      append(paramValue);
      numParams++;
    }
    try {
      if (log.isDebugEnabled()) {
        log.debug("Requesting:[" + urlBuilder.toString() + "]");
      }
      return new URL(urlBuilder.toString());
    } catch (MalformedURLException e) {
      throw new RuntimeException("Malformed URL:[" + urlBuilder.toString() + "]", e);
    }
  }
}

All the client has to do is instantiate this class with the parameters, then execute the service command. This is completely generic, by the way, not tied to our API service in any way. As an example, I tried hitting the RSS feed for the National Public Radio (NPR) Top Stories Page with the test code below:

Based on our original requirement, the objective is to convert the SyndFeed object returned from the call to ApiClient.execute() to an appropriate user object. We call our user object SearchResult, and it is a POJO as shown below:

1
2
3
4
5
6
7
8
9
// SearchResult.java
public class SearchResult {

  private String title;
  private String url;
  private String summary;
  // auto-generated getters and setters removed for brevity
  ...
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
// NprApiClient.java
import java.net.URL;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import com.mycompany.feeds.client.ApiClient;
import com.sun.syndication.feed.synd.SyndCategory;
import com.sun.syndication.feed.synd.SyndEntry;
import com.sun.syndication.feed.synd.SyndFeed;

public class NprApiClient {

  private static final String SERVICE_URL = "http://www.npr.org/rss";
  private static final boolean USE_CACHE = true;
  private static final int DEFAULT_CONN_TIMEOUT = 5000;
  private static final int DEFAULT_READ_TIMEOUT = 1000;
  
  private ApiClient apiClient;
  
  public NprApiClient() throws Exception {
    apiClient = new ApiClient(new URL(SERVICE_URL), USE_CACHE, DEFAULT_CONN_TIMEOUT, 
      DEFAULT_READ_TIMEOUT);
  }
  
  @SuppressWarnings("unchecked")
  public List<SearchResult> getTopStories() {
    Map<String,String> args = new HashMap<String,String>();
    args.put("id", "1001");
    SyndFeed feed = apiClient.execute("rss.php", args);
    List<SyndEntry> entries = feed.getEntries();
    List<SearchResult> results = new ArrayList<SearchResult>();
    for (SyndEntry entry : entries) {
      SearchResult result = new SearchResult();
      result.setTitle(entry.getTitle());
      result.setUrl(entry.getLink());
      result.setSummary(entry.getDescription().getValue());
      results.add(result);
    }
    return results;
  }
  
  public static void main(String[] args) {
    try {
      NprApiClient client = new NprApiClient();
      List<SearchResult> results = client.doTopStorySearch();
      for (SearchResult result : results) {
        System.out.println(result.getTitle());
        System.out.println("URL:" + result.getUrl());
        System.out.println(result.getSummary());
        System.out.println("--");
      }
    } catch (Exception e) {
      System.err.println(e.getMessage());
      throw new RuntimeException(e);
    }
  }
}
Here are the (partial) results from the run. I have truncated the results after the first few results but you get the idea.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
Housing Bill Clears Senate, Awaits Bush's Signature
URL:http://www.npr.org/templates/story/story.php?storyId=92964747&ft=1&f=1001
The Senate met in a rare Saturday session and gave final congressional approval to a wide-ranging 
housing bill.  The bill aims to bolster the sagging housing market and includes measures aimed at 
shoring up Fannie Mae and Freddie Mac. The president says he'll sign it when it reaches his desk, 
early next week.
--
What's The Deal With The XM-Sirius Merger?
URL:http://www.npr.org/templates/story/story.php?storyId=92960423&ft=1&f=1001
The FCC has approved the merger of XM and Sirius satellite radio after 17 months of behind-
the-scenes negotiations. While some critics have said the merger represents a monopoly, it 
appears that the two weak companies may be combining to form one weak company.
--
Military Tribunals Begin At Guantanamo
URL:http://www.npr.org/templates/story/story.php?storyId=92960420&ft=1&f=1001
The first war crimes trials since World War II started this week at Guantanamo Bay. Andrew 
McBride, a former Justice Department official, discusses the trials, as well as how Guantanamo's 
war crimes compare with those of 1945.
--
...

Although the above code is good enough for a standard RSS feed parsing client, I was not able to get results out of our custom tags (for our RSS-2.0 based API I spoke about earlier). I plan to investigate this, since we use a variety of open-source RSS custom modules (such as Amazon's OpenSearch as well as our own home-grown custom module to satisfy several data requirements that cannot be accommodated by standard RSS 2.0. Because of this, it is important for our clients to be able to parse out our custom module and its contents from the SyndFeed object.

I will investigate this on my own and write about it in a future post. From what I see so far, the ROME Fetcher is not passing the custom module information through in the SyndFeed object it parses out of the XML. It is possible that I am just missing some configuration piece that would enable it. In the meantime, if you happen to know how to do this, would really appreciate you letting me know.

Saturday, July 19, 2008

Automating Documentation

Lately I have been doing quite a bit of documentation of other people's code, on the premise (with apologies to George Bernard Shaw) that "those who can, code; those who can't, document" :-).

The documentation is primarily for home-grown application level frameworks based on Spring, which allow us to plug in custom behavior using implementations of predefined classes, using hook points in the standard strategy code. The hooks are exposed as bean properties of the strategy bean, and the properties default to classes that define the standard behavior. As you can imagine, this can quickly lead to XML hell, unless you know what the hooks allow, and what custom classes already exist to modify a certain behavior. So documenting the custom classes and where they fit in can help new developers get up to speed on the framework more quickly.

Yet another reason to document is for non-coders to look at and suggest improvements. Due to the nature of our business, a lot of people outside the programming group have a very solid understanding of our technology, and are therefore in a great position to suggest improvements we (coders) haven't thought of. However, they don't write code anymore, so they cannot see in-line comments in code.

So while I firmly maintain (like most programmers), that the best place to write documentation is in-lining it in the code itself, there are enough reasons to spend the effort documenting the code for people who are yet to look at the code, or who will never look at it. However, rather than writing documentation separate from the code, my preferred approach is to generate the documentation from the in-line code comments.

This approach has (at least) three important advantages. One, it encourages programmers to write better in-line documentation in their classes. Two, it allows the documentation to keep up with a rapidly evolving code base without getting stale. Three, it eliminates the drudgery of writing documentation, one reason why, in any project, documentation never keeps up with code, unless you have a dedicated documentation writer for your project.

So, given an applicationContext.xml file, its easy to pull out the bean names that are of a certain class, like so:

1
2
3
4
    ApplicationContext context = 
      new ClassPathXmlApplicationContext("classpath:applicationContext.xml");
    Set<String> beanNames = new HashSet<String>();
    beanNames.addAll(context.getBeanNamesForType(MyStrategyClass.class).asList());

Given these bean names, we can use a standard XML parser toolkit such as JDOM to parse out the beans whose names are in our list of bean names.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    private static final Namespace BEANS_NS = 
      Namespace.getNamespace("http://www.springframework.org/schema/beans");
    ...
    List<Element> beanElements = root.getChildren("bean", BEANS_NS);
    for (Element beanElement : beanElements) {
      String id = beanElement.getAttributeValue("id");
      if (beanNames.contains(id)) {
        documentBean(beanElement);
      }
    }

The documentBean() method will go through each of the properties in the bean and attempt to document them as well. If the property has a ref attribute, then it grabs the bean definition for that ref and calls documentBean() on the ref element. I do not show the code for the documentBean() method since it is very implementation specific (I write to a wiki format, some others may write to a DocBook XML or HTML format, etc).

The code above will expose the structure, but does not yet describe what each bean does. For that, I rely on the class level Javadocs for each class. To extract the class level Javadocs, I use the QDox library, using the following code. QDox parses the source Java files, so you will need those around in a defined SOURCE_PATH somewhere.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
  private static final String SOURCE_PATH = "src/main/java/";
  ...
  private String getClassLevelJavadoc(String className) throws Exception {
    File javaFile = new File(SOURCE_PATH + StringUtils.replace(className, ".", "/") + ".java");
    JavaDocBuilder builder = new JavaDocBuilder();
    JavaSource source = builder.addSource(javaFile);
    JavaClass[] javaClasses = source.getClasses();
    JavaClass mainClass = null;
    for (JavaClass javaClass : javaClasses) {
      if (javaClass.getFullyQualifiedName().equals(className)) {
        mainClass = javaClass;
        break;
      }
    }
    String comment = mainClass.getComment();
    // post-process the comment (implementation specific)
    // ...
    return comment;
  }

You may need to post-process the comments returned if you are writing to a wiki. For example, my post-processing code replaces a single newline with a space, but two newlines with a single one, and escapes WikiWords.

QDox is not limited to pulling out only the class-level Javadocs. It parses the source file into a bean that allows you to access both class and method level tags by name, as well as a host of other things. However, in my case, my documentation needs are satisfied by a well-written class level Javadoc comment, so that's what I used.

A better known approach in the Java world to do this sort of thing is to use XDoclet. In SQLUnit, one of my open source projects, I used it to parse out custom @sqlunit.xxx class and method level Javadoc tags and convert them to DocBook XML snippets, which I then imported into the source for my User Guide. While QDox solves a similar problem, it is simpler to use in my opinion, since you don't have to write XML converters.

Saturday, July 12, 2008

According to Google...

Here is a somewhat silly but useful little script I wrote. We often have to do a discovery crawl of client sites, using seed URLs they give us. One such time, we ended up with abnormally few records in the index, so naturally the question arose as to whether something was wrong with the crawl.

During the debugging process, one of our crawl engineers sent out an email with numbers pulled from Google's site search. So if you wanted to know how many pages were indexed by Google for a site (say foo.bar.com), you would enter the query "site:foo.bar.com" in the search box, and the number you are looking for would be available on the right hand side of the blue title bar of the results.

1
2
Results 1 - 10 of about 1001 from foo.bar.com (0.15 seconds)
                        ^^^^

His email started with the phrase, "According to Google...", which is the inspiration for the name of the script and this blog post. I thought of writing this script with the idea that we could tack this on at the end of the crawl as a quick check to verify that our crawler crawled the "correct" number of pages. Obviously the number returned by Google is an approximation, since the number of pages could have changed between their crawl and ours, so we want to verify that we are within a factor, say 10%, of the Google numbers.

The script can be run from the command line with a list of seed URLs as arguments. Here is an example of calling it and the resulting output.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
sujit@sirocco:~$ ./acc2google.py \
  sujitpal.blogspot.com \
  www.geocities.com/sujitpal \
  foo.bar.baz
According to Google...
  #-pages for sujitpal.blogspot.com : 323
  #-pages for www.geocities.com/sujitpal : 71
  #-pages for foo.bar.baz : 0
  ----------------------------------------
  Total pages: 394

The last URL is bogus, but the script will correctly report 0 pages for it. The script can be modified to extract the seed URLs from your configuration quite easily, but the location of the information would be crawler specific. Here is the script.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/usr/bin/python
# acc2google.py
# Reads a set of seed lists from a list of sites in the command line params
# and returns the approximate number of pages that are indexed for each site
# by Google
#
import httplib
import locale
import re
import string
import sys
import urllib

def main():
  numargs = len(sys.argv)
  if (numargs < 2):
    print " ".join([sys.argv[0], "site ..."])
    sys.exit(-1)
  print "According to Google..."
  totalIndexed = 0
  for i in range(1, numargs):
    site = sys.argv[i]
    query = "site:" + urllib.quote_plus(site) 
    try:
      conn = httplib.HTTPConnection("www.google.com")
      conn.request("GET", "".join(["/search?hl=en&q=", query, "&btnG=Google+Search"]))
      response = conn.getresponse()
    except:
      continue
    m = re.search("Results <b>\\d+</b> - <b>\\d+</b> " +
      "of about <b>([0-9][0-9,]*)</b>", response.read())
    if (not m):
      numIndexed = "0"
    else:
      numIndexed = m.group(1)
    print "  " + " ".join(["#-pages for", site, ":", numIndexed])
    totalIndexed = totalIndexed + int(string.replace(numIndexed, ",", ""))
  print "  ----------------------------------------"
  locale.setlocale(locale.LC_NUMERIC, '')
  print "  " + " ".join(["Total pages:", locale.format("%.*f", (0, totalIndexed), True)]) 
               
if __name__ == "__main__":
  main()

As you can see, there is not much to the script, but it can be very useful as an early warning system. There are also situations where you want to do a custom discovery crawl with a large number of seed URLs for a handpicked group of public sites, and its useful to know how many records to expect in the index. In that case, its easier to run this once rather than have to do site searches for each of the individual seed URLs.

Wednesday, July 09, 2008

Yahoo WebSearch API Javascript client using Dojo

In my last post, I described a Javascript client to display results from Google's JSON search service. In that, I used a PHP proxy to get around Javascript's Same Origin Policy. A cleaner remoting architecture called JSONP or Padded JSON, proposed by Bob Ippolito, and supported by most JSON web services, relies on the server being able to emit a JSON response wrapped in a client specified callback function.

To request padded JSON, the client would populate an optional query parameter which would contain the Javascript callback function name. The client code would implement the callback function. The implementation would typically parse the JSON response and construct HTML to populate into the innerHTML element of a div tag on the page displayed on the browser.

So when the query is sent to the server, the JSON response is wrapped inside the specified callback function name. For example, a query to the Yahoo WebSearchService API would look something like this:

1
2
3
http://search.yahooapis.com/WebSearchService/webSearch?query=foo&\
  callback=handleResponse&\
  appid=get-your-own-yahoo-id-and-stick-it-in-here

And the server will return a JSON response wrapped within the callback, which is executed as a Javascript function call.

1
  handleResponse(json_response_string);

So now if we defined a function handleResponse(String), then whatever is in the function will be executed.

I think this approach is quite beautiful (in the Beautiful Code sense) - not only does it exploit the macro expansion feature in interpreted languages in a clever yet intuitive way, it enables true serverless operation by getting around Javascript's Same Origin Policy.

Setting up the client to do dynamic calls is a bit of a pain with this approach though. Since we don't know the search term until its entered, so using plain Javascript involves manipulating the DOM tree to insert the call into a html/head/script element. However, there are a lot of Javascript frameworks around which make light of this work. One such framework is Dojo, which comes with both JSON and UI components.

In this post, I describe a client that I built using Dojo to run against Yahoo's WebSearch API to display search results from my blog. Dojo has a fairly steep learning curve, but it is very well-documented, and the resulting code is very easy to read and maintain. Here is the code (really an HTML page containing Javascript code):

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
  <head>
    <title>My Blog Search Widget</title>
    <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
    <style type="text/css">
      @import http://o.aolcdn.com/dojo/1.0.0/dojo/resources/dojo.css;
      @import http://o.aolcdn.com/dojo/1.0.0/dijit/themes/tundra/tundra.css;
    </style>
    <script type="text/javascript" 
      src="http://o.aolcdn.com/dojo/1.0.0/dojo/dojo.xd.js" 
      djConfig="parseOnLoad: true"></script>
    <script type="text/javascript">
      dojo.require("dijit.form.Button");
      dojo.require("dojo.io.script");
    </script>
    <script type="text/javascript">
function handleResponse(data, ioArgs) {
  var html = '<b>Results ' +
    data.ResultSet.firstResultPosition + 
    '-' +
    data.ResultSet.totalResultsReturned +
    ' for term ' +
    dojo.byId('q').value + 
    ' of about ' +
    data.ResultSet.totalResultsAvailable +
    '</b><br/><br/>';
  dojo.forEach(data.ResultSet.Result, function(result) {
    html += '<b><a href=\"' + 
      result.Url + 
      '">' +
      result.Title + 
      '</a></b><br/>' +
      result.Summary + 
      '<br/><b>' +
      result.DisplayUrl +
      '</b><br/><br/>';
  }); 
  dojo.byId("results").innerHTML = html;
}
    </script>
  </head>
  <body class="tundra">
    <p>
    <b>Enter your query:</b>
    <input type="text" id="q" name="q"/>
    <button dojoType="dijit.form.Button" id="searchButton">Search!
      <script type="dojo/method" event="onClick">
        dojo.io.script.get({
          url: 'http://search.yahooapis.com/WebSearchService/V1/webSearch',
          content: {
            appid: 'get-your-own-appid-and-stick-it-in-here',
            query: dojo.byId('q').value,
            site: 'sujitpal.blogspot.com',
            output: 'json',
            callback: 'handleResponse'
          },
          callbackParamName: handleResponse
        });
      </script>
    </button>
    </p>
    <hr/>
    <div id="results"></div>
  </body>
</html>

And here is the obligatory screenshot: