Saturday, September 16, 2006

Search and Replace with UrlRewriteFilter

In a previous post, I wrote about a generic JUnit test for testing rewrite rules for Paul Tuckey's UrlRewriteFilter. In the course of using the filter, I found that while it is very easy to take a URL and tear it up into peices, then rearrange these peices into a new URL, it is very hard to do a simple search and replace on one or more of these peices. This article discusses the approach I took to do this.

Imagine that we have a website of science fiction book reviews, where users can browse our reviews by specifying the author name and book name in the URL. We assume also that we are running our website on a J2EE Servlet based environment where we can use the UrlRewriteFilter. Users looking for Isaac Asimov's Foundation Trilogy series would probably look at the following URLs.

1
2
3
/asimov/foundation.html
/asimov/foundation_and_empire.html
/asimov/second_foundation.html

Now imagine that you want to change the above URLs to be hyphenated instead of underscore separated because apparently the Googlebot likes hyphens better than underscores. Matt Cutts has a blog article here that explains why. So our new URLs would look like this:

1
2
3
/asimov/foundation.html
/asimov/foundation-and-empire.html
/asimov/second-foundation.html

Sound simple, right? Just do a s/_/-/g on the incoming URL. However, it is not as simple as it sounds. UrlRewriteFilter relies on regular expressions to split up the peices of the incoming URL, and backreferences to rearrange these peices into the rewritten URL. However, because UrlRewriteFilter has no support for inline replacement of backreferences (although it would seem like a useful thing to have), there is nothing you can do with the backreferences once you have matched and split the incoming URL. So our initial rule:

1
2
3
4
  <rule>
     <from>/(\w+)/(\w+(_\w+)*)\.html</from>
     <to>/$1/$2.html</to>
  </rule> 

matches the incoming URLs, but the URL rewrite has no visible effect. What we really want is to take the second back reference ($2) and pass it through our s/_/-/g substitution. We can achieve the same effect by passing it through an external class. The methods of the external class looks a lot like those of a Servlet. I call these FLets since they run within the context of a Filter. Here is the source for RegexReplaceFLet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
package org.urlrewrite;

import javax.servlet.ServletConfig;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class RegexReplaceFLet {

    private String searchFor;
    private String replaceWith;
    
    public void init(ServletConfig config) {
        this.searchFor = config.getInitParameter("searchFor");
        this.replaceWith = config.getInitParameter("replaceWith");
    }
    
    public void run(HttpServletRequest request, HttpServletResponse response) {
        String source = (String) request.getAttribute("source");
        String target = source.replaceAll(searchFor, replaceWith);
        request.setAttribute("target", target);
    }
}

Our rule will initialize our FLet by calling its init() method, where the init-params will be used to initialize the FLet. Then the run() method is invoked, which will read the "source" request attribute which contains our $2 backreference, do the regexp replace on it and populate the target request attribute.

The next step is getting to the request attribute. This is not documented in the manual, but a quick look at the UrlRewriteFilter sources told me that the request attribute "foo" can be got at from within the rule using %{attribute:foo}. So our new rule looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
  <rule>
    <from>/(\w+)/(\w+(_\w+)*)\.html</from>
    <set name="source">$2</set>
    <run class="org.urlrewrite.RegexReplaceFLet" neweachtime="true">
      <init-param>
        <param-name>searchFor</param-name>
        <param-value>_</param-value>
      </init-param>
      <init-param>
        <param-name>replaceWith</param-name>
        <param-value>-</param-value>
      </init-param>
    </run>
  </rule>
  <rule>
    <from>/(\w+)/(\w+(_\w+)*)\.html</from>
    <to>/$1/%{attribute:target}.html</to>
  </rule> 

Notice that there are two rules. The first rule has no "to" element. The second rule simply matches on the same "from" value and outputs the result. Why do we have two rules? Because the "to" element in the first rule does not see the attribute value of target that was set by RegexReplaceFLet. This could be a design quirk of rules engines in general, which usually precalculate some things for performance. However, spreading the rewrite logic between two rules solves the problem.

I hope this article has helped. It does appear that this kind of problem is quite rare, however, since I could not find any discussion of this problem on the Internet. Presumably, UrlRewriteFilter is more often used for rearranging parts of the URL, not replacing parts of them. However, this ability makes the UrlRewriteFilter much more powerful and versatile than it already is. The idea presented in this article can also be extended to do much more fancy rewriting, such as database lookups to get the id from a name embedded in the URL and pass the id to the rewritten URL.

4 comments (moderated to prevent spam):

Anonymous said...

brilliant - lifesaver :)

Sujit Pal said...

Thanks! Glad it helped :-).

Anonymous said...

Nowadays it's a lot easier, you could easily do a replace: ${replace:$2:_:-}

See also: http://urlrewritefilter.googlecode.com/svn/trunk/src/doc/manual/4.0/index.html#functions

Sujit Pal said...

Nice, thanks, good to see UrlFilter being actively improved. I don't do any web development at work anymore, but the folks who do prefer Apache rewrites, so we have completely phased the UrlFilter stuff out in favor of that.