Saturday, September 23, 2006

Moving from Ant to Maven2

My build system at work is Ant. I haven't used Maven2 in real-world (commercial) environments before, but I have used it enough in my personal projects to convince me that this would be a good move to make, and eliminate quite a bit of build maintenance drudgery that we currently take for granted.

The main benefits of using Maven2 as a build system are standardized directory structure and targets (or goals in Maven), no more JAR files in CVS, automatically building the project descriptor files for various IDEs (Eclipse, IDEA) from the POM, support for project documentation site building, the APT wiki-style documentation language which can be automatically translated to HTML in the project website, built in support for generating various kinds of project reports, and many others. Disadvantages include limited support for building targets that deviate from the norm, limited support for multi-environment deployment, and a steep learning curve (which I am not totally over yet).

Anyway, the first application to be built with Maven2 would be a small webapp, which depended for most of its functionality on Java code written in another web application, which was built with Ant. Both systems are expected have pretty rapid code churn, so it is important that the new webapp be able to pick up the latest JAR files from the older application as they are built.

Fortunately, the older webapp is being continuously integrated using Anthill. So it should be possible to get the latest JAR file from Anthill, since Anthill is configured to rebuild every half an hour if it detects changes in the version control system. This article describes the infrastructure I had to set up to get this working.

Creating a local repository

Since the JAR file from the older webapp is an internal JAR file, we cannot upload to the ibiblio or codehaus repositories, so we need our own repository. For this I chose maven-proxy, a simple web application which exposes a remote file based repository over HTTP, as well as forwards requests it cannot server from its local repository to upstream servers such as ibiblio or codehaus.

I put maven-proxy on the same host, and in the same Tomcat instance on which Anthill was running. The local repository was set up to be a sub-directory maven-repository under the tomcat user's home directory.

The maven-proxy distribution comes as a WAR (Web ARchive) file, which one needs to drop into the webapps directory. However, the first time it explodes, the web.xml is incomplete and the maven-proxy.properties file is non-existent, so it will throw errors and refuse to function. You need to supply the maven-proxy.properties (see the Configuration page for examples), and point the "maven-proxy.properties" configuration parameter to the location of your properties file, as shown below:

1
2
3
4
5
  <context-param>
    <param-name>maven-proxy.properties</param-name>
    <param-value>/opt/tomcat/webapps/maven-proxy/WEB-INF/classes/maven-proxy.properties</param-value>
    <description>Controls where maven-proxy grabs its properties from</description>
  </context-param>

Once this is done, you can access the repository with an URL like this:

1
http://build.mycompany.com:8080/maven-proxy

Under the maven-repository are two other subdirectories, internal and external. The internal subdirectory is for JAR files generated by modules written within the company, and the external subdirectory are for JAR files that are written by external third parties, but may not be available at ibiblio for whatever reason. I had thought of having a third subdirectory central which would really be the cache for the ibiblio and codehaus stuff, but that did not work, because I would have to declare this repository to be a mirror in settings.xml which would not be available in the project CVS, so it would not be suitable for team work. So JAR files such as commons-lang, etc would still be pulled directly from ibiblio and its mirrors.

Getting Anthill to write to the repository

The second part was to add functionality to the build.xml file so it would add the JAR file it built to the local repository directory. Recall that I set up the repository on the same machine as the build machine and owned by the same tomcat user that Anthill is also running under, so there are no permission issues.

To do this, I added an antcall into the target that generated the JAR file.

1
2
3
4
  <target name="jar" depends="compile,build.properties">
    ...
    <antcall target="copy-jar-file-to-maven-repository" />
  </target>

This target will first check to see if it is being run by Anthill, in which case a deployDir property would be set, to tell Anthill where to write the JAR after its done building. So we test for the property, and if we are running on Anthill, then we also copy to the maven-repository/internal directory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
  <!-- Check if we are running on AntHill -->
  <target name="check-if-anthill">
    <condition property="is.anthill" value="true">
      <not>
        <equals arg1="${deployDir}" arg2=""/>
      </not>
    </condition>
  </target>

  <!-- If we are running on AntHill, then execute this target to copy jar
       to the internal maven repository -->
  <target name="copy-jar-file-to-maven-repository" depends="check-if-anthill" if="is.anthill">
    <property name="mavenRepo" value="${tomcatHome}/maven-repository/internal" />
    <property name="jarLocation" value="${mavenRepo}/com/mycompany/oldapp/SNAPSHOT" />
    <mkdir dir="${jarLocation}" />
    <copy file="${build}/${jar.name}" tofile="${jarLocation}/oldapp-SNAPSHOT.jar" />
    <exec dir="${jarLocation}" executable="/usr/bin/md5sum" os="Linux" output="${jarLocation}/oldapp-SNAPSHOT.jar.md5">
      <arg value="oldapp-SNAPSHOT.jar" />
    </exec>
    <copy file="${basedir}/bin/pom-template.xml" tofile="${jarLocation}/oldapp-SNAPSHOT.pom" />
    <exec dir="${jarLocation}" executable="/usr/bin/md5sum" os="Linux" output="${jarLocation}/oldapp-SNAPSHOT.pom.md5">
      <arg value="oldapp-SNAPSHOT.pom" />
    </exec>
  </target>

Setting up other JARs into the repository

Now came the one time task (or relatively infrequent) of putting JAR files, such as activation.jar, etc, which do not live in open repositories like ibiblio because Sun does not permit it, and commercial third party JARs from our vendors. I used a very minimal POM template file and a shell script to do this relatively painlessly. They are shown below.

1
2
3
4
5
6
7
8
9
<!-- A minimal POM template -->
<project>
  <modelVersion>4.0.0</modelVersion>
  <groupId>_groupId_</groupId>
  <artifactId>_artifactId_</artifactId>
  <packaging>jar</packaging>
  <version>_version_</version>
  <dependencies />
</project>
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
EXTERNAL=/home/tomcat/maven-repository/external
JARHOME=/opt/tomcat/webapps/oldapp/WEB-INF/lib

# $1 = name of jar file (with jar ext)
# $2 = groupId
# $3 = artifactId
# $4 = version
function buildjar {
  todir=`echo $2 | sed -e 's/\./\//g'`
  echo "Building jar $1 to repo/$todir/$3/$3-$4.jar"
  cp $JARHOME/$1 $3-$4.jar
  md5sum $3-$4.jar > $3-$4.jar.md5
  cat pom_template.xml | sed s/_groupId_/$2/ | sed s/_artifactId_/$3/ | sed s/_version_/$4/ > $3-$4.pom
  md5sum $3-$4.pom > $3-$4.pom.md5
  mkdir -p $EXTERNAL/$todir/$3/$4
  cp $3-$4.* $EXTERNAL/$todir/$3/$4
  rm $3-$4.*
}

buildjar "activation.jar" "activation" "activation" "1.1"
...

Looking up the local repository from the POM

Finally, the moment of truth...getting all the JARs from our local repository. Actually, its not that momentous, I had been trying out the POM after every little change, so I was fairly confident that it will all work this time. Basically, I added the following two repositories into my pom.xml file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
  <!-- Repository points to local proxy -->
  <repositories>
    <repository>
      <id>external</id>
      <name>Repository for non-company JARS not in ibiblio</name>      <url>http://build.mycompany.com:8081/maven-proxy/repository/external</url>
    </repository>
    <repository>
      <id>internal</id>
      <name>Company JAR repository</name>
      <url>http://build.mycompany.com:8081/maven-proxy/repository/internal</url>
      <layout>default</layout>
    </repository>
  </repositories>

and ran the command:

1
$ mvn clean compile test

which went through fine. The only other thing I wanted to do was to make my repository a copy of ibiblio, so each time I get a new JAR, I can write it to my local repository and never have to go looking at ibiblio again, unless its specified as a SNAPSHOT or I change the version. But that requires me to declare this repository to be a mirror, and the declaration can only happen in the settings.xml file, which lives in each developer's personal repository under ~/.m2. We may do that yet, but so far this setup seems to be holding up well.

When people think of moving to Maven2, they are overwhelmed at the sheer learning curve and complexity of the product. They are also overwhelmed at the amount of work it would take to convert their existing Ant installations to Maven2 installations. Often the first thought when you are confronted with a problem such as mine is to first convert the existing code to Maven2, then make it part of a multi-module build. A quicker alternative would be to move ahead with the approach described above.

Saturday, September 16, 2006

Search and Replace with UrlRewriteFilter

In a previous post, I wrote about a generic JUnit test for testing rewrite rules for Paul Tuckey's UrlRewriteFilter. In the course of using the filter, I found that while it is very easy to take a URL and tear it up into peices, then rearrange these peices into a new URL, it is very hard to do a simple search and replace on one or more of these peices. This article discusses the approach I took to do this.

Imagine that we have a website of science fiction book reviews, where users can browse our reviews by specifying the author name and book name in the URL. We assume also that we are running our website on a J2EE Servlet based environment where we can use the UrlRewriteFilter. Users looking for Isaac Asimov's Foundation Trilogy series would probably look at the following URLs.

1
2
3
/asimov/foundation.html
/asimov/foundation_and_empire.html
/asimov/second_foundation.html

Now imagine that you want to change the above URLs to be hyphenated instead of underscore separated because apparently the Googlebot likes hyphens better than underscores. Matt Cutts has a blog article here that explains why. So our new URLs would look like this:

1
2
3
/asimov/foundation.html
/asimov/foundation-and-empire.html
/asimov/second-foundation.html

Sound simple, right? Just do a s/_/-/g on the incoming URL. However, it is not as simple as it sounds. UrlRewriteFilter relies on regular expressions to split up the peices of the incoming URL, and backreferences to rearrange these peices into the rewritten URL. However, because UrlRewriteFilter has no support for inline replacement of backreferences (although it would seem like a useful thing to have), there is nothing you can do with the backreferences once you have matched and split the incoming URL. So our initial rule:

1
2
3
4
  <rule>
     <from>/(\w+)/(\w+(_\w+)*)\.html</from>
     <to>/$1/$2.html</to>
  </rule> 

matches the incoming URLs, but the URL rewrite has no visible effect. What we really want is to take the second back reference ($2) and pass it through our s/_/-/g substitution. We can achieve the same effect by passing it through an external class. The methods of the external class looks a lot like those of a Servlet. I call these FLets since they run within the context of a Filter. Here is the source for RegexReplaceFLet:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
package org.urlrewrite;

import javax.servlet.ServletConfig;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;

public class RegexReplaceFLet {

    private String searchFor;
    private String replaceWith;
    
    public void init(ServletConfig config) {
        this.searchFor = config.getInitParameter("searchFor");
        this.replaceWith = config.getInitParameter("replaceWith");
    }
    
    public void run(HttpServletRequest request, HttpServletResponse response) {
        String source = (String) request.getAttribute("source");
        String target = source.replaceAll(searchFor, replaceWith);
        request.setAttribute("target", target);
    }
}

Our rule will initialize our FLet by calling its init() method, where the init-params will be used to initialize the FLet. Then the run() method is invoked, which will read the "source" request attribute which contains our $2 backreference, do the regexp replace on it and populate the target request attribute.

The next step is getting to the request attribute. This is not documented in the manual, but a quick look at the UrlRewriteFilter sources told me that the request attribute "foo" can be got at from within the rule using %{attribute:foo}. So our new rule looks like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
  <rule>
    <from>/(\w+)/(\w+(_\w+)*)\.html</from>
    <set name="source">$2</set>
    <run class="org.urlrewrite.RegexReplaceFLet" neweachtime="true">
      <init-param>
        <param-name>searchFor</param-name>
        <param-value>_</param-value>
      </init-param>
      <init-param>
        <param-name>replaceWith</param-name>
        <param-value>-</param-value>
      </init-param>
    </run>
  </rule>
  <rule>
    <from>/(\w+)/(\w+(_\w+)*)\.html</from>
    <to>/$1/%{attribute:target}.html</to>
  </rule> 

Notice that there are two rules. The first rule has no "to" element. The second rule simply matches on the same "from" value and outputs the result. Why do we have two rules? Because the "to" element in the first rule does not see the attribute value of target that was set by RegexReplaceFLet. This could be a design quirk of rules engines in general, which usually precalculate some things for performance. However, spreading the rewrite logic between two rules solves the problem.

I hope this article has helped. It does appear that this kind of problem is quite rare, however, since I could not find any discussion of this problem on the Internet. Presumably, UrlRewriteFilter is more often used for rearranging parts of the URL, not replacing parts of them. However, this ability makes the UrlRewriteFilter much more powerful and versatile than it already is. The idea presented in this article can also be extended to do much more fancy rewriting, such as database lookups to get the id from a name embedded in the URL and pass the id to the rewritten URL.

Saturday, September 09, 2006

WebApp remote control with JMX and Spring

Many web applications have an admin or tools component, which allow the application administrator to get information about the application internals and to update its configuration properties to change application behavior without having to take down the application or write/modify code or properties. Recently I needed to write such a tool.

The tool would provide a form that would allow the super-user/admin type to update a Configuration bean properties. This bean controlled the behavior of a front end controller, so changing the Configuration bean properties would automatically change the behavior of the controller. I had originally envisioned the tool as a standard Spring FormController with a set of JSP pages, but the more I thought about it, it seemed like JMX (Java Management eXtension) would be a more appropriate choice here.

I had never actually used JMX before this, but I had read about it in this JMX book by Lindfors, Fleury and the JBoss Group, and I remembered that the two JMX implementations covered in the book came with HTTP based JMX Consoles. My plan was to somehow hook up the JMX console to my application so it could be used to update the Configuration bean properties.

To check out the feasibility of this approach, I decided to do a small proof of concept, a rather lame calculator web application. It has a single page which is accessible via a HTTP GET request. The GET request provides two integer parameters, which are then added by the calculator and the results displayed on the web page. The actual operator that the calculator uses (add, subtract, multiply or divide) is provided by the CalculatorConfig bean, which is a Managed bean that should be visible to the JMX console. The web application is a Spring application. Spring provides very good integration with JMX (as it does with a variety of other J2EE technologies). For the JMX implementation, I chose Sun's JMX-RI reference implementation, mainly because its HtmlAdaptor (aka the HTML Console) is all Java (no JSP) and comes pre-packaged in the jmxtools.jar file, so its easy to embed.

The calculator webapp contains all the usual suspects. The incoming request is handled by the CalculatorController (an implementation of the Spring Controller interface), which calls the CalculatorService with operands passed in on the request parameters. The CalculatorService is injected with the CalculatorConfig which implements the standard MBean interface CalculatorConfigMBean. The MBean suffix is needed if you are using standard MBeans, that is the only way the MBean server figures out if a bean is manageable. I could have dispensed with the interface and made CalculatorConfig a managed bean by renaming it with an MBean suffix, but I figured that in a real application, I may not have the luxury of changing class names. The Spring configuration for the webapp side is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE beans PUBLIC "-//SPRING//DTD BEAN//EN" "http://www.springframework.org/dtd/spring-beans.dtd" >
<beans>

  <bean id="calculatorConfig" class="my.lame.calculator.beans.CalculatorConfig">
    <property name="operation" value="+" />
  </bean>

  <bean id="calculatorService" class="my.lame.calculator.services.CalculatorService">
    <property name="calculatorConfig" ref="calculatorConfig" />
  </bean>

  <bean id="calculatorController" class="my.lame.calculator.controllers.CalculatorController">
    <property name="calculatorService" ref="calculatorService" />
  </bean>

  <bean id="simpleUrlHandlerMapping" class="org.springframework.web.servlet.handler.SimpleUrlHandlerMapping">
    <property name="mappings">
      <props>
        <prop key="calculator.do">calculatorController</prop>
      </props>
    </property>
  </bean>

  <bean id="viewResolver" class="org.springframework.web.servlet.view.InternalResourceViewResolver">
    <property name="prefix" value="/" />
    <property name="suffix" value=".jsp" />
    <property name="viewClass" value="org.springframework.web.servlet.view.JstlView" />
  </bean>
  ...

The JMX side involves setting up an MBeanServer and registering our CalculatorConfig bean (which is an MBean since it implements a standard MBean), and the HtmlAdaptor from JMX-RI, which is also an MBean, and starting the HtmlAdaptor. This creates a JMX console listening on port 8082 (which you can configure differently by setting the port property on the HtmlAdaptor bean). Both the HtmlAdaptor and the CalculatorConfig MBeans are declared as regular beans in the Spring applicationContext.xml file. The MBeanServerFactoryBean creates the embedded MBeanServer, and the two MBeans are registered into the server using the MBeanExporter bean. The embedded MBeanServer approach is optional, since most application servers now have JMX consoles built in, we could register our MBeans into the application server directly as well. However, embedding the MBeanServer within our application is probably more generic and guaranteed to work with any application server. Here is the JMX side of the Spring configuration:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
  ...
  <bean id="mbeanServer" class="org.springframework.jmx.support.MBeanServerFactoryBean" />

  <bean id="htmlAdaptor" class="com.sun.jdmk.comm.HtmlAdaptorServer" init-method="start" />

  <bean id="exporter" class="org.springframework.jmx.export.MBeanExporter">
    <property name="beans">
     <map>
       <entry key="adaptor:name=htmlAdaptor" value-ref="htmlAdaptor" />
       <entry key="bean:name=calculatorConfigBean" value-ref="calculatorConfig" />
     </map>
    </property>
    <property name="server" ref="mbeanServer" />
  </bean>

</beans>

The MBeanServer setup was a little complicated. I got as far as setting up the MBeanServer and the MBeanExporter with the Spring reference, but I had to go look at the BaseAgent.java code in the JMX Tutorial (provided with the JMX-RI source download) to figure out the part where I could register the HtmlAdaptor MBean supplied with the JMX-RI to the MBeanServer and call start() on it with the bean:@init-method attribute. Another thing to note is that we are currently running the MBeanServer and our HtmlAdaptor client within the same JVM as the managed web application, but it is possible to make the MBeanServer expose a serviceUrl which can be accessed by an external JMX client as explained in this Java.net article by Lu Jian.

The figures below shows the CalculatorConfig MBean in the JMX Console, and the result of the HTTP GET request on the actual webapp. The operator is set to "+", and the result of the HTTP GET request is: "Result of 1 + 2 = 3"

Now I change the operator to "*" in the JMX Console and click Apply. The same HTTP GET request now produces a different result: "Result of 1 * 2 = 2":

The code for the CalculatorController, ControllerService and ControllerConfig beans are trivial, and I am not going to include it here. Basically, the CalculatorController parses out the operands from the HTTP GET request, and passes them to the CalculatorService, which pulls the operator from CalculatorConfig, applies them to the operands and returns a result, which the Controller then forwards to a JSP view.

I thought it was pretty cool that all we had to do was to add two additional JMX JAR files (jmxri.jar and jmxtools.jar) to our classpath, add a few more lines of configuration in Spring's application context, and make the bean we wanted to manage implement an interface that is marked as manageable (by suffixing the MBean to its interface name). And the end result was that we could control the configuration of our application. Granted, the application does not do much, so you could argue that it was probably more work to read and experiment with JMX than slapping on a single JSP page to update the configuration, but now that I know how to use it, there will be no learning curve in the future, and its definitely less work to add bean references to the MBeanExporter in the Spring configuration than it is to build individual JSP pages.

Another complaint may be that the JMX interface cannot be customized, to which I would argue that this is a backend GUI which exposes manageable beans at a fairly low level, and eye-candy is unlikely to mandatory for the user working at this level. What this user wants is to get his job done quickly and efficiently, and the JMX console GUI is more than adequate for this purpose. However, if corporate needs dictate, it is still cheaper and more efficient to build up your own corporate HtmlAdapter MBean which you can drop into any web application on demand.

Saturday, September 02, 2006

ETL Case Study using Kettle

ETL (Extract, Transform, Load) has traditionally been the domain of data warehousing practitioners, but it can be applied to any process or set of processes that load data into databases. Data is the lifeblood of any organization. However, data by itself is not too interesting - what is interesting is the information that the data can be processed into. Many enterprise systems dedicate a significant chunk of their functionality and resources to developing programs and scripts that transform and migrate data from one form to another, so the downstream module can present it in a manner more intuitive to their clients.

Writing data transformation routines may be pervasive, but the actual transformation code is generally not very challenging. More challenging is splitting up the transformation into multiple threads and running them in parallel, since ETL jobs usually work with large data sets, and we want the job to complete in a reasonable time. Business application developers generally don't do multithreaded programming too well, mainly because they don't do it often enough. Furthermore, the transformation business logic is inside the application code, which means it cannot be sanity checked by the business person whose needs drove the transformation in the first place.

I heard about Kettle, an open source ETL Tool, from a colleague at a previous job, where he was using it to automate data transformations to push out denormalized versions of data from backend databases to frontend databases. Unfortunately, I never got a chance to use it at work, but it remained on my ToDo list as something I wanted to learn for later. Kettle started as a project by a single developer, but has since been acquired by Pentaho who sell and support a suite of open source Business Intelligence tools, of which Kettle is one, under a dual open-source/commercial license similar to MySQL.

Early in my career, I worked for the MIS group of a factory that manufactured switchboards. It occured to me that one of the processes for generating monthly factory-wide input costs would be a good candidate to convert to Kettle and understand its functionality. Part of the input costs for the factory for the month were the sum of the actual dollar amount paid out to workers. This was governed by the worker's hourly rate and the number of hours worked. The number of hours were derived from the times recorded when the worker signed in and out of the factory. The values are reported by department. The figure below shows the flow.

To replicate the process, I created a flat file for 5 employees in 2 departments (Electrical and Mechanical) which contained in and out times for these employees over a 30 day period. The original databases involved were dBase-IV and Informix with migration scripts written with Clipper and Informix-4GL, the ones in my case study were PostgreSQL and MySQL. A data flow diagram for the Kettle based solution is shown below:

The input file dump looks like this:

1
2
3
4
5
1000015 I 2006-08-01 08:07:00 1154444820
2000024 I 2006-08-01 08:09:00 1154444940
1000015 O 2006-08-01 16:05:00 1154473500
2000024 O 2006-08-01 16:08:00 1154473680
...

The tables involved in the HRDB PostgreSQL table look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
hrdb=> \d employee
              Table "public.employee"
    Column     |          Type          | Modifiers
---------------+------------------------+-----------
 emp_id        | integer                |
 dept_id       | integer                |
 emp_name      | character varying(255) |
 rate_per_hour | numeric(8,2)           |

hrdb=> \d timecard
       Table "public.timecard"
  Column  |     Type      | Modifiers
----------+---------------+-----------
 emp_id   | integer       | not null
 tc_date  | character(10) | not null
 time_in  | integer       |
 time_out | integer       |

And the MySQL table that is populated as a result of the transformations looks like this:

1
2
3
4
5
6
7
mysql> desc input_cost;
+------------+---------------+------+-----+---------+-------+
| Field      | Type          | Null | Key | Default | Extra |
+------------+---------------+------+-----+---------+-------+
| dept_id    | int(11)       |      |     | 0       |       |
| input_cost | decimal(12,2) | YES  |     | NULL    |       |
+------------+---------------+------+-----+---------+-------+

Kettle comes with four main components - Spoon, Pan, Chef and Kitchen. Spoon is a GUI editor for building data transformations. Pan is a command line tool for running a transformation created with Spoon. Chef is a GUI for building up jobs, which are a set of transformations that should work together, and Kitchen is again a command line tool to run jobs built with Chef.

I balked initially at having to use a GUI to design transformations. I would have preferred a scripting language or some sort of XML configuration to do this, but I guess developers have traditionally not been the target market for ETL tools. And I guess the objective of using Kettle is to not do programming for data transformations, and to a certain extent, scripting is programming. Anyway, using Spoon was pretty straightforward, and I was able to generate three transformations which could be applied to my flat file dump in sequence to produce two rows in the CostingDB MySQL table.

Each Spoon Transformation produces as output a .ktr XML file. It can also write the transformation metadata to a database repository (the recommended option). The first transformation reads the flat file, choosing rows with the "I" flag set (for incoming timecard entry), and inserts it into the HRDB.timecard table. The second transformation reads the flat file a second time, this time choosing rows with the "O" flag set (for outgoing timecard entry) and updates the time_out column in the timecard table. The reason we have two separate transformations instead of having two streams from the filter is because the two streams are going to be multi-threaded and there is no guarantee that an insert would complete before the corresponding update is applied.

The third transformation reads the HRDB.timecard table, calculates worked hours per employee over the given time period, aggregates the worked hours per employee, applies the employee's per hour rate from the HRDB.employee table to get the dollar value to be paid out, then groups and aggregates the dollar values over department, then inserts the two rows into the MySQL CostingDB table.

You can run the transformations individually through Spoon using the "Run" icon. Alternatively, you can run them through the Pan tool. Here is a Pan script that runs the entire transformation:

1
2
3
4
5
6
7
#!/bin/bash
KETTLE_HOME=/path/to/your/kettle/installation/here
cd $KETTLE_HOME
./pan.sh -file=usr/costing/extract_in_times.ktr 
./pan.sh -file=usr/costing/extract_out_times.ktr
./pan.sh -file=usr/costing/aggregate_worked_hrs.ktr
cd -

Alternatively, you could use Chef GUI Tool to build up this job graphically. Chef offers some other features such as modules which do FTP, send email and so on. The job is shown graphically below, along with the generated .kjb file.

Finally, you can run more than one job, schedule them and so on using Kitchen. Frankly, I dont see much reason to use Chef and Kitchen, since you can just write Pan scripts and schedule them via cron, but I guess the Kettle team put them in there for completeness.

My conclusion is that Spoon is fairly powerful and provides very powerful plumbing to design and run ETL jobs. I still dont like the fact that the only programming interface is a GUI, but I dont have any concrete suggestions for a scripting interface. For those whose needs are not met by the standard components provided by Spoon, there is the Injector component which can be backed by user-supplied Java code, so Kettle also provides a hook for extensibility.