Saturday, January 19, 2008

Externalizing Database Access with a ResultSet Iterator

I recently needed to structure some code such that a query against a database table and a sequential read from a (structured) flat file provided the same interface to client code, something like this:

1
2
3
4
public interface ISource {
  ...
  public SomeObject getNextDocument();
}

This API is a very natural fit for the flat file, since the implementation can hold a reference to the current position in the file advance it with each invocation of getNextDocument(). However, when the ISource implementation happens to be a SQL query running against a database table, things get a little hairy.

The classic JDBC data access pattern requires us to do all our processing within the ResultSet processing loop, and/or accumulate a Collection of objects for later use, something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
List<SomeObject> myObjects = new ArrayList<SomeObject>();
Connection conn = dataSource.getConnection();
PreparedStatement ps = conn.prepareStatement("select * from mytable");
ResultSet rs = ps.executeQuery();
while (rs.next()) {
  SomeObject myObject = doSomethingWithResultSet(rs);
  myObjects.add(myObject);
}
rs.close();
ps.close();

The basic usage pattern for the Spring JdbcTemplate is the queryForList() method, which simply allows accumulation of results for later use. The result of this call is a List of Maps, where each Map represents a row of database results.

1
2
3
4
5
6
7
8
List<SomeObject> myObjects = new ArrayList<SomeObject>();
List<Map<String,Object>> rows = jdbcTemplate.queryForList("select * from mytable");
for (Map<String,Object> row : rows) {
  SomeObject myObject = new SomeObject();
  myObject.setMyField(row.get("my_field"));
  ...
  myObjects.add(myObject);
}

JdbcTemplate also allows processing the individual rows inline, using a RowMapper anonymous inner class, something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
List<SomeObject> myObjects = jdbcTemplate.query(
  "select * from mytable",
  new RowMapper() {
    public Object mapRow(ResultSet rs, int rowNum) throws SQLException {
      SomeObject myObject = new SomeObject();
      myObject.setMyField(rs.getString("my_field"));
      ...
      return myObject;
    }
  });

None of these solutions were what I was after, however. What I needed was a way to tell a method to return me the "next row from the database" and hold a reference internally until I invoke the method again. Iterators came to mind almost instantly, but my first attempt was this rather pathetic one, of holding a private reference to the Iterator of the List object generated from a call to jdbcTemplate.queryForList().

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
public class MyFirstPatheticAttempt implements ISource {
  private String sql; // injected from configuration
  private Iterator it;
  ...
  public open() {
    List<Map<String,Object>> rows = jdbcTemplate.queryForList(sql);
    it = rows.iterator();
  }

  public SomeObject getNextObject() {
    if (it.hasNext()) {
      Map<String,Object> row = it.next();
      SomeObject myObject = new SomeObject();
      myObject.setMyField(row.get("my_field"));
      ...
      return myObject;
    }
    return null;
  }
  ...
}

Sure, it satisfied the ISource interface, but it felt horribly wrong. Looking up the Iterator Pattern on wikipedia led me to the Generator Pattern, which was basically what I was after. Turns out that Java does not have the "yield return" construct that is required to support this pattern, but C# and Python does, so perhaps its just a matter of time before it pops up here. In any case, it wasn't available now, so that did not solve my problem, so I kept looking.

I finally came across Ryan Brase's article "Deciding between iterators and lists for returned values in Java" on TechRepublic which had an example of what I needed to do (Listing 2 in the article). Unlike his example, however, my code would not know the SQL query it would be executing, so it would need to return the entire record in some way, without having to explicitly list them in the code.

Fortunately, the folks at the Apache commons-dbutils project have built a ResultSetIterator which is an Iterator implementation whose constructor takes a reference to a ResultSet, and which I gratefully re-used.

So here is what I finally came up with. Although it uses plain old JDBC, it is fairly clean and compact, and does the job of allowing the database access to be externalized into a method call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class MyFinalAttempt implements ISource {
  private String sql; // injected from configuration
  private Iterator it;
  ...
  public open() {
    Connection conn = ds.getConnection();
    PreparedStatement ps = conn.prepareStatement(sql);
    ResultSet rs = ps.executeQuery();
    it = new ResultSetIterator(rs);
  }

  public SomeObject getNextObject() {
    if (it.hasNext()) {
      Object[] row = it.next();
      SomeObject myObject = new SomeObject();
      myObject.setMyField(row[0]);
      ...
      return myObject;
    }
    return null;
  }
  ...
}

The only thing I did not quite like was that the ResultSetIterator.next() call returned an Object[] row. I would have liked it to return a Map<String,Object> instead, although that shortcoming can be worked around fairly easily in application code. The other thing I wanted to do was to use Spring JdbcTemplate, since that way a lot of resource management issues are taken care of automatically, but I could not figure out how to get a handle on the ResultSet.

If anyone has figured out how to get this pattern going using Spring, I would really like to know. Also, not sure how common such a use-case is, but if you have faced it and solved it, or you think my approach is lame and you have a better one, I would appreciate hearing from you.

Update Jan 26 2008

When I was trying to figure out various approaches to solve this problem, I came across a Java based "yield return" solution on Chaotic Java. It provides an abstract Yielder class which implements the Iterable interface. However, it requires bytecode instrumentation using ASM. The resulting code is quite elegant, something like this, although I could not make it work.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  public Iterable<SomeObject> getSomeObjects() {
    Connection conn = ds.getConnection();
    PreparedStatement ps = conn.prepareStatement(sql);
    final ResultSet rs = ps.executeQuery();
    return new Yielder<SomeObject>() {
      protected void yieldNextCore() {
        while (rs.next()) {
          SomeObject obj = new SomeObject();
          obj.setName(rs.getString(1));
          ...
          yieldReturn(obj);
        }
        yieldBreak();
      }
    }
  }

  @Test
  public void processObjects() {
    for (SomeObject obj : getSomeObjects()) {
      // do something with obj
    }
  }

To run it, a -javaagent parameter set to the path to the yielder.jar file needs to be passed to the Java command. I was setting this in the configuration/property for the maven-surefire-plugin and calling this through a JUnit test. I was using asm-all-3.3.0 (downloaded from the yielder SVN repository). From my tests, I see that the first call to getSomeObject() results in all the rows being iterated. Obviously I am doing something wrong, would appreciate any pointers from readers if you have gotten this to work.

Saturday, January 12, 2008

Viruses and the Visitor Pattern

On my commute to work earlier this week, I came across a copy of New York Times that another passenger had read and discarded on the seat across from me, and having nothing better to do, started to read it. In the Science section, I came across Natalie Angier's article "Tiny Specs of Misery, both Vile and Useful", where a section caught my eye. Talking about how viruses attack human cells, it says:

All viruses have at their core compact genetic instructions for making more viruses, some of the booklets written in DNA, others in the related nucleic language of RNA. Our cells have the means to read either code, whether they ought to or not. Encasing the terse viral genomes are capsids, protective coats constructed of interlocking protein modules and decorated with some sort of docking device, a pleat of just the right shape to infiltrate a particular cell. Rhinoviruses dock onto receptors projecting from the cells of our nasal passages, while hepatitis viruses are shaped to exploit portholes on liver cells.

What impressed me was the clarity of the analogy, and how closely it resembles the Visitor Pattern. An example of (programming) art imitating life, I guess.

To express this analogy in concrete terms, consider an interface called ICell to model the target human cells, and another called IVirus to model the viruses. The IVirus.canVisit() method is driven by the similarity in shape between the capsids and the docking ports on the target cells. The ICell.accept() method describes transformations to the cell when attacked by an instance of IVirus.

1
2
3
4
5
6
7
8
9
// ICell.java
public interface ICell {
  public void accept(IVirus virus);
}

// IVirus.java
public interface IVirus {
  public boolean canVisit(ICell cell);
}

Assume that NasalPassageCell.java and LiverCell.java are two implementations of the ICell interface, shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// NasalPassageCell.java
public class NasalPassageCell implements ICell {
  public void accept(IVirus virus) {
    produceCommonCold(virus);
  }
  ...
}

// LiverCell.java
public class LiverCell implements ICell {
  public void accept(IVirus virus) {
    produceStomachFlu(virus);
  }
  ...
}

The above is obviously an oversimplification, since a particular cell implementation can react in different ways to different IVirus implementations, so perhaps that behavior can be modeled by delegating to private accept() methods, or using an else-if condition.

Similarly, assume that RhinoVirus.java and HepatitisVirus.java are two implementations of the IVirus interface. As before, a virus can probably visit different kinds of cells, so the canVisit() method is likely to be more complex than that shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
// RhinoVirus.java
public class RhinoVirus implements IVirus {
  public boolean canVisit(ICell cell) {
    return (cell instanceof NasalPassageCell);
  }
  ...
}

// HepatitisVirus.java
public class HepatitisVirus implements IVirus {
  public boolean canVisit(ICell cell) {
    return (cell instanceof LiverCell);
  }
  ...
}

The coupling between the two classes of objects is effected using a CellVisitor class, shown below. A CellVisitor would actually be a virus (since it visits cells), so this logic could probably be put into a BaseVirus superclass instead of in a separate class. The CellVisitor class will define how the source object (IVirus) will traverse a collection of target objects (ICell), and is likely to be application-specific. In our case, it would probably look like this:

1
2
3
4
5
6
7
8
9
public class CellVisitor {
  public void processCells(IVirus virus, Collection<ICell> cells) {
    for (ICell cell : cells) {
      if (virus.canVisit(cell)) {
        cell.accept(virus);
      }
    }
  }
}

This pattern is quite powerful, both for viruses and code. As time passes, cells grow resistant to certain viruses, and viruses mutate into new viruses which attempt to overcome the cell's newly developed resistivity. However, as long as cells and viruses continue to conform to the same interface definition, the CellVisitor.processCells() method will continue to work, although it will produce different results as the inputs change.

In case of code, this pattern provides a way to loosely couple two classes of objects. As long as the interface remains unchanged, new implementations can be quickly added without concern that the behavior will change.

Tuesday, January 01, 2008

Personality Tests

Happy New Year. In 2006, I ended the year with a blog that I called a retrospective, ie looking back on the year past. In the interests of doing something different this year, I will end with an introspective. Some weeks ago, on a relatively slow afternoon, I found a link in a Splee's blog that led me to a couple of rather interesting personality tests, the results of which appear below:

You are PHP.  You enjoy the World Wide Web.  You are constantly changing the way you do things, and this tends to confuse people who work with you.
Which Programming Language are You?
You are Amiga OS. Ahead of your time.  You keep a lot of balls in the air.  If only your parents had given you more opportunities to suceed.
Which OS are You?

I don't normally do these tests, having about the same respect for their effectiveness as I would have in the prognostications of a Magic 8-ball. However, knowing what kind of programming language or operating system I resembled had a certain nerdy appeal. Besides, a fair indication of how good they were would be how easy or natural it felt working in these two environments, so it would be fairly easy to test how good the tests were, so I went for it.

While PHP would not be the first language I would identify with, I did find it very easy to work with both times I tried it, thanks in part to the great online documentation and examples. As for AmigaOS, I did not know anything about it, except for the fact that an OS once existed by that name. Reading about it, however, I think I would have liked working on it, and possibly even identifying with it, it being so far ahead of its time and all :-). Unlike the AmigaOS however, my parents actually went out of their way to give me opportunities for success, the fact that I am not as successful as the test would like me to be is probably solely my fault :-).

Or maybe I am just a gullible person, and in reality, these are just generic profiles which my mind coerces into the right shape to fit my self-image.

Be that as it may, regardless of my (or your) opinions on the predictive powers of a personality test, the more interesting question was, how do they work? Turns out that there is some math behind at least some of the tests. I came across some published material on this here, here and here.

Based on this, a possible algorithm may go like this. To train the personality analyzer, a large set of questions are asked of a set of control subjects, and their answers are correlated to their actual personality (determined by manual psychiatric evaluation). Each test would have a small and finite set of personality types that can be provided as a result (such as the 16 different personality types proposed in the Myers-Briggs classification). Each of the multiple-choice answers (given by the control group) to the questions are assigned a weight for each personality, which represents the computed probability of that choice being made by the member of that personality type. In pseudo-code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
for control_subject in control_subjects:
  for question in questions:
    answer = question.selected_answer
    personality_type = control_subject.personality_type
    total_answered[question, answer, personality_type]++
for control_subject in control_subjects:
  personality_type = control_subject.personality_type
  count_of_personality_type[personality_type]++
for question in questions:
  for answer in answers:
    for personality_type in personality_types:
      prob[question, answer, personality_type] = 
        total_answered[question, answer, personality_type] / 
        count_of_personality_type[personality_type]

At the end of this, each answer to each question will have a score representing the probability of a particular personality type selecting that particular answer to a question.

A prospective test-taker is assigned a random subset of the question set. The sum of the weights of his/her answers are computed for each personality type, and the type corresponding to the highest sum is offered as a result. In pseudo-code, again:

1
2
3
4
5
for question in question:
  answer = question.selected_answer
  for personality_type in personality_types:
    score[personality_type] += prob[question, answer, personality_type]
return (max(score[personality_type])).personality_type

While on the subject of Statistics, I was in India recently and took some time out to meet my folks. One of them is a brilliant but rather acerbic old gentleman who started out as a physicist (as did I, hence the connection) and later turned his talents to mechanical engineering. I happened to mention that I used Statistics at work, and gave him an example of one such case. The conversation then veered to how statistics can help if all the variables in the problem are not fully defined, and he happened to comment that for problems where all variables are known, vector algebra could help, and in cases where the resulting matrix model is computationally intensive, he suggested using sets. I had recently discovered this (the hard way, by building both implementations) for myself, but I was surprised by his remark where I had not even defined a problem, leading me to think that this could be some kind of heuristic. Anyone know what I should be looking for if I want to know more about the reasoning behind such a heuristic?

On another completely different note, I indulged in an accidental bit of tourism, visiting Mount Abu for a day trip. I was really quite blown away by the Jain temples there. The ceilings and columns in the temples are intricately carved out of marble, and are built over a period of about 500 years starting in the 11th century. The ceilings are inverted semi-spheroids with lotuses and such hanging off the top, all carved out of one solid block of marble. According to the guide, one of the temples was built by the craftsmen themselves, in their spare time, using marble that was not perfect enough for the main temples and therefore discarded. Like most ancient structures, it is as much an engineering feat as a work of art, so in that sense, this could probably be considered an early example of skunkworks development.