Salmon Run: May 2006

Friday, May 26, 2006

Java Data Structure: A Generic Tree

A few weeks ago, I was talking to another Java programmer who was once a C++ programmer. The conversation gravitated towards data structures, and about how C++ programmers need to consider the best data structure for their particular application and then figure out the best and most optimal way to build it, while Java programmers typically just choose the most optimal data structure for their application from the standard java.util library. In my opinion, this is both a good thing and a bad thing for Java programmers. Its good because you dont have to worry about implementing a Vector for each project (or remember which third party library to include or purchase), and its bad, because it makes you lazy. So lazy, in fact, that when the time comes to implement a data structure that is not available in the Java standard library, or cannot be synthesized by combining Maps and Lists, Java programmers are often at a loss for how to proceed.

This conversation turned out to be prescient, since a few days later, I found myself confronted with programming a business object that just had to be mapped to a Tree data structure. It was a Task object, which can contain other Tasks, which can contain other Tasks, and so on. Since there is no built in Tree data structure in the java.utils package, I had to create my own. This article explains the design and provides the code for a generic Tree structure.

The basic Tree Data Structure

First off, the Tree class itself should be a container class, and should contain elements of type Node. The Node element is a thin wrapper over the actual data content, to which we will assign the generic type T. So our Tree<T> will contain a collection of Node<T> objects, where T is the actual data that we want to represent in our tree structure. The Node is represented as an element of type T, and a List<T> of children of type Node<T>. We use a List object to hold all the children, because we are dealing with an N-ary Tree, ie, we cannot tell in advance how many children a particular Node will have. Here is a diagram showing the logical tree and the corresponding data structure.

The code for the Tree and Node objects follow:

/**
 * Represents a Tree of Objects of generic type T. The Tree is represented as
 * a single rootElement which points to a List<Node<T>> of children. There is
 * no restriction on the number of children that a particular node may have.
 * This Tree provides a method to serialize the Tree into a List by doing a
 * pre-order traversal. It has several methods to allow easy updation of Nodes
 * in the Tree.
 */
public class Tree<T> {
 
    private Node<T> rootElement;
     
    /**
     * Default ctor.
     */
    public Tree() {
        super();
    }
 
    /**
     * Return the root Node of the tree.
     * @return the root element.
     */
    public Node<T> getRootElement() {
        return this.rootElement;
    }
 
    /**
     * Set the root Element for the tree.
     * @param rootElement the root element to set.
     */
    public void setRootElement(Node<T> rootElement) {
        this.rootElement = rootElement;
    }
     
    /**
     * Returns the Tree<T> as a List of Node<T> objects. The elements of the
     * List are generated from a pre-order traversal of the tree.
     * @return a List<Node<T>>.
     */
    public List<Node<T>> toList() {
        List<Node<T>> list = new ArrayList<Node<T>>();
        walk(rootElement, list);
        return list;
    }
     
    /**
     * Returns a String representation of the Tree. The elements are generated
     * from a pre-order traversal of the Tree.
     * @return the String representation of the Tree.
     */
    public String toString() {
        return toList().toString();
    }
     
    /**
     * Walks the Tree in pre-order style. This is a recursive method, and is
     * called from the toList() method with the root element as the first
     * argument. It appends to the second argument, which is passed by reference     * as it recurses down the tree.
     * @param element the starting element.
     * @param list the output of the walk.
     */
    private void walk(Node<T> element, List<Node<T>> list) {
        list.add(element);
        for (Node<T> data : element.getChildren()) {
            walk(data, list);
        }
    }
}

/**
 * Represents a node of the Tree<T> class. The Node<T> is also a container, and
 * can be thought of as instrumentation to determine the location of the type T
 * in the Tree<T>.
 */
public class Node<T> {
 
    public T data;
    public List<Node<T>> children;
 
    /**
     * Default ctor.
     */
    public Node() {
        super();
    }
 
    /**
     * Convenience ctor to create a Node<T> with an instance of T.
     * @param data an instance of T.
     */
    public Node(T data) {
        this();
        setData(data);
    }
     
    /**
     * Return the children of Node<T>. The Tree<T> is represented by a single
     * root Node<T> whose children are represented by a List<Node<T>>. Each of
     * these Node<T> elements in the List can have children. The getChildren()
     * method will return the children of a Node<T>.
     * @return the children of Node<T>
     */
    public List<Node<T>> getChildren() {
        if (this.children == null) {
            return new ArrayList<Node<T>>();
        }
        return this.children;
    }
 
    /**
     * Sets the children of a Node<T> object. See docs for getChildren() for
     * more information.
     * @param children the List<Node<T>> to set.
     */
    public void setChildren(List<Node<T>> children) {
        this.children = children;
    }
 
    /**
     * Returns the number of immediate children of this Node<T>.
     * @return the number of immediate children.
     */
    public int getNumberOfChildren() {
        if (children == null) {
            return 0;
        }
        return children.size();
    }
     
    /**
     * Adds a child to the list of children for this Node<T>. The addition of
     * the first child will create a new List<Node<T>>.
     * @param child a Node<T> object to set.
     */
    public void addChild(Node<T> child) {
        if (children == null) {
            children = new ArrayList<Node<T>>();
        }
        children.add(child);
    }
     
    /**
     * Inserts a Node<T> at the specified position in the child list. Will     * throw an ArrayIndexOutOfBoundsException if the index does not exist.
     * @param index the position to insert at.
     * @param child the Node<T> object to insert.
     * @throws IndexOutOfBoundsException if thrown.
     */
    public void insertChildAt(int index, Node<T> child) throws IndexOutOfBoundsException {
        if (index == getNumberOfChildren()) {
            // this is really an append
            addChild(child);
            return;
        } else {
            children.get(index); //just to throw the exception, and stop here
            children.add(index, child);
        }
    }
     
    /**
     * Remove the Node<T> element at index index of the List<Node<T>>.
     * @param index the index of the element to delete.
     * @throws IndexOutOfBoundsException if thrown.
     */
    public void removeChildAt(int index) throws IndexOutOfBoundsException {
        children.remove(index);
    }
 
    public T getData() {
        return this.data;
    }
 
    public void setData(T data) {
        this.data = data;
    }
     
    public String toString() {
        StringBuilder sb = new StringBuilder();
        sb.append("{").append(getData().toString()).append(",[");
        int i = 0;
        for (Node<T> e : getChildren()) {
            if (i > 0) {
                sb.append(",");
            }
            sb.append(e.getData().toString());
            i++;
        }
        sb.append("]").append("}");
        return sb.toString();
    }
}

Reading and Writing Trees from Database

The Tree and Node classes are not very useful by themselves, since what I am after is the ability to store and retrieve my Tasks from a database. I will use Hibernate for my ORM layer. Since Hibernate, to the best of my knowledge, does not work with generic Java code, we create fully specified subclasses out of our generic classes above, and the Task POJO itself. The Task database table itself follows the standard pattern for representing hierarchical data in relational tables, with an embedded parent id that points back to the row's logical parent. The diagram below illustrates this pattern.

And here is the code for the subclasses of the generic Tree and Node classes, the POJO representing the Task object, the Hibernate XML mapping, the abstract base Hibernate DAO class, and the DAO classes for TaskTree and Task.

/**
 * Non-generic subclass of Tree<Task>
 */
public class TaskTree extends Tree<Task> {
    public TaskTree() {
        super();
    }
}

/**
 * Non-generic subclass of Node<T>
 */
public class TaskNode extends Node<Task> {
    public TaskNode() {
        super();
    }
}

/**
 * Simple POJO with one to one mapping to database table Task.
 */
public class Task implements Serializable {
    private Long id;
    private Long parentId;
    private String taskName;
    private String taskDescription;
    private String allocatedTo;
    private Integer estimatedHours;
    private Integer actualHours;
    private String status;
 
    public Task() {
        super();
    }
     
    public Task(String taskName) {
        this();
        setTaskName(taskName);
    }
     
    public Integer getActualHours() {
        return this.actualHours;
    }
 
    public void setActualHours(Integer actualHours) {
        this.actualHours = actualHours;
    }
 
    public String getAllocatedTo() {
        return this.allocatedTo;
    }
 
    public void setAllocatedTo(String allocatedTo) {
        this.allocatedTo = allocatedTo;
    }
 
    public Integer getEstimatedHours() {
        return this.estimatedHours;
    }
 
    public void setEstimatedHours(Integer estimatedHours) {
        this.estimatedHours = estimatedHours;
    }
 
    public Long getId() {
        return this.id;
    }
 
    public void setId(Long id) {
        this.id = id;
    }
 
    public Long getParentId() {
        return this.parentId;
    }
 
    public void setParentId(Long parentId) {
        this.parentId = parentId;
    }
 
    public String getStatus() {
        return this.status;
    }
 
    public void setStatus(String status) {
        this.status = status;
    }
 
    public String getTaskDescription() {
        return this.taskDescription;
    }
 
    public void setTaskDescription(String taskDescription) {
        this.taskDescription = taskDescription;
    }
 
    public String getTaskName() {
        return this.taskName;
    }
 
    public void setTaskName(String taskName) {
        this.taskName = taskName;
    }
     
    public String toString() {
        return ReflectionToStringBuilder.reflectionToString(this, ToStringStyle.DEFAULT_STYLE);
    }
}

The Hibernate mapping file for the Task object looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE hibernate-mapping PUBLIC "-//Hibernate/Hibernate Mapping DTD 3.0//EN" "http://hibernate.sourceforge.net/hibernate-mapping-3.0.dtd" >
<hibernate-mapping>
                                                                                
  <class name="com.mycompany.beans.Task" table="Task" lazy="false">
                                                                                
    <id name="id" type="java.lang.Long" unsaved-value="null">
      <generator class="native" />
    </id>
                                                                                
    <property name="parentId" type="java.lang.Long">
      <column name="parentId" />
    </property>
                                                                                
    <property name="taskName" type="java.lang.String">
      <column name="taskName" />
    </property>
                                                                                
    <property name="taskDescription" type="java.lang.String">
      <column name="taskDescription" />
    </property>
                                                                                
    <property name="allocatedTo" type="java.lang.String">
      <column name="allocatedTo" />
    </property>
                                                                                
    <property name="estimatedHours" type="java.lang.Integer">
      <column name="estimatedHours" />
    </property>
                                                                                
    <property name="actualHours" type="java.lang.Integer">
      <column name="actualHours" />
    </property>
                                                                                
    <property name="status" type="java.lang.String">
      <column name="status" />
    </property>
                                                                                
  </class>
</hibernate-mapping>

Finally, in keeping with our thinking that the Tree object should be a collection and concern itself only with navigation, and the Node object should reference the actual object, we create two separate DAO (Data Access Objects). The client will normally reference the TreeDao, and the TreeDao will internally reference the TaskDao as needed. Here is the code for the TreeDao and TaskDao classes. We have also refactored the common code into an abstract class AbstractHibernateDao (which extends Spring's HibernateDaoSupport).

/**
 * A base DAO object that needs to be extended by all our individual DAOs. This
 * class provides the base functionality to interact with the Hibernate ORM.
 * @author Sujit Pal (sujit.pal@comcast.net)
 * @version $Revision$
 */
public abstract class AbstractHibernateDao extends HibernateDaoSupport {
 
    public AbstractHibernateDao() {
        super();
    }
 
    public Serializable save(Object entity) {
        return getHibernateTemplate().save(entity);
    }
     
    public void saveOrUpdate(Object entity) {
        getHibernateTemplate().saveOrUpdate(entity);
    }
     
    public void update(Object entity) {
        getHibernateTemplate().update(entity);
    }
     
    public void remove(Object entity) {
        getHibernateTemplate().delete(entity);
    }
     
    public Object get(Class clazz, Serializable id) {
        return getHibernateTemplate().get(clazz, id);
    }
     
    public Object load(Class clazz, Serializable id) {
        return getHibernateTemplate().load(clazz, id);
    }
     
    public List find(final String hqlQuery, final String[] params, final Object[] paramValues, final Type[] paramTypes, final Integer start, final Integer count) {
        return (List) getHibernateTemplate().execute(new HibernateCallback() {
            public Object doInHibernate(Session session) throws HibernateException, SQLException {
                Query query = session.createQuery(hqlQuery);
                int paramsLength = (params == null ? 0 : params.length);
                if (params != null) {
                    if (paramValues == null || paramValues.length != params.length) {
                        throw new IllegalArgumentException("ParamValues not completely specified for all params");
                    }
                    if (paramValues == null || paramTypes.length != params.length) {
                        throw new IllegalArgumentException("ParamTypes not completely specified for all params");
                    }
                }
                for (int i = 0; i < paramsLength; i++) {
                    query.setParameter(params[i], paramValues[i], paramTypes[i]);
                }
                if (start == null) {
                    query.setFirstResult(0);
                } else {
                    query.setFirstResult(start.intValue());
                }
                if (count == null) {
                    query.setMaxResults(0);
                } else {
                    query.setMaxResults(count.intValue());
                }
                return query.list();
            }
        }, true);
    }
}

/**
 * Dao for the Task POJO.
 */
public class TaskDao extends AbstractHibernateDao {
     
    private final static Logger log = Logger.getLogger(TaskDao.class);
 
    public TaskDao() {
        super();
    }
     
    /**
     * Remove a Task object from the database.
     * @param task the Task to remove.
     */
    public void remove(Task task) {
        super.remove(task);
    }
     
    /**
     * Saves a Task to the database if it is a new one, else updates it.
     * @param task the Task to insert or update.
     * @return the task id.
     */
    public Long saveOrUpdate(Task task) {
        if (task.getId() == null) {
            return (Long) save(task);
        } else {
            update(task);
            return task.getId();
        }
    }
     
    /**
     * Returns the Task pointed to by the task id. Returns a NULL if the
     * Task is not found in the database.
     * @param taskId the task id to look up.
     * @return the Task object corresponding to the id.
     */
    public Task get(Long taskId) {
        return (Task) super.get(Task.class, taskId);
    }
     
    /**
     * Returns all the children of the Task which has the specified parent id.
     * @param parentId the id of the parent Task.
     * @return a List of Tasks which are children of the Task specified.
     */
    @SuppressWarnings("unchecked")
    public List<Task> findByParentId(Long parentId) {
        return super.find("from Task where parentId = :parentId",
            new String[] {"parentId"},
            new Object[] {parentId},
            new Type[] {Hibernate.LONG},
            new Integer(0), new Integer(1));
    }
     
    /**
     * Returns the first task with the specified name. This will typically never     * be used directly from the application, mainly for use in tests.
     * @param taskName the name of the task to return.
     * @return the first task with the specified name.
     */
    @SuppressWarnings("unchecked")
    public Task findByName(String taskName) {
        List<Task> tasks =  super.find("from Task where taskName = :taskName",
            new String[] {"taskName"},
            new Object[] {taskName},
            new Type[] {Hibernate.STRING},
            new Integer(0), new Integer(1));
        return tasks.get(0);
    }
}

/**
 * Data Access Object for the TaskTree object. This is not a true Dao, since it
 * delegates to the TaskDao object for any database operations.
 */
public class TaskTreeDao {
     
    private static final Logger log = Logger.getLogger(TaskTreeDao.class);
     
    private TaskDao taskDao;
 
    public TaskTreeDao() {
        super();
    }
     
    public void setTaskDao(TaskDao taskDao) {
        this.taskDao = taskDao;
    }

    /**
     * Saves a TaskTree object.
     * @param taskTree a TaskTree object.
     */     
    public void saveOrUpdate(TaskTree taskTree) {
        List<Node<Task>> tasks = taskTree.toList();
        // save the tree in reverse order, starting from the leaf nodes
        // and going up to the root of the tree.
        int numberOfNodes = tasks.size();
        for (int i = numberOfNodes - 1; i >= 0; i--) {
            Node<Task> taskElement = tasks.get(i);
            Task task = taskElement.getData();
            taskDao.saveOrUpdate(task);
            Long parentId = task.getId();
            for (Iterator<Node<Task>> it = taskElement.getChildren().iterator(); it.hasNext();) {
                Node<Task> childElement = it.next();
                Task childTask = childElement.getData();
                childTask.setParentId(parentId);
                taskDao.saveOrUpdate(childTask);
            }
        }
    }
     
    /**
     * Gets a single TaskTree by id. This is the head of a recursive method call
     * to getRecursive().
     * @param id the id of the TaskTree.
     * @return a TaskTree for that id.
     */
    public TaskTree get(Long id) {
        TaskTree taskTree = new TaskTree();
        Node<Task> rootElement = new Node<Task>(taskDao.get(id));
        getRecursive(rootElement, taskTree);
        taskTree.setRootElement(rootElement);
        return taskTree;
    }
     
    /**
     * Recursively descends the Tree and populates the TaskTree object.
     * @param taskElement the root Node.
     * @param tree the TaskTree object to populate.
     */
    private void getRecursive(Node<Task> taskElement, TaskTree tree) {
        List<Task> children = taskDao.findByParentId(taskElement.getData().getId());
        List<Node<Task>> childElements = new ArrayList<Node<Task>>();
        for (Iterator<Task> it = children.iterator(); it.hasNext();) {
            Task childTask = it.next();
            Node<Task> childElement = new Node<Task>(childTask);
            childElements.add(childElement);
            getRecursive(childElement, tree);
        }
        taskElement.setChildren(childElements);
    }
}

Usage

Using the above classes, if we have built up a TaskTree object, we can save it using the following call:

1
2
3

    TaskTree taskTree = buildUpOurTaskTreeSomehow();
    TaskTreeDao taskTreeDao = getReferenceToATaskTreeDao();
    taskTreeDao.saveOrUpdate(taskTree);

and retrieve it back from the database using this call:

1
2
3

    TaskNode rootElement = getOurRootNodeSomehow();
    TaskTreeDao taskTreeDao = getReferenceToATaskTreeDao();
    TaskTree taskTree = taskTreeDao.get(rootElement.getId());

So, although Java allows the Java programmer to be blissfully unaware of the complexity of building data structures by hand, sometimes it becomes necessary to do so, simply because the data structures provided by Java is not sufficient for the task at hand. There are a number of good books on the subjects, as mentioned in my last blog post, and if you don't want to spend any money, check out The Particle's Guide to Java Data Structures. I've only just skimmed it, and it could do with improvement, but hey, its free.

Saturday, May 20, 2006

Obligatory JavaOne 2006 Post

I attended the JavaOne 2006 conference and expo on Thursday last week. It was a full 5 days long, which was a break from the usual 3.5-4 days from earlier years. One of the advantages of being a local attendee is that you can go over to the conference whenever you have some free time, and the disadvantage is that you end up not having too much free time. Which was just as well, come to think about it, since the two sessions I attended had enough information content to make me feel as if I was drinking from a fire hose. I can only conclude that the people who go to as many sessions as they can for the full 5 days either have lots of time to research all the new Java offerings from Sun and their partners, or that they drop a lot of packets - I know I would do the latter in a similar situation.

The two sessions I ended up going to were J2EE Performance Profiling and Monitoring (TS-1549) and Troubleshooting in a Production Environment (TS-1669). I had actually signed on for a session on the AJAX DOJO Toolkit (TS-3577), but I changed my mind at the last moment and decided to go to the Profiling and Monitoring session. One thing I noted (from my own experience at JavaOne and descriptions of the recent Ajaxian conference from descriptions by colleagues who attended) was the huge interest in AJAX and all the different frameworks that have come up around it - there were easily 1000+ people lining up to attend any session even remotely AJAX related.

The proliferation of AJAX frameworks reminds me of a similar situation that happened with MVC frameworks a few years ago. My personal favorite (and incidentally the only one I have worked with, not counting Javascript code consisting of raw XmlHttpRequest calls) is DWR (Direct Web Remoting), although lots of people I know swear by the Prototype library. It would be interesting to see which AJAX frameworks finally emerge as the winners after the inevitable shakeout that will follow in the coming months - for now, I will attempt to only apply what I know about AJAX, rather than chase down the AJAX framework du jour.

The Profiling and Monitoring session turned out to be about JFluid, the new profiler plugin available in NetBeans and about Project Glassfish, an open source application server project sponsored by Sun. The profiler plugin is pretty slick, and it allows you to quickly identify hot spots in your J2EE application. To the best of my knowledge, however, its only available in NetBeans, although it is possible that the Eclipse folks may decide that they like it enough to write a plugin for it, similar to the Matisse Swing designer plugin.

Project Glassfish turned out to be something of a disappointment for me, however. All I was looking for was a standard (and extendable) JMX console application that could hook into any J2EE compliant application server. What I got was yet another application server, which seems to be based on the same microkernel architecture pioneered by JBoss and which is more recently used in Apache Geronimo. From my user-level perspective, the world needs another buggy (at first) application server like it needs a hole in its head, but I guess writing full blown application servers is more fun than writing useful little generic apps.

On the troubleshooting session, the speaker Alexandre Rafalovitch, the author of the View from the trenches blog was his usual raconturing self. I had attended his session two years earlier when he was with BEA Weblogic, and as usual there were lots of useful information to be had from his session. In the previous session he spoke of reading thread dumps, and pointed out the useful thread dump analyzer tool Samurai, which I still use. Just in passing, a colleague came up with a text based way to track dumps which I post here.

1	$ sort thread_dump.txt \| uniq -c \| sort -n

and then open the thread dump in an editor looking for the top patterns returned by the above command.

In this session, he covered logging tools that can listen on events raised by multiple application servers, such as Splunk and Apache Chainsaw, the unix utility lsof to check on files and sockets held on to by specified processes, and the network sniffer tool Ethereal.

I went back home and looked at Apache Chainsaw, it is a GUI which listens on a specific port (4445 by default) on your machine, and you can configure your log4j.properties in your application(s) to write to a SocketAppender at that port. I used this information in Bertrand's Weblog to set up Apache Chainsaw, and the information is valid for Apache Chainsaw v2 as well. Its quite slick, the different levels of log messages (DEBUG, INFO, WARN, ERROR) are color coded and easy to read. It would be especially useful if you were looking at logs in a clustered or multi-machine environment.

I also looked at Ethereal, and I had to also install the ethereal-gnome RPM on my Fedora Core 2 laptop in order to get the GUI, the ethereal RPM contains only tethereal, the text based version of the tool. It is a network sniffer and can sniff and report on an impressive number of protocols spanning the entire network stack.

I have been using the lsof utility before I heard of it in Java One, and the man page was enough to get me started, so I wont talk about it much here. Basically, it lists all open files (and sockets, since everything is a file in Unix) held open by a process, so I have found it useful in the past to see which process is holding on to a port, and kill it if needed.

On the expo front, I got to see many vendors whose products I actually use, which is heartening to see, because I have been using (at least partial) open source products which get sold under commercial licenses (thereby getting some degree of support but keeping licensing costs down). The vendors I actually got to talking with were Terracotta and MainSoft. I also kind of looked over someone's shoulder for a NetBeans demo at the Sun booth.

Terracotta offers an interesting clustering solution. Their product sits between the JVMs in a clustered application and the rest of the application. In an IoC environment such as Spring, wwitching between using the single user mode (for development) and the clustered mode (for production) is a matter of a single change, repointing the wiring from the local bean to the clustered bean. Unlike other clustering solutions, which rely on a distributed cache, Terracotta relies on bytecode instrumentation, enhancing application classes to add event generators which talk to listeners in the Terracotta layer. I haven't actually used the product yet, but I would definitely like to evaluate it.

MainSoft offers a product that will convert the DLLs created by your C#/.NET application into JAR files that can be served in a Java environment. This would be useful for people who want to serve their .NET apps from webservers other than Microsoft IIS, at the same time not having to retrain their .NET developers. My use case was slightly different, and the product will not address my needs, but it was interesting talking to them. Incidentally, it also seems to reflect the greater interest and emphasis on Java/.NET interoperability.

The NetBeans demo that I listened in on impressed me enough so I came back home and downloaded the NetBeans 5.5beta version. I am an Eclipse/MyEclipse user at home and an IntelliJ IDEA user at work. My initial reaction is that NetBeans 5.5 is quite feature rich, and on-par (and in some cases slightly better) compared to Eclipse in terms of features supported. The GUI looks much cleaner and more sophisticated, but is just a tad slower than Eclipse when it comes to doing autocompletes and such. There are certain operations in NetBeans which you just cannot seem to do without a mouse, which was a bummer for me. But overall, it looks quite clean and feature rich. For someone looking to start out with a free Java IDE, both NetBeans and Eclipse look equally promising. Of course, if you are already using Eclipse with MyEclipse, you may consider the tradeoff between the slight loss in functionality against the recurring annual subscription fee.

Yet another trend this year was the appearance of a shelf-ful of books dealing with Data Structures and Algorithms in the JavaOne bookstore this year. I think this is because Java is beginning to tackle harder and harder problems, and knowing about Data Structures and existing algorithms helps to not reinvent wheels to solve problems which have already been solved. Two books I liked were Data Structures and Algorithm Analysis in Java by Mark A Weiss and Data Structures and Algorithms in Java by Michael T Goodrich.

Saturday, May 13, 2006

Using HttpClient to PURGE Squid entries

This article describes a hack to send HTTP PURGE requests to a Squid server using the Apache Commons HttpClient library. Squid is a web proxy cache which sits between the webserver and the client, intercepting HTTP GET requests and serving them out of the cache if available, or passing the request through to the webserver, populating the cache, and serving it from the cache if not. This configuration is known as a reverse-proxy configuration, probably to distinguish it from the caching proxies that ISPs put in front of their gateways to speed up customer's HTTP requests.

Squid allows you to purge entries from its cache by using a PURGE request. You can configure Squid to accept PURGE requests only from localhost (or from a set of specified internal hosts), and provides a squidclient command to do the purging. This is explained in detail in the FAQ entry here.

Since my objective was to send the PURGE request to Squid from within a Java program, using the squidclient command was not the most optimal option. Looking around the web, I came across a newsgroup post which showed a Perl script to do the same thing, which just sent this standard HTTP request over a socket to the Squid port.

1 2	PURGE http://my.squid.host:port/junk HTTP/1.0 Accept: /

Obviously, I could do something similar in Java as well. But I was also using HttpClient in this project to send HTTP GET requests, so I thought that it would be more maintainable and unified if I could somehow use HttpClient instead of doing direct socket calls for the PURGE. However, the PURGE request neither part of the standard HTTP 1.1 protocol, nor does it make sense in the context of a standard webserver, so it is not supported by HttpClient out of the box.

Adding support for a PURGE method was quite trivial, however. All I had to do was create a new PurgeMethod.java class, using the source code for the GetMethod.java class as a template, and HttpClient was able to serve HTTP PURGE requests to Squid. Here is the code for the PurgeMethod.java class.

// PurgeMethod.java
package my.company.com.httpclient.methods;
 
import org.apache.commons.httpclient.HttpMethodBase;
import org.apache.log4j.Logger;
 
/**
 * Specialized method to send a HTTP PURGE request to the specified URL. This
 * class implements the HttpMethod interface from the commons HttpClient
 * package.
 */
public class PurgeMethod extends HttpMethodBase {
 
    public PurgeMethod() {
        super();
        setFollowRedirects(true);
    }
 
    public PurgeMethod(String url) {
        super(url);
        setFollowRedirects(true);
    }
 
    public String getName() {
        return "PURGE";
    }
}

The calling code is very similar to the standard calling code for sending HTTP GET requests using HttpClient, which is detailed in the HttpClient Tutorial here. A stripped down version (without timeout settings and retries) is shown below:

    HttpClient client = new HttpClient();
    HttpMethod method = new PurgeMethod(url);
    try {
        int status = client.executeMethod(method);
        if (status != HttpStatus.SC_OK && status != HttpStatus.SC_NOTFOUND) {
            throw new Exception("HTTP PURGE failed for: " + url + "(" + status + ")");
        }
        return; // response body does not make any sense here
    } finally {
        method.releaseConnection();
    }

To test this, I aimed the code at a running Squid installation which was set up to reverse proxy to a Resin server. My test consisted of sending a PURGE request followed by two GET requests in succession, while watching the Squid access.log from another terminal. My expectation is that I will see a PURGE request, then a GET request which will not find the page in the cache (a TCP_MISS:DIRECT) followed by a GET request which will find the page in cache (a TCP_HIT:NONE). Here is a snippet from the Squid access.log.

1
2
3

10.16.181.34 - - [11/May/2006:15:51:08 -0700] "PURGE http://my.company.com/myapp/mypage.html HTTP/1.1" 200 122 TCP_MISS:NONE
10.16.181.34 - - [11/May/2006:15:51:09 -0700] "GET http://my.company.com/myapp/mypage.html HTTP/1.1" 200 30351 TCP_MISS:DIRECT
10.16.181.34 - - [11/May/2006:15:52:13 -0700] "GET http://my.company.com/myapp/mypage.html HTTP/1.1" 200 29043 TCP_HIT:NONE

As you can see, the actual code involved in adding the PURGE method functionality to HttpClient is trivial. However, this is one of the two ways I can think of to purge Squid entries in pure Java in a platform independent way. The other pure Java way is to use Java Sockets. I think that this approach is cleaner in the sense that it re-invents fewer wheels and piggybacks on standard open source libraries which provides the base plumbing functionality. Also I would like to commend the developers of the HttpClient library for making the framework so easy to extend, which I believe is a hallmark of great framework software.

Saturday, May 06, 2006

Performance Tuning Thoughts

I recently attended the MySQL Users Conference at Santa Clara, CA. One of the tutorials I signed up for was Mark Matthew's talk on J2EE Performance Tuning. Of course, this was a MySQL conference, and the speaker happened to be the author of the original JDBC driver for MySQL, so understandably there was a lot of emphasis on the new performance monitoring aspects of MySQL/Connector-J version 5, the upcoming JDBC driver for MySQL version 5.x databases.

But the nice part of the talk was that it got me thinking again about how to monitor and address performance related issues in a holistic manner. In previous lives, I have been a developer and part time system administrator for console-based Unix systems, and more recently, an Informix DBA. In both these times, I have had to address performance issues, and I have done so in a non-holistic (for want of a better word) manner. For example, my first reaction to a performance issue as an Informix DBA would have been to check the database read-write statistics, looking for hot spots, and trying to address problems by splitting the reads and writes on different disks. The next place I would look for is to check cardinality of data in the tables, looking for missing indexes. Since Informix, to the best of my knowledge, did not log slow queries, analyzing the queries meant that I would have to scan the entire codebase looking for them, so that was something I would do after the other approaches did not deliver the required functionality.

As a developer in a J2EE environment, I still have to address performance issues, but the focus is developer oriented. Typically, I measure wall-clock times of various methods and find methods which take the longest times and see if there is SQL or code that can be optimized. I measure front end responses with the Apache Bench tool, which allows you to set the number of clients, and the number of requests each client will make, and returns (among other things), the requests per second the page could serve, and the average, minimum and maximum processing times. Although MySQL logs slow queries in the slow query log, it gets used only when I am reacting to a performance problem, not when I am being proactive about ensuring my code is performant, because getting the slow query log needs DBA involvement (on MySQL version 4.1). The important thing to note in all these cases is that the performance measurements are developer-centric.

Occasionally, when reacting to performance problems, I would also look at application server (Resin) thread dumps, and try to find and fix code bottlenecks by tracing the dump back to the offending code. Although we don't run Resin with the stock JVM settings (these settings are determined by another group, based on the machine capacity on which Resin will be running), the only JVM settings I have ever actually changed myself are the minimum and maximum heap sizes.

What the talk did for me was to highlight that a J2EE application is really a layered cake of potentially non-performant hardware and software. At the very bottom there is the CPU and the RAM, followed by the operating system, followed by the database, followed by the application server, followed by the application code. The operating system, the database and the application server could potentially be non-performant because they have been improperly tuned for the application.

Fortunately, however, one does not have to start from scratch when trying to optimize for performance. It really boils down to choosing the right sized components as a starting point. Based on the projected demands on your application, you can usually choose the appropriate number and type of CPUs, and the amount of memory on your system, and the size of your swap space (among other things) on a Linux (Unix) based operating system. Databases generally offer configuration profiles (for example, the tiny, small, large and huge memory models of MySQL) to suit the particular application and hardware. Java based application servers allow you to tune the starting and maximum heap sizes, the type of garbage collector you want to use, and the generation sizes within your heap to optimize garbage collection. So really, the starting point of delivering optimal performance is to set up the optimal capacity.

Still, a person who needs to diagnose and fix a problem with a J2EE application will need to be familiar with and be able to tweak all these subsystems. Because performance metrics are heavily application dependent, this person will also need to be familiar with the application itself. Finding such a person in an organization of even moderate size is next to impossible. Bringing together groups of people to do a performance audit or diagnose and fix performance issues is a possibility, but since fixing a performance problem involves an iterative cycle of observation, tweak and more observation, this is often a time consuming operation, and often not acceptable to a business, which is losing money every minute with the non-performing application.

There was a time when I would sneer at the practice of "throwing more hardware" at a problem to fix performance issues, but the more I think about the expenses in continuing to operate a non-performant application, and the logistics to try to fix this in the time provided, the more I lean towards this option as the simplest and most cost effective way to deliver performance. By that, I do not imply the sloppy and non-performant code is ok. If there are indexes that need to be applied to the database, or the SQL needs to be re-written, or the components appropriately sized, then these should by done first. However, if your application is serving 1000 pages per second, and it starts keeling over when it is required to serve 2000 pages per second, it is possible that a few days of performance tuning will allow you to scale to the new level, perhaps more. But if it did, then it is more than possible that your application was not performant to begin with, and that should have been addressed before the application was deployed.

What I mean by "throwing more hardware" is the ability to scale out using clustering technologies. Putting the application behind a webserver and setting up reverse proxies to multiple underlying application servers, all running the same application, is one way. While each application server will contain the exact same copy of the application, they will be serving different slices of the application. The slicing will be set up in the reverse proxy configuration of the webserver. Alternatively, each application server serves the full application, but through multiple webservers behind a hardware load balancer. A hybrid of these two approaches is also a possibility.

On the database side, the scale out can be achieved by clustering multiple database masters (the read-any, write-all approach) or database replicants (one master, multiple slaves). I personally prefer the multiple master clustering scenario, since the application does not need to be changed at all to accomodate the change. The application thinks it is talking to a single database. On the other hand, in a replicated setup, you will have to have separate configurations to read from the slaves and to read and write from and to the master. There is also a replication latency which you will have to account for if your application has a scenario where it writes and reads back within a very short time. Most databases (including MySQL) offer the ability to do replication. C-JDBC is an open source initiative to achieve multiple master clustering, but its performance leaves a lot to be desired. I am told that m/cluster from Continuent, a commercial offering based on C-JDBC is much better in that regard.

Another important component is the application cache. Most J2EE applications that serve dynamic (ie generated from a database) and have significant traffic tend to use caching of some kind. When we attempt to serve the traffic with multiple machines, the cache has to be shared between the application server JVMs. There are a variety of distributed caches available in the market. The best known of these is Coherence from Tangosol, but there are others, such as JBoss Cache and SwarmCache. These caches either replicate or distribute - replicated caches replicate the cached contents to all the caches in the cluster, while distributed caches store it in only one place, but are able to pull it from the right place when they are requested to.

I still think that performance reviews of application code and load testing on hardware comparable to actual production hardware have lots of value, but neither of these approaches are simple to set up. The new MySQL JDBC driver provides a lot more performance metrics (including a "local" list of slow queries), so a disciplined J2EE developer using MySQL will find this very helpful in finding and fixing code and SQL performance bottlenecks before pushing the code out of development.

So I think, my personal performance tuning mantra boils down to these two simple commandments:

Find and fix bottlenecks in SQL and code in development.
Simulate load using the Apache Bench tool.
Design for clustering.

Salmon Run

Friday, May 26, 2006

Java Data Structure: A Generic Tree

Saturday, May 20, 2006

Obligatory JavaOne 2006 Post

Saturday, May 13, 2006

Using HttpClient to PURGE Squid entries

Saturday, May 06, 2006

Performance Tuning Thoughts

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me