Saturday, May 24, 2008

Parsing OWL XML with StAX

I needed some data to build an ontology, so I downloaded a sample wine ontology from W3C. The file is in OWL (Web Ontology Language) XML format, so I needed to parse it. Ordinarily, I would have used JDOM to parse it, but I had recently heard some good things about XML pull-parsing from a colleague, so I decided to use StAX, which is built into Java 6, in order to check out pull-parsing.

My objective was to parse out the XML file into a simple database structure, so I can use it later. The database structure is shown below. The central table is the entity table which represents a node of the ontology. The relations table links nodes via relationships. The attributes table contains non-base properties of an entity. The distinction between attributes and relationships is kind of gray, but in general I consider an attribute to be a property that we can look an entity up by, such as a name. The attribute_type and relation_type tables are an attempt to normalize repetitive attribute and relation names out of the tables.

I would like to point out that the parser is not a "standard" OWL parser. The tag names to extract information from was determined by eyeballing the wine.rdf file and figuring out which tags would yield interesting information. So it will need to change in order to parse some other OWL file. However, if you are just looking for pointers on how to go about doing something similar, you may find the post useful.

Since this is the first time I used any pull parsing library, I would like to share my initial reactions to this strategy. At first sight, there does not seem to be much difference between SAX (push-parsing) and StAX. With SAX, you intercept the parser lifecycle, adding hooks into the startElement() and endElement() methods to do custom processing, and with StAX, you respond to startElement and endElement events fired by the parser. The difference becomes apparent once you start working with it, however. Because you are working with events, you can delegate processing to sub-methods by passing around the parser reference, which makes your code a bit cleaner.

The code for the parser is shown below. It saves the extracted data into a local MySQL database. Callers would instantiate the OwlDbLoader, set the path to the OWL file and DataSource, and call the parseAndLoadData() method.

package com.mycompany.myapp.ontology.loaders;

import java.io.FileInputStream;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.List;

import javax.sql.DataSource;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

import org.apache.commons.collections15.Closure;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.dao.IncorrectResultSizeDataAccessException;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.PreparedStatementCreator;
import org.springframework.jdbc.support.GeneratedKeyHolder;
import org.springframework.jdbc.support.KeyHolder;

import com.mycompany.myapp.ontology.Attribute;
import com.mycompany.myapp.ontology.Entity;

/**
 * Parse OWL files representing external ontologies and loads them 
 * into local database.
 */
public class OwlDbLoader {
  
  private final Log log = LogFactory.getLog(getClass());
  
  private final static String RDF_URI = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
  
  private JdbcTemplate jdbcTemplate;
  private String owlFileLocation;
  
  private static String parentTagName = null;
  
  public void setDataSource(DataSource dataSource) {
    this.jdbcTemplate = new JdbcTemplate(dataSource);
  }
  
  public void setOwlFileLocation(String owlFileLocation) {
    this.owlFileLocation = owlFileLocation;
  }

  /**
   * These parsing rules were devised by physically looking at the OWL file
   * and figuring out what goes where. This should by no means be considered
   * a generalized way to parse OWL files.
   * 
   * Parsing rules:
   * 
   * owl:Class@rdf:ID = entity (1), type=Wine
   * optional:
   *   owl:Class/rdfs:subClassOf@rdf:resource = entity (2), type=Wine
   *   (2) -- parent --> (1)
   * if owl:Class/rdfs:subClassOf has no attributes, ignore
   * if no owl:Class/rdfs:subClassOf entity, ignore it
   * owl:Class/owl:Restriction/owl:onProperty@rdf:resource related to
   *   owl:Class/owl:Restriction/owl:hasValue@rdf:resource
   *  
   * Region@rdf:ID = entity, type=Region
   * optional:
   *   Region/locatedIn@rdf:resource=entity (2), type=Region
   *   (2) -- parent -- (1)
   * owl:Class/rdfs:subClassOf/owl:Restriction - ignore
   * 
   * WineBody@rdf:ID = entity, type=WineBody
   * WineColor@rdf:ID = entity, type=WineColor
   * WineFlavor@rdf:ID = entity, type=WineFlavor
   * WineSugar@rdf:ID = entity, type=WineSugar
   * Winery@rdf:ID = entity, type=Winery
   * WineGrape@rdf:ID = entity, type=WineGrape
   * 
   * Else if no namespace, this must be a wine itself, capture as entity:
   * ?@rdf:ID = entity, type=Wine
   *   all subtags are relations:
   *     tagname = relation_name
   *     tag@rdf:resource = target entity
   */
  public void parseAndLoadData() throws Exception {
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader parser = factory.createXMLStreamReader(
      new FileInputStream(owlFileLocation));
    int depth = 0;
    for (;;) {
      int event = parser.next();
      if (event == XMLStreamConstants.END_DOCUMENT) {
        break;
      }
      switch (event) {
        case XMLStreamConstants.START_ELEMENT:
          depth++;
          String tagName = formatTag(parser.getName());
          if (tagName.equals("owl:Class")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                // relations are not being persisted because value of child
                // entity cannot be persisted.
                String tagName = formatTag(parser.getName());
                if (tagName.equals("owl:Class")) {
                  String name = parser.getAttributeValue(RDF_URI, "ID");
                  if (name != null) {
                    Entity classEntity = new Entity();
                    parentTagName = name;
                    classEntity.setName(parentTagName);
                    classEntity.addAttribute(new Attribute("Type", "Class"));
                    saveEntity(classEntity);
                  }
                } else if (tagName.equals("rdfs:subClassOf")) {
                  String name = parser.getAttributeValue(RDF_URI, "resource");
                  if (name != null) {
                    Entity superclassEntity = new Entity();
                    if (name.startsWith("http://")) {
                      superclassEntity.setName(name.substring(name.lastIndexOf('#') + 1));
                      superclassEntity.addAttribute(new Attribute("Type", 
                        name.substring(name.lastIndexOf('/') + 1, 
                        name.lastIndexOf('#')) + ":Class"));
                    } else if (name.startsWith("#")) {
                      superclassEntity.setName(name.substring(1));
                      superclassEntity.addAttribute(new Attribute("Type", "Class"));
                    } else {
                      superclassEntity.setName(name);
                      superclassEntity.addAttribute(new Attribute("Type", "Class"));
                    }
                    saveEntity(superclassEntity);
                    saveRelation(parentTagName, superclassEntity.getName(), "parentOf");
                    parentTagName = null;
                  }
                }
              }
            });
          } else if (tagName.equals("Region")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                String tagName = formatTag(parser.getName());
                if (tagName.equals("Region")) {
                  Entity classEntity = new Entity();
                  parentTagName = parser.getAttributeValue(RDF_URI, "ID");
                  classEntity.setName(parentTagName);
                  classEntity.addAttribute(new Attribute("Type", "Region"));
                  saveEntity(classEntity);
                } else if (tagName.equals("locatedIn")) {
                  Entity superclassEntity = new Entity();
                  String locationEntityName = parser.getAttributeValue(RDF_URI, "resource");
                  if (locationEntityName.startsWith("#")) {
                    locationEntityName = locationEntityName.substring(1);
                  }
                  superclassEntity.setName(locationEntityName);
                  superclassEntity.addAttribute(new Attribute("Type", "Region"));
                  saveEntity(superclassEntity);
                  saveRelation(parentTagName, locationEntityName, "locatedIn");
                  parentTagName = null;
                }
              }
            });
          } else if (tagName.equals("WineBody") || 
              tagName.equals("WineColor") ||
              tagName.equals("WineFlavor") ||
              tagName.equals("WineSugar") ||
              tagName.equals("WineGrape")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                Entity entity = new Entity();
                String name = parser.getAttributeValue(RDF_URI, "ID");
                if (name != null) {
                  entity.setName(name);
                  String tagName = parser.getLocalName();
                  Attribute attribute = null;
                  if (tagName.equals("WineBody")) {
                    attribute = new Attribute("Type", "Body");
                  } else if (tagName.equals("WineColor")) {
                    attribute = new Attribute("Type", "Color");
                  } else if (tagName.equals("WineFlavor")) {
                    attribute = new Attribute("Type", "Flavor");
                  } else if (tagName.equals("WineSugar")) {
                    attribute = new Attribute("Type", "Sugar");
                  } else if (tagName.equals("WineGrape")) {
                    attribute = new Attribute("Type", "Grape");
                  }
                  entity.addAttribute(attribute);
                  saveEntity(entity);
                }
              }
            });
          } else if (tagName.equals("vin:Winery")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                String wineryName = parser.getAttributeValue(RDF_URI, "about");
                if (wineryName.startsWith("#")) {
                  wineryName = wineryName.substring(1);
                }
                Entity entity = new Entity();
                entity.setName(wineryName);
                entity.addAttribute(new Attribute("Type", "Winery"));
                saveEntity(entity);
              }
            });
          } else if (! tagName.startsWith("owl:")) {
            long parentEntityId = getEntityIdFromDb(tagName);
            if (parentEntityId != -1) {
              processTag(parser, new Closure<XMLStreamReader>() {
                public void execute(XMLStreamReader parser) {
                  String tagName = formatTag(parser.getName());
                  String id = parser.getAttributeValue(RDF_URI, "ID");
                  if (StringUtils.isNotBlank(id)) {
                    // this is the entity
                    Entity entity = new Entity();
                    entity.setName(id);
                    entity.addAttribute(new Attribute("Type", "Wine"));
                    parentTagName = entity.getName();
                    saveEntity(entity);
                  } else {
                    // these are the relations
                    String relationName = tagName;
                    String targetEntityName = parser.getAttributeValue(RDF_URI, "resource");
                    if (targetEntityName != null && targetEntityName.startsWith("#")) {
                      targetEntityName = targetEntityName.substring(1);
                    }
                    if (targetEntityName != null) {
                      saveRelation(parentTagName, targetEntityName, relationName);
                    }
                  }
                }
              });
            }
          }
          break;
        case XMLStreamConstants.END_ELEMENT:
          depth--;
          break;
        default:
          break;
      }
      parser.close();
    }
  }

  /**
   * A tag processor template method which takes as input a closure that is
   * responsible for extracting the information from the tag and saving it
   * to the database. The contents of the closure is called inside the
   * START_DOCUMENT case of the template code.
   * @param parser a reference to our StAX XMLStreamReader.
   * @param tagProcessor a reference to the Closure to process the tag.
   * @throws Exception if one is thrown.
   */
  private void processTag(XMLStreamReader parser, Closure<XMLStreamReader> tagProcessor) 
      throws Exception {
    int depth = 0;
    int event = parser.getEventType();
    String startTag = formatTag(parser.getName());
    FOR_LOOP:
    for (;;) {
      switch(event) {
        case XMLStreamConstants.START_ELEMENT:
          String tagName = formatTag(parser.getName());
          tagProcessor.execute(parser);
          depth++;
          break;
        case XMLStreamConstants.END_ELEMENT:
          tagName = formatTag(parser.getName());
          depth--;
          if (tagName.equals(startTag) && depth == 0) {
            break FOR_LOOP;
          }
          break;
        default:
          break;
      }
      event = parser.next();
    }
  }
  
  // ====================== DB load/save methods =========================

  /**
   * Saves an entity to the database. Takes care of setting attribute_types and
   * attribute objects linked to the entity.
   * @param entity the Entity to save.
   */
  private void saveEntity(final Entity entity) {
    // if entity already exists, don't save
    long entityId = getEntityIdFromDb(entity.getName());
    if (entityId == -1L) {
      log.debug("Saving entity:" + entity.getName());
      // insert the entity
      KeyHolder entityKeyHolder = new GeneratedKeyHolder();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn)
        throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into entities(name) values (?)", 
            Statement.RETURN_GENERATED_KEYS);
          ps.setString(1, entity.getName());
          return ps;
        }
      }, entityKeyHolder);
      entityId = entityKeyHolder.getKey().longValue();
      List<Attribute> attributes = entity.getAttributes();
      for (Attribute attribute : attributes) {
        saveAttribute(entityId, attribute);
      }
      // finally, always save the "english name" of the entity as an attribute
      saveAttribute(entityId, new Attribute("EnglishName", getEnglishName(entity.getName())));
    }
  }

  /**
   * Saves an entity attribute to the database and links the attribute to the
   * specified entity id.
   * @param entityId the entity id.
   * @param attribute the Attribute object to save.
   */
  private void saveAttribute(long entityId, Attribute attribute) {
    // check to see if the attribute is defined, if not define it
    long attributeId = 0L;
    try {
      attributeId = jdbcTemplate.queryForLong(
        "select id from attribute_types where attr_name = ?", 
        new String[] {attribute.getName()});
    } catch (IncorrectResultSizeDataAccessException e) {
      KeyHolder keyholder = new GeneratedKeyHolder();
      final String attributeName = attribute.getName();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn)
        throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into attribute_types(attr_name) values (?)");
          ps.setString(1, attributeName);
          return ps;
        }
      }, keyholder);
      attributeId = keyholder.getKey().longValue();
    }
    jdbcTemplate.update(
      "insert into attributes(entity_id, attr_id, value) values (?,?,?)",
      new Object[] {entityId, attributeId, attribute.getValue()});
  }

  /**
   * Saves the relation into the database. Both entities must exist if the
   * relation is to be saved. Takes care of updating relation_types as well.
   * @param sourceEntityName the name of the source entity.
   * @param targetEntityName the name of the target entity.
   * @param relationName the name of the relation.
   */
  private void saveRelation(final String sourceEntityName, final String targetEntityName, 
      final String relationName) {
    // get the entity ids for source and target
    long sourceEntityId = getEntityIdFromDb(sourceEntityName);
    long targetEntityId = getEntityIdFromDb(targetEntityName);
    if (sourceEntityId == -1L || targetEntityId == -1L) {
      log.error("Cannot save relation: " + relationName + "(" + 
        sourceEntityName + "," + targetEntityName + ")"); 
      return;
    }
    log.debug("Saving relation: " + relationName + "(" + 
      sourceEntityName + "," + targetEntityName + ")");
    // get the relation id
    long relationTypeId = 0L;
    try {
      relationTypeId = jdbcTemplate.queryForInt(
        "select id from relation_types where type_name = ?", 
        new String[] {relationName});
    } catch (IncorrectResultSizeDataAccessException e) {
      KeyHolder keyholder = new GeneratedKeyHolder();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn) 
            throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into relation_types(type_name) values (?)", 
            Statement.RETURN_GENERATED_KEYS);
          ps.setString(1, relationName);
          return ps;
        }
      }, keyholder);
      relationTypeId = keyholder.getKey().longValue();
    }
    // save it
    jdbcTemplate.update(
      "insert into relations(src_entity_id, trg_entity_id, relation_id) values (?, ?, ?)", 
      new Long[] {sourceEntityId, targetEntityId, relationTypeId});
  }

  /**
   * Looks up the database to get the entity id given the name of the entity.
   * If the entity is not found, it returns -1.
   * @param entityName the name of the entity.
   * @return the entity id, or -1 of the entity.
   */
  private long getEntityIdFromDb(String entityName) {
    try {
      long sourceEntityId = jdbcTemplate.queryForLong(
        "select id from entities where name = ?", 
        new String[] {entityName});
      return sourceEntityId;
    } catch (IncorrectResultSizeDataAccessException e) {
      return -1L;
    }
  }

  // ======== String manipulation methods ========
  
  /**
   * Format the XML tag. Takes as input the QName of the tag, and formats
   * it to a namespace:tagname format.
   * @param qname the QName for the tag.
   * @return the formatted QName for the tag.
   */
  private String formatTag(QName qname) {
    String prefix = qname.getPrefix();
    String suffix = qname.getLocalPart();
    if (StringUtils.isBlank(prefix)) {
      return suffix;
    } else {
      return StringUtils.join(new String[] {prefix, suffix}, ":");
    }
  }

  /**
   * Split up Uppercase Camelcased names (like Java classnames or C++ variable
   * names) into English phrases by splitting wherever there is a transition 
   * from lowercase to uppercase.
   * @param name the input camel cased name.
   * @return the "english" name.
   */
  private String getEnglishName(String name) {
    StringBuilder englishNameBuilder = new StringBuilder();
    char[] namechars = name.toCharArray();
    for (int i = 0; i < namechars.length; i++) {
      if (i > 0 && Character.isUpperCase(namechars[i]) && 
          Character.isLowerCase(namechars[i-1])) {
        englishNameBuilder.append(' ');
      }
      englishNameBuilder.append(namechars[i]);
    }
    return englishNameBuilder.toString();
  }
}

If you have been reading the code closely above, notice the calls to the processTag() method in parseAndLoadData(). The parseAndLoadData() implements an infinite loop where parser.next() is called repeatedly until the END_DOCUMENT event is encountered. You want to do specific processing for certain tags as you encounter them. Because the specific processing will also set up the for(;;) loop as in parseAndLoadData() and break out of it when a closing tag at the same depth is encountered, the code is repetitive if it is put into every sub method. The processTag() method implements a template to which I pass a Closure.

Because I use anonymous Closures inlined into the code, the parseAndLoadData() looks monolithic. Some people would prefer to use declared private Closure implementations and use them here instead. This will make the code superficially cleaner, but because it implements only a portion of the functionality, readers of the code will bounce between parseAndLoadData(), processTag() and the Closure implementation, and the result is likely to be less readable than the current approach. I prefer it the way I have written it - even though all the code is in one place, having the code inside inner classe methods decreases the coupling and makes it more readable than monolithic code. Choose whichever approach works best for you and your coding style.

Overall, I liked StAX. Everything else being equal, I would still prefer DOM (using JDOM) parsing over SAX or StAX. However, when parsing large XML files, DOM is impractical, and StAX is a better alternative, resulting in cleaner and more maintainable code.