Salmon Run: Parsing OWL XML with StAX

I needed some data to build an ontology, so I downloaded a sample wine ontology from W3C. The file is in OWL (Web Ontology Language) XML format, so I needed to parse it. Ordinarily, I would have used JDOM to parse it, but I had recently heard some good things about XML pull-parsing from a colleague, so I decided to use StAX, which is built into Java 6, in order to check out pull-parsing.

My objective was to parse out the XML file into a simple database structure, so I can use it later. The database structure is shown below. The central table is the entity table which represents a node of the ontology. The relations table links nodes via relationships. The attributes table contains non-base properties of an entity. The distinction between attributes and relationships is kind of gray, but in general I consider an attribute to be a property that we can look an entity up by, such as a name. The attribute_type and relation_type tables are an attempt to normalize repetitive attribute and relation names out of the tables.

I would like to point out that the parser is not a "standard" OWL parser. The tag names to extract information from was determined by eyeballing the wine.rdf file and figuring out which tags would yield interesting information. So it will need to change in order to parse some other OWL file. However, if you are just looking for pointers on how to go about doing something similar, you may find the post useful.

Since this is the first time I used any pull parsing library, I would like to share my initial reactions to this strategy. At first sight, there does not seem to be much difference between SAX (push-parsing) and StAX. With SAX, you intercept the parser lifecycle, adding hooks into the startElement() and endElement() methods to do custom processing, and with StAX, you respond to startElement and endElement events fired by the parser. The difference becomes apparent once you start working with it, however. Because you are working with events, you can delegate processing to sub-methods by passing around the parser reference, which makes your code a bit cleaner.

The code for the parser is shown below. It saves the extracted data into a local MySQL database. Callers would instantiate the OwlDbLoader, set the path to the OWL file and DataSource, and call the parseAndLoadData() method.

package com.mycompany.myapp.ontology.loaders;

import java.io.FileInputStream;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.List;

import javax.sql.DataSource;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

import org.apache.commons.collections15.Closure;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.dao.IncorrectResultSizeDataAccessException;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.PreparedStatementCreator;
import org.springframework.jdbc.support.GeneratedKeyHolder;
import org.springframework.jdbc.support.KeyHolder;

import com.mycompany.myapp.ontology.Attribute;
import com.mycompany.myapp.ontology.Entity;

/**
 * Parse OWL files representing external ontologies and loads them 
 * into local database.
 */
public class OwlDbLoader {
  
  private final Log log = LogFactory.getLog(getClass());
  
  private final static String RDF_URI = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
  
  private JdbcTemplate jdbcTemplate;
  private String owlFileLocation;
  
  private static String parentTagName = null;
  
  public void setDataSource(DataSource dataSource) {
    this.jdbcTemplate = new JdbcTemplate(dataSource);
  }
  
  public void setOwlFileLocation(String owlFileLocation) {
    this.owlFileLocation = owlFileLocation;
  }

  /**
   * These parsing rules were devised by physically looking at the OWL file
   * and figuring out what goes where. This should by no means be considered
   * a generalized way to parse OWL files.
   * 
   * Parsing rules:
   * 
   * owl:Class@rdf:ID = entity (1), type=Wine
   * optional:
   *   owl:Class/rdfs:subClassOf@rdf:resource = entity (2), type=Wine
   *   (2) -- parent --> (1)
   * if owl:Class/rdfs:subClassOf has no attributes, ignore
   * if no owl:Class/rdfs:subClassOf entity, ignore it
   * owl:Class/owl:Restriction/owl:onProperty@rdf:resource related to
   *   owl:Class/owl:Restriction/owl:hasValue@rdf:resource
   *  
   * Region@rdf:ID = entity, type=Region
   * optional:
   *   Region/locatedIn@rdf:resource=entity (2), type=Region
   *   (2) -- parent -- (1)
   * owl:Class/rdfs:subClassOf/owl:Restriction - ignore
   * 
   * WineBody@rdf:ID = entity, type=WineBody
   * WineColor@rdf:ID = entity, type=WineColor
   * WineFlavor@rdf:ID = entity, type=WineFlavor
   * WineSugar@rdf:ID = entity, type=WineSugar
   * Winery@rdf:ID = entity, type=Winery
   * WineGrape@rdf:ID = entity, type=WineGrape
   * 
   * Else if no namespace, this must be a wine itself, capture as entity:
   * ?@rdf:ID = entity, type=Wine
   *   all subtags are relations:
   *     tagname = relation_name
   *     tag@rdf:resource = target entity
   */
  public void parseAndLoadData() throws Exception {
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader parser = factory.createXMLStreamReader(
      new FileInputStream(owlFileLocation));
    int depth = 0;
    for (;;) {
      int event = parser.next();
      if (event == XMLStreamConstants.END_DOCUMENT) {
        break;
      }
      switch (event) {
        case XMLStreamConstants.START_ELEMENT:
          depth++;
          String tagName = formatTag(parser.getName());
          if (tagName.equals("owl:Class")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                // relations are not being persisted because value of child
                // entity cannot be persisted.
                String tagName = formatTag(parser.getName());
                if (tagName.equals("owl:Class")) {
                  String name = parser.getAttributeValue(RDF_URI, "ID");
                  if (name != null) {
                    Entity classEntity = new Entity();
                    parentTagName = name;
                    classEntity.setName(parentTagName);
                    classEntity.addAttribute(new Attribute("Type", "Class"));
                    saveEntity(classEntity);
                  }
                } else if (tagName.equals("rdfs:subClassOf")) {
                  String name = parser.getAttributeValue(RDF_URI, "resource");
                  if (name != null) {
                    Entity superclassEntity = new Entity();
                    if (name.startsWith("http://")) {
                      superclassEntity.setName(name.substring(name.lastIndexOf('#') + 1));
                      superclassEntity.addAttribute(new Attribute("Type", 
                        name.substring(name.lastIndexOf('/') + 1, 
                        name.lastIndexOf('#')) + ":Class"));
                    } else if (name.startsWith("#")) {
                      superclassEntity.setName(name.substring(1));
                      superclassEntity.addAttribute(new Attribute("Type", "Class"));
                    } else {
                      superclassEntity.setName(name);
                      superclassEntity.addAttribute(new Attribute("Type", "Class"));
                    }
                    saveEntity(superclassEntity);
                    saveRelation(parentTagName, superclassEntity.getName(), "parentOf");
                    parentTagName = null;
                  }
                }
              }
            });
          } else if (tagName.equals("Region")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                String tagName = formatTag(parser.getName());
                if (tagName.equals("Region")) {
                  Entity classEntity = new Entity();
                  parentTagName = parser.getAttributeValue(RDF_URI, "ID");
                  classEntity.setName(parentTagName);
                  classEntity.addAttribute(new Attribute("Type", "Region"));
                  saveEntity(classEntity);
                } else if (tagName.equals("locatedIn")) {
                  Entity superclassEntity = new Entity();
                  String locationEntityName = parser.getAttributeValue(RDF_URI, "resource");
                  if (locationEntityName.startsWith("#")) {
                    locationEntityName = locationEntityName.substring(1);
                  }
                  superclassEntity.setName(locationEntityName);
                  superclassEntity.addAttribute(new Attribute("Type", "Region"));
                  saveEntity(superclassEntity);
                  saveRelation(parentTagName, locationEntityName, "locatedIn");
                  parentTagName = null;
                }
              }
            });
          } else if (tagName.equals("WineBody") || 
              tagName.equals("WineColor") ||
              tagName.equals("WineFlavor") ||
              tagName.equals("WineSugar") ||
              tagName.equals("WineGrape")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                Entity entity = new Entity();
                String name = parser.getAttributeValue(RDF_URI, "ID");
                if (name != null) {
                  entity.setName(name);
                  String tagName = parser.getLocalName();
                  Attribute attribute = null;
                  if (tagName.equals("WineBody")) {
                    attribute = new Attribute("Type", "Body");
                  } else if (tagName.equals("WineColor")) {
                    attribute = new Attribute("Type", "Color");
                  } else if (tagName.equals("WineFlavor")) {
                    attribute = new Attribute("Type", "Flavor");
                  } else if (tagName.equals("WineSugar")) {
                    attribute = new Attribute("Type", "Sugar");
                  } else if (tagName.equals("WineGrape")) {
                    attribute = new Attribute("Type", "Grape");
                  }
                  entity.addAttribute(attribute);
                  saveEntity(entity);
                }
              }
            });
          } else if (tagName.equals("vin:Winery")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                String wineryName = parser.getAttributeValue(RDF_URI, "about");
                if (wineryName.startsWith("#")) {
                  wineryName = wineryName.substring(1);
                }
                Entity entity = new Entity();
                entity.setName(wineryName);
                entity.addAttribute(new Attribute("Type", "Winery"));
                saveEntity(entity);
              }
            });
          } else if (! tagName.startsWith("owl:")) {
            long parentEntityId = getEntityIdFromDb(tagName);
            if (parentEntityId != -1) {
              processTag(parser, new Closure<XMLStreamReader>() {
                public void execute(XMLStreamReader parser) {
                  String tagName = formatTag(parser.getName());
                  String id = parser.getAttributeValue(RDF_URI, "ID");
                  if (StringUtils.isNotBlank(id)) {
                    // this is the entity
                    Entity entity = new Entity();
                    entity.setName(id);
                    entity.addAttribute(new Attribute("Type", "Wine"));
                    parentTagName = entity.getName();
                    saveEntity(entity);
                  } else {
                    // these are the relations
                    String relationName = tagName;
                    String targetEntityName = parser.getAttributeValue(RDF_URI, "resource");
                    if (targetEntityName != null && targetEntityName.startsWith("#")) {
                      targetEntityName = targetEntityName.substring(1);
                    }
                    if (targetEntityName != null) {
                      saveRelation(parentTagName, targetEntityName, relationName);
                    }
                  }
                }
              });
            }
          }
          break;
        case XMLStreamConstants.END_ELEMENT:
          depth--;
          break;
        default:
          break;
      }
      parser.close();
    }
  }

  /**
   * A tag processor template method which takes as input a closure that is
   * responsible for extracting the information from the tag and saving it
   * to the database. The contents of the closure is called inside the
   * START_DOCUMENT case of the template code.
   * @param parser a reference to our StAX XMLStreamReader.
   * @param tagProcessor a reference to the Closure to process the tag.
   * @throws Exception if one is thrown.
   */
  private void processTag(XMLStreamReader parser, Closure<XMLStreamReader> tagProcessor) 
      throws Exception {
    int depth = 0;
    int event = parser.getEventType();
    String startTag = formatTag(parser.getName());
    FOR_LOOP:
    for (;;) {
      switch(event) {
        case XMLStreamConstants.START_ELEMENT:
          String tagName = formatTag(parser.getName());
          tagProcessor.execute(parser);
          depth++;
          break;
        case XMLStreamConstants.END_ELEMENT:
          tagName = formatTag(parser.getName());
          depth--;
          if (tagName.equals(startTag) && depth == 0) {
            break FOR_LOOP;
          }
          break;
        default:
          break;
      }
      event = parser.next();
    }
  }
  
  // ====================== DB load/save methods =========================

  /**
   * Saves an entity to the database. Takes care of setting attribute_types and
   * attribute objects linked to the entity.
   * @param entity the Entity to save.
   */
  private void saveEntity(final Entity entity) {
    // if entity already exists, don't save
    long entityId = getEntityIdFromDb(entity.getName());
    if (entityId == -1L) {
      log.debug("Saving entity:" + entity.getName());
      // insert the entity
      KeyHolder entityKeyHolder = new GeneratedKeyHolder();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn)
        throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into entities(name) values (?)", 
            Statement.RETURN_GENERATED_KEYS);
          ps.setString(1, entity.getName());
          return ps;
        }
      }, entityKeyHolder);
      entityId = entityKeyHolder.getKey().longValue();
      List<Attribute> attributes = entity.getAttributes();
      for (Attribute attribute : attributes) {
        saveAttribute(entityId, attribute);
      }
      // finally, always save the "english name" of the entity as an attribute
      saveAttribute(entityId, new Attribute("EnglishName", getEnglishName(entity.getName())));
    }
  }

  /**
   * Saves an entity attribute to the database and links the attribute to the
   * specified entity id.
   * @param entityId the entity id.
   * @param attribute the Attribute object to save.
   */
  private void saveAttribute(long entityId, Attribute attribute) {
    // check to see if the attribute is defined, if not define it
    long attributeId = 0L;
    try {
      attributeId = jdbcTemplate.queryForLong(
        "select id from attribute_types where attr_name = ?", 
        new String[] {attribute.getName()});
    } catch (IncorrectResultSizeDataAccessException e) {
      KeyHolder keyholder = new GeneratedKeyHolder();
      final String attributeName = attribute.getName();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn)
        throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into attribute_types(attr_name) values (?)");
          ps.setString(1, attributeName);
          return ps;
        }
      }, keyholder);
      attributeId = keyholder.getKey().longValue();
    }
    jdbcTemplate.update(
      "insert into attributes(entity_id, attr_id, value) values (?,?,?)",
      new Object[] {entityId, attributeId, attribute.getValue()});
  }

  /**
   * Saves the relation into the database. Both entities must exist if the
   * relation is to be saved. Takes care of updating relation_types as well.
   * @param sourceEntityName the name of the source entity.
   * @param targetEntityName the name of the target entity.
   * @param relationName the name of the relation.
   */
  private void saveRelation(final String sourceEntityName, final String targetEntityName, 
      final String relationName) {
    // get the entity ids for source and target
    long sourceEntityId = getEntityIdFromDb(sourceEntityName);
    long targetEntityId = getEntityIdFromDb(targetEntityName);
    if (sourceEntityId == -1L || targetEntityId == -1L) {
      log.error("Cannot save relation: " + relationName + "(" + 
        sourceEntityName + "," + targetEntityName + ")"); 
      return;
    }
    log.debug("Saving relation: " + relationName + "(" + 
      sourceEntityName + "," + targetEntityName + ")");
    // get the relation id
    long relationTypeId = 0L;
    try {
      relationTypeId = jdbcTemplate.queryForInt(
        "select id from relation_types where type_name = ?", 
        new String[] {relationName});
    } catch (IncorrectResultSizeDataAccessException e) {
      KeyHolder keyholder = new GeneratedKeyHolder();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn) 
            throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into relation_types(type_name) values (?)", 
            Statement.RETURN_GENERATED_KEYS);
          ps.setString(1, relationName);
          return ps;
        }
      }, keyholder);
      relationTypeId = keyholder.getKey().longValue();
    }
    // save it
    jdbcTemplate.update(
      "insert into relations(src_entity_id, trg_entity_id, relation_id) values (?, ?, ?)", 
      new Long[] {sourceEntityId, targetEntityId, relationTypeId});
  }

  /**
   * Looks up the database to get the entity id given the name of the entity.
   * If the entity is not found, it returns -1.
   * @param entityName the name of the entity.
   * @return the entity id, or -1 of the entity.
   */
  private long getEntityIdFromDb(String entityName) {
    try {
      long sourceEntityId = jdbcTemplate.queryForLong(
        "select id from entities where name = ?", 
        new String[] {entityName});
      return sourceEntityId;
    } catch (IncorrectResultSizeDataAccessException e) {
      return -1L;
    }
  }

  // ======== String manipulation methods ========
  
  /**
   * Format the XML tag. Takes as input the QName of the tag, and formats
   * it to a namespace:tagname format.
   * @param qname the QName for the tag.
   * @return the formatted QName for the tag.
   */
  private String formatTag(QName qname) {
    String prefix = qname.getPrefix();
    String suffix = qname.getLocalPart();
    if (StringUtils.isBlank(prefix)) {
      return suffix;
    } else {
      return StringUtils.join(new String[] {prefix, suffix}, ":");
    }
  }

  /**
   * Split up Uppercase Camelcased names (like Java classnames or C++ variable
   * names) into English phrases by splitting wherever there is a transition 
   * from lowercase to uppercase.
   * @param name the input camel cased name.
   * @return the "english" name.
   */
  private String getEnglishName(String name) {
    StringBuilder englishNameBuilder = new StringBuilder();
    char[] namechars = name.toCharArray();
    for (int i = 0; i < namechars.length; i++) {
      if (i > 0 && Character.isUpperCase(namechars[i]) && 
          Character.isLowerCase(namechars[i-1])) {
        englishNameBuilder.append(' ');
      }
      englishNameBuilder.append(namechars[i]);
    }
    return englishNameBuilder.toString();
  }
}

If you have been reading the code closely above, notice the calls to the processTag() method in parseAndLoadData(). The parseAndLoadData() implements an infinite loop where parser.next() is called repeatedly until the END_DOCUMENT event is encountered. You want to do specific processing for certain tags as you encounter them. Because the specific processing will also set up the for(;;) loop as in parseAndLoadData() and break out of it when a closing tag at the same depth is encountered, the code is repetitive if it is put into every sub method. The processTag() method implements a template to which I pass a Closure.

Because I use anonymous Closures inlined into the code, the parseAndLoadData() looks monolithic. Some people would prefer to use declared private Closure implementations and use them here instead. This will make the code superficially cleaner, but because it implements only a portion of the functionality, readers of the code will bounce between parseAndLoadData(), processTag() and the Closure implementation, and the result is likely to be less readable than the current approach. I prefer it the way I have written it - even though all the code is in one place, having the code inside inner classe methods decreases the coupling and makes it more readable than monolithic code. Choose whichever approach works best for you and your coding style.

Overall, I liked StAX. Everything else being equal, I would still prefer DOM (using JDOM) parsing over SAX or StAX. However, when parsing large XML files, DOM is impractical, and StAX is a better alternative, resulting in cleaner and more maintainable code.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

11 comments (moderated to prevent spam):

Anonymous said...: Although you have only a few tables, you should couple this with Hibernate; 12/16/2008 9:10 PM
Sujit Pal said...: My approach when I did this was to get the data loaded quickly. Arguably, once you are used to Hibernate mapping, mapping a schema as simple as this is a matter of an hour's work or less, and it would pay dividends on the retrieval end. I used Hibernate at my previous job, but at my current job, the preferred standard is JDBC, so my Hibernate is kind of rusty at the moment, and just using JDBC seemed to be faster.; 12/19/2008 6:40 PM
anon_anon said...: VTD-XML is the other latest XML processing model that is way more efficient than DOM and SAX

http://vtd-xml.sf.net; 11/17/2009 8:46 PM
Sujit Pal said...: Thanks for the pointer, dontcare, I will check it out.; 11/20/2009 7:02 PM
Sheba Wilfred said...: hi sir ,
I m doin my final yr, currently workg with my project named "RESUME FILTER" using sparql,xpath,xquery to retrieve records and finding the efficient method among these for quick search..We now struct in the middle with the problem of inputing a owl file into a dom parser and to retrieve each attribute and display it in label..(ex. if NAME is defined in owl file , my DOM program must input that owl file and retrieve the NAME tag and display in it a label , enabling the user to fill the resume).Our ultimate aim is to write a DOM program that gets owl file and display all its contents dynamically in the label..I referred your page and ur posts suprise us..But our prbm is, we find it complex to our level and is of high standard..,as we r nt aware of many of the techniques you hav used.. can you help us to solve our prbm plz ? can you suggest any simple DOM program that inputs a owl file and retrieve all its data and display it ..? ? my id is snoffysheba@gmail.com .. we are waitg fr ur rly sir..; 7/01/2011 7:22 AM
Sheba Wilfred said...: Hi sir,

We are in need of your help..Can u guide us for doin our project ...?; 7/01/2011 8:05 AM
Sujit Pal said...: Hi Sheba, I am/was not aware of specialized OWL parsers, but a quick google search of "java owl parser" pointed me to the OWL API Project. If you want something more general purpose, there is a very large number of XML parsers available in Java - my favorite was JDOM but nowadays I also use StaX (which was used in this post and is built into Java 5 and higher) or dom4j as appropriate. Hope these pointers help.; 7/03/2011 11:41 AM
joanna hwa said...: Dear sir,

I really need your help. Do you have java code to do text categorization application? Thanks.; 7/11/2011 7:06 AM
Sujit Pal said...: Hi Joanna, no, I don't have any java code that are generally applicable to any categorization situation, sorry.; 7/29/2011 2:33 PM
Anonymous said...: where can I find the "*com.mycompany.myapp.ontology*" package to be imported?; 2/23/2012 1:03 AM
Sujit Pal said...: Hi, you can find the code for the parser and other related code in the SVN repo for my JTMT project here.; 2/26/2012 1:25 PM

Salmon Run

Saturday, May 24, 2008

Parsing OWL XML with StAX

11 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me