Saturday, May 24, 2008

Parsing OWL XML with StAX

I needed some data to build an ontology, so I downloaded a sample wine ontology from W3C. The file is in OWL (Web Ontology Language) XML format, so I needed to parse it. Ordinarily, I would have used JDOM to parse it, but I had recently heard some good things about XML pull-parsing from a colleague, so I decided to use StAX, which is built into Java 6, in order to check out pull-parsing.

My objective was to parse out the XML file into a simple database structure, so I can use it later. The database structure is shown below. The central table is the entity table which represents a node of the ontology. The relations table links nodes via relationships. The attributes table contains non-base properties of an entity. The distinction between attributes and relationships is kind of gray, but in general I consider an attribute to be a property that we can look an entity up by, such as a name. The attribute_type and relation_type tables are an attempt to normalize repetitive attribute and relation names out of the tables.

I would like to point out that the parser is not a "standard" OWL parser. The tag names to extract information from was determined by eyeballing the wine.rdf file and figuring out which tags would yield interesting information. So it will need to change in order to parse some other OWL file. However, if you are just looking for pointers on how to go about doing something similar, you may find the post useful.

Since this is the first time I used any pull parsing library, I would like to share my initial reactions to this strategy. At first sight, there does not seem to be much difference between SAX (push-parsing) and StAX. With SAX, you intercept the parser lifecycle, adding hooks into the startElement() and endElement() methods to do custom processing, and with StAX, you respond to startElement and endElement events fired by the parser. The difference becomes apparent once you start working with it, however. Because you are working with events, you can delegate processing to sub-methods by passing around the parser reference, which makes your code a bit cleaner.

The code for the parser is shown below. It saves the extracted data into a local MySQL database. Callers would instantiate the OwlDbLoader, set the path to the OWL file and DataSource, and call the parseAndLoadData() method.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
package com.mycompany.myapp.ontology.loaders;

import java.io.FileInputStream;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.List;

import javax.sql.DataSource;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;

import org.apache.commons.collections15.Closure;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.dao.IncorrectResultSizeDataAccessException;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.PreparedStatementCreator;
import org.springframework.jdbc.support.GeneratedKeyHolder;
import org.springframework.jdbc.support.KeyHolder;

import com.mycompany.myapp.ontology.Attribute;
import com.mycompany.myapp.ontology.Entity;

/**
 * Parse OWL files representing external ontologies and loads them 
 * into local database.
 */
public class OwlDbLoader {
  
  private final Log log = LogFactory.getLog(getClass());
  
  private final static String RDF_URI = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
  
  private JdbcTemplate jdbcTemplate;
  private String owlFileLocation;
  
  private static String parentTagName = null;
  
  public void setDataSource(DataSource dataSource) {
    this.jdbcTemplate = new JdbcTemplate(dataSource);
  }
  
  public void setOwlFileLocation(String owlFileLocation) {
    this.owlFileLocation = owlFileLocation;
  }

  /**
   * These parsing rules were devised by physically looking at the OWL file
   * and figuring out what goes where. This should by no means be considered
   * a generalized way to parse OWL files.
   * 
   * Parsing rules:
   * 
   * owl:Class@rdf:ID = entity (1), type=Wine
   * optional:
   *   owl:Class/rdfs:subClassOf@rdf:resource = entity (2), type=Wine
   *   (2) -- parent --> (1)
   * if owl:Class/rdfs:subClassOf has no attributes, ignore
   * if no owl:Class/rdfs:subClassOf entity, ignore it
   * owl:Class/owl:Restriction/owl:onProperty@rdf:resource related to
   *   owl:Class/owl:Restriction/owl:hasValue@rdf:resource
   *  
   * Region@rdf:ID = entity, type=Region
   * optional:
   *   Region/locatedIn@rdf:resource=entity (2), type=Region
   *   (2) -- parent -- (1)
   * owl:Class/rdfs:subClassOf/owl:Restriction - ignore
   * 
   * WineBody@rdf:ID = entity, type=WineBody
   * WineColor@rdf:ID = entity, type=WineColor
   * WineFlavor@rdf:ID = entity, type=WineFlavor
   * WineSugar@rdf:ID = entity, type=WineSugar
   * Winery@rdf:ID = entity, type=Winery
   * WineGrape@rdf:ID = entity, type=WineGrape
   * 
   * Else if no namespace, this must be a wine itself, capture as entity:
   * ?@rdf:ID = entity, type=Wine
   *   all subtags are relations:
   *     tagname = relation_name
   *     tag@rdf:resource = target entity
   */
  public void parseAndLoadData() throws Exception {
    XMLInputFactory factory = XMLInputFactory.newInstance();
    XMLStreamReader parser = factory.createXMLStreamReader(
      new FileInputStream(owlFileLocation));
    int depth = 0;
    for (;;) {
      int event = parser.next();
      if (event == XMLStreamConstants.END_DOCUMENT) {
        break;
      }
      switch (event) {
        case XMLStreamConstants.START_ELEMENT:
          depth++;
          String tagName = formatTag(parser.getName());
          if (tagName.equals("owl:Class")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                // relations are not being persisted because value of child
                // entity cannot be persisted.
                String tagName = formatTag(parser.getName());
                if (tagName.equals("owl:Class")) {
                  String name = parser.getAttributeValue(RDF_URI, "ID");
                  if (name != null) {
                    Entity classEntity = new Entity();
                    parentTagName = name;
                    classEntity.setName(parentTagName);
                    classEntity.addAttribute(new Attribute("Type", "Class"));
                    saveEntity(classEntity);
                  }
                } else if (tagName.equals("rdfs:subClassOf")) {
                  String name = parser.getAttributeValue(RDF_URI, "resource");
                  if (name != null) {
                    Entity superclassEntity = new Entity();
                    if (name.startsWith("http://")) {
                      superclassEntity.setName(name.substring(name.lastIndexOf('#') + 1));
                      superclassEntity.addAttribute(new Attribute("Type", 
                        name.substring(name.lastIndexOf('/') + 1, 
                        name.lastIndexOf('#')) + ":Class"));
                    } else if (name.startsWith("#")) {
                      superclassEntity.setName(name.substring(1));
                      superclassEntity.addAttribute(new Attribute("Type", "Class"));
                    } else {
                      superclassEntity.setName(name);
                      superclassEntity.addAttribute(new Attribute("Type", "Class"));
                    }
                    saveEntity(superclassEntity);
                    saveRelation(parentTagName, superclassEntity.getName(), "parentOf");
                    parentTagName = null;
                  }
                }
              }
            });
          } else if (tagName.equals("Region")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                String tagName = formatTag(parser.getName());
                if (tagName.equals("Region")) {
                  Entity classEntity = new Entity();
                  parentTagName = parser.getAttributeValue(RDF_URI, "ID");
                  classEntity.setName(parentTagName);
                  classEntity.addAttribute(new Attribute("Type", "Region"));
                  saveEntity(classEntity);
                } else if (tagName.equals("locatedIn")) {
                  Entity superclassEntity = new Entity();
                  String locationEntityName = parser.getAttributeValue(RDF_URI, "resource");
                  if (locationEntityName.startsWith("#")) {
                    locationEntityName = locationEntityName.substring(1);
                  }
                  superclassEntity.setName(locationEntityName);
                  superclassEntity.addAttribute(new Attribute("Type", "Region"));
                  saveEntity(superclassEntity);
                  saveRelation(parentTagName, locationEntityName, "locatedIn");
                  parentTagName = null;
                }
              }
            });
          } else if (tagName.equals("WineBody") || 
              tagName.equals("WineColor") ||
              tagName.equals("WineFlavor") ||
              tagName.equals("WineSugar") ||
              tagName.equals("WineGrape")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                Entity entity = new Entity();
                String name = parser.getAttributeValue(RDF_URI, "ID");
                if (name != null) {
                  entity.setName(name);
                  String tagName = parser.getLocalName();
                  Attribute attribute = null;
                  if (tagName.equals("WineBody")) {
                    attribute = new Attribute("Type", "Body");
                  } else if (tagName.equals("WineColor")) {
                    attribute = new Attribute("Type", "Color");
                  } else if (tagName.equals("WineFlavor")) {
                    attribute = new Attribute("Type", "Flavor");
                  } else if (tagName.equals("WineSugar")) {
                    attribute = new Attribute("Type", "Sugar");
                  } else if (tagName.equals("WineGrape")) {
                    attribute = new Attribute("Type", "Grape");
                  }
                  entity.addAttribute(attribute);
                  saveEntity(entity);
                }
              }
            });
          } else if (tagName.equals("vin:Winery")) {
            processTag(parser, new Closure<XMLStreamReader>() {
              public void execute(XMLStreamReader parser) {
                String wineryName = parser.getAttributeValue(RDF_URI, "about");
                if (wineryName.startsWith("#")) {
                  wineryName = wineryName.substring(1);
                }
                Entity entity = new Entity();
                entity.setName(wineryName);
                entity.addAttribute(new Attribute("Type", "Winery"));
                saveEntity(entity);
              }
            });
          } else if (! tagName.startsWith("owl:")) {
            long parentEntityId = getEntityIdFromDb(tagName);
            if (parentEntityId != -1) {
              processTag(parser, new Closure<XMLStreamReader>() {
                public void execute(XMLStreamReader parser) {
                  String tagName = formatTag(parser.getName());
                  String id = parser.getAttributeValue(RDF_URI, "ID");
                  if (StringUtils.isNotBlank(id)) {
                    // this is the entity
                    Entity entity = new Entity();
                    entity.setName(id);
                    entity.addAttribute(new Attribute("Type", "Wine"));
                    parentTagName = entity.getName();
                    saveEntity(entity);
                  } else {
                    // these are the relations
                    String relationName = tagName;
                    String targetEntityName = parser.getAttributeValue(RDF_URI, "resource");
                    if (targetEntityName != null && targetEntityName.startsWith("#")) {
                      targetEntityName = targetEntityName.substring(1);
                    }
                    if (targetEntityName != null) {
                      saveRelation(parentTagName, targetEntityName, relationName);
                    }
                  }
                }
              });
            }
          }
          break;
        case XMLStreamConstants.END_ELEMENT:
          depth--;
          break;
        default:
          break;
      }
      parser.close();
    }
  }

  /**
   * A tag processor template method which takes as input a closure that is
   * responsible for extracting the information from the tag and saving it
   * to the database. The contents of the closure is called inside the
   * START_DOCUMENT case of the template code.
   * @param parser a reference to our StAX XMLStreamReader.
   * @param tagProcessor a reference to the Closure to process the tag.
   * @throws Exception if one is thrown.
   */
  private void processTag(XMLStreamReader parser, Closure<XMLStreamReader> tagProcessor) 
      throws Exception {
    int depth = 0;
    int event = parser.getEventType();
    String startTag = formatTag(parser.getName());
    FOR_LOOP:
    for (;;) {
      switch(event) {
        case XMLStreamConstants.START_ELEMENT:
          String tagName = formatTag(parser.getName());
          tagProcessor.execute(parser);
          depth++;
          break;
        case XMLStreamConstants.END_ELEMENT:
          tagName = formatTag(parser.getName());
          depth--;
          if (tagName.equals(startTag) && depth == 0) {
            break FOR_LOOP;
          }
          break;
        default:
          break;
      }
      event = parser.next();
    }
  }
  
  // ====================== DB load/save methods =========================

  /**
   * Saves an entity to the database. Takes care of setting attribute_types and
   * attribute objects linked to the entity.
   * @param entity the Entity to save.
   */
  private void saveEntity(final Entity entity) {
    // if entity already exists, don't save
    long entityId = getEntityIdFromDb(entity.getName());
    if (entityId == -1L) {
      log.debug("Saving entity:" + entity.getName());
      // insert the entity
      KeyHolder entityKeyHolder = new GeneratedKeyHolder();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn)
        throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into entities(name) values (?)", 
            Statement.RETURN_GENERATED_KEYS);
          ps.setString(1, entity.getName());
          return ps;
        }
      }, entityKeyHolder);
      entityId = entityKeyHolder.getKey().longValue();
      List<Attribute> attributes = entity.getAttributes();
      for (Attribute attribute : attributes) {
        saveAttribute(entityId, attribute);
      }
      // finally, always save the "english name" of the entity as an attribute
      saveAttribute(entityId, new Attribute("EnglishName", getEnglishName(entity.getName())));
    }
  }

  /**
   * Saves an entity attribute to the database and links the attribute to the
   * specified entity id.
   * @param entityId the entity id.
   * @param attribute the Attribute object to save.
   */
  private void saveAttribute(long entityId, Attribute attribute) {
    // check to see if the attribute is defined, if not define it
    long attributeId = 0L;
    try {
      attributeId = jdbcTemplate.queryForLong(
        "select id from attribute_types where attr_name = ?", 
        new String[] {attribute.getName()});
    } catch (IncorrectResultSizeDataAccessException e) {
      KeyHolder keyholder = new GeneratedKeyHolder();
      final String attributeName = attribute.getName();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn)
        throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into attribute_types(attr_name) values (?)");
          ps.setString(1, attributeName);
          return ps;
        }
      }, keyholder);
      attributeId = keyholder.getKey().longValue();
    }
    jdbcTemplate.update(
      "insert into attributes(entity_id, attr_id, value) values (?,?,?)",
      new Object[] {entityId, attributeId, attribute.getValue()});
  }

  /**
   * Saves the relation into the database. Both entities must exist if the
   * relation is to be saved. Takes care of updating relation_types as well.
   * @param sourceEntityName the name of the source entity.
   * @param targetEntityName the name of the target entity.
   * @param relationName the name of the relation.
   */
  private void saveRelation(final String sourceEntityName, final String targetEntityName, 
      final String relationName) {
    // get the entity ids for source and target
    long sourceEntityId = getEntityIdFromDb(sourceEntityName);
    long targetEntityId = getEntityIdFromDb(targetEntityName);
    if (sourceEntityId == -1L || targetEntityId == -1L) {
      log.error("Cannot save relation: " + relationName + "(" + 
        sourceEntityName + "," + targetEntityName + ")"); 
      return;
    }
    log.debug("Saving relation: " + relationName + "(" + 
      sourceEntityName + "," + targetEntityName + ")");
    // get the relation id
    long relationTypeId = 0L;
    try {
      relationTypeId = jdbcTemplate.queryForInt(
        "select id from relation_types where type_name = ?", 
        new String[] {relationName});
    } catch (IncorrectResultSizeDataAccessException e) {
      KeyHolder keyholder = new GeneratedKeyHolder();
      jdbcTemplate.update(new PreparedStatementCreator() {
        public PreparedStatement createPreparedStatement(Connection conn) 
            throws SQLException {
          PreparedStatement ps = conn.prepareStatement(
            "insert into relation_types(type_name) values (?)", 
            Statement.RETURN_GENERATED_KEYS);
          ps.setString(1, relationName);
          return ps;
        }
      }, keyholder);
      relationTypeId = keyholder.getKey().longValue();
    }
    // save it
    jdbcTemplate.update(
      "insert into relations(src_entity_id, trg_entity_id, relation_id) values (?, ?, ?)", 
      new Long[] {sourceEntityId, targetEntityId, relationTypeId});
  }

  /**
   * Looks up the database to get the entity id given the name of the entity.
   * If the entity is not found, it returns -1.
   * @param entityName the name of the entity.
   * @return the entity id, or -1 of the entity.
   */
  private long getEntityIdFromDb(String entityName) {
    try {
      long sourceEntityId = jdbcTemplate.queryForLong(
        "select id from entities where name = ?", 
        new String[] {entityName});
      return sourceEntityId;
    } catch (IncorrectResultSizeDataAccessException e) {
      return -1L;
    }
  }

  // ======== String manipulation methods ========
  
  /**
   * Format the XML tag. Takes as input the QName of the tag, and formats
   * it to a namespace:tagname format.
   * @param qname the QName for the tag.
   * @return the formatted QName for the tag.
   */
  private String formatTag(QName qname) {
    String prefix = qname.getPrefix();
    String suffix = qname.getLocalPart();
    if (StringUtils.isBlank(prefix)) {
      return suffix;
    } else {
      return StringUtils.join(new String[] {prefix, suffix}, ":");
    }
  }

  /**
   * Split up Uppercase Camelcased names (like Java classnames or C++ variable
   * names) into English phrases by splitting wherever there is a transition 
   * from lowercase to uppercase.
   * @param name the input camel cased name.
   * @return the "english" name.
   */
  private String getEnglishName(String name) {
    StringBuilder englishNameBuilder = new StringBuilder();
    char[] namechars = name.toCharArray();
    for (int i = 0; i < namechars.length; i++) {
      if (i > 0 && Character.isUpperCase(namechars[i]) && 
          Character.isLowerCase(namechars[i-1])) {
        englishNameBuilder.append(' ');
      }
      englishNameBuilder.append(namechars[i]);
    }
    return englishNameBuilder.toString();
  }
}

If you have been reading the code closely above, notice the calls to the processTag() method in parseAndLoadData(). The parseAndLoadData() implements an infinite loop where parser.next() is called repeatedly until the END_DOCUMENT event is encountered. You want to do specific processing for certain tags as you encounter them. Because the specific processing will also set up the for(;;) loop as in parseAndLoadData() and break out of it when a closing tag at the same depth is encountered, the code is repetitive if it is put into every sub method. The processTag() method implements a template to which I pass a Closure.

Because I use anonymous Closures inlined into the code, the parseAndLoadData() looks monolithic. Some people would prefer to use declared private Closure implementations and use them here instead. This will make the code superficially cleaner, but because it implements only a portion of the functionality, readers of the code will bounce between parseAndLoadData(), processTag() and the Closure implementation, and the result is likely to be less readable than the current approach. I prefer it the way I have written it - even though all the code is in one place, having the code inside inner classe methods decreases the coupling and makes it more readable than monolithic code. Choose whichever approach works best for you and your coding style.

Overall, I liked StAX. Everything else being equal, I would still prefer DOM (using JDOM) parsing over SAX or StAX. However, when parsing large XML files, DOM is impractical, and StAX is a better alternative, resulting in cleaner and more maintainable code.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

11 comments (moderated to prevent spam):

Anonymous said...

Although you have only a few tables, you should couple this with Hibernate

Sujit Pal said...

My approach when I did this was to get the data loaded quickly. Arguably, once you are used to Hibernate mapping, mapping a schema as simple as this is a matter of an hour's work or less, and it would pay dividends on the retrieval end. I used Hibernate at my previous job, but at my current job, the preferred standard is JDBC, so my Hibernate is kind of rusty at the moment, and just using JDBC seemed to be faster.

dontcare said...

VTD-XML is the other latest XML processing model that is way more efficient than DOM and SAX

http://vtd-xml.sf.net

Sujit Pal said...

Thanks for the pointer, dontcare, I will check it out.

Sheba Wilfred said...

hi sir ,
I m doin my final yr, currently workg with my project named "RESUME FILTER" using sparql,xpath,xquery to retrieve records and finding the efficient method among these for quick search..We now struct in the middle with the problem of inputing a owl file into a dom parser and to retrieve each attribute and display it in label..(ex. if NAME is defined in owl file , my DOM program must input that owl file and retrieve the NAME tag and display in it a label , enabling the user to fill the resume).Our ultimate aim is to write a DOM program that gets owl file and display all its contents dynamically in the label..I referred your page and ur posts suprise us..But our prbm is, we find it complex to our level and is of high standard..,as we r nt aware of many of the techniques you hav used.. can you help us to solve our prbm plz ? can you suggest any simple DOM program that inputs a owl file and retrieve all its data and display it ..? ? my id is snoffysheba@gmail.com .. we are waitg fr ur rly sir..

Sheba Wilfred said...

Hi sir,

We are in need of your help..Can u guide us for doin our project ...?

Sujit Pal said...

Hi Sheba, I am/was not aware of specialized OWL parsers, but a quick google search of "java owl parser" pointed me to the OWL API Project. If you want something more general purpose, there is a very large number of XML parsers available in Java - my favorite was JDOM but nowadays I also use StaX (which was used in this post and is built into Java 5 and higher) or dom4j as appropriate. Hope these pointers help.

joanna hwa said...

Dear sir,

I really need your help. Do you have java code to do text categorization application? Thanks.

Sujit Pal said...

Hi Joanna, no, I don't have any java code that are generally applicable to any categorization situation, sorry.

Anonymous said...

where can I find the "*com.mycompany.myapp.ontology*" package to be imported?

Sujit Pal said...

Hi, you can find the code for the parser and other related code in the SVN repo for my JTMT project here.