I needed some data to build an ontology, so I downloaded a sample wine ontology from W3C. The file is in OWL (Web Ontology Language) XML format, so I needed to parse it. Ordinarily, I would have used JDOM to parse it, but I had recently heard some good things about XML pull-parsing from a colleague, so I decided to use StAX, which is built into Java 6, in order to check out pull-parsing.
My objective was to parse out the XML file into a simple database structure, so I can use it later. The database structure is shown below. The central table is the entity table which represents a node of the ontology. The relations table links nodes via relationships. The attributes table contains non-base properties of an entity. The distinction between attributes and relationships is kind of gray, but in general I consider an attribute to be a property that we can look an entity up by, such as a name. The attribute_type and relation_type tables are an attempt to normalize repetitive attribute and relation names out of the tables.
I would like to point out that the parser is not a "standard" OWL parser. The tag names to extract information from was determined by eyeballing the wine.rdf file and figuring out which tags would yield interesting information. So it will need to change in order to parse some other OWL file. However, if you are just looking for pointers on how to go about doing something similar, you may find the post useful.
Since this is the first time I used any pull parsing library, I would like to share my initial reactions to this strategy. At first sight, there does not seem to be much difference between SAX (push-parsing) and StAX. With SAX, you intercept the parser lifecycle, adding hooks into the startElement() and endElement() methods to do custom processing, and with StAX, you respond to startElement and endElement events fired by the parser. The difference becomes apparent once you start working with it, however. Because you are working with events, you can delegate processing to sub-methods by passing around the parser reference, which makes your code a bit cleaner.
The code for the parser is shown below. It saves the extracted data into a local MySQL database. Callers would instantiate the OwlDbLoader, set the path to the OWL file and DataSource, and call the parseAndLoadData() method.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 | package com.mycompany.myapp.ontology.loaders;
import java.io.FileInputStream;
import java.sql.Connection;
import java.sql.PreparedStatement;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.List;
import javax.sql.DataSource;
import javax.xml.namespace.QName;
import javax.xml.stream.XMLInputFactory;
import javax.xml.stream.XMLStreamConstants;
import javax.xml.stream.XMLStreamReader;
import org.apache.commons.collections15.Closure;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.dao.IncorrectResultSizeDataAccessException;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.core.PreparedStatementCreator;
import org.springframework.jdbc.support.GeneratedKeyHolder;
import org.springframework.jdbc.support.KeyHolder;
import com.mycompany.myapp.ontology.Attribute;
import com.mycompany.myapp.ontology.Entity;
/**
* Parse OWL files representing external ontologies and loads them
* into local database.
*/
public class OwlDbLoader {
private final Log log = LogFactory.getLog(getClass());
private final static String RDF_URI = "http://www.w3.org/1999/02/22-rdf-syntax-ns#";
private JdbcTemplate jdbcTemplate;
private String owlFileLocation;
private static String parentTagName = null;
public void setDataSource(DataSource dataSource) {
this.jdbcTemplate = new JdbcTemplate(dataSource);
}
public void setOwlFileLocation(String owlFileLocation) {
this.owlFileLocation = owlFileLocation;
}
/**
* These parsing rules were devised by physically looking at the OWL file
* and figuring out what goes where. This should by no means be considered
* a generalized way to parse OWL files.
*
* Parsing rules:
*
* owl:Class@rdf:ID = entity (1), type=Wine
* optional:
* owl:Class/rdfs:subClassOf@rdf:resource = entity (2), type=Wine
* (2) -- parent --> (1)
* if owl:Class/rdfs:subClassOf has no attributes, ignore
* if no owl:Class/rdfs:subClassOf entity, ignore it
* owl:Class/owl:Restriction/owl:onProperty@rdf:resource related to
* owl:Class/owl:Restriction/owl:hasValue@rdf:resource
*
* Region@rdf:ID = entity, type=Region
* optional:
* Region/locatedIn@rdf:resource=entity (2), type=Region
* (2) -- parent -- (1)
* owl:Class/rdfs:subClassOf/owl:Restriction - ignore
*
* WineBody@rdf:ID = entity, type=WineBody
* WineColor@rdf:ID = entity, type=WineColor
* WineFlavor@rdf:ID = entity, type=WineFlavor
* WineSugar@rdf:ID = entity, type=WineSugar
* Winery@rdf:ID = entity, type=Winery
* WineGrape@rdf:ID = entity, type=WineGrape
*
* Else if no namespace, this must be a wine itself, capture as entity:
* ?@rdf:ID = entity, type=Wine
* all subtags are relations:
* tagname = relation_name
* tag@rdf:resource = target entity
*/
public void parseAndLoadData() throws Exception {
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = factory.createXMLStreamReader(
new FileInputStream(owlFileLocation));
int depth = 0;
for (;;) {
int event = parser.next();
if (event == XMLStreamConstants.END_DOCUMENT) {
break;
}
switch (event) {
case XMLStreamConstants.START_ELEMENT:
depth++;
String tagName = formatTag(parser.getName());
if (tagName.equals("owl:Class")) {
processTag(parser, new Closure<XMLStreamReader>() {
public void execute(XMLStreamReader parser) {
// relations are not being persisted because value of child
// entity cannot be persisted.
String tagName = formatTag(parser.getName());
if (tagName.equals("owl:Class")) {
String name = parser.getAttributeValue(RDF_URI, "ID");
if (name != null) {
Entity classEntity = new Entity();
parentTagName = name;
classEntity.setName(parentTagName);
classEntity.addAttribute(new Attribute("Type", "Class"));
saveEntity(classEntity);
}
} else if (tagName.equals("rdfs:subClassOf")) {
String name = parser.getAttributeValue(RDF_URI, "resource");
if (name != null) {
Entity superclassEntity = new Entity();
if (name.startsWith("http://")) {
superclassEntity.setName(name.substring(name.lastIndexOf('#') + 1));
superclassEntity.addAttribute(new Attribute("Type",
name.substring(name.lastIndexOf('/') + 1,
name.lastIndexOf('#')) + ":Class"));
} else if (name.startsWith("#")) {
superclassEntity.setName(name.substring(1));
superclassEntity.addAttribute(new Attribute("Type", "Class"));
} else {
superclassEntity.setName(name);
superclassEntity.addAttribute(new Attribute("Type", "Class"));
}
saveEntity(superclassEntity);
saveRelation(parentTagName, superclassEntity.getName(), "parentOf");
parentTagName = null;
}
}
}
});
} else if (tagName.equals("Region")) {
processTag(parser, new Closure<XMLStreamReader>() {
public void execute(XMLStreamReader parser) {
String tagName = formatTag(parser.getName());
if (tagName.equals("Region")) {
Entity classEntity = new Entity();
parentTagName = parser.getAttributeValue(RDF_URI, "ID");
classEntity.setName(parentTagName);
classEntity.addAttribute(new Attribute("Type", "Region"));
saveEntity(classEntity);
} else if (tagName.equals("locatedIn")) {
Entity superclassEntity = new Entity();
String locationEntityName = parser.getAttributeValue(RDF_URI, "resource");
if (locationEntityName.startsWith("#")) {
locationEntityName = locationEntityName.substring(1);
}
superclassEntity.setName(locationEntityName);
superclassEntity.addAttribute(new Attribute("Type", "Region"));
saveEntity(superclassEntity);
saveRelation(parentTagName, locationEntityName, "locatedIn");
parentTagName = null;
}
}
});
} else if (tagName.equals("WineBody") ||
tagName.equals("WineColor") ||
tagName.equals("WineFlavor") ||
tagName.equals("WineSugar") ||
tagName.equals("WineGrape")) {
processTag(parser, new Closure<XMLStreamReader>() {
public void execute(XMLStreamReader parser) {
Entity entity = new Entity();
String name = parser.getAttributeValue(RDF_URI, "ID");
if (name != null) {
entity.setName(name);
String tagName = parser.getLocalName();
Attribute attribute = null;
if (tagName.equals("WineBody")) {
attribute = new Attribute("Type", "Body");
} else if (tagName.equals("WineColor")) {
attribute = new Attribute("Type", "Color");
} else if (tagName.equals("WineFlavor")) {
attribute = new Attribute("Type", "Flavor");
} else if (tagName.equals("WineSugar")) {
attribute = new Attribute("Type", "Sugar");
} else if (tagName.equals("WineGrape")) {
attribute = new Attribute("Type", "Grape");
}
entity.addAttribute(attribute);
saveEntity(entity);
}
}
});
} else if (tagName.equals("vin:Winery")) {
processTag(parser, new Closure<XMLStreamReader>() {
public void execute(XMLStreamReader parser) {
String wineryName = parser.getAttributeValue(RDF_URI, "about");
if (wineryName.startsWith("#")) {
wineryName = wineryName.substring(1);
}
Entity entity = new Entity();
entity.setName(wineryName);
entity.addAttribute(new Attribute("Type", "Winery"));
saveEntity(entity);
}
});
} else if (! tagName.startsWith("owl:")) {
long parentEntityId = getEntityIdFromDb(tagName);
if (parentEntityId != -1) {
processTag(parser, new Closure<XMLStreamReader>() {
public void execute(XMLStreamReader parser) {
String tagName = formatTag(parser.getName());
String id = parser.getAttributeValue(RDF_URI, "ID");
if (StringUtils.isNotBlank(id)) {
// this is the entity
Entity entity = new Entity();
entity.setName(id);
entity.addAttribute(new Attribute("Type", "Wine"));
parentTagName = entity.getName();
saveEntity(entity);
} else {
// these are the relations
String relationName = tagName;
String targetEntityName = parser.getAttributeValue(RDF_URI, "resource");
if (targetEntityName != null && targetEntityName.startsWith("#")) {
targetEntityName = targetEntityName.substring(1);
}
if (targetEntityName != null) {
saveRelation(parentTagName, targetEntityName, relationName);
}
}
}
});
}
}
break;
case XMLStreamConstants.END_ELEMENT:
depth--;
break;
default:
break;
}
parser.close();
}
}
/**
* A tag processor template method which takes as input a closure that is
* responsible for extracting the information from the tag and saving it
* to the database. The contents of the closure is called inside the
* START_DOCUMENT case of the template code.
* @param parser a reference to our StAX XMLStreamReader.
* @param tagProcessor a reference to the Closure to process the tag.
* @throws Exception if one is thrown.
*/
private void processTag(XMLStreamReader parser, Closure<XMLStreamReader> tagProcessor)
throws Exception {
int depth = 0;
int event = parser.getEventType();
String startTag = formatTag(parser.getName());
FOR_LOOP:
for (;;) {
switch(event) {
case XMLStreamConstants.START_ELEMENT:
String tagName = formatTag(parser.getName());
tagProcessor.execute(parser);
depth++;
break;
case XMLStreamConstants.END_ELEMENT:
tagName = formatTag(parser.getName());
depth--;
if (tagName.equals(startTag) && depth == 0) {
break FOR_LOOP;
}
break;
default:
break;
}
event = parser.next();
}
}
// ====================== DB load/save methods =========================
/**
* Saves an entity to the database. Takes care of setting attribute_types and
* attribute objects linked to the entity.
* @param entity the Entity to save.
*/
private void saveEntity(final Entity entity) {
// if entity already exists, don't save
long entityId = getEntityIdFromDb(entity.getName());
if (entityId == -1L) {
log.debug("Saving entity:" + entity.getName());
// insert the entity
KeyHolder entityKeyHolder = new GeneratedKeyHolder();
jdbcTemplate.update(new PreparedStatementCreator() {
public PreparedStatement createPreparedStatement(Connection conn)
throws SQLException {
PreparedStatement ps = conn.prepareStatement(
"insert into entities(name) values (?)",
Statement.RETURN_GENERATED_KEYS);
ps.setString(1, entity.getName());
return ps;
}
}, entityKeyHolder);
entityId = entityKeyHolder.getKey().longValue();
List<Attribute> attributes = entity.getAttributes();
for (Attribute attribute : attributes) {
saveAttribute(entityId, attribute);
}
// finally, always save the "english name" of the entity as an attribute
saveAttribute(entityId, new Attribute("EnglishName", getEnglishName(entity.getName())));
}
}
/**
* Saves an entity attribute to the database and links the attribute to the
* specified entity id.
* @param entityId the entity id.
* @param attribute the Attribute object to save.
*/
private void saveAttribute(long entityId, Attribute attribute) {
// check to see if the attribute is defined, if not define it
long attributeId = 0L;
try {
attributeId = jdbcTemplate.queryForLong(
"select id from attribute_types where attr_name = ?",
new String[] {attribute.getName()});
} catch (IncorrectResultSizeDataAccessException e) {
KeyHolder keyholder = new GeneratedKeyHolder();
final String attributeName = attribute.getName();
jdbcTemplate.update(new PreparedStatementCreator() {
public PreparedStatement createPreparedStatement(Connection conn)
throws SQLException {
PreparedStatement ps = conn.prepareStatement(
"insert into attribute_types(attr_name) values (?)");
ps.setString(1, attributeName);
return ps;
}
}, keyholder);
attributeId = keyholder.getKey().longValue();
}
jdbcTemplate.update(
"insert into attributes(entity_id, attr_id, value) values (?,?,?)",
new Object[] {entityId, attributeId, attribute.getValue()});
}
/**
* Saves the relation into the database. Both entities must exist if the
* relation is to be saved. Takes care of updating relation_types as well.
* @param sourceEntityName the name of the source entity.
* @param targetEntityName the name of the target entity.
* @param relationName the name of the relation.
*/
private void saveRelation(final String sourceEntityName, final String targetEntityName,
final String relationName) {
// get the entity ids for source and target
long sourceEntityId = getEntityIdFromDb(sourceEntityName);
long targetEntityId = getEntityIdFromDb(targetEntityName);
if (sourceEntityId == -1L || targetEntityId == -1L) {
log.error("Cannot save relation: " + relationName + "(" +
sourceEntityName + "," + targetEntityName + ")");
return;
}
log.debug("Saving relation: " + relationName + "(" +
sourceEntityName + "," + targetEntityName + ")");
// get the relation id
long relationTypeId = 0L;
try {
relationTypeId = jdbcTemplate.queryForInt(
"select id from relation_types where type_name = ?",
new String[] {relationName});
} catch (IncorrectResultSizeDataAccessException e) {
KeyHolder keyholder = new GeneratedKeyHolder();
jdbcTemplate.update(new PreparedStatementCreator() {
public PreparedStatement createPreparedStatement(Connection conn)
throws SQLException {
PreparedStatement ps = conn.prepareStatement(
"insert into relation_types(type_name) values (?)",
Statement.RETURN_GENERATED_KEYS);
ps.setString(1, relationName);
return ps;
}
}, keyholder);
relationTypeId = keyholder.getKey().longValue();
}
// save it
jdbcTemplate.update(
"insert into relations(src_entity_id, trg_entity_id, relation_id) values (?, ?, ?)",
new Long[] {sourceEntityId, targetEntityId, relationTypeId});
}
/**
* Looks up the database to get the entity id given the name of the entity.
* If the entity is not found, it returns -1.
* @param entityName the name of the entity.
* @return the entity id, or -1 of the entity.
*/
private long getEntityIdFromDb(String entityName) {
try {
long sourceEntityId = jdbcTemplate.queryForLong(
"select id from entities where name = ?",
new String[] {entityName});
return sourceEntityId;
} catch (IncorrectResultSizeDataAccessException e) {
return -1L;
}
}
// ======== String manipulation methods ========
/**
* Format the XML tag. Takes as input the QName of the tag, and formats
* it to a namespace:tagname format.
* @param qname the QName for the tag.
* @return the formatted QName for the tag.
*/
private String formatTag(QName qname) {
String prefix = qname.getPrefix();
String suffix = qname.getLocalPart();
if (StringUtils.isBlank(prefix)) {
return suffix;
} else {
return StringUtils.join(new String[] {prefix, suffix}, ":");
}
}
/**
* Split up Uppercase Camelcased names (like Java classnames or C++ variable
* names) into English phrases by splitting wherever there is a transition
* from lowercase to uppercase.
* @param name the input camel cased name.
* @return the "english" name.
*/
private String getEnglishName(String name) {
StringBuilder englishNameBuilder = new StringBuilder();
char[] namechars = name.toCharArray();
for (int i = 0; i < namechars.length; i++) {
if (i > 0 && Character.isUpperCase(namechars[i]) &&
Character.isLowerCase(namechars[i-1])) {
englishNameBuilder.append(' ');
}
englishNameBuilder.append(namechars[i]);
}
return englishNameBuilder.toString();
}
}
|
If you have been reading the code closely above, notice the calls to the processTag() method in parseAndLoadData(). The parseAndLoadData() implements an infinite loop where parser.next() is called repeatedly until the END_DOCUMENT event is encountered. You want to do specific processing for certain tags as you encounter them. Because the specific processing will also set up the for(;;) loop as in parseAndLoadData() and break out of it when a closing tag at the same depth is encountered, the code is repetitive if it is put into every sub method. The processTag() method implements a template to which I pass a Closure.
Because I use anonymous Closures inlined into the code, the parseAndLoadData() looks monolithic. Some people would prefer to use declared private Closure implementations and use them here instead. This will make the code superficially cleaner, but because it implements only a portion of the functionality, readers of the code will bounce between parseAndLoadData(), processTag() and the Closure implementation, and the result is likely to be less readable than the current approach. I prefer it the way I have written it - even though all the code is in one place, having the code inside inner classe methods decreases the coupling and makes it more readable than monolithic code. Choose whichever approach works best for you and your coding style.
Overall, I liked StAX. Everything else being equal, I would still prefer DOM (using JDOM) parsing over SAX or StAX. However, when parsing large XML files, DOM is impractical, and StAX is a better alternative, resulting in cleaner and more maintainable code.
Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.
Although you have only a few tables, you should couple this with Hibernate
ReplyDeleteMy approach when I did this was to get the data loaded quickly. Arguably, once you are used to Hibernate mapping, mapping a schema as simple as this is a matter of an hour's work or less, and it would pay dividends on the retrieval end. I used Hibernate at my previous job, but at my current job, the preferred standard is JDBC, so my Hibernate is kind of rusty at the moment, and just using JDBC seemed to be faster.
ReplyDeleteVTD-XML is the other latest XML processing model that is way more efficient than DOM and SAX
ReplyDeletehttp://vtd-xml.sf.net
Thanks for the pointer, dontcare, I will check it out.
ReplyDeletehi sir ,
ReplyDeleteI m doin my final yr, currently workg with my project named "RESUME FILTER" using sparql,xpath,xquery to retrieve records and finding the efficient method among these for quick search..We now struct in the middle with the problem of inputing a owl file into a dom parser and to retrieve each attribute and display it in label..(ex. if NAME is defined in owl file , my DOM program must input that owl file and retrieve the NAME tag and display in it a label , enabling the user to fill the resume).Our ultimate aim is to write a DOM program that gets owl file and display all its contents dynamically in the label..I referred your page and ur posts suprise us..But our prbm is, we find it complex to our level and is of high standard..,as we r nt aware of many of the techniques you hav used.. can you help us to solve our prbm plz ? can you suggest any simple DOM program that inputs a owl file and retrieve all its data and display it ..? ? my id is snoffysheba@gmail.com .. we are waitg fr ur rly sir..
Hi sir,
ReplyDeleteWe are in need of your help..Can u guide us for doin our project ...?
Hi Sheba, I am/was not aware of specialized OWL parsers, but a quick google search of "java owl parser" pointed me to the OWL API Project. If you want something more general purpose, there is a very large number of XML parsers available in Java - my favorite was JDOM but nowadays I also use StaX (which was used in this post and is built into Java 5 and higher) or dom4j as appropriate. Hope these pointers help.
ReplyDeleteDear sir,
ReplyDeleteI really need your help. Do you have java code to do text categorization application? Thanks.
Hi Joanna, no, I don't have any java code that are generally applicable to any categorization situation, sorry.
ReplyDeletewhere can I find the "*com.mycompany.myapp.ontology*" package to be imported?
ReplyDeleteHi, you can find the code for the parser and other related code in the SVN repo for my JTMT project here.
ReplyDelete