Salmon Run: November 2011

Friday, November 25, 2011

Homonym Disambiguation using Concept Distance

According to Wikipedia, a homonym is one of a group of words that often share the same spelling and pronunciation but have different meanings. Homonym disambiguation, from my (admittedly limited point of view), is the process by which a program that extracts concepts from text (such as my TGNI application) infers the correct sense of such a word if it encounters one.

In my world, some examples of homonyms are medical abbreviations which have non-medical analogs (ie AA, Aortic Aneurysm and American Airlines), medical words and abbreviations (ie COLD, the common cold and Chronic Obstructive Lung Disease), and abbreviations for different medical concepts that are identical (ie, ARF, Acute Renal Failure, Acute Respiratory Failure, and Activation Resorption Formation). We do have a pretty reliable algorithm to detect and disambiguate homonyms in our in-house mapper, that relies on hand-crafted lists of reinforcer and dereinforcer words, but it is high maintenance, and I wanted to see if I could figure out something simpler.

Of course, disambiguation is a bit of a misnomer, since given enough context, there is no ambiguity. In my case, the context comes from the document itself. So in a document which has one or more occurrences of ARF, the intuition is that there will be other words or phrases that make it clear "which" ARF is being referred to. Further, the concepts representing these words or phrases will occur "closer to" one of the concepts representing ARF - the one for which the closeness is the highest would be the correct sense of the homonym ARF.

To codify this intuition, I came up with a metric that defines the distance of a concept from a set of other concepts in concept (taxonomy) space. Basically it is the inverse of the root mean square of the weights covered by the paths starting from the homonym concept to the top concepts in the document. The distance between consecutive vertices on the path is computed as the weight (indicating relevance) assigned to the edge multiplied by the number of occurrences of the top concept in the document. As the path becomes longer, each successive edge is damped by the square of the distance from the original homonym concept. Here is the formula:

d = 1 / (√ Σ p_i²)
p_i = Σ r_j-1,j * n_i / j²
where:
d = the distance between homonym concept C_h and the group of top concepts {C_T}.
p = the path weight of the path between C_h and the i^th top concept.
r_j-1,j = the rank assigned to the edge connecting the j-1^th and the j^th node in path p_i.
n_i = the number of occurrences of the i^th top concept in the document.

Notice that our weights are more an indicator of likelihood than distance, ie, higher values indicate lower cost or distance. This is why we need to calculate the inverse of the root-mean-square of the paths to get a distance metric.

The root-mean-square came from the notion of Pythagorean distance in concept space and the damping factor for consecutive edges of a path come from the behavior of gravitational attraction, which falls off by the square of the distance. Multiplying by the number of occurrences seemed to make sense because a greater occurrence of a particular related concept would indicate more weighting towards that concept.

Of course, computing the shortest path between any two nodes in a graph is likely to be compute intensive, and worse, it is quite likely that there may be no such path at all. So we limit our algorithm in two ways for reasonable performance. First we consider only the distance from the homonym concepts to the top 3 concepts in the document, and second, we limit the search for the shortest path to a maximum path length of 5. We may need to tinker with these values a bit, but it seems to work fine for my initial tests.

So anyway, to test out my idea, I found two documents, one for Acute Renal Failure and another for Acute Respiratory Failure, concept mapped it, and wrote up a little JUnit test to see if the intuition outlined above is correct. Here it is:

// Source: src/main/java/com/mycompany/tgni/neo4j/ConceptDistanceHomonymDisambiguatorTest.java
package com.mycompany.tgni.neo4j;

import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.Iterator;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.DoubleRange;
import org.apache.commons.math.stat.descriptive.DescriptiveStatistics;
import org.junit.AfterClass;
import org.junit.Assert;
import org.junit.BeforeClass;
import org.junit.Test;
import org.neo4j.graphalgo.GraphAlgoFactory;
import org.neo4j.graphalgo.PathFinder;
import org.neo4j.graphdb.Node;
import org.neo4j.graphdb.Path;
import org.neo4j.graphdb.PropertyContainer;
import org.neo4j.graphdb.Relationship;
import org.neo4j.graphdb.RelationshipExpander;
import org.neo4j.kernel.Traversal;

/**
 * Tests for calculating shortest path between two nodes
 * for homonym disambiguation.
 */
public class ConceptDistanceHomonymDisambiguatorTest {

  private static final String[][] CONCEPTS_FOR_ACUTE_RENAL_FAILURE_DOC = 
    new String[][] { ... };
  private static final String[][] CONCEPTS_FOR_ACUTE_RESPIRATORY_FAILURE_DOC = 
    new String[][] { ... };

  private static NodeService nodeService;
  
  @BeforeClass
  public static void setupBeforeClass() throws Exception {
    nodeService = NodeServiceFactory.getInstance();
  }
  
  @AfterClass
  public static void teardownAfterClass() throws Exception {
    NodeServiceFactory.destroy();
  }

  @Test
  public void testPathBetweenTwoNodes() throws Exception {
    // Acute renal failure: 3814964
    // Acute respiratory failure: 5355868
    double d11 = findDistance(3814964, CONCEPTS_FOR_ACUTE_RENAL_FAILURE_DOC);
    double d12 = findDistance(5355868, CONCEPTS_FOR_ACUTE_RENAL_FAILURE_DOC);
    double d21 = findDistance(3814964, CONCEPTS_FOR_ACUTE_RESPIRATORY_FAILURE_DOC);
    double d22 = findDistance(5355868, CONCEPTS_FOR_ACUTE_RESPIRATORY_FAILURE_DOC);
    System.out.println("d11=" + d11);
    System.out.println("d12=" + d12);
    Assert.assertTrue(d11 < d12);
    System.out.println("d21=" + d21);
    System.out.println("d22=" + d22);
    Assert.assertTrue(d22 < d21);
  }
  
  private double findDistance(int oid, String[][] docterms) 
      throws Exception {
    double pythDist = 0.0D;
    List<String[]> topTerms = getTopTerms(oid, docterms);
    for (String[] topTerm : topTerms) {
      Integer topTermOid = Integer.valueOf(topTerm[0]);
      Integer occurrences = Integer.valueOf(topTerm[2]);
      Path shortestPath = findShortestPath(oid, topTermOid);
      System.out.println(showPath(shortestPath));
      if (shortestPath == null) continue;
      double distance = 0.0D;
      int hops = 1;
      for (Iterator<PropertyContainer> it = shortestPath.iterator(); 
          it.hasNext(); ) {
        PropertyContainer pc = it.next();
        if (pc instanceof Relationship) {
          Long strength = (Long) ((Relationship) pc).getProperty("mrank");
          distance += (occurrences * strength) / Math.pow(hops, 2);
          hops++;
        }
      }
      pythDist += Math.pow(distance, 2);
    }
    return (1.0D / Math.sqrt(pythDist));
  }

  private List<String[]> getTopTerms(int oid, String[][] docterms) {
    List<String[]> topTerms = new ArrayList<String[]>();
    for (String[] docterm : docterms) {
      topTerms.add(docterm);
    }
    Collections.sort(topTerms, new Comparator<String[]>() {
      @Override
      public int compare(String[] term1, String[] term2) {
        Integer count1 = Integer.valueOf(term1[2]);
        Integer count2 = Integer.valueOf(term2[2]);
        return count2.compareTo(count1);
    }});
    if (topTerms.size() > 3) {
      for (String[] topterm : topTerms.subList(0, 3)) {
        System.out.println(StringUtils.join(topterm, ";"));
      }
      return topTerms.subList(0, 3);
    } else {
      for (String[] topterm : topTerms) {
        System.out.println(StringUtils.join(topterm, ";"));
      }
      return topTerms;
    }
  }

  private Path findShortestPath(int oid1, int oid2) throws Exception {
    long nid1 = nodeService.indexService.getNid(oid1);
    long nid2 = nodeService.indexService.getNid(oid2);
    Node node1 = nodeService.graphService.getNodeById(nid1);
    Node node2 = nodeService.graphService.getNodeById(nid2);
    RelationshipExpander expander = Traversal.expanderForAllTypes();
    PathFinder<Path> finder = GraphAlgoFactory.shortestPath(expander, 5);
    Iterable<Path> paths = finder.findAllPaths(node1, node2);
    // these are the shortest path(s) in terms of number of hops
    // now we need to find the most likely path based on the 
    // sum of the rank of relationships
    Path bestPath = null;
    Long maxStrength = 0L;
    for (Path path : paths) {
      Long strength = 0L;
      for (Iterator<PropertyContainer> it = path.iterator(); it.hasNext(); ) {
        PropertyContainer pc = it.next();
        if (pc instanceof Relationship) {
          strength += (Long) ((Relationship) pc).getProperty("mrank"); 
        }
      }
      if (strength > maxStrength) {
        maxStrength = strength;
        bestPath = path;
      }
    }
    return bestPath;
  }

  private String showPath(Path path) {
    if (path == null) return "NONE";
    StringBuilder buf = new StringBuilder();
    for (Iterator<PropertyContainer> it = path.iterator(); it.hasNext(); ) {
      PropertyContainer pc = it.next();
      if (pc instanceof Node) {
        Node npc = (Node) pc;
        buf.append((String) npc.getProperty("pname")).
          append("(").
          append((Integer) npc.getProperty("oid")).
          append(")");
      } else if (pc instanceof Relationship) {
        Relationship rpc = (Relationship) pc;
        buf.append("--(").
          append(rpc.getType().name()).
          append("[").
          append((Long) rpc.getProperty("mrank")).
          append("])-->");
      }
    }
    return buf.toString();
  }
}

Based on the results of the test, the approach appears to be sound. As you can see, the distance measure of ARF (acute renal failure) is lower in a document about Acute Renal Failure than ARF (acute respiratory failure), and vice versa. So the metric provides a good indication of the correct sense of a homonym term from its context. Of course, a lot depends on the quality of the mapping and the taxonomy.

                          Doc (Acute Renal Failure)   Doc (Acute Resp Failure)
------------------------------------------------------------------------------
Acute Renal Failure       0.003245729458981506        0.003333949158063146
Acute Resp. Failure       0.013424935407337267        0.007470294606214862

Conclusion

The nice thing about this approach is that it can be applied to all the types of homonyms we encounter, and needs no extra data to be maintained apart from the taxonomy relations, which is what we are doing already. While I focus mainly on the text mapping here, this approach could be adapted for the search side as well. In case of search, the general approach is to provide the user with a list of possibilities and ask him/her to refine the search, providing results for the "most likely" possibility. If the search application "knew" the user's search history (like Google+ does now), it could provide results for the "most likely" possibility based on the user's search history (using the same Pythagorean distance approach) instead.

Friday, November 04, 2011

Resume Management with XmlResume, Python and OpenOffice

Couple weeks ago, aimlessly surfing the web (a relatively rare occurrence for me nowadays, thanks to Google), I came across someone's resume whose format I liked - at the bottom of the page, it said "Generated by XmlResume". That got me curious about what it was and how it could help me, so I decided to check it out.

With XmlResume, you write your resume out once in a standard XML format, and XmlResume can parse this XML into plain text, HTML and PDF. You can also filter out specific sections of the resume by setting an optional target attribute to any of the elements in the XML. So simple and elegant, yet such a powerful idea.

The last time I was looking for a job, the trend was to send out plain text resumes, which you would drop into the body of an email. Before that, it was PDF attachments. Apparently the trend now is to send them out as Microsoft Word attachments. Sadly XmlResume cannot write out MS-Word docs, and even the text format it writes is not exactly what I am used to (its formatted with margins and newlines to look almost like a Word or PDF doc, requiring extensive reformatting if I decided to send it in the body of an email).

I did take a quick look at the code, but decided it would be too much work to modify it to suit my requirements (MS-Word and email friendly text output). Thinking about this some more, I figured that if I could convert the XML into an OpenOffice text document (ODT), OpenOffice could then convert the ODT into a multitude of formats, including plain text, XHTML, PDF and Microsoft Word formats.

Since XmlResume is a Java application, I initially thought about adding this as an extension to it using the jOpenDocument library, but then found the odfpy Python library. Both of these are wrappers to write your content out into the OpenDocument format (ODF), which is basically just a zipped set of XML files. Since this was something that I would want to just run from the command line, writing the whole thing in Python seemed to be a simpler alternative than messing with Ant targets or shell script wrappers.

So I wrote a little Python script that parses the input XmlResume XML file into a bean using the XML parsing library elementtree, then converting the bean to either a plain text document (initially for testing) using plain file.write() calls or to an OpenOffice text document (.odt) using odfpy. Here it is:

#!/usr/bin/python
# -*- coding: utf-8 -*-
import sys

from elementtree.ElementTree import parse
import getopt
from odf.opendocument import OpenDocumentText
from odf.style import FontFace
from odf.style import ListLevelProperties
from odf.style import ParagraphProperties
from odf.style import Style
from odf.style import TextProperties
from odf.text import List
from odf.text import ListItem
from odf.text import ListLevelStyleBullet
from odf.text import ListStyle
from odf.text import P
from odf.text import Span
import string


class ResumeModel:

  def __init__(self):
    self.name = None
    self.address = None
    self.phone = None
    self.email = None
    self.contacts = []
    self.objective_title = None
    self.objectives = []
    self.skillarea_title = None
    self.skillset_titles = []
    self.skillsets = [[]]
    self.jobs_title = None
    self.job_titles = []
    self.job_employers = []
    self.job_periods = []
    self.job_descriptions = []
    self.job_achievements = [[]]
    self.academics_title = None
    self.academics = []
    self.awards_title = None
    self.awards = []

  def to_string(self):
    print "name=", self.name
    print "address=", self.address
    print "phone=", self.phone
    print "email=", self.email
    for contact in self.contacts:
      print "contact=", contact
    print "objective_title", self.objective_title
    for objective in self.objectives:
      print "objective=", objective
    print "skillarea_title=", self.skillarea_title
    for skillset_title in self.skillset_titles:
      print "skillset_title:", skillset_title
    for skillset in self.skillsets:
      print "skillset=", ",".join(skillset)
    print "jobs_title=", self.jobs_title
    for job_title in self.job_titles:
      print "job_title=", job_title
    for job_employer in self.job_employers:
      print "job_employer=", job_employer
    for job_description in self.job_descriptions:
      print "job_description=", job_description
    for job_period in self.job_periods:
      print "job_period=", job_period
    for job_achievement in self.job_achievements:
      for job_achievement_item in job_achievement:
        print "achievement_item=", job_achievement_item
    print "academics_title", self.academics_title
    for academic in self.academics:
      print "academic=", academic
    print "awards_title=", self.awards_title
    for award in self.awards:
      print "award=", award


class XmlResumeParser():

  def __init__(self, input_file, target):
    self.target = target
    self.input = open(input_file, "r")
    self.root = parse(input_file).getroot()
    self.breadcrumb = []
    self.model = ResumeModel()
    self.skillset_idx = -1
    self.job_idx = -1
    self.degree_idx = -1
    self.award_idx = -1

  def close(self):
    self.input.close()

  def parse(self):
    self.parse_r(self.root)
    
  def parse_r(self, parent):
    if not self.process_target(parent, self.target):
      return
    self.breadcrumb.append(parent.tag)
    self.process_element(parent)
    for child in list(parent):
      self.parse_r(child)
    self.breadcrumb.pop()

  def process_target(self, parent, target):
    target_attr = parent.attrib.get("target")
    if target is None:
      if target_attr is None:
        return True
      else:
        return False
    else:
      if target_attr is None:
        return True
      else:
        if target.find('+') > -1 or target.find(',') > -1:
          op = "+" if target.find('+') > -1 else ","
          target_set = set(target.split(op))
          target_attr_set = set(target_attr.split(op))
          if target.find('+') > -1:
            return True if len(target_set.intersection(target_attr_set)) \
              == len(target_set) else False
          else:
            return True if len(target_set.intersection(target_attr_set)) > 0 \
              else False
        else:
          return True if target_attr == target else False

  def process_element(self, elem):
    key = "/".join(self.breadcrumb)
    tag = elem.tag
    last_tag = self.breadcrumb[-1:][0]
    if key.startswith("resume/header/name/"):
      self.model.name = self.append(self.model.name, elem.text)
    elif key.startswith("resume/header/address/"):
      if tag == "street":
        self.model.address = elem.text
      elif tag == "city" or tag == "state":
        self.model.address = self.append(self.model.address, elem.text, ", ")
      elif tag == "zip":
        self.model.address = self.append(self.model.address, elem.text, " ")
    elif key.startswith("resume/header/contact/"):
      if tag == "phone":
        self.model.phone = "PHONE: " + elem.text
      elif tag == "email":
        self.model.email = "EMAIL: " + elem.text
      else:
        self.model.contacts.append(string.upper(elem.tag) + ": " + elem.text)
    elif key == "resume/objective":
      self.model.objective_title = self.get_title(elem)
    elif key.startswith("resume/objective/"):
      self.model.objectives.append(elem.text)
    elif key == "resume/skillarea":
      self.model.skillarea_title = self.get_title(elem)
    elif key == "resume/skillarea/skillset":
      self.skillset_idx = self.skillset_idx + 1
      self.model.skillset_titles.append(self.get_title(elem))
      self.model.skillsets.append([])
    elif key == "resume/skillarea/skillset/skill":
      if elem.attrib.get("level") != None:
        self.model.skillsets[self.skillset_idx].append(elem.text +
          " (" + elem.attrib.get("level") + ")")
      else:
        self.model.skillsets[self.skillset_idx].append(elem.text)
    elif key == "resume/history":
      self.model.jobs_title = self.get_title(elem)
    elif key == "resume/history/job":
      self.job_idx = self.job_idx + 1
      self.model.job_achievements.append([])
    elif key.startswith("resume/history/job/"):
      if tag == "jobtitle":
        self.model.job_titles.append(elem.text)
      elif tag == "employer":
        self.model.job_employers.append(elem.text)
      elif tag == "from":
        if len(list(elem)) == 1:
          date_from = self.format_date(list(elem)[0])
          self.model.job_employers[self.job_idx] = \
            self.model.job_employers[self.job_idx] + " (" + date_from
      elif tag == "to":
        if len(list(elem)) == 1:
          date_to = self.format_date(list(elem)[0])
          self.model.job_employers[self.job_idx] = \
            self.model.job_employers[self.job_idx] + " - " + date_to + ")"
      elif tag == "description":
        self.model.job_descriptions.append(elem.text)
      elif tag == "achievement":
        self.model.job_achievements[self.job_idx].append(elem.text)
    elif key == "resume/academics":
      self.model.academics_title = self.get_title(elem)
    elif key == "resume/academics/degrees/degree":
      self.degree_idx = self.degree_idx + 1
      self.model.academics.append([])
    elif key.startswith("resume/academics/degrees/degree/"):
      if tag == "level":
        self.model.academics[self.degree_idx] = elem.text
      elif tag == "major":
        self.model.academics[self.degree_idx] = \
          self.model.academics[self.degree_idx] + ", " + elem.text
      elif tag == "institution":
        self.model.academics[self.degree_idx] = \
          self.model.academics[self.degree_idx] + " from " + elem.text
      elif tag == "from":
        if len(list(elem) == 1):
          from_date = self.format_date(list(elem)[0])
          self.model.academics[self.degree_idx] = \
            self.model.academics[self.degree_idx] + " (" + elem.text
      elif tag == "to":
        if len(list(elem) == 1):
          to_date = self.format_date(list(elem)[0])
          self.model.academics[self.degree_idx] = \
            self.model.academics[self.degree_idx] + " - " + elem.text + ")"
    elif key == "resume/awards":
      self.model.awards_title = self.get_title(elem)
    elif key == "resume/awards/award":
      self.award_idx = self.award_idx + 1
      self.model.awards.append([])
    elif key.startswith("resume/awards/award/"):
      if tag == "title":
        self.model.awards[self.award_idx] = elem.text
      elif tag == "organization":
        self.model.awards[self.award_idx] = \
          self.model.awards[self.award_idx] + " from " + elem.text
      elif tag == "date":
        award_date = self.format_date(elem)
        self.model.awards[self.award_idx] = \
          self.model.awards[self.award_idx] + " (" + award_date + ")"

  def format_date(self, elem):
    if elem.tag != "date":
      return elem.tag
    dmy = ["", "", ""]
    for child in list(elem):
      if child.tag == "day":
        dmy[0] = child.text
      elif child.tag == "month":
        dmy[1] = child.text
      elif child.tag == "year":
        dmy[2] = child.text
      else:
        continue
    filtered_dmy = filter(lambda e : len(e) > 0, dmy)
    if len(filtered_dmy) > 0:
      return " ".join(filtered_dmy)

  def get_title(self, elem):
    title = elem.attrib.get("title")
    if title is None:
      return string.upper(elem.tag)
    else:
      return title

  def append(self, buf, str, sep=" "):
    if buf == None:
      buf = str
    else:
      buf = buf + sep + str
    return buf


class TextResumeWriter():

  def __init__(self, filename):
    self.file = open(filename, 'w')

  def write(self, model):
    self.writeln(model.name)
    self.writeln(model.address)
    self.writeln(", ".join([model.phone, model.email]))
    self.writeln(", ".join(model.contacts))
    self.writeln("-" * 80)
    self.writeln(model.objective_title)
    self.writeln()
    self.writeln("\n".join(model.objectives))
    self.writeln("-" * 80)
    self.writeln(model.skillarea_title)
    self.writeln()
    for i in range(0, len(model.skillset_titles)):
      self.writeln(model.skillset_titles[i] + ": " + ",".join(model.skillsets[i]))
    self.writeln("-" * 80)
    self.writeln(model.jobs_title)
    for i in range(0, len(model.job_titles)):
      self.writeln()
      self.writeln(model.job_titles[i])
      self.writeln(model.job_employers[i])
      self.writeln(model.job_descriptions[i])
      for achievement in model.job_achievements[i]:
        self.writeln("* " + achievement)
    self.writeln("-" * 80)
    self.writeln(model.academics_title)
    self.writeln()
    for academic in model.academics:
      self.writeln("* " + academic)
    self.writeln("-" * 80)
    self.writeln(model.awards_title)
    self.writeln()
    for award in model.awards:
      self.writeln("* " + award)
      
  def writeln(self, s=None):
    if s != None:
      self.file.write(s)
    self.file.write("\n")
    
  def close(self):
    self.file.close()


class OdfResumeWriter():

  def __init__(self, filename):
    self.filename = filename
    self.doc = OpenDocumentText()
    # font
    self.doc.fontfacedecls.addElement((FontFace(name="Arial", \
      fontfamily="Arial", fontsize="10", fontpitch="variable", \
      fontfamilygeneric="swiss")))
    # styles
    style_standard = Style(name="Standard", family="paragraph", \
      attributes={"class":"text"})
    style_standard.addElement(ParagraphProperties(punctuationwrap="hanging", \
      writingmode="page", linebreak="strict"))
    style_standard.addElement(TextProperties(fontname="Arial", \
      fontsize="10pt", fontsizecomplex="10pt", fontsizeasian="10pt"))
    self.doc.styles.addElement(style_standard)
    # automatic styles
    style_normal = Style(name="ResumeText", parentstylename="Standard", \
        family="paragraph")
    self.doc.automaticstyles.addElement(style_normal)

    style_bold_text = Style(name="ResumeBoldText", parentstylename="Standard", \
        family="text")
    style_bold_text.addElement(TextProperties(fontweight="bold", \
      fontweightasian="bold", fontweightcomplex="bold"))
    self.doc.automaticstyles.addElement(style_bold_text)

    style_list_text = ListStyle(name="ResumeListText")
    style_list_bullet = ListLevelStyleBullet(level="1", \
      stylename="ResumeListTextBullet", numsuffix=".", bulletchar=u'\u2022')
    style_list_bullet.addElement(ListLevelProperties(spacebefore="0.1in", \
      minlabelwidth="0.2in"))
    style_list_text.addElement(style_list_bullet)
    self.doc.automaticstyles.addElement(style_list_text)

    style_bold_para = Style(name="ResumeH2", parentstylename="Standard", \
      family="paragraph")
    style_bold_para.addElement(TextProperties(fontweight="bold", \
      fontweightasian="bold", fontweightcomplex="bold"))
    self.doc.automaticstyles.addElement(style_bold_para)

    style_bold_center = Style(name="ResumeH1", parentstylename="Standard", \
        family="paragraph")
    style_bold_center.addElement(TextProperties(fontweight="bold", \
      fontweightasian="bold", fontweightcomplex="bold"))
    style_bold_center.addElement(ParagraphProperties(textalign="center"))
    self.doc.automaticstyles.addElement(style_bold_center)

  def write(self, model):
    self.doc.text.addElement(P(text=model.name, stylename="ResumeH1"))
    self.doc.text.addElement(P(text=model.address, stylename="ResumeH1"))
    self.doc.text.addElement(P(text=", ".join([model.phone, model.email]), \
      stylename="ResumeH1"))
    for contact in model.contacts:
      self.doc.text.addElement(P(text=contact, stylename="ResumeH1"))
    self.nl()
    self.doc.text.addElement(P(text=model.objective_title, \
      stylename="ResumeH1"))
    self.nl()
    for objective in model.objectives:
      self.doc.text.addElement(P(text=objective, stylename="ResumeText"))
    self.nl()
    self.doc.text.addElement(P(text=model.skillarea_title, \
      stylename="ResumeH1"))
    self.nl()
    for i in range(0, len(model.skillset_titles)):
      skillset_line = P(text="")
      skillset_line.addElement(Span(text=model.skillset_titles[i], \
        stylename="ResumeBoldText"))
      skillset_line.addElement(Span(text=": ", stylename="ResumeBoldText"))
      skillset_line.addText(", ".join(model.skillsets[i]))
      self.doc.text.addElement(skillset_line)
    self.nl()
    self.doc.text.addElement(P(text=model.jobs_title, stylename="ResumeH1"))
    for i in range(0, len(model.job_titles)):
      self.nl()
      self.doc.text.addElement(P(text=model.job_titles[i], \
        stylename="ResumeH2"))
      self.doc.text.addElement(P(text=model.job_employers[i], \
        stylename="ResumeH2"))
      self.doc.text.addElement(P(text=model.job_descriptions[i], \
        stylename="ResumeText"))
      achievements_list = List(stylename="ResumeTextList")
      for achievement in model.job_achievements[i]:
        achievements_listitem = ListItem()
        achievements_listitem.addElement(P(text=achievement, \
          stylename="ResumeText"))
        achievements_list.addElement(achievements_listitem)
      self.doc.text.addElement(achievements_list)
    self.nl()
    self.doc.text.addElement(P(text=model.academics_title, \
      stylename="ResumeH1"))
    academics_list = List(stylename="ResumeTextList")
    for academic in model.academics:
      academics_listitem = ListItem()
      academics_listitem.addElement(P(text=academic, stylename="ResumeText"))
      academics_list.addElement(academics_listitem)
    self.doc.text.addElement(academics_list)
    self.nl()
    self.doc.text.addElement(P(text=model.awards_title, stylename="ResumeH1"))
    awards_list = List(stylename="ResumeTextList")
    for award in model.awards:
      awards_listitem = ListItem()
      awards_listitem.addElement(P(text=award, stylename="ResumeText"))
      awards_list.addElement(awards_listitem)
    self.doc.text.addElement(awards_list)
    self.nl()

  def nl(self):
    self.doc.text.addElement(P(text="\n", stylename="ResumeText"))

  def close(self):
    self.doc.save(self.filename)


def usage(msg=None):
  if msg:
    print "ERROR: %s" % (msg)
  print "Usage: %s -i input.xml -o output_file [-t target]" % (sys.argv[0])
  print "OPTIONS:"
  print "-i | --input  : input resume.xml file"
  print "-o | --output : output file name. Suffix dictates output format"
  print "              : supported formats (txt, odt)"
  print "-t | --target : filters elements for target if specified"
  print "              : (optional, default is None)"
  print "-h | --help   : print this message"
  sys.exit(2)

def get_writer(output):
  output_format = output.split(".")[-1:][0]
  if output_format == "txt":
    return TextResumeWriter(output)
  elif output_format == "odt":
    return OdfResumeWriter(output)
  else:
    return None

def main():
  try:
    (opts, args) = getopt.getopt(sys.argv[1:], "i:o:t:h",
      ["input", "output", "target", "help"])
  except:
    usage()
  if len(opts) == 0:
    usage()
  target = None
  for opt in opts:
    (key, value) = opt
    if key in ("-h", "--help"):
      usage()
    elif key in ("-i", "--input"):
      input = value
    elif key in ("-o", "--output"):
      output = value
    elif key in ("-t", "--target"):
      target = value
  if input is None or output is None:
    usage("Input and Output is mandatory")
  writer = get_writer(output)
  if writer is None:
    usage("Unsupported output format")
  parser = XmlResumeParser(input, target)
  parser.parse()
  writer.write(parser.model)
  parser.close()
  writer.close()

if __name__ == "__main__":
  main()

You call this from the command line using something like this:

1
2
3

sujit@cyclone:resume$ ./genresume.py --input your_resume.xml \
    --output your_resume.[txt|odt] \
    [--target="target1+target2+...|target1,target2,..."]

Specifying an output file with suffix .txt will create a text version of the resume (suitable for dropping into the body of an email as mentioned above), and specifying an .odt suffix will create an OpenOffice text document. I had initially meant for the text version to go away once I was done, but then found that OpenOffice does not do the ODT to text conversion correctly (it misses the bullets in list items).

The behavior of the target attribute is similar to that in XmlResume. Multiple targets can be specified, separated by plus or comma. If the separator is plus, all targets must be declared in the element for it to pass through the filter (AND filtering). If the separator is comma, any one of the targets needs to be declared in the XmlResume element for it to pass through the filter (OR filtering). In addition, elements with no target attribute are always passed through the filter.

One caveat - this is not a generic solution. That is, if you were planning on running this script against your own XmlResume XML resume, it very likely won't work the way you'd expect. While I have tried to model my own resume on others that I have seen in my industry (thereby making it somewhat standards-compliant), it is quite possible that your resume contains extra information or elements that I don't need and haven't handled. But if you know a bit of Python, it should be fairly easy to modify this script to come up with something that works for you.

For reference (to match up with the parsing code above), here is a RELAX-NG like definition of the portion of the XmlResume schema that I have used in my resume.

resume {
  header {
    naem { firstname, surname },
    address { street, city, state, zip },
    contact { phone, email, * }
  },
  objective {
    @title, para+
  },
  skillarea {
    @title,
    skillset { 
      @title, 
      skill { @level }+ 
    }+
  },
  history {
    @title,
    job {
      jobtitle, employer, period {
        from { date { year, month, day } },
        to { present | date { month, year, day } }
      },
      employer,
      description,
      achievements { achievement+ }
    }+
  },
  academics {
    @title,
    degrees {
      degree { level, major, institution }+
    }
  },
  awards {
    @title,
    award { title, organization, date { year } }+
  }
}

Programming challenges wise, my elementtree knowledge was quite rusty and I hadn't used odfpy before this, but there are enough examples on the web to get you started on either one. I tried implementing a pure event based parsing approach initially, since I needed this to allow for filtering on any element that had the target attribute, but then settled on a hybrid approach where I parse all elements in a generic way, but save the text and attribtues off some of them into a model bean, which is then written out in a specific format by the text and ODT writers. This approach makes it easier to maintain and extend the functionality (at least for me).

Just writing out the text into the ODT using odfpy was fairly simple, but it took me a while to get the formatting (font size, bolding, centering, etc) right. The API documentation supplied with the odfpy distribution is not very useful. I ultimately wrote out an unformatted document, manually applied formatting to it, unzipped the resulting ODT file and pored through the style.xml and content.xml to find the correct parameters to pass to the various odfpy functions. Once you know what to pass it, though, it works like a charm.

I believe the solution I have now works better for me than just using XmlResume. For one, (like XmlResume) it allows me to maintain a single XML file for my resume, with information targeted to different job groups or industries I might be interested in. Second, by using OpenOffice as an intermediate output, it gives me the option of automatically writing it out into multiple formats, some of which are either not possible (MS Word) or difficult (PDF) with XmlResume. Third, as a nice side effect, it also supports an email friendly text format. Fourth, it offers a simple command line interface without any additional effort. The only downside is the need to modify the code if and when I decide to add more elements into my source XML file or modify the output format.

Update 2011-11-16: Found Serna Free, a nice XML Editor from Syntext while looking for something to view larger and more complex compacted XML files at work. These files were malformed so I could not use either Firefox/Chrome or my Python xmlcat script on it, but Serna opened it without problems. It also provides a stylesheet for XmlResume. Binaries are available for (at least) Linux and Mac OSX. Just putting it here in case its useful, I will probably continue to hand-edit mine using vim.

Salmon Run

Friday, November 25, 2011

Homonym Disambiguation using Concept Distance

Conclusion

Friday, November 04, 2011

Resume Management with XmlResume, Python and OpenOffice

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me