Wednesday, July 01, 2009

Python script to read Maven XML

One thing that I've always liked about the US is its public library system - knowledge is effectively free as long as you spend the effort to acquire it. My impression was based on the Denver Public Library, where there was (at least, as far as I can remember) an entire aisle devoted to computer books. Fast forward a decade and a thousand miles westward, and my local library has half a rack of computer books, mostly targeted at home PC users (you know, Microsoft, for Dummies, etc). This in in spite of the fact that I alone have donated over 60 books on Java, Python, Perl, Linux, etc over the last five years or so - apparently, these are not suitable for the library, given that they are considered "manuals" (one possible translation: too technical for the library staff to decide whether they are suitable or not). And we wonder why our kids don't measure up to other states, let alone other countries, in science, math and computing.

But enough ranting... I was actually quite pleasantly surprised last week to find a book in there that I wanted to read - The Productive Programmer by Neal Ford. I am told that I am fairly productive, so it was more curiosity than an urge for self-improvement that drove me to read it - ok, so there was a bit of that as well. The book contains the usual suspects of course, such as knowing your IDE, using the command line effectively, scripting and (programming language) polyglotism, which most os us are aware of. But I learned quite a few new things from the book as well. Overall, a very useful set of tips and techniques - read it if you get a chance.

Why do I even bring this up? Well, a commenter pointed out that my code was calling classes in the commons-math library that do not (yet) exist in the GA distribution, and requesting me to put all my JAR files someplace on the jtmt project's SVN repository. Now, I understand and empathize with his problem, but I had been resisting even creating a public project because I am too lazy/time-challenged to maintain it, and this request meant work that I had been hoping to avoid. However, as the book says, script stuff that is mundane and boring, and who knows, you may actually learn something new and interesting. That is pretty much what happened, which is why.

I figured that instead of manually building up the lib directory, it would be fun to write a Python script to parse the dependencies out of the project's pom.xml and automatically copy the JAR files over from my local Maven repository. I use Python quite a bit for scripting work, but so far, the only time I used it to process XML was when I built a client that talked to blogger, and once I found that GData had an API that wrapped all the messy XML stuff in it, I dumped all the XML code (which I was struggling through anyway, because of insufficient understanding/documentation) and just used the API instead.

As Nelson Minar mentions in his blog post XPath, XML, Python, and Uche Ogbuji follows up with, Python, like Java, offers multiple (too many?) ways of handling XML and there is no clear winner in all of this. Each seems to be optimized for a part of the solution, so it becomes hard to decide which one to learn. I decided to go with elementtree, because of good reviews from both bloggers, and because I read it was used to power the GData API. Its been a while since the two posts were written, and elementtree seems to have improved its XPath support since then, so thats even better news for me.

So, anyway, heres the code. Most of it is basically a shell script in Python and uses shutils to do the heavy lifting. The XML parsing is in the parse() method - we use XPath to jump down to the dependency tags, then create a map of its children tags and their content. Not much, but its a fairly standard use-case. I poked around a bit more in the elementtree documentation, and it seems to be a nice, clean API - looks like it could be for Python what JDOM was to Java.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
#!/usr/bin/python
# Source: lib/pom2lib.py
# Simple python script to read the pom.xml and copy over JAR files from
# my local maven repository to a target directory. Also generates a shell
# script to move these files back to a local maven repository on a target
# machine.
# Usage Examples:
# 1) to copy from local M2_REPO to the jtmt lib directory:
# ./pom2lib.py -s /home/sujit/src/jtmt/pom.xml /home/sujit/src/jtmt/lib
# 2) to copy from jtmt lib directory to the local M2_REPO:
# ./pom2lib.py -r /home/sujit/src/jtmt/pom.xml /home/sujit/src/jtmt/lib
# Remember to set the M2_REPO appropriately for your system.
#
import sys
import getopt
import os.path
import os
from elementtree import ElementTree
import string
import shutil

POM_NS = "{http://maven.apache.org/POM/4.0.0}"
M2_REPO = "/home/sujit/.m2/repository"

def usage(error=None):
  if (error != None):
    print "ERROR: %s" % (error)
  print "Usage:"
  print "pom2lib.py [-s|-r|-h] pom_file lib_dir"
  print "where:"
  print "-s|--store - copy jar files from your m2 repo to target"
  print "-r|--retrieve - copy jar files from target to your m2 repo"
  print "-h|--help - print this message"
  print "pom_file - the full path to the pom.xml file"
  print "lib_dir - the full path to the jar directory"
  sys.exit(-1)

def contains(opts, patterns):
  return len(filter(lambda x: x[0] in patterns, opts)) == 1

def buildPath(props):
  """
  Return a pair containing the absolute file names of the jar file and
  the corresponding src-jar file in the user's M2_REPO. There is no check
  at this stage to verify that the src-jar exists (it may not in some cases)
  """
  groupId = props["%sgroupId" % (POM_NS)]
  artifactId = props["%sartifactId" % (POM_NS)]
  version = props["%sversion" % (POM_NS)]
  jarpath = os.path.join(M2_REPO,
    string.replace(groupId, ".", os.sep),
    artifactId,
    version,
    "".join([artifactId, "-", version, ".jar"]))
  srcJarpath = os.path.join(M2_REPO,
    string.replace(groupId, ".", os.sep),
    artifactId,
    version,
    "".join([artifactId, "-", version, "-sources.jar"]))
  return (jarpath, srcJarpath)
  
def parse(pom):
  """
  Parses the POM to get a list of file path pairs returned by buildPath()
  """
  paths = []
  tree = ElementTree.parse(pom)
  for dependency in tree.findall("//%sdependency" % (POM_NS)):
    props = {}
    for element in dependency:
      props[element.tag] = element.text
    paths.append(buildPath(props))
  return paths

def copyToLib(pathpairs, libdir):
  """
  Copies the jar file and the source jar file (if it exists) into the
  libdir.
  """
  if (not os.path.exists(libdir)):
    print "mkdir -p %s" % (libdir)
    os.makedirs(libdir)
  for (jar, srcjar) in pathpairs:
    print "cp %s %s" % (jar, libdir)
    shutil.copy(jar, libdir)
    if (os.path.exists(srcjar)):
      print "cp %s %s" % (srcjar, libdir)
      shutil.copy(srcjar, libdir)

def copyFromLib(libdir, pathpairs):
  """
  Copy the jars from libdir to the local M2_REPO of a target machine.
  """
  for (jarpath, srcJarpath) in pathpairs:
    src = os.path.join(libdir, os.path.basename(jarpath))
    targetdir = os.path.dirname(jarpath)
    if (os.path.exists(src)):
      if (not os.path.exists(targetdir)):
        print "mkdir -p %s" % (targetdir)
        os.makedirs(targetdir)
      print "cp %s %s" % (src, targetdir)
      shutil.copy(src, targetdir)
    src = os.path.join(libdir, os.path.basename(srcJarpath))
    if (os.path.exists(src)):
      print "cp %s %s" % (src, targetdir)
      shutil.copy(src, targetdir)

def main():
  """
  Input validation and dispatch to the appropriate method.
  """
  try:
    (opts, args) = getopt.getopt(sys.argv[1:], "srh",
      ["store", "retrieve", "help"])
  except getopt.GetoptError:
    usage()
  if (contains(opts, ("-h", "--help"))):
    usage()
  if (len(args) != 2):
    usage("Lib directory and/or POM path must be specified")
  if (not os.path.exists(args[0])):
    usage("POM file not found: %s" % (args[0]))
  if (contains(opts, ("-s", "--store"))):
    copyToLib(parse(args[0]), args[1])
  if (contains(opts, ("-r", "--retrieve"))):
    if (not os.path.exists(args[1])):
      usage("Lib directory not found: %s" % (args[1]))
    copyFromLib(args[1], parse(args[0]))
  
if __name__ == "__main__":
  main()

It took me about an hour to actually write this script. It would have taken me about 10-15 minutes to do the copying manually. However, now that I am committed to maintaining the JAR files, I will run the script every time I add/delete a JAR from my POM, so the effort is probably well worth it. The nice byproduct is that I ended up learning how to use the elementtree API, giving me a way to do XML processing in Python.

2 comments (moderated to prevent spam):

kenliu said...

This approach to copying dependencies will not work if some of the dependencies declared in the POM have transitive dependencies. You really want to instead parse the output of the dependency plugin to get a list of all of the dependencies including the transitive ones.

Sujit Pal said...

Hi KenLiu, good suggestion, I wrote something similar to convert from POM to a shell script, and was thrilled when someone suggested the maven assembly plugin :-). But your suggestion is a good one...I will check it out.