Salmon Run: Python SQL Runner

Recently, I have been working quite a bit with our taxonomy database, using the data in it as input to some data mining programs I am working on. Initially, I built in the database call into the programs themselves, but this approach turned out to be inconvenient for a number of reasons. One of these reasons is that the program is harder to restart in case of failures. Having these programs work with input files instead of a database query also results in cleaner and more extensible design, since now they can work with inputs other than SQL.

The databases I work with are MySQL and Oracle. The last time I needed this sort of stuff, I would whip up a Python script on-demand using either the MySQLdb or cx-oracle packages. One crashed hard disk and a couple of OS re-installs later, these scripts no longer work because the packages need to be re-installed as well. I have mostly gotten by since the crash using the MyEclipse Database Explorer and cut-n-paste, since the datasets I was working with were smaller, but the size of the data (~ 100K plus rows) I am dealing with now are likely to cause Eclipse to clutch its throat and do a crash-n-burn, taking my other unsaved code along with it, so I needed something that I could use from the command line.

This time around, I planned to do things a bit smarter than I did in the past, so I thought of building a generic script that would dump out rows given any SQL, something similar to my fledgling attempt many, many years ago. Funny how things come full circle from time to time. This time around, I also wanted to make sure its usage survives disk crashes and OS re-installs, so rather than have a script that requires extra dependencies, I wanted to use the JDBC drivers that I was going to install anyway. So my choices boiled down to either Jython or Scala.

I ended up writing it using Jython because my other scripts are all written in Python, and because writing database code in Python is slightly less verbose than in Java (or Scala, since that would have been the same thing). The name of the script is db2csv.py, and the code is shown below:

#!/usr/bin/env /opt/jython-2.2.1/jython
import sys
import getopt
import os.path
import traceback
from com.ziclix.python.sql import zxJDBC

class SqlRunner:
  """
  Class to run the SQL and print it out into the output file as a CSV
  """
  def __init__(self, dbconfig, sqlfile, outputfile, sepchar):
    """
    @param dbconfig the database configuration file name
    @param sqlfile the name of the file containing the SQL to be executed
    @param outputfile the name of the file where output will be written
    @param sepchar the separator character (or string) to use
    """
    self.dbconfig = dbconfig
    self.sqlfile = sqlfile
    self.outputfile = outputfile
    self.sepchar = sepchar

  def getDbProps(self):
    """
    Return the database properties as a map of name value pairs.
    @return a map of name value pairs
    """
    if (not os.path.exists(self.dbconfig)):
      raise Exception("File not found: %s" % (self.dbconfig))
    props = {}
    pfile = open(self.dbconfig, 'rb')
    for pline in pfile:
      (name, value) = pline[:-1].split("=")
      props[name] = value
    pfile.close()
    return props

  def getSql(self):
    """
    Validate the sql file name, and parse out the SQL to be run. The
    method will skip SQL single-line comments. Blocks enclosed by free
    standing multi-line comments are also skipped.
    @return the SQL as a string
    """
    if (not os.path.exists(self.sqlfile)):
      raise Exception("File not found: %s" % (self.sqlfile))
    sql = []
    sfile = open(self.sqlfile, 'rb')
    incomment = False
    for sline in sfile:
      sline = sline.rstrip('\n')
      if (sline.startswith("--") or len(sline.rstrip()) == 0):
        # SQL Comment line, skip
        continue
      if (sline.rstrip() == "/*"):
        # start of SQL comment block
        incomment = True
        continue
      if (sline.rstrip() == "*/"):
        # end of SQL comment block
        incomment = False
        continue
      if (not incomment):
        sql.append(sline)
    sfile.close()
    return " ".join(sql)

  def runSql(self):
    """
    Runs the SQL and prints it out into the specified output file as a CSV
    file delimited by sepchar.
    """
    props = self.getDbProps()
    sql = self.getSql()
    print "Running SQL: %s" % (sql)
    ofile = open(self.outputfile, 'wb')
    db = zxJDBC.connect(props["url"], props["user"], props["password"],
      props["driver"])
    cur = db.cursor(True)
    cur.execute(sql)
    # print the header
    meta = cur.description
    print "Writing output to: %s" % (self.outputfile)
    ofile.write(self.sepchar.join(map(lambda x: x[0], meta)) + "\n")
    for row in cur.fetchall():
      strrow = map(lambda x: str(x), row)
      ofile.write(self.sepchar.join(strrow) + "\n")
    ofile.close()
    cur.close()
    db.close()
    
def usage(error=""):
  """
  Print the usage information. If an error message is supplied, print that
  on top of the usage information.
  """
  if (len(str(error)) > 0):
    print "ERROR: %s" % (error)
    print "STACK TRACE:"
    traceback.print_exc()
  print "USAGE:"
  print "%s -d dbconfig -q queryfile -s sepchar -o outputfile" % (sys.argv[0])
  print "OR: %s -h" % (sys.argv[0])
  print "OPTIONS:"
  print "--dbconfig | -d  : database configuration file"
  print "  configuration file must be in properties format, with the following"
  print "  keys defined: driver, url, user and password"
  print "--queryfile | -q : name of file containing SQL to be run"
  print "--outputfile | -o: name of file where results should be written"
  print "--sep | -s       : the separator character to use in output"
  print "--help | -h      : print this information"
  sys.exit(2)

def extractOptions(argv):
  """
  Extract command line options and return a tuple
  @param argv the sys.argv object
  @return a tuple containing the information for running the SQL
  """
  try:
    (opts, args) = getopt.getopt(argv[1:], "d:q:s:o:h",
      ["dbconfig=", "queryfile=", "sep=", "outputfile=", "help"])
  except getopt.GetoptError:
    usage()
  if (len(filter(lambda x: x[0] in ("-h", "--help"), opts)) == 1):
    usage()
  if (len(opts) != 4):
    usage()
  for opt in opts:
    (key, value) = opt
    if (key in ("-d", "--dbconfig")):
      dbconfig = value
    elif (key in ("-q", "--queryfile")):
      sqlfile = value
    elif (key in ("-o", "--outputfile")):
      outputfile = value
    elif (key in ("-s", "--sep")):
      sepchar = value
    else:
      usage()
  return (dbconfig, sqlfile, outputfile, sepchar)
  
def main():
  """
  This is how we are called
  """
  (dbconfig, sqlfile, outputfile, sepchar) = extractOptions(sys.argv)
  sqlrunner = SqlRunner(dbconfig, sqlfile, outputfile, sepchar)
  try:
    sqlrunner.runSql()
  except Exception, e:
    usage(e)

if __name__ == "__main__":
  main()

As you can see, there is nothing really new here, you can get most of this from the relevant section of the Jython User Guide. I did start to use a bit of Python lambdas, which ironically, I learned on my recent foray into Scala-land.

In the spirit of not having to install anything but the base language, the zJDBC package comes standard with the Jython version (2.2.1) that I am using. Since I already use Java I will install it anyway, as well JDBC drivers for the different databases that I will talk to.

You can call the script using command line parameters as shown in the help output below:

sujit@sirocco:~$ ./db2csv.py -h
USAGE:
./db2csv.py -d dbconfig -q queryfile -s sepchar -o outputfile
OR: ./db2csv.py -h
OPTIONS:
--dbconfig | -d  : database configuration file
  configuration file must be in properties format, with the following
  keys defined: driver, url, user and password
--queryfile | -q : name of file containing SQL to be run
--outputfile | -o: name of file where results should be written
--sep | -s       : the separator character to use in output
--help | -h      : print this information

I had described in a previous post how one can add JAR files to the Jython classpath by appending the path names to sys.path in the code, but the approach didn't work for me here, perhaps because the appropriate Driver JAR class is being loaded explicitly by the code here. So I fell back to having it be loaded from the Java CLASSPATH instead. It's not such a bad thing, though - with this approach, one can use the script as is, for any database, as long as the JDBC driver exists on the CLASSPATH when the script is invoked. And if the driver is not in the CLASSPATH, the script tells you:

sujit@sirocco:~$ ./db2csv.py -d /path/to/config/file -q /path/to/sql/file.sql \
  -o /path/to/report/file -s "|"
Running SQL: select foo from bar where baz = 'baz'
ERROR: driver [com.mysql.jdbc.Driver] not found
STACK TRACE:
Traceback (most recent call last):
  File "./db2csv.py", line 156, in main
    sqlrunner.runSql()
  File "./db2csv.py", line 83, in runSql
    db = zxJDBC.connect(props["url"], props["user"], props["password"],
DatabaseError: driver [com.mysql.jdbc.Driver] not found
USAGE:
./db2csv.py -d dbconfig -q queryfile -s sepchar -o outputfile
OR: ./db2csv.py -h
OPTIONS:
--dbconfig | -d  : database configuration file
  configuration file must be in properties format, with the following
  keys defined: driver, url, user and password
--queryfile | -q : name of file containing SQL to be run
--outputfile | -o: name of file where results should be written
--sep | -s       : the separator character to use in output
--help | -h      : print this information

I also had to increase the JVM heap size because Jython was running out of heap space when running the queries. I did this directly in the Jython script, by adding -Xmx2048m to the java call.

In spite of the simplicity of the script, I am already finding it immensely useful. If you have reached here looking for a similar script, I hope you will find it useful as well.

2 comments:

Anonymous5/14/2009 5:03 PM
you forgot that inline comments can be well inline.
I added
sline = sline.partition('--')[0]
after

sline = sline.rstrip('\n')

of course this breaks the line if '--' is actually in a part of your sql. (I had one in some text i was inserting) but i was layz and just able to change the text.
--AaronS
Sujit Pal5/16/2009 10:24 AM
@AaronS: Cool, I didn't think of that, I usually don't do inline comments on my SQL. Either they are really simple and they don't need commenting, or they are really complex and I put the comments up at the top or at the beginning of each select for multi-select (eg nested or union type) statements. Thanks for the fix!

Comments are moderated to prevent spam.

Saturday, February 21, 2009

Python SQL Runner

2 comments: