Recently, I have been working quite a bit with our taxonomy database, using the data in it as input to some data mining programs I am working on. Initially, I built in the database call into the programs themselves, but this approach turned out to be inconvenient for a number of reasons. One of these reasons is that the program is harder to restart in case of failures. Having these programs work with input files instead of a database query also results in cleaner and more extensible design, since now they can work with inputs other than SQL.
The databases I work with are MySQL and Oracle. The last time I needed this sort of stuff, I would whip up a Python script on-demand using either the MySQLdb or cx-oracle packages. One crashed hard disk and a couple of OS re-installs later, these scripts no longer work because the packages need to be re-installed as well. I have mostly gotten by since the crash using the MyEclipse Database Explorer and cut-n-paste, since the datasets I was working with were smaller, but the size of the data (~ 100K plus rows) I am dealing with now are likely to cause Eclipse to clutch its throat and do a crash-n-burn, taking my other unsaved code along with it, so I needed something that I could use from the command line.
This time around, I planned to do things a bit smarter than I did in the past, so I thought of building a generic script that would dump out rows given any SQL, something similar to my fledgling attempt many, many years ago. Funny how things come full circle from time to time. This time around, I also wanted to make sure its usage survives disk crashes and OS re-installs, so rather than have a script that requires extra dependencies, I wanted to use the JDBC drivers that I was going to install anyway. So my choices boiled down to either Jython or Scala.
I ended up writing it using Jython because my other scripts are all written in Python, and because writing database code in Python is slightly less verbose than in Java (or Scala, since that would have been the same thing). The name of the script is db2csv.py, and the code is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 | #!/usr/bin/env /opt/jython-2.2.1/jython
import sys
import getopt
import os.path
import traceback
from com.ziclix.python.sql import zxJDBC
class SqlRunner:
"""
Class to run the SQL and print it out into the output file as a CSV
"""
def __init__(self, dbconfig, sqlfile, outputfile, sepchar):
"""
@param dbconfig the database configuration file name
@param sqlfile the name of the file containing the SQL to be executed
@param outputfile the name of the file where output will be written
@param sepchar the separator character (or string) to use
"""
self.dbconfig = dbconfig
self.sqlfile = sqlfile
self.outputfile = outputfile
self.sepchar = sepchar
def getDbProps(self):
"""
Return the database properties as a map of name value pairs.
@return a map of name value pairs
"""
if (not os.path.exists(self.dbconfig)):
raise Exception("File not found: %s" % (self.dbconfig))
props = {}
pfile = open(self.dbconfig, 'rb')
for pline in pfile:
(name, value) = pline[:-1].split("=")
props[name] = value
pfile.close()
return props
def getSql(self):
"""
Validate the sql file name, and parse out the SQL to be run. The
method will skip SQL single-line comments. Blocks enclosed by free
standing multi-line comments are also skipped.
@return the SQL as a string
"""
if (not os.path.exists(self.sqlfile)):
raise Exception("File not found: %s" % (self.sqlfile))
sql = []
sfile = open(self.sqlfile, 'rb')
incomment = False
for sline in sfile:
sline = sline.rstrip('\n')
if (sline.startswith("--") or len(sline.rstrip()) == 0):
# SQL Comment line, skip
continue
if (sline.rstrip() == "/*"):
# start of SQL comment block
incomment = True
continue
if (sline.rstrip() == "*/"):
# end of SQL comment block
incomment = False
continue
if (not incomment):
sql.append(sline)
sfile.close()
return " ".join(sql)
def runSql(self):
"""
Runs the SQL and prints it out into the specified output file as a CSV
file delimited by sepchar.
"""
props = self.getDbProps()
sql = self.getSql()
print "Running SQL: %s" % (sql)
ofile = open(self.outputfile, 'wb')
db = zxJDBC.connect(props["url"], props["user"], props["password"],
props["driver"])
cur = db.cursor(True)
cur.execute(sql)
# print the header
meta = cur.description
print "Writing output to: %s" % (self.outputfile)
ofile.write(self.sepchar.join(map(lambda x: x[0], meta)) + "\n")
for row in cur.fetchall():
strrow = map(lambda x: str(x), row)
ofile.write(self.sepchar.join(strrow) + "\n")
ofile.close()
cur.close()
db.close()
def usage(error=""):
"""
Print the usage information. If an error message is supplied, print that
on top of the usage information.
"""
if (len(str(error)) > 0):
print "ERROR: %s" % (error)
print "STACK TRACE:"
traceback.print_exc()
print "USAGE:"
print "%s -d dbconfig -q queryfile -s sepchar -o outputfile" % (sys.argv[0])
print "OR: %s -h" % (sys.argv[0])
print "OPTIONS:"
print "--dbconfig | -d : database configuration file"
print " configuration file must be in properties format, with the following"
print " keys defined: driver, url, user and password"
print "--queryfile | -q : name of file containing SQL to be run"
print "--outputfile | -o: name of file where results should be written"
print "--sep | -s : the separator character to use in output"
print "--help | -h : print this information"
sys.exit(2)
def extractOptions(argv):
"""
Extract command line options and return a tuple
@param argv the sys.argv object
@return a tuple containing the information for running the SQL
"""
try:
(opts, args) = getopt.getopt(argv[1:], "d:q:s:o:h",
["dbconfig=", "queryfile=", "sep=", "outputfile=", "help"])
except getopt.GetoptError:
usage()
if (len(filter(lambda x: x[0] in ("-h", "--help"), opts)) == 1):
usage()
if (len(opts) != 4):
usage()
for opt in opts:
(key, value) = opt
if (key in ("-d", "--dbconfig")):
dbconfig = value
elif (key in ("-q", "--queryfile")):
sqlfile = value
elif (key in ("-o", "--outputfile")):
outputfile = value
elif (key in ("-s", "--sep")):
sepchar = value
else:
usage()
return (dbconfig, sqlfile, outputfile, sepchar)
def main():
"""
This is how we are called
"""
(dbconfig, sqlfile, outputfile, sepchar) = extractOptions(sys.argv)
sqlrunner = SqlRunner(dbconfig, sqlfile, outputfile, sepchar)
try:
sqlrunner.runSql()
except Exception, e:
usage(e)
if __name__ == "__main__":
main()
|
As you can see, there is nothing really new here, you can get most of this from the relevant section of the Jython User Guide. I did start to use a bit of Python lambdas, which ironically, I learned on my recent foray into Scala-land.
In the spirit of not having to install anything but the base language, the zJDBC package comes standard with the Jython version (2.2.1) that I am using. Since I already use Java I will install it anyway, as well JDBC drivers for the different databases that I will talk to.
You can call the script using command line parameters as shown in the help output below:1 2 3 4 5 6 7 8 9 10 11 12 | sujit@sirocco:~$ ./db2csv.py -h
USAGE:
./db2csv.py -d dbconfig -q queryfile -s sepchar -o outputfile
OR: ./db2csv.py -h
OPTIONS:
--dbconfig | -d : database configuration file
configuration file must be in properties format, with the following
keys defined: driver, url, user and password
--queryfile | -q : name of file containing SQL to be run
--outputfile | -o: name of file where results should be written
--sep | -s : the separator character to use in output
--help | -h : print this information
|
I had described in a previous post how one can add JAR files to the Jython classpath by appending the path names to sys.path in the code, but the approach didn't work for me here, perhaps because the appropriate Driver JAR class is being loaded explicitly by the code here. So I fell back to having it be loaded from the Java CLASSPATH instead. It's not such a bad thing, though - with this approach, one can use the script as is, for any database, as long as the JDBC driver exists on the CLASSPATH when the script is invoked. And if the driver is not in the CLASSPATH, the script tells you:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 | sujit@sirocco:~$ ./db2csv.py -d /path/to/config/file -q /path/to/sql/file.sql \
-o /path/to/report/file -s "|"
Running SQL: select foo from bar where baz = 'baz'
ERROR: driver [com.mysql.jdbc.Driver] not found
STACK TRACE:
Traceback (most recent call last):
File "./db2csv.py", line 156, in main
sqlrunner.runSql()
File "./db2csv.py", line 83, in runSql
db = zxJDBC.connect(props["url"], props["user"], props["password"],
DatabaseError: driver [com.mysql.jdbc.Driver] not found
USAGE:
./db2csv.py -d dbconfig -q queryfile -s sepchar -o outputfile
OR: ./db2csv.py -h
OPTIONS:
--dbconfig | -d : database configuration file
configuration file must be in properties format, with the following
keys defined: driver, url, user and password
--queryfile | -q : name of file containing SQL to be run
--outputfile | -o: name of file where results should be written
--sep | -s : the separator character to use in output
--help | -h : print this information
|
I also had to increase the JVM heap size because Jython was running out of heap space when running the queries. I did this directly in the Jython script, by adding -Xmx2048m to the java call.
In spite of the simplicity of the script, I am already finding it immensely useful. If you have reached here looking for a similar script, I hope you will find it useful as well.
you forgot that inline comments can be well inline.
ReplyDeleteI added
sline = sline.partition('--')[0]
after
sline = sline.rstrip('\n')
of course this breaks the line if '--' is actually in a part of your sql. (I had one in some text i was inserting) but i was layz and just able to change the text.
--AaronS
@AaronS: Cool, I didn't think of that, I usually don't do inline comments on my SQL. Either they are really simple and they don't need commenting, or they are really complex and I put the comments up at the top or at the beginning of each select for multi-select (eg nested or union type) statements. Thanks for the fix!
ReplyDelete