Monday, May 30, 2011

Python JSON Prettifier

I have written earlier about my XML Viewer utility. Over the last few years, I have found good use for it, mainly for making sense of large compressed XML files. On a recent project into which I got pulled in, we needed to make sense of (you guessed it) large compressed JSON files.

I was not working with these files directly myself, but after seeing one colleague peering through the files using a text editor, and another writing horrendously complex regular expressions to extract control data (ie how many records, etc) from these files, I realized that it may be useful to build something along the lines of my XML viewer, but for JSON. Because, you know, I may be the one doing these things next :-).

Looking around, I found the JSONFormat online service, which worked beautifully for the medium size JSON file I threw into it, but I figured it would be nice to have something I could run off the command line (without an internet connection when I am on my commute, for example). Besides with a command line tool I could use other Unix tools like grep and wc to get quick stats on the files without having to do any cut-n-paste madness.

My JSON library of choice for Java work is the Jackson JSON processor, but since this was going to be a script, I wanted to use Python to build it. I initially tried jsonlib, because it promised pretty-printing support. Unfortunately, the input file was had characters encoded in ISO-8859-1 (copy-pasting data from MS-Office will do that to you), and several frustrating hours later, I ditched jsonlib in favor of the built-in json (formerly simplejson) library for reasons completely unrelated to the capabilities of either one.

I finally figured out how to deal with the ISO-8859-1 encoding using the codecs library (also built in for Python 2.6), but the built in pretty printing did not work quite the way I had hoped (ie, like jsonformat.com), so I figured that I could easily write my own - after all, a JSON object is just an arbitary structure composed of embedded lists and dictionaries.

So my approach was to use the codecs library to read the file into a Python object using the appropriate (user supplied) encoding. Strings in this Python object are stored as Unicode. My code then recurses through the object, putting in the necessary newlines and indents, and returning the strings as plain old ASCII strings, replacing non-ASCII characters with its XML entity reference string. Here is the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
#! /usr/bin/python
# -*- coding: utf-8 -*-
#
# Converts the specified JSON file with the specified encoding
# (default encoding is utf8 if not specified) into a formatted 
# version with the specified indent (2 if not specified) and 
# writes to STDOUT.
#
# Usage: jsoncat.py inputfile.json [input_encoding] [indent]
#
import sys
import json
import codecs

def usage():
  print "Usage: %s input.json [input_encoding] [indent]" % sys.argv[0]
  print "input.json - the input file to read"
  print "input_encoding - optional, default utf-8"
  print "indent - optional, default 2"
  sys.exit(-1)

def prettyPrint(obj, indent, depth, suffix=False):
  if isinstance(obj, list):
    sys.stdout.write("%s[\n" % str(" " * indent * depth))
    sz = len(obj)
    i = 1
    for row in obj:
      prettyPrint(row, indent, depth + 1, i < sz)
      i = i + 1
    ct = "," if suffix else ""
    sys.stdout.write("%s]%s\n" % (str(" " * indent * depth), ct))
  elif isinstance(obj, dict):
    sys.stdout.write("%s{\n" % str(" " * indent * depth))
    sz = len(obj.keys())
    i = 1
    for key in obj.keys():
      sys.stdout.write("%s%s :" % (str(" " * indent * (depth + 1)), qq(key)))
      val = obj[key]
      if isinstance(val, list) or isinstance(val, dict):
        prettyPrint(val, indent, depth + 1, i < sz)
      else:
        prettyPrint(val, 1, 1, i < sz)
      i = i + 1
    ct = "," if suffix else ""
    sys.stdout.write("%s}%s\n" % (str(" " * indent * depth), ct))
  else:
    ct = "," if suffix else ""
    sys.stdout.write("%s%s%s\n" % (str(" " * indent * depth), qq(obj), ct))

def qq(obj):
  if isinstance(obj, unicode):
    return "\"" + obj.encode('ascii', 'xmlcharrefreplace') + "\""
  else:
    return repr(obj)

def main():
  if len(sys.argv) < 2 and len(sys.argv) > 4:
    usage()
  encoding = "utf-8" if len(sys.argv) == 2 else sys.argv[2]
  indent = 2 if len(sys.argv) == 3 else sys.argv[3]
  infile = codecs.open(sys.argv[1], "r", encoding)
  content = infile.read()
  infile.close()
  jsobj = json.loads(content)
  prettyPrint(jsobj, indent, 0)
  
if __name__ == "__main__":
  main()

To use it, you would supply it with a JSON file and an optional encoding and an optional indentation size. If no encoding is specified, then the script assumes UTF-8. If no indent is specified, the scripts assumes 2 characters. Output is written to console, which you can capture into a file using standard Unix redirection, or pipe through various Unix filters if it makes sense.

Be the first to comment. Comments are moderated to prevent spam.