Friday, July 21, 2006

Python XML Viewer for Linux

I recently needed to view an XML file I had generated, to verify that the code worked right. Normally I would just view the file with Firefox using the file:// protocol, and Firefox would show me the document tree with its default XML rendering. But this file was quite large (about 800MB) and Firefox was not able to complete loading the file. It did not crash, but the keyboard and mouse became unresponsive and I had to manually kill Firefox from the command prompt.

So I figured that since Firefox is a general purpose browser, it was probably expecting too much to ask it to render such large files, and I would have better luck with software that was optimized to only render XML - in other words, an XML viewer. So I did a quick search for "XMLViewer" on Google, but came up empty in terms of software I could actually use. There appears to be many more XML Viewers for Windows world than for Linux. The only ones I came up with were KXMLViewer and gxmlviewer.

The screenshots for KXMLViewer look nice but the functionality was not exactly what I wanted. Gxmlviewer was tested on RedHat 7.1 and has probably not been updated since. I was unable to either install the RPM or build from source using the downloads on Fedora Core 3. I could probably have done it if I had tried a little harder, but I decided to pass.

Having failed to find tools to do what I wanted, I started wondering if I probably should just build it myself. Since I wanted to parse large XML files, using a SAX parser seemed to be the obvious choice. As a matter of fact, one of the reasons Firefox was crashing because it was trying to slurp in all the 800MB of data before trying to render it. It needs to do that because it allows you additional controls to expand and collapse elements. That functionality would be a nice-to-have for me, but I definitely did not need it. All I wanted was something that would format the XML (which was written for compactness) into something that I could read. Writing a SAX parser that formats the output into a nice indented document tree structures is practically one of the first examples you encounter when you read about SAX parsing. So building such a tool would be trivial.

To navigate, ie go up or down one or more pages can be done simply using the Linux utility less. The [SPACE] key moves the formatted output forward a page, and [CTRL-B] moves it back. To call the tool, specify the file name and the tabstop indent.

1
$ xmlcat.py my_really_large_xml_file.xml 4 | less

I wrote xmlcat.py in Python. Here is the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#!/usr/bin/python
# A simple SAX Parser to view large XML files as a nicely formatted XML
# document tree. Pipe the output through less and move forward and backward
# using [SPACE] and [CTRL-B] respectively. Standard less keyboard commands
# will also work.
#
import string
import sys
from xml.sax import make_parser
from xml.sax.handler import ContentHandler

class PrettyPrintingContentHandler(ContentHandler):
    """ Subclass of the SAX ContentHandler to print document tree """

    def __init__(self, indent):
        """ Ctor """
        self.indent = indent
        self.level = 0
        self.chars = ''

    def startElement(self, name, attrs):
        """ Set the level and print opening tag with attributes """
        self.level = self.level + 1
        attrString = ""
        qnames = attrs.getQNames()
        for i in range(0, len(qnames)):
            attrString = attrString + " " + qnames[i] + "=\"" + attrs.getValueByQName(qnames[i]) + "\""
        print self.tab(self.level) + "<" + string.rstrip(name) + attrString + ">"

    def endElement(self, name):
        """ Print the characters and the closing tag """
        if (len(string.strip(self.chars)) > 0):
            print self.tab(self.level + 1) + string.rstrip(self.chars)
        self.chars = ''
        print self.tab(self.level) + "</" + string.rstrip(name) + ">"
        self.level = self.level - 1

    def characters(self, c):
        """ Accumulate characters, ignore whitespace """
        if (len(string.strip(c)) > 0):
            self.chars = self.chars + c

    def tab(self, n):
        """ Print the tabstop for the current element """
        tab = ""
        for i in range(1, n):
            for j in range(1, int(self.indent)):
                tab = tab + " "
        return tab

def usage():
    """ Print the usage """
    print "Usage: xmlcat.py xml_file indent | less"
    print "Use [SPACE] and [CTRL-B] to move forward and backward"
    sys.exit(-1)

def main():
    """ Check the arguments, instantiate the parser and parse """
    if (len(sys.argv) != 3):
        usage()
    file = sys.argv[1]
    indent = sys.argv[2]
    parser = make_parser()
    prettyPrintingContentHandler = PrettyPrintingContentHandler(indent)
    parser.setContentHandler(prettyPrintingContentHandler)
    parser.parse(file)

if __name__ == "__main__":
    main()

As you can see, the code is quite trivial and based in large part on this DevShed article, but I am including it here anyway. It took me all of half an hour to write and test, so its not rocket science, but it may help you save half an hour when you are looking for a similar tool, and you stumble upon this page.

12 comments:

  1. I was just looking for a way to peek into a 900MB XML file, so this came in very handy. However, I had to add the following to make it work properly with UTF-8:

    add
    import codecs

    and then after main() add this line:

    sys.stdout = codecs.getwriter('utf8')(sys.stdout)

    Just in case somebody runs into the same problem.

    Thanks,

    Ctop

    ReplyDelete
  2. Thank you for the patch, Anonymous, much appreciated. Having to read UTF-8 is a common requirement, so this is very useful.

    ReplyDelete
  3. thank you for this tool !

    ReplyDelete
  4. You're welcome, Socrates. (I clicked on your profile link and I thought I recognized the pic, so then I looked up the wikipedia page for Socrates where the name is provided in Greek, and from then on, it was just manual pattern matching).

    ReplyDelete
  5. Hmmm. I like the tool. Very nice.

    Had you hit the repositories with 'xml editor' you might have stumbled across Conglomerate XML Editor. It still has bugs but as a viwer it works quite well.

    ReplyDelete
  6. Hi JohnMc, thanks for the link -- I did not know about Conglomerate, this was /exactly/ the sort of thing I was looking for some 3-4 years ago, when I wrote a very large DocBook manual for SQLUnit using vim :-). Another (non-free) xml alternative I have used and liked is oXygen.

    ReplyDelete
  7. So simple, so awesome!

    ReplyDelete
  8. Thanks I need it.

    ReplyDelete
  9. hello friend.
    good stuff.
    i transformed the output so it can resemble html.

    best wishes,
    alex

    ReplyDelete
  10. Thanks Alex, and thats very cool. In my case, I already had the browser plugin (I think I was using either Firefox or Chrome at the time) to display XML, but the problem for me was giant XML files which would cause the browser to either hang or crash, thats why I built this so I could just scroll through the file until I understood its structure well enough to write a parser for it.

    ReplyDelete

Comments are moderated to prevent spam.