Sunday, February 08, 2009

Syntax Coloring and Pygments Lexer for Lucli

I have recently come to the realization that I have been doing you, my readers, a disservice. Over the last year, I have posted blogs which mostly consist of gobs and gobs of code served without any syntax coloration. My first experience with syntax coloration was with the vim editor, and happened accidentally, with an operating system upgrade. I remember how I could suddenly see the code so much more clearly, and how my productivity shot up over the next few weeks. So syntax coloration is a big deal, and I should have addressed it sooner.

Looking around at other people's blogs, I noticed that most of them used Alex Gorbatchev's SyntaxHighlighter. SyntaxHighlighter is written in Javascript and dynamically colorizes your code using CSS. I also liked how easy it was to create a colorizer (called a "brush") for a new language. I tried to set it up for my blog with the approach described here, but could not make it work to my satisfaction - specifically, the line numbers were appearing but no syntax coloration - and yes, I did include the links to the various brushes.

I looked around for alternatives, and found the Pygments project. The Pygments project is very comprehensive and at least at first glance, slightly more complex than SyntaxHighlighter. Pygments does have an AJAX/Javascript mode thanks to some folks at Objectgraph. However, I chose to go the command line route, where you pre-build the code into HTML and stick it in. The advantage of this approach is that you can build custom lexers for your own content, and the disadvantage is that your HTML becomes very hard to read and maintain - if you want to change some code, your best bet is to copy it from the rendered version and rerun it through your colorizing process.

Pygments can be used to colorize code and render into a variety of output formats - I am only interested in the HTML formatter. To colorize a block of code, it uses language specific Lexers, which output keyword, operator, etc, tokens. The formatter takes these tokens and wraps them in <span> tags. There is a CSS file which dictates the behavior of the classes named in the span tags, which results in the syntax coloring. So for a person in my situation, my main concern would be the availability of Lexers I need, or an API by which I can build a custom Lexer.

To be fair, Pygments has a huge number of Lexers available, many more compared to the brushes in the SyntaxHighlighter project, so may actually never have to build one yourself. However, I wanted to learn how to do this, and output from the Lucli console (see my previous blog) seemed to be a good candidate, so this post is about my experience building that.

I first generated the CSS to put into the header. You can do this by invoking the pygmentize script, as shown below:

1
prompt$ pygmentize -S emacs -f html >/tmp/styles.css

The resulting CSS file is put into the Template file in the <b:skin> block, along with the other CSS declarations. I also added the additional .linenos class in here to make the line number margin gray. Here is what my CSS block looks like.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
.linenos {background-color: #cccccc }
.hll { background-color: #ffffcc }
.c { color: #008800; font-style: italic } /* Comment */
.err { border: 1px solid #FF0000 } /* Error */
.k { color: #AA22FF; font-weight: bold } /* Keyword */
.o { color: #666666 } /* Operator */
.cm { color: #008800; font-style: italic } /* Comment.Multiline */
.cp { color: #008800 } /* Comment.Preproc */
.c1 { color: #008800; font-style: italic } /* Comment.Single */
.cs { color: #008800; font-weight: bold } /* Comment.Special */
.gd { color: #A00000 } /* Generic.Deleted */
.ge { font-style: italic } /* Generic.Emph */
.gr { color: #FF0000 } /* Generic.Error */
.gh { color: #000080; font-weight: bold } /* Generic.Heading */
.gi { color: #00A000 } /* Generic.Inserted */
.go { color: #808080 } /* Generic.Output */
.gp { color: #000080; font-weight: bold } /* Generic.Prompt */
.gs { font-weight: bold } /* Generic.Strong */
.gu { color: #800080; font-weight: bold } /* Generic.Subheading */
.gt { color: #0040D0 } /* Generic.Traceback */
.kc { color: #AA22FF; font-weight: bold } /* Keyword.Constant */
.kd { color: #AA22FF; font-weight: bold } /* Keyword.Declaration */
.kn { color: #AA22FF; font-weight: bold } /* Keyword.Namespace */
.kp { color: #AA22FF } /* Keyword.Pseudo */
.kr { color: #AA22FF; font-weight: bold } /* Keyword.Reserved */
.kt { color: #00BB00; font-weight: bold } /* Keyword.Type */
.m { color: #666666 } /* Literal.Number */
.s { color: #BB4444 } /* Literal.String */
.na { color: #BB4444 } /* Name.Attribute */
.nb { color: #AA22FF } /* Name.Builtin */
.nc { color: #0000FF } /* Name.Class */
.no { color: #880000 } /* Name.Constant */
.nd { color: #AA22FF } /* Name.Decorator */
.ni { color: #999999; font-weight: bold } /* Name.Entity */
.ne { color: #D2413A; font-weight: bold } /* Name.Exception */
.nf { color: #00A000 } /* Name.Function */
.nl { color: #A0A000 } /* Name.Label */
.nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
.nt { color: #008000; font-weight: bold } /* Name.Tag */
.nv { color: #B8860B } /* Name.Variable */
.ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
.w { color: #bbbbbb } /* Text.Whitespace */
.mf { color: #666666 } /* Literal.Number.Float */
.mh { color: #666666 } /* Literal.Number.Hex */
.mi { color: #666666 } /* Literal.Number.Integer */
.mo { color: #666666 } /* Literal.Number.Oct */
.sb { color: #BB4444 } /* Literal.String.Backtick */
.sc { color: #BB4444 } /* Literal.String.Char */
.sd { color: #BB4444; font-style: italic } /* Literal.String.Doc */
.s2 { color: #BB4444 } /* Literal.String.Double */
.se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
.sh { color: #BB4444 } /* Literal.String.Heredoc */
.si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
.sx { color: #008000 } /* Literal.String.Other */
.sr { color: #BB6688 } /* Literal.String.Regex */
.s1 { color: #BB4444 } /* Literal.String.Single */
.ss { color: #B8860B } /* Literal.String.Symbol */
.bp { color: #AA22FF } /* Name.Builtin.Pseudo */
.vc { color: #B8860B } /* Name.Variable.Class */
.vg { color: #B8860B } /* Name.Variable.Global */
.vi { color: #B8860B } /* Name.Variable.Instance */
.il { color: #666666 } /* Literal.Number.Integer.Long */

For the Lucli command line, I wanted the lucli> prompt and the available commands to be highlighted in a different color, and somehow be able to distinguish the Lucli output from the user input, perhaps by bolding the user input. The LucliLexer class is quite simple, it extends the RegexLexer class and sets up the necessary regular expressions that will be matched to appropriate tokens. Here is the code.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#!/usr/bin/python
from pygments.lexer import RegexLexer, bygroups
from pygments.token import *

# All my custom Lexers here
class LucliLexer(RegexLexer):
  """
  Simple Lucli command line lexer based on RegexLexer. We just define various
  kinds of tokens for coloration in the tokens dictionary below.
  """
  name = 'Lucli'
  aliases = ['lucli']
  filenames = ['*.lucli']

  tokens = {
    'root' : [
      # our startup header
      (r'Lucene CLI.*$', Generic.Heading),
      # the prompt
      (r'lucli> ', Generic.Prompt),
      # keywords that appear by themselves in a single line
      (r'(analyzer|help|info|list|orient|quit|similarity|terms)$', Keyword),
      # keywords followed by arguments in single line
      (r'(analyzer|count|explain|get|index|info|list|optimize|'
       r'orient|quit|remove|search|similarity|terms|tokens)(\s+)(.*?)$',
       bygroups(Keyword, Text, Generic.Strong)),
      # rest of the text
      (r'.*\n', Text)
    ]
  }

For my own convenience, I create another Python script to replace the pygmentize script. I plan to call this script on source files I want to colorize (with my defaults, such as line numbering and style set to emacs mode). Unlike pygmentize, it will allow me to work with my own custom Lexers. Right now, it takes in a block of text and uses the first line of the text (where I usually put in the full path name of the file) to figure out the Lexer to call. If it cannot figure it out, then it asks the user. In the future, I plan on adding the file name as another way to figure out the correct lexer to use, but that's not done yet.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#!/usr/bin/python
from pygments import highlight
from pygments.lexers import *
from pygments.formatters import HtmlFormatter
from mypygmentlexers import *
import re

LEXER_MAPPINGS = {
  "java" : JavaLexer(),
  "xml" : XmlLexer(),
  "scala" : ScalaLexer(),
  "python" : PythonLexer(),
  "py" : PythonLexer(),
  "bash" : BashLexer(),
  "sh" : BashLexer(),
  "javascript" : JavascriptLexer(),
  "css" : CssLexer(),
  "jsp" : JspLexer(),
  "lucli" : LucliLexer(),
  "text" : TextLexer()
}

def askLexer(text):
  """
  Display the text on console and ask the user what it is. Must be one
  of the patterns in LEXER_MAPPINGS.
  """
  print '==== text ===='
  print text
  print '==== /text ===='
  while 1:
    ctype = raw_input("Specify type (" + ",".join(LEXER_MAPPINGS.keys()) + "): ")
    try:
      lexer = LEXER_MAPPINGS[ctype]
      break
    except KeyError:
      print 'Sorry, invalid type, try again'
  return lexer

def guessLexer(text):
  """
  Uses the file name metadata or shebang info on first line if it exists
  and try to "guess" the lexer that is required for colorizing.
  """
  firstline = text[0:text.find("\n")]
  match = re.search(r'[/|.]([a-zA-Z]+)$', firstline)
  if (match):
    guess = match.group(1)
    try:
      return LEXER_MAPPINGS[guess]
    except KeyError:
      return askLexer(text)
  else:
    return askLexer(text)

def colorize(text, lexer):
  """
  Calls the pygments API to colorize text with appropriate defaults (for my
  use) for inclusion a HTML page.
  """
  formatter = HtmlFormatter(linenos=True)
  return highlight(text, lexer, formatter)

def fileToString(filename):
  """
  Convenience method to read a file into a String.
  """
  infile = open(filename, 'rb')
  contents = infile.read()
  infile.close()
  return contents

def writeOutputFile(filename, coloredCode):
  """
  Convenience method to write the colored code into the named output file
  with the appropriate surrounding html markup and style sheet.
  """
  file = open(filename, 'wb')
  file.write("<html>\n<head>\n<style>\n")
  file.write(fileToString("/tmp/styles.css"))
  file.write("</style>\n</head>\n</body>\n")
  file.write(coloredCode)
  file.write("</body>\n</html>\n")
  file.close()

def testSingleInput(infile, outfile):
  """
  Test class to convert a named input file into a colorized versoin into the
  named output file. Can be applied to any kind of file.
  """
  print("Processing " + infile + " -> " + outfile)
  code = fileToString(infile)
  writeOutputFile(outfile, colorize(code, guessLexer(code)))

def main():
  testSingleInput("/tmp/test.lucli", "/tmp/test1.html")

if __name__ == "__main__":
    main();

Running it from the command line (without arguments) produces the following output. The input file (hardcoded in the test method) is a typical Lucli session. The output file is the rendered output. For testing convenience, I build full HTML files (with stylesheet) that I can look at in my browser. The snippet below is the colorized body, since the CSS is already in my template.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
Lucene CLI. Please specify index. Type 'help' for instructions.
lucli> help
 analyzer: Set/Unset custom analyzer, default is StandardAnalyzer. Ex: analyzer [analyzer_class]
 count: Return number of results from search. Ex: count query
 explain: Generates explanation for the query. Ex: explain query
 get: Return the record at the specified position. Ex: rec 0
 help: Display command help
 index: Choose new lucene index. Ex: index my_index...
 info: Display info about current Lucene index.
 list: List all or named fields in index. Ex: list f1 f2...
 optimize: Optimize current index
 orient: Set result display orientation, default horizontal. Ex: orient [vertical|horizontal]
 quit: Quit/Exit Lucli
 remove: Remove record at specified position. Ex: remove 0
 search: Search current index. Ex: search query
 similarity: Set/Unset custom similarity, default is DefaultSimilarity
 terms: Show first 100 terms in this index. Can filter by field name if supplied. Ex: terms [field]
 tokens: Returns top 10 tokens for each document (Verbose)
lucli> index src/test/resources/movieindex1   
Index has 99998 documents 
All Fields: [released, body, title]
Indexed Fields: [released, body, title]
lucli> search +title:"happy days" +body:"fonz"
13 total matching documents
-------------------- Result-#:      0, DocId:  18851 --------------------
score                  :  6.010070
released         (I-S-): 1974
title            (ITS-): Happy Days  - Because It's There  #11.1
-------------------- Result-#:      1, DocId:  18855 --------------------
score                  :  6.010070
released         (I-S-): 1974
title            (ITS-): Happy Days  - Cruisin'  #2.16
-------------------- Result-#:      2, DocId:  18858 --------------------
score                  :  5.372923
released         (I-S-): 1974
title            (ITS-): Happy Days  - Fonzie Moves In  #3.1
-------------------- Result-#:      3, DocId:  18863 --------------------
score                  :  5.372923
released         (I-S-): 1974
title            (ITS-): Happy Days  - Hardware Jungle  #1.5
-------------------- Result-#:      4, DocId:  18883 --------------------
score                  :  5.372923
released         (I-S-): 1974
title            (ITS-): Happy Days  - Tall Story  #8.17
lucli> quit

Anyway, thats pretty much it for this week. I think that many Python programmers may already be using Pygments, but it is usable for non-Python programmers as well. Pygments has support for an enormous number of languages, so chances are that you will not have to write a line of Python code to use it, and even if you do, the API looks more complex than it really is, so I expect that you will be pleasantly surprised at how easy it is to build your own custom Lexer.

You may have noticed that the salmon background in my <pre>..</pre> blocks have disappeared from my previous blogs. I had to do this in order to make the colorizing work. The HTML formatter wraps the code in a pre block, which is in turn wrapped in a table/td block, so if I override the behavior of my pre tag, that has precedence over the behavior specified by Pygments, so I end up getting salmon colored margins and code background, which looks pretty horrible with the colorized code in it. I do have plans of replacing all my pre blocks with colorized versions throughout the blog, but that is a slightly larger job than I am willing to tackle right now.

No comments:

Post a Comment

Comments are moderated to prevent spam.