Monday, February 16, 2009

Blog Beautification with Blogger GData API

Last week, I wrote about how I had to disable the salmon/orange background in my <pre> tags so that output of the Pygments syntax colorizer rendered nicely. So that left all but couple of my blogs displaying code on a plain white background, which was not what I intended when I set out on my blog beautification project. I don't have that many posts (about 150 over the last 3 years), but it is still too many to go back and convert manually (and still retain my sanity). So I figured that this would be a good chance to check out the Google Blogger API.

Since Pygments is a Python project, I decided to write my conversion script in Python. The script works in three phases. Phase 1 is the download phase, where all the posts are downloaded at once on the local disk. Phase 2 consists of the extraction of the <pre> blocks, colorizing using Pygments, writing out an HTML page for manual review, as well as writing out the colorized version of the post onto local disk. The third phase is to actually upload the colorized post back to Blogger. I realize that its nicer conceptually to be able to do this in one fell swoop, but this would have not worked for me, as I elaborate below.

Here is the full Python code. I started out not knowing about the existence of the Python GData Blogger API, so my first version used httplib for network IO and libxml2 for XML processing. I chose libxml2 for its XSLT support, but the API is a bit of a mess, so if I were to do XML processing in the future, I would probably choose minidom or elementtree instead. The Python gdata module uses httplib and elementtree under the covers, and (obviously) results in much shorter application code.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
#!/usr/bin/python
# Source: bloggerclient.py
from MyPygmentLexers import *
import gdata
from gdata import service
from gdata import atom
from pygments.formatters import HtmlFormatter
from pygments import highlight
from pygments.lexers import *
from xml.sax.saxutils import unescape
import os
import os.path
import re
import shutil
import sys

# Completely dependent on my setup. Change this stuff to suit your own.
DATA_DIR = "/path/to/temporary/data/directory"
BLOGGER_EMAIL = "your_email@your_domain.com"
BLOGGER_PASSWD = "your-blogger-password"
PUBDATE_MIN = "2005-01-01T00:00:00"
PUBDATE_MAX = "2009-12-31T00:00:00"

# A map of string to lexer instances I need to colorize my blogs so far, you
# may need your own set if your language preferences are different
LEXER_MAPPINGS = {
  "java" : JavaLexer(),
  "xml" : XmlLexer(),
# TODO: Change back when bug 392 is fixed:
# http://dev.pocoo.org/projects/pygments/ticket/392
#  "scala" : ScalaLexer(),
  "scala" : JavaLexer(),
  "python" : PythonLexer(),
  "py" : PythonLexer(),
  "jython" : PythonLexer(),
  "php" : PhpLexer(),
  "lisp" : CommonLispLexer(),
  "cs" : CSharpLexer(),
  "unix" : UnixConsoleLexer(),
  "bash" : BashLexer(),
  "sh" : BashLexer(),
  "mysql" : MySqlLexer(),
  "javascript" : JavascriptLexer(),
  "css" : CssLexer(),
  "jsp" : JspLexer(),
  "html" : HtmlLexer(),
  "properties" : IniLexer(),
  "diff" : DiffLexer(),
  "gp" : GnuplotLexer(),
  "rb" : RubyLexer(),
  "lucli" : LucliLexer(),
  "text" : TextLexer(),
  "none" : None
}
LEXER_NAMES = sorted(LEXER_MAPPINGS.keys(), key=str)
HTML_FORMATTER = HtmlFormatter(styel='emacs', linenos='table')

### ============= GData ================

def authenticate():
  """
  Uses the ClientLogin approach to log into Blogger. Once this is called
  subsequent requests are automatically authenticated.
  @return a reference to the blogger service
  """
  blogger = service.GDataService(BLOGGER_EMAIL, BLOGGER_PASSWD)
  blogger.source = 'your-useragent-here-0.1'
  blogger.service = 'blogger'
  blogger.account_type = 'GOOGLE'
  blogger.server = 'www.blogger.com'
  blogger.ProgrammaticLogin()
  return blogger

def getBlogIds(blogger, userId='default'):
  """
  Retrieves the blog metadata from blogger and returns the blogIds
  for the given userId.
  @param blogger the reference to an authenticated service
  @param userId default value 'default' if not supplied
  @return a List of blogIds
  """
  query = service.Query()
  query.feed = '/feeds/%s/blogs' % (userId)
  feed = blogger.Get(query.ToUri())
  blogIds = []
  for entry in feed.entry:
    blogIds.append(entry.id.text.split("-")[-1])
  return blogIds

def getBlogEntries(blogger, blogId, pubMin=PUBDATE_MIN, pubMax=PUBDATE_MAX):
  """
  Returns all posts from PUBLISHED_DATE_MIN to PUBLISHED_DATE_MAX for the
  specified blogId.
  @param blogger the reference to the authenticated service
  @param blogId the id of the blog
  @param pubMin the minimum publish date to retrieve
  @param pubMax the maximum publish date to retrieve
  @return a List of entry objects
  """
  query = service.Query()
  query.feed = '/feeds/%s/posts/default' % (blogId)
  query.published_min = pubMin
  query.published_max = pubMax
  query.max_results = 1000
  feed = blogger.Get(query.ToUri())
  entries = []
  for entry in feed.entry:
    entries.append(entry)
  return entries

def minAfter(ts):
  """
  Returns a time stamp which represents one minute after specified timestamp.
  A timestamp looks like this: 2005-01-01T00:00:00
  @param ts the specified timestamp.
  @return the timestamp representing a minute after ts.
  """
  parts = ts.split(":")
  nextMin = str(int(parts[1]) + 1).zfill(2)
  parts[1] = nextMin
  print ":".join(parts)
  return ":".join(parts)

### ============== Pygments ==================

def askLexer(text):
  """
  Display the text on console and ask the user what it is. Must be one
  of the patterns in LEXER_MAPPINGS.
  @param text the text to display to user
  @return the Lexer instance from LEXER_MAPPINGS
  """
  print '==== text ===='
  print text
  print '==== /text ===='
  while 1:
    ctype = raw_input("Specify type (" + ", ".join(LEXER_NAMES) + "): ")
    try:
      lexer = LEXER_MAPPINGS[ctype]
      break
    except KeyError:
      print 'Sorry, invalid type, try again'
  return lexer

def guessLexer(text):
  """
  Uses the file name metadata or shebang info on first line if it exists
  and try to "guess" the lexer that is required for colorizing.
  @param text the text to analyze
  @return the Lexer instance from LEXER_MAPPINGS
  """
  firstline = text[0:text.find("\n")]
  match = re.search(r'[/|.]([a-zA-Z]+)$', firstline)
  if (match):
    guess = match.group(1)
    try:
      return LEXER_MAPPINGS[guess]
    except KeyError:
      return askLexer(text)
  else:
    return askLexer(text)

def colorize(text, lexer):
  """
  Calls the pygments API to colorize text with appropriate defaults (for my
  use) for inclusion a HTML page.
  @param text the input text
  @lexer the Lexer to use
  @return the colorized text
  """
  if (lexer == None):
    return "\n".join(["<pre class=\"hll\">", text, "</pre>"])
  else:
    return highlight(text, lexer, HTML_FORMATTER)

### ====================== Local IO =================

def createPreview(downloadFilename, previewFilename):
  """
  Reads the download file body and colorizes it as applicable. Writes it out
  to a HTML file (wrapped in a stylesheet, etc) so it can be previewed. If
  there is text in the downloadFile that can be colorized, then returns true
  else returns false.
  @param downloadFilename the name of the download file.
  @param previewFilename the name of the preview file.
  @return True if processing happened, else False.
  """
  previewFile = open(previewFilename, 'wb')
  previewFile.write("""
<html><head><style>
.linenos {background-color: #cccccc }
.hll { background-color: #ffffcc }
.c { color: #008800; font-style: italic } /* Comment */
.err { border: 1px solid #FF0000 } /* Error */
.k { color: #AA22FF; font-weight: bold } /* Keyword */
.o { color: #666666 } /* Operator */
.cm { color: #008800; font-style: italic } /* Comment.Multiline */
.cp { color: #008800 } /* Comment.Preproc */
.c1 { color: #008800; font-style: italic } /* Comment.Single */
.cs { color: #008800; font-weight: bold } /* Comment.Special */
.gd { color: #A00000 } /* Generic.Deleted */
.ge { font-style: italic } /* Generic.Emph */
.gr { color: #FF0000 } /* Generic.Error */
.gh { color: #000080; font-weight: bold } /* Generic.Heading */
.gi { color: #00A000 } /* Generic.Inserted */
.go { color: #808080 } /* Generic.Output */
.gp { color: #000080; font-weight: bold } /* Generic.Prompt */
.gs { font-weight: bold } /* Generic.Strong */
.gu { color: #800080; font-weight: bold } /* Generic.Subheading */
.gt { color: #0040D0 } /* Generic.Traceback */
.kc { color: #AA22FF; font-weight: bold } /* Keyword.Constant */
.kd { color: #AA22FF; font-weight: bold } /* Keyword.Declaration */
.kn { color: #AA22FF; font-weight: bold } /* Keyword.Namespace */
.kp { color: #AA22FF } /* Keyword.Pseudo */
.kr { color: #AA22FF; font-weight: bold } /* Keyword.Reserved */
.kt { color: #00BB00; font-weight: bold } /* Keyword.Type */
.m { color: #666666 } /* Literal.Number */
.s { color: #BB4444 } /* Literal.String */
.na { color: #BB4444 } /* Name.Attribute */
.nb { color: #AA22FF } /* Name.Builtin */
.nc { color: #0000FF } /* Name.Class */
.no { color: #880000 } /* Name.Constant */
.nd { color: #AA22FF } /* Name.Decorator */
.ni { color: #999999; font-weight: bold } /* Name.Entity */
.ne { color: #D2413A; font-weight: bold } /* Name.Exception */
.nf { color: #00A000 } /* Name.Function */
.nl { color: #A0A000 } /* Name.Label */
.nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
.nt { color: #008000; font-weight: bold } /* Name.Tag */
.nv { color: #B8860B } /* Name.Variable */
.ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
.w { color: #bbbbbb } /* Text.Whitespace */
.mf { color: #666666 } /* Literal.Number.Float */
.mh { color: #666666 } /* Literal.Number.Hex */
.mi { color: #666666 } /* Literal.Number.Integer */
.mo { color: #666666 } /* Literal.Number.Oct */
.sb { color: #BB4444 } /* Literal.String.Backtick */
.sc { color: #BB4444 } /* Literal.String.Char */
.sd { color: #BB4444; font-style: italic } /* Literal.String.Doc */
.s2 { color: #BB4444 } /* Literal.String.Double */
.se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
.sh { color: #BB4444 } /* Literal.String.Heredoc */
.si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
.sx { color: #008000 } /* Literal.String.Other */
.sr { color: #BB6688 } /* Literal.String.Regex */
.s1 { color: #BB4444 } /* Literal.String.Single */
.ss { color: #B8860B } /* Literal.String.Symbol */
.bp { color: #AA22FF } /* Name.Builtin.Pseudo */
.vc { color: #B8860B } /* Name.Variable.Class */
.vg { color: #B8860B } /* Name.Variable.Global */
.vi { color: #B8860B } /* Name.Variable.Instance */
.il { color: #666666 } /* Literal.Number.Integer.Long */
</style></head><body>
    """)
  downloadFile = open(downloadFilename, 'rb')
  pres = []
  inPreBlock = False
  processed = False
  for line in downloadFile:
    line = line[:-1]
    if (line == "<pre>"):
      inPreBlock = True
      processed = True
    elif (line == "</pre>"):
      pretext = unescape("\n".join(pres))
      colorized = colorize(pretext, guessLexer(pretext))
      previewFile.write(colorized + "\n")
      pres = []
      inPreBlock = False
    else:
      if (inPreBlock):
        pres.append(line)
      else:
        previewFile.write(line + "\n")
  previewFile.write("""
</body></html>
    """)
  downloadFile.close()
  previewFile.close()
  return processed

def createUpload(previewFilename, uploadFilename):
  """
  Replace the body of the inputXml file with the body of the preview HTML
  file and produce the outputXml file suitable for uploading to Blogger site.
  @param previewFilename the name of the preview file.
  @param uploadFilename the name of the upload file.
  """
  previewFile = open(previewFilename, 'rb')
  uploadFile = open(uploadFilename, 'wb')
  inBody = False
  for line in previewFile.readlines():
    line = line[:-1]
    if (line.endswith("<body>")):
      inBody = True
    elif (line.startswith("</body>")):
      inBody = False
    else:
      if (inBody):
        uploadFile.write(line + "\n")
  previewFile.close()
  uploadFile.close()

def getCatalogDirs():
  """
  Returns all directories under DATA_DIR where catalog.txt directory is found
  @return a List of directory names.
  """
  catalogDirs = []
  def callback(arg, directory, files):
    for file in files:
      if (file == arg):
        catalogDirs.append(directory)
  os.path.walk(DATA_DIR, callback, "catalog.txt")
  return catalogDirs

### ============== called from command line ========================

def usage():
  """ Prints the usage """
  print "Usage: %s download|upload|preview|clean" % (sys.argv[0])
  print "\tdownload -- download remote blog(s) into local directory"
  print "\tupload -- upload colorized post(s) back to blog"
  print "\tpreview -- build html for local preview before upload"
  print "\tclean -- clean up the data directory"
  sys.exit(-1)

def clean():
  """ Clean up the data directory for a new run """
  if (os.path.exists(DATA_DIR)):
    yorn = raw_input("Deleting directory: %s. Proceed (y/n)? " % (DATA_DIR))
    if (yorn == 'y'):
      print "Deleting directory: %s" % (DATA_DIR)
      shutil.rmtree(DATA_DIR)

def download(blogger):
  """
  Downloads one or more blogs for the specified user. The posts are stored
  under ${DATA_DIR}/downloads/${blogId} as XML files. Each XML file contains
  the full atom entry element for a single post.
  @param blogger the reference to the authenticated blogger service
  """
  downloadDir = os.sep.join([DATA_DIR, "downloads"])
  if (not os.path.exists(downloadDir)):
    os.makedirs(downloadDir)
  blogIds = getBlogIds(blogger)
  for blogId in blogIds:
    downloadBlogDir = os.sep.join([downloadDir, blogId])
    if (not os.path.exists(downloadBlogDir)):
      os.makedirs(downloadBlogDir)
    catalog = open(os.sep.join([downloadBlogDir, "catalog.txt"]), 'wb')
    blogEntries = getBlogEntries(blogger, blogId)
    for blogEntry in blogEntries:
      id = blogEntry.id.text.split("-")[-1]
      title = blogEntry.title.text
      published = blogEntry.published.text
      publishUrl = blogEntry.GetEditLink().href
      catalog.write("|".join([id, published, publishUrl, title, "\n"]))
      print ">>> Retrieving [%s] to %s.txt" % (title, id)
      pfile = open(os.sep.join([downloadBlogDir, id + ".txt"]), 'wb')
      pfile.write(blogEntry.content.text)
      pfile.close()
    catalog.close()

def preview():
  """
  Runs through the downloaded XML files, extracts and colorizes the body, then
  wraps it into a HTML template for local viewing on a browser. Since this has
  a manual component (askLexer), the method checks to see if a preview file has
  already been created (so this can be run multiple times without deleting work
  done previously). If the preview generation is not successful (due to a code
  bug somewhere), then you need to manually delete the preview file.
  """
  catalogDirs = getCatalogDirs()
  for catalogDir in catalogDirs:
    blogId = catalogDir.split("/")[-1]
    catalog = open(os.path.join(catalogDir, "catalog.txt"), 'rb')
    for catline in catalog:
      catline = catline[:-1]
      id = catline.split("|")[0]
      previewDir = os.sep.join([DATA_DIR, "preview", blogId])
      if (not os.path.exists(previewDir)):
        os.makedirs(previewDir)
      uploadDir = os.sep.join([DATA_DIR, "uploads", blogId])
      if (not os.path.exists(uploadDir)):
        os.makedirs(uploadDir)
      downloadFile = os.path.join(catalogDir, id + ".txt")
      previewFile = os.path.join(previewDir, id + ".html")
      uploadFile = os.path.join(uploadDir, id + ".txt")
      if (os.path.exists(downloadFile) and os.path.exists(previewFile)):
        print ">>> Skipping file: %s, already processed" % (downloadFile)
        continue
      else:
        print ">>> Processing file: %s" % (downloadFile)
      processed = createPreview(downloadFile, previewFile)
      if (processed):
        createUpload(previewFile, uploadFile)
    catalog.close()
    
def upload(blogger):
  """
  Runs through the catalog file and extract data from the upload text
  file. We then get the blogger entry and update the body with the text
  from the upload file, then HTTP PUT it back to blogger.

  NOTE: This does not work at the moment, I have a question posted on
  the gdata-python-client-library-contributors list:
  http://groups.google.com/group/gdata-python-client-library-contributors/browse_thread/thread/8a7f6f94873921f1
  I finally used the Java client to do the upload.

  @param blogger a reference to the authenticated blogger service.
  """
  catalogDirs = getCatalogDirs()
  for catalogDir in catalogDirs:
    blogId = catalogDir.split("/")[-1]
    catalog = open(os.path.join(catalogDir, "catalog.txt"), 'rb')
    for catline in catalog:
      (id, pubdate, pubUrl, title, junk) = catline.split("|")
      # get data to upload
      uploadFilename = os.sep.join([DATA_DIR, "uploads", blogId, id + ".txt"])
      uploadFile = open(uploadFilename, 'rb')
      uploadData = uploadFile.read()[:-1]
      uploadFile.close()
      # retrieve entry
      entries = getBlogEntries(blogger, blogId, pubdate, minAfter(pubdate))
      if (len(entries) != 1):
        print "Too few or too many entries found for date range, upload skipped"
        return
      entry = entries[0]
      entry.content = atom.Content("html", uploadData)
      print entry
      print ">>> Uploading file: %s.txt" % (id)
      response = blogger.Put(entry, pubUrl)
      print response

def main():
  if (len(sys.argv) < 2):
    usage()
  if (not os.path.exists(DATA_DIR)):
    os.makedirs(DATA_DIR)
  if (sys.argv[1] == 'download' or sys.argv[1] == 'upload'):
    blogger = authenticate()
    if (sys.argv[1] == 'download'):
      download(blogger)
    else:
      upload(blogger)
    conn.close()
  elif (sys.argv[1] == 'preview' or sys.argv[1] == 'clean'):
    if (sys.argv[1] == 'preview'):
      preview()
    else:
      clean()
  else:
      usage()

if __name__ == "__main__":
  main()

The code above also uses some custom Pygment lexer classes I wrote, one for Lucli (described here) and another really simple one for Unix console commands. This class is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#!/usr/bin/python
# Source: MyPygmentLexers.py
from pygments.lexer import RegexLexer, bygroups
from pygments.token import *

# All my custom Lexers here
class LucliLexer(RegexLexer):
  """
  Simple Lucli command line lexer based on RegexLexer. We just define various
  kinds of tokens for coloration in the tokens dictionary below.
  """
  name = 'Lucli'
  aliases = ['lucli']
  filenames = ['*.lucli']

  tokens = {
    'root' : [
      # our startup header
      (r'Lucene CLI.*$', Generic.Heading),
      # the prompt
      (r'lucli> ', Generic.Prompt),
      # keywords that appear by themselves in a single line
      (r'(analyzer|help|info|list|orient|quit|similarity|terms)$', Keyword),
      # keywords followed by arguments in single line
      (r'(analyzer|count|explain|get|index|info|list|optimize|'
       r'orient|quit|remove|search|similarity|terms|tokens)(\s+)(.*?)$',
       bygroups(Keyword, Text, Generic.Strong)),
      # rest of the text
      (r'.*\n', Text)
    ]
  }

class UnixConsoleLexer(RegexLexer):
  name = "UnixConsole"
  aliases = ['UnixConsole']
  filenames = ['*.sh']

  tokens = {
    'root' : [
      (r'.*?\$', Generic.Prompt),
      (r'.*\n', Text)
    ]
  }

To use the bloggerclient.py script, first update the block of globals beginning with DATA_DIR, etc, with your own values. The script takes a few action parameters, similar to an Ant script. To download all the posts, invoke the following command. This will download the posts, one to a file, as well as write a catalog file (catalog.txt) in your ${DATA_DIR}/downloads/${blogId} directory.

1
prompt$ ./bloggerclient.py download

To process the downloaded posts, invoke bloggerclient.py with preview. This will attempt to colorize (freestanding) <pre> blocks in the input text into colorized output with Pygment, using the LEXER_MAPPINGS to fire the appropriate Lexer for a given block of code. It attempts to figure out the programming language from the first line of the code (where I usually have a Source: comment), otherwise, it will display the block and ask you to choose the appropriate Lexer. The colorized code is written to an HTML file with the appropriate stylesheet inlined so you can look at it before deciding to upload it. It also writes out the colorized code so its ready for upload.

1
prompt$ ./bloggerclient.py preview

During the preview process, I discovered a bug in the Scala Lexer, which causes it to hang indefinitely, presumably within a regex lookup. I opened a bug for this. However, a quick workaround for this is to use the Java Lexer instead - it does most of what the Scala Lexer needs to do.

Finally, to upload the posts, the idea was to invoke bloggerclient.py with the upload option. However, I could not get that to work. I suspect that its either a bug in the GData module, since other people have noticed it also, or it could be something to do with my version of httplib, since I could not get my original version using httplib to HTTP PUT to Blogger either. I have posted my problem on the gdata-python-client-library-contributors list, we'll see what comes of that.

Since I just wanted to be done with this stuff, and because I already had the colorized versions of the posts on local disk at this stage, I decided to use the Java GData API to upload, which happily succeeded. Here is the code, written out in the form of JUnit tests so it can be run easily from the command line using mvn test.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
// Source: src/main/java/com/mycompany/blogger/client/GDataBloggerUploadTest.java
package com.mycompany.blogger.client;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.net.URL;
import java.text.SimpleDateFormat;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.junit.Test;

import com.google.gdata.client.GoogleService;
import com.google.gdata.client.Query;
import com.google.gdata.data.Content;
import com.google.gdata.data.DateTime;
import com.google.gdata.data.Entry;
import com.google.gdata.data.Feed;
import com.google.gdata.data.HtmlTextConstruct;
import com.google.gdata.data.TextContent;

/**
 * Simple test case to upload locally updated blogger pages back to Blogger.
 */
public class GDataBloggerUploadTest {

  private static final String BLOGGER_EMAIL = "your_email@company.com";
  private static final String BLOGGER_PASSWD = "your_blogger_password";
  private static final String DOWNLOAD_DIR = "/path/to/download/dir";
  private static final String UPLOAD_DIR = "/path/to/upload/dir";
  private static final String FEED_URL = "http://blogger.feed.url/";
  private static final String BLOG_ID = "your_blog_id";
  
  private static final SimpleDateFormat TS_FORMATTER = new SimpleDateFormat(
      "yyyy-MM-dd'T'HH:mm:ss");
  
//  @Test
  public void testUploadByPubdate() throws Exception {
    GoogleService service = new GoogleService("blogger", "salmonrun-bloggerclient-j-0.1");
    // login
    service.setUserCredentials(BLOGGER_EMAIL, BLOGGER_PASSWD);
    // read catalog file
    BufferedReader catalogReader = new BufferedReader(new FileReader(
      DOWNLOAD_DIR + "/catalog.txt"));
    String catalogLine;
    // read through the catalog file for metadata
    while ((catalogLine = catalogReader.readLine()) != null) {
      String[] cols = StringUtils.split(catalogLine, "|");
      String id = cols[0];
      String pubDate = cols[1];
      String pubUrl = cols[2];
      String title = cols[3];
      // check to see if the file needs to be uploaded (if not available,
      // then it does not need to be uploaded).
      File uploadFile = new File(UPLOAD_DIR + "/" + id + ".txt");
      if (! uploadFile.exists()) {
        System.out.println("Skipping post (" + id + "): " + title + ", no changes");
        continue;
      }
      System.out.println("Uploading post (" + id + "): " + title);
      // suck out all the data into a data buffer
      BufferedReader uploadReader = new BufferedReader(new FileReader(
        UPLOAD_DIR + "/" + id + ".txt"));
      StringBuilder uploadDataBuffer = new StringBuilder();
      String uploadLine;
      while ((uploadLine = uploadReader.readLine()) != null) {
        uploadDataBuffer.append(uploadLine).append("\n");
      }
      uploadReader.close();
      // retrieve the post
      long pubMinAsLong = TS_FORMATTER.parse(pubDate).getTime();
      DateTime pubMin = new DateTime(pubMinAsLong);
      DateTime pubMax = new DateTime(pubMinAsLong + 3600000L); // 1 hour after
      URL feedUrl = new URL(FEED_URL);
      Query query = new Query(feedUrl);
      query.setPublishedMin(pubMin);
      query.setPublishedMax(pubMax);
      Feed result = service.query(query, Feed.class);
      List<Entry> entries = result.getEntries();
      if (entries.size() != 1) {
        System.out.println("Invalid number of entries: " + entries.size() + ", skip: " + id);
        continue;
      }
      Entry entry = entries.get(0);
      // then stick the updated content into the post
      entry.setContent(new TextContent(
        new HtmlTextConstruct(uploadDataBuffer.toString())));
      // then upload
      service.update(new URL(pubUrl), entry);
      // rename them so they are not picked up next time round
      uploadFile.renameTo(new File(UPLOAD_DIR + "/" + id + ".uploaded"));
    }
    catalogReader.close();
  }
  
//  @Test
  public void testUploadAll() throws Exception {
    GoogleService service = new GoogleService("blogger", "salmonrun-bloggerclient-j-0.1");
    // login
    service.setUserCredentials(BLOGGER_EMAIL, BLOGGER_PASSWD);
    // read catalog file
    BufferedReader catalogReader = new BufferedReader(new FileReader(
      DOWNLOAD_DIR + "/catalog.txt"));
    String catalogLine;
    // read through the catalog file for metadata, and build a set of 
    // entries to upload
    Set<String> ids = new HashSet<String>();
    while ((catalogLine = catalogReader.readLine()) != null) {
      String[] cols = StringUtils.split(catalogLine, "|");
      String id = cols[0];
      // check to see if the file needs to be uploaded (if not available,
      // then it does not need to be uploaded).
      File uploadFile = new File(UPLOAD_DIR + "/" + id + ".txt");
      if (! uploadFile.exists()) {
        continue;
      }
      ids.add("tag:blogger.com,1999:blog-" + BLOG_ID + ".post-" + id);
    }
    catalogReader.close();
    System.out.println("#-entries to upload: " + ids.size());
    // now get all the posts
    URL feedUrl = new URL(FEED_URL);
    Query query = new Query(feedUrl);
    query.setPublishedMin(new DateTime(TS_FORMATTER.parse("2005-01-01T00:00:00")));
    query.setPublishedMax(new DateTime(TS_FORMATTER.parse("2009-12-31T00:00:00")));
    query.setMaxResults(1000); // I just have about 150, so this will cover everything
    Feed result = service.query(query, Feed.class);
    List<Entry> entries = result.getEntries();
    for (Entry entry : entries) {
      String id = entry.getId();
      if (! ids.contains(id)) {
        continue;
      }
      String title = entry.getTitle().getPlainText();
      // get contents to update
      String fn = id.substring(id.lastIndexOf('-') + 1);
      System.out.println(">>> Uploading entry (" + id + "): [" + title + "] from file: " + 
        fn + ".txt");
      File uploadFile = new File(UPLOAD_DIR, fn + ".txt");
      if (! uploadFile.exists()) {
        System.out.println("Upload file does not exist: " + uploadFile.toString());
        continue;
      }
      String contents = FileUtils.readFileToString(uploadFile, "UTF-8");
      if (StringUtils.trim(contents).length() == 0) {
        System.out.println("Zero bytes for " + fn + ", skipping");
        continue;
      }
      // then stick the updated content into the post
      entry.setContent(new TextContent(
        new HtmlTextConstruct(contents)));
      String publishUrl = entry.getEditLink().getHref();
      // then upload
      service.update(new URL(publishUrl), entry);
    }
  }
  
  @Test
  public void testFindEmptyBlogs() throws Exception {
    GoogleService service = new GoogleService("blogger", "salmonrun-bloggerclient-j-0.1");
    // login
    service.setUserCredentials(BLOGGER_EMAIL, BLOGGER_PASSWD);
    // get all posts
    URL feedUrl = new URL(FEED_URL);
    Query query = new Query(feedUrl);
    query.setPublishedMin(new DateTime(TS_FORMATTER.parse("2005-01-01T00:00:00")));
    query.setPublishedMax(new DateTime(TS_FORMATTER.parse("2009-12-31T00:00:00")));
    query.setMaxResults(1000); // I just have about 150, so this will cover everything
    Feed result = service.query(query, Feed.class);
    List<Entry> entries = result.getEntries();
    for (Entry entry : entries) {
      String id = entry.getId();
      String title = entry.getTitle().getPlainText();
      String content = ((TextContent) entry.getContent()).getContent().getPlainText();
      if (StringUtils.trim(content).length() == 0) {
        String postId = id.substring(id.lastIndexOf('-') + 1);
        System.out.println(postId + " (" + title + ")");
      }
    }
  }
}

The testUploadByPubDate() tries the same approach as the the Python upload() function, downloading each post by publishedDate and trying to update. However, I found that some posts could not be retrieved using this strategy. I then tried the second approach shown in the testUploadAll(), which first downloads all the posts, then runs through them, applying updates to the ones that are not updated already. This resulted in several blogs just disappearing. Apparently, the upload did not go through completely, so I had to repeat them. The third test method testFindEmptyBlogs() was to figure out which ones to send for reprocessing.

Anyway, the Blog Beautification Project is over, at least for now. Hopefully the next time round it won't be so invasive. I hope you found the results visually appealing and this post itself interesting, at least as a case study of using the Blogger API.

In retrospect, the time I took to write the Python version using httplib and libxml2, convert to using the gdata module, then finally writing a Java version of the upload was probably about the same or more than it would have taken me to do the colorization manually, but it was much more fun. I haven't written much Python code lately, so it was a nice change to be able to use it again.

No comments:

Post a Comment

Comments are moderated to prevent spam.