Last week, I wrote about how I had to disable the salmon/orange background in my <pre> tags so that output of the Pygments syntax colorizer rendered nicely. So that left all but couple of my blogs displaying code on a plain white background, which was not what I intended when I set out on my blog beautification project. I don't have that many posts (about 150 over the last 3 years), but it is still too many to go back and convert manually (and still retain my sanity). So I figured that this would be a good chance to check out the Google Blogger API.
Since Pygments is a Python project, I decided to write my conversion script in Python. The script works in three phases. Phase 1 is the download phase, where all the posts are downloaded at once on the local disk. Phase 2 consists of the extraction of the <pre> blocks, colorizing using Pygments, writing out an HTML page for manual review, as well as writing out the colorized version of the post onto local disk. The third phase is to actually upload the colorized post back to Blogger. I realize that its nicer conceptually to be able to do this in one fell swoop, but this would have not worked for me, as I elaborate below.
Here is the full Python code. I started out not knowing about the existence of the Python GData Blogger API, so my first version used httplib for network IO and libxml2 for XML processing. I chose libxml2 for its XSLT support, but the API is a bit of a mess, so if I were to do XML processing in the future, I would probably choose minidom or elementtree instead. The Python gdata module uses httplib and elementtree under the covers, and (obviously) results in much shorter application code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 | #!/usr/bin/python
# Source: bloggerclient.py
from MyPygmentLexers import *
import gdata
from gdata import service
from gdata import atom
from pygments.formatters import HtmlFormatter
from pygments import highlight
from pygments.lexers import *
from xml.sax.saxutils import unescape
import os
import os.path
import re
import shutil
import sys
# Completely dependent on my setup. Change this stuff to suit your own.
DATA_DIR = "/path/to/temporary/data/directory"
BLOGGER_EMAIL = "your_email@your_domain.com"
BLOGGER_PASSWD = "your-blogger-password"
PUBDATE_MIN = "2005-01-01T00:00:00"
PUBDATE_MAX = "2009-12-31T00:00:00"
# A map of string to lexer instances I need to colorize my blogs so far, you
# may need your own set if your language preferences are different
LEXER_MAPPINGS = {
"java" : JavaLexer(),
"xml" : XmlLexer(),
# TODO: Change back when bug 392 is fixed:
# http://dev.pocoo.org/projects/pygments/ticket/392
# "scala" : ScalaLexer(),
"scala" : JavaLexer(),
"python" : PythonLexer(),
"py" : PythonLexer(),
"jython" : PythonLexer(),
"php" : PhpLexer(),
"lisp" : CommonLispLexer(),
"cs" : CSharpLexer(),
"unix" : UnixConsoleLexer(),
"bash" : BashLexer(),
"sh" : BashLexer(),
"mysql" : MySqlLexer(),
"javascript" : JavascriptLexer(),
"css" : CssLexer(),
"jsp" : JspLexer(),
"html" : HtmlLexer(),
"properties" : IniLexer(),
"diff" : DiffLexer(),
"gp" : GnuplotLexer(),
"rb" : RubyLexer(),
"lucli" : LucliLexer(),
"text" : TextLexer(),
"none" : None
}
LEXER_NAMES = sorted(LEXER_MAPPINGS.keys(), key=str)
HTML_FORMATTER = HtmlFormatter(styel='emacs', linenos='table')
### ============= GData ================
def authenticate():
"""
Uses the ClientLogin approach to log into Blogger. Once this is called
subsequent requests are automatically authenticated.
@return a reference to the blogger service
"""
blogger = service.GDataService(BLOGGER_EMAIL, BLOGGER_PASSWD)
blogger.source = 'your-useragent-here-0.1'
blogger.service = 'blogger'
blogger.account_type = 'GOOGLE'
blogger.server = 'www.blogger.com'
blogger.ProgrammaticLogin()
return blogger
def getBlogIds(blogger, userId='default'):
"""
Retrieves the blog metadata from blogger and returns the blogIds
for the given userId.
@param blogger the reference to an authenticated service
@param userId default value 'default' if not supplied
@return a List of blogIds
"""
query = service.Query()
query.feed = '/feeds/%s/blogs' % (userId)
feed = blogger.Get(query.ToUri())
blogIds = []
for entry in feed.entry:
blogIds.append(entry.id.text.split("-")[-1])
return blogIds
def getBlogEntries(blogger, blogId, pubMin=PUBDATE_MIN, pubMax=PUBDATE_MAX):
"""
Returns all posts from PUBLISHED_DATE_MIN to PUBLISHED_DATE_MAX for the
specified blogId.
@param blogger the reference to the authenticated service
@param blogId the id of the blog
@param pubMin the minimum publish date to retrieve
@param pubMax the maximum publish date to retrieve
@return a List of entry objects
"""
query = service.Query()
query.feed = '/feeds/%s/posts/default' % (blogId)
query.published_min = pubMin
query.published_max = pubMax
query.max_results = 1000
feed = blogger.Get(query.ToUri())
entries = []
for entry in feed.entry:
entries.append(entry)
return entries
def minAfter(ts):
"""
Returns a time stamp which represents one minute after specified timestamp.
A timestamp looks like this: 2005-01-01T00:00:00
@param ts the specified timestamp.
@return the timestamp representing a minute after ts.
"""
parts = ts.split(":")
nextMin = str(int(parts[1]) + 1).zfill(2)
parts[1] = nextMin
print ":".join(parts)
return ":".join(parts)
### ============== Pygments ==================
def askLexer(text):
"""
Display the text on console and ask the user what it is. Must be one
of the patterns in LEXER_MAPPINGS.
@param text the text to display to user
@return the Lexer instance from LEXER_MAPPINGS
"""
print '==== text ===='
print text
print '==== /text ===='
while 1:
ctype = raw_input("Specify type (" + ", ".join(LEXER_NAMES) + "): ")
try:
lexer = LEXER_MAPPINGS[ctype]
break
except KeyError:
print 'Sorry, invalid type, try again'
return lexer
def guessLexer(text):
"""
Uses the file name metadata or shebang info on first line if it exists
and try to "guess" the lexer that is required for colorizing.
@param text the text to analyze
@return the Lexer instance from LEXER_MAPPINGS
"""
firstline = text[0:text.find("\n")]
match = re.search(r'[/|.]([a-zA-Z]+)$', firstline)
if (match):
guess = match.group(1)
try:
return LEXER_MAPPINGS[guess]
except KeyError:
return askLexer(text)
else:
return askLexer(text)
def colorize(text, lexer):
"""
Calls the pygments API to colorize text with appropriate defaults (for my
use) for inclusion a HTML page.
@param text the input text
@lexer the Lexer to use
@return the colorized text
"""
if (lexer == None):
return "\n".join(["<pre class=\"hll\">", text, "</pre>"])
else:
return highlight(text, lexer, HTML_FORMATTER)
### ====================== Local IO =================
def createPreview(downloadFilename, previewFilename):
"""
Reads the download file body and colorizes it as applicable. Writes it out
to a HTML file (wrapped in a stylesheet, etc) so it can be previewed. If
there is text in the downloadFile that can be colorized, then returns true
else returns false.
@param downloadFilename the name of the download file.
@param previewFilename the name of the preview file.
@return True if processing happened, else False.
"""
previewFile = open(previewFilename, 'wb')
previewFile.write("""
<html><head><style>
.linenos {background-color: #cccccc }
.hll { background-color: #ffffcc }
.c { color: #008800; font-style: italic } /* Comment */
.err { border: 1px solid #FF0000 } /* Error */
.k { color: #AA22FF; font-weight: bold } /* Keyword */
.o { color: #666666 } /* Operator */
.cm { color: #008800; font-style: italic } /* Comment.Multiline */
.cp { color: #008800 } /* Comment.Preproc */
.c1 { color: #008800; font-style: italic } /* Comment.Single */
.cs { color: #008800; font-weight: bold } /* Comment.Special */
.gd { color: #A00000 } /* Generic.Deleted */
.ge { font-style: italic } /* Generic.Emph */
.gr { color: #FF0000 } /* Generic.Error */
.gh { color: #000080; font-weight: bold } /* Generic.Heading */
.gi { color: #00A000 } /* Generic.Inserted */
.go { color: #808080 } /* Generic.Output */
.gp { color: #000080; font-weight: bold } /* Generic.Prompt */
.gs { font-weight: bold } /* Generic.Strong */
.gu { color: #800080; font-weight: bold } /* Generic.Subheading */
.gt { color: #0040D0 } /* Generic.Traceback */
.kc { color: #AA22FF; font-weight: bold } /* Keyword.Constant */
.kd { color: #AA22FF; font-weight: bold } /* Keyword.Declaration */
.kn { color: #AA22FF; font-weight: bold } /* Keyword.Namespace */
.kp { color: #AA22FF } /* Keyword.Pseudo */
.kr { color: #AA22FF; font-weight: bold } /* Keyword.Reserved */
.kt { color: #00BB00; font-weight: bold } /* Keyword.Type */
.m { color: #666666 } /* Literal.Number */
.s { color: #BB4444 } /* Literal.String */
.na { color: #BB4444 } /* Name.Attribute */
.nb { color: #AA22FF } /* Name.Builtin */
.nc { color: #0000FF } /* Name.Class */
.no { color: #880000 } /* Name.Constant */
.nd { color: #AA22FF } /* Name.Decorator */
.ni { color: #999999; font-weight: bold } /* Name.Entity */
.ne { color: #D2413A; font-weight: bold } /* Name.Exception */
.nf { color: #00A000 } /* Name.Function */
.nl { color: #A0A000 } /* Name.Label */
.nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
.nt { color: #008000; font-weight: bold } /* Name.Tag */
.nv { color: #B8860B } /* Name.Variable */
.ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
.w { color: #bbbbbb } /* Text.Whitespace */
.mf { color: #666666 } /* Literal.Number.Float */
.mh { color: #666666 } /* Literal.Number.Hex */
.mi { color: #666666 } /* Literal.Number.Integer */
.mo { color: #666666 } /* Literal.Number.Oct */
.sb { color: #BB4444 } /* Literal.String.Backtick */
.sc { color: #BB4444 } /* Literal.String.Char */
.sd { color: #BB4444; font-style: italic } /* Literal.String.Doc */
.s2 { color: #BB4444 } /* Literal.String.Double */
.se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
.sh { color: #BB4444 } /* Literal.String.Heredoc */
.si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
.sx { color: #008000 } /* Literal.String.Other */
.sr { color: #BB6688 } /* Literal.String.Regex */
.s1 { color: #BB4444 } /* Literal.String.Single */
.ss { color: #B8860B } /* Literal.String.Symbol */
.bp { color: #AA22FF } /* Name.Builtin.Pseudo */
.vc { color: #B8860B } /* Name.Variable.Class */
.vg { color: #B8860B } /* Name.Variable.Global */
.vi { color: #B8860B } /* Name.Variable.Instance */
.il { color: #666666 } /* Literal.Number.Integer.Long */
</style></head><body>
""")
downloadFile = open(downloadFilename, 'rb')
pres = []
inPreBlock = False
processed = False
for line in downloadFile:
line = line[:-1]
if (line == "<pre>"):
inPreBlock = True
processed = True
elif (line == "</pre>"):
pretext = unescape("\n".join(pres))
colorized = colorize(pretext, guessLexer(pretext))
previewFile.write(colorized + "\n")
pres = []
inPreBlock = False
else:
if (inPreBlock):
pres.append(line)
else:
previewFile.write(line + "\n")
previewFile.write("""
</body></html>
""")
downloadFile.close()
previewFile.close()
return processed
def createUpload(previewFilename, uploadFilename):
"""
Replace the body of the inputXml file with the body of the preview HTML
file and produce the outputXml file suitable for uploading to Blogger site.
@param previewFilename the name of the preview file.
@param uploadFilename the name of the upload file.
"""
previewFile = open(previewFilename, 'rb')
uploadFile = open(uploadFilename, 'wb')
inBody = False
for line in previewFile.readlines():
line = line[:-1]
if (line.endswith("<body>")):
inBody = True
elif (line.startswith("</body>")):
inBody = False
else:
if (inBody):
uploadFile.write(line + "\n")
previewFile.close()
uploadFile.close()
def getCatalogDirs():
"""
Returns all directories under DATA_DIR where catalog.txt directory is found
@return a List of directory names.
"""
catalogDirs = []
def callback(arg, directory, files):
for file in files:
if (file == arg):
catalogDirs.append(directory)
os.path.walk(DATA_DIR, callback, "catalog.txt")
return catalogDirs
### ============== called from command line ========================
def usage():
""" Prints the usage """
print "Usage: %s download|upload|preview|clean" % (sys.argv[0])
print "\tdownload -- download remote blog(s) into local directory"
print "\tupload -- upload colorized post(s) back to blog"
print "\tpreview -- build html for local preview before upload"
print "\tclean -- clean up the data directory"
sys.exit(-1)
def clean():
""" Clean up the data directory for a new run """
if (os.path.exists(DATA_DIR)):
yorn = raw_input("Deleting directory: %s. Proceed (y/n)? " % (DATA_DIR))
if (yorn == 'y'):
print "Deleting directory: %s" % (DATA_DIR)
shutil.rmtree(DATA_DIR)
def download(blogger):
"""
Downloads one or more blogs for the specified user. The posts are stored
under ${DATA_DIR}/downloads/${blogId} as XML files. Each XML file contains
the full atom entry element for a single post.
@param blogger the reference to the authenticated blogger service
"""
downloadDir = os.sep.join([DATA_DIR, "downloads"])
if (not os.path.exists(downloadDir)):
os.makedirs(downloadDir)
blogIds = getBlogIds(blogger)
for blogId in blogIds:
downloadBlogDir = os.sep.join([downloadDir, blogId])
if (not os.path.exists(downloadBlogDir)):
os.makedirs(downloadBlogDir)
catalog = open(os.sep.join([downloadBlogDir, "catalog.txt"]), 'wb')
blogEntries = getBlogEntries(blogger, blogId)
for blogEntry in blogEntries:
id = blogEntry.id.text.split("-")[-1]
title = blogEntry.title.text
published = blogEntry.published.text
publishUrl = blogEntry.GetEditLink().href
catalog.write("|".join([id, published, publishUrl, title, "\n"]))
print ">>> Retrieving [%s] to %s.txt" % (title, id)
pfile = open(os.sep.join([downloadBlogDir, id + ".txt"]), 'wb')
pfile.write(blogEntry.content.text)
pfile.close()
catalog.close()
def preview():
"""
Runs through the downloaded XML files, extracts and colorizes the body, then
wraps it into a HTML template for local viewing on a browser. Since this has
a manual component (askLexer), the method checks to see if a preview file has
already been created (so this can be run multiple times without deleting work
done previously). If the preview generation is not successful (due to a code
bug somewhere), then you need to manually delete the preview file.
"""
catalogDirs = getCatalogDirs()
for catalogDir in catalogDirs:
blogId = catalogDir.split("/")[-1]
catalog = open(os.path.join(catalogDir, "catalog.txt"), 'rb')
for catline in catalog:
catline = catline[:-1]
id = catline.split("|")[0]
previewDir = os.sep.join([DATA_DIR, "preview", blogId])
if (not os.path.exists(previewDir)):
os.makedirs(previewDir)
uploadDir = os.sep.join([DATA_DIR, "uploads", blogId])
if (not os.path.exists(uploadDir)):
os.makedirs(uploadDir)
downloadFile = os.path.join(catalogDir, id + ".txt")
previewFile = os.path.join(previewDir, id + ".html")
uploadFile = os.path.join(uploadDir, id + ".txt")
if (os.path.exists(downloadFile) and os.path.exists(previewFile)):
print ">>> Skipping file: %s, already processed" % (downloadFile)
continue
else:
print ">>> Processing file: %s" % (downloadFile)
processed = createPreview(downloadFile, previewFile)
if (processed):
createUpload(previewFile, uploadFile)
catalog.close()
def upload(blogger):
"""
Runs through the catalog file and extract data from the upload text
file. We then get the blogger entry and update the body with the text
from the upload file, then HTTP PUT it back to blogger.
NOTE: This does not work at the moment, I have a question posted on
the gdata-python-client-library-contributors list:
http://groups.google.com/group/gdata-python-client-library-contributors/browse_thread/thread/8a7f6f94873921f1
I finally used the Java client to do the upload.
@param blogger a reference to the authenticated blogger service.
"""
catalogDirs = getCatalogDirs()
for catalogDir in catalogDirs:
blogId = catalogDir.split("/")[-1]
catalog = open(os.path.join(catalogDir, "catalog.txt"), 'rb')
for catline in catalog:
(id, pubdate, pubUrl, title, junk) = catline.split("|")
# get data to upload
uploadFilename = os.sep.join([DATA_DIR, "uploads", blogId, id + ".txt"])
uploadFile = open(uploadFilename, 'rb')
uploadData = uploadFile.read()[:-1]
uploadFile.close()
# retrieve entry
entries = getBlogEntries(blogger, blogId, pubdate, minAfter(pubdate))
if (len(entries) != 1):
print "Too few or too many entries found for date range, upload skipped"
return
entry = entries[0]
entry.content = atom.Content("html", uploadData)
print entry
print ">>> Uploading file: %s.txt" % (id)
response = blogger.Put(entry, pubUrl)
print response
def main():
if (len(sys.argv) < 2):
usage()
if (not os.path.exists(DATA_DIR)):
os.makedirs(DATA_DIR)
if (sys.argv[1] == 'download' or sys.argv[1] == 'upload'):
blogger = authenticate()
if (sys.argv[1] == 'download'):
download(blogger)
else:
upload(blogger)
conn.close()
elif (sys.argv[1] == 'preview' or sys.argv[1] == 'clean'):
if (sys.argv[1] == 'preview'):
preview()
else:
clean()
else:
usage()
if __name__ == "__main__":
main()
|
The code above also uses some custom Pygment lexer classes I wrote, one for Lucli (described here) and another really simple one for Unix console commands. This class is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | #!/usr/bin/python
# Source: MyPygmentLexers.py
from pygments.lexer import RegexLexer, bygroups
from pygments.token import *
# All my custom Lexers here
class LucliLexer(RegexLexer):
"""
Simple Lucli command line lexer based on RegexLexer. We just define various
kinds of tokens for coloration in the tokens dictionary below.
"""
name = 'Lucli'
aliases = ['lucli']
filenames = ['*.lucli']
tokens = {
'root' : [
# our startup header
(r'Lucene CLI.*$', Generic.Heading),
# the prompt
(r'lucli> ', Generic.Prompt),
# keywords that appear by themselves in a single line
(r'(analyzer|help|info|list|orient|quit|similarity|terms)$', Keyword),
# keywords followed by arguments in single line
(r'(analyzer|count|explain|get|index|info|list|optimize|'
r'orient|quit|remove|search|similarity|terms|tokens)(\s+)(.*?)$',
bygroups(Keyword, Text, Generic.Strong)),
# rest of the text
(r'.*\n', Text)
]
}
class UnixConsoleLexer(RegexLexer):
name = "UnixConsole"
aliases = ['UnixConsole']
filenames = ['*.sh']
tokens = {
'root' : [
(r'.*?\$', Generic.Prompt),
(r'.*\n', Text)
]
}
|
To use the bloggerclient.py script, first update the block of globals beginning with DATA_DIR, etc, with your own values. The script takes a few action parameters, similar to an Ant script. To download all the posts, invoke the following command. This will download the posts, one to a file, as well as write a catalog file (catalog.txt) in your ${DATA_DIR}/downloads/${blogId} directory.
1 | prompt$ ./bloggerclient.py download
|
To process the downloaded posts, invoke bloggerclient.py with preview. This will attempt to colorize (freestanding) <pre> blocks in the input text into colorized output with Pygment, using the LEXER_MAPPINGS to fire the appropriate Lexer for a given block of code. It attempts to figure out the programming language from the first line of the code (where I usually have a Source: comment), otherwise, it will display the block and ask you to choose the appropriate Lexer. The colorized code is written to an HTML file with the appropriate stylesheet inlined so you can look at it before deciding to upload it. It also writes out the colorized code so its ready for upload.
1 | prompt$ ./bloggerclient.py preview
|
During the preview process, I discovered a bug in the Scala Lexer, which causes it to hang indefinitely, presumably within a regex lookup. I opened a bug for this. However, a quick workaround for this is to use the Java Lexer instead - it does most of what the Scala Lexer needs to do.
Finally, to upload the posts, the idea was to invoke bloggerclient.py with the upload option. However, I could not get that to work. I suspect that its either a bug in the GData module, since other people have noticed it also, or it could be something to do with my version of httplib, since I could not get my original version using httplib to HTTP PUT to Blogger either. I have posted my problem on the gdata-python-client-library-contributors list, we'll see what comes of that.
Since I just wanted to be done with this stuff, and because I already had the colorized versions of the posts on local disk at this stage, I decided to use the Java GData API to upload, which happily succeeded. Here is the code, written out in the form of JUnit tests so it can be run easily from the command line using mvn test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | // Source: src/main/java/com/mycompany/blogger/client/GDataBloggerUploadTest.java
package com.mycompany.blogger.client;
import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.net.URL;
import java.text.SimpleDateFormat;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.junit.Test;
import com.google.gdata.client.GoogleService;
import com.google.gdata.client.Query;
import com.google.gdata.data.Content;
import com.google.gdata.data.DateTime;
import com.google.gdata.data.Entry;
import com.google.gdata.data.Feed;
import com.google.gdata.data.HtmlTextConstruct;
import com.google.gdata.data.TextContent;
/**
* Simple test case to upload locally updated blogger pages back to Blogger.
*/
public class GDataBloggerUploadTest {
private static final String BLOGGER_EMAIL = "your_email@company.com";
private static final String BLOGGER_PASSWD = "your_blogger_password";
private static final String DOWNLOAD_DIR = "/path/to/download/dir";
private static final String UPLOAD_DIR = "/path/to/upload/dir";
private static final String FEED_URL = "http://blogger.feed.url/";
private static final String BLOG_ID = "your_blog_id";
private static final SimpleDateFormat TS_FORMATTER = new SimpleDateFormat(
"yyyy-MM-dd'T'HH:mm:ss");
// @Test
public void testUploadByPubdate() throws Exception {
GoogleService service = new GoogleService("blogger", "salmonrun-bloggerclient-j-0.1");
// login
service.setUserCredentials(BLOGGER_EMAIL, BLOGGER_PASSWD);
// read catalog file
BufferedReader catalogReader = new BufferedReader(new FileReader(
DOWNLOAD_DIR + "/catalog.txt"));
String catalogLine;
// read through the catalog file for metadata
while ((catalogLine = catalogReader.readLine()) != null) {
String[] cols = StringUtils.split(catalogLine, "|");
String id = cols[0];
String pubDate = cols[1];
String pubUrl = cols[2];
String title = cols[3];
// check to see if the file needs to be uploaded (if not available,
// then it does not need to be uploaded).
File uploadFile = new File(UPLOAD_DIR + "/" + id + ".txt");
if (! uploadFile.exists()) {
System.out.println("Skipping post (" + id + "): " + title + ", no changes");
continue;
}
System.out.println("Uploading post (" + id + "): " + title);
// suck out all the data into a data buffer
BufferedReader uploadReader = new BufferedReader(new FileReader(
UPLOAD_DIR + "/" + id + ".txt"));
StringBuilder uploadDataBuffer = new StringBuilder();
String uploadLine;
while ((uploadLine = uploadReader.readLine()) != null) {
uploadDataBuffer.append(uploadLine).append("\n");
}
uploadReader.close();
// retrieve the post
long pubMinAsLong = TS_FORMATTER.parse(pubDate).getTime();
DateTime pubMin = new DateTime(pubMinAsLong);
DateTime pubMax = new DateTime(pubMinAsLong + 3600000L); // 1 hour after
URL feedUrl = new URL(FEED_URL);
Query query = new Query(feedUrl);
query.setPublishedMin(pubMin);
query.setPublishedMax(pubMax);
Feed result = service.query(query, Feed.class);
List<Entry> entries = result.getEntries();
if (entries.size() != 1) {
System.out.println("Invalid number of entries: " + entries.size() + ", skip: " + id);
continue;
}
Entry entry = entries.get(0);
// then stick the updated content into the post
entry.setContent(new TextContent(
new HtmlTextConstruct(uploadDataBuffer.toString())));
// then upload
service.update(new URL(pubUrl), entry);
// rename them so they are not picked up next time round
uploadFile.renameTo(new File(UPLOAD_DIR + "/" + id + ".uploaded"));
}
catalogReader.close();
}
// @Test
public void testUploadAll() throws Exception {
GoogleService service = new GoogleService("blogger", "salmonrun-bloggerclient-j-0.1");
// login
service.setUserCredentials(BLOGGER_EMAIL, BLOGGER_PASSWD);
// read catalog file
BufferedReader catalogReader = new BufferedReader(new FileReader(
DOWNLOAD_DIR + "/catalog.txt"));
String catalogLine;
// read through the catalog file for metadata, and build a set of
// entries to upload
Set<String> ids = new HashSet<String>();
while ((catalogLine = catalogReader.readLine()) != null) {
String[] cols = StringUtils.split(catalogLine, "|");
String id = cols[0];
// check to see if the file needs to be uploaded (if not available,
// then it does not need to be uploaded).
File uploadFile = new File(UPLOAD_DIR + "/" + id + ".txt");
if (! uploadFile.exists()) {
continue;
}
ids.add("tag:blogger.com,1999:blog-" + BLOG_ID + ".post-" + id);
}
catalogReader.close();
System.out.println("#-entries to upload: " + ids.size());
// now get all the posts
URL feedUrl = new URL(FEED_URL);
Query query = new Query(feedUrl);
query.setPublishedMin(new DateTime(TS_FORMATTER.parse("2005-01-01T00:00:00")));
query.setPublishedMax(new DateTime(TS_FORMATTER.parse("2009-12-31T00:00:00")));
query.setMaxResults(1000); // I just have about 150, so this will cover everything
Feed result = service.query(query, Feed.class);
List<Entry> entries = result.getEntries();
for (Entry entry : entries) {
String id = entry.getId();
if (! ids.contains(id)) {
continue;
}
String title = entry.getTitle().getPlainText();
// get contents to update
String fn = id.substring(id.lastIndexOf('-') + 1);
System.out.println(">>> Uploading entry (" + id + "): [" + title + "] from file: " +
fn + ".txt");
File uploadFile = new File(UPLOAD_DIR, fn + ".txt");
if (! uploadFile.exists()) {
System.out.println("Upload file does not exist: " + uploadFile.toString());
continue;
}
String contents = FileUtils.readFileToString(uploadFile, "UTF-8");
if (StringUtils.trim(contents).length() == 0) {
System.out.println("Zero bytes for " + fn + ", skipping");
continue;
}
// then stick the updated content into the post
entry.setContent(new TextContent(
new HtmlTextConstruct(contents)));
String publishUrl = entry.getEditLink().getHref();
// then upload
service.update(new URL(publishUrl), entry);
}
}
@Test
public void testFindEmptyBlogs() throws Exception {
GoogleService service = new GoogleService("blogger", "salmonrun-bloggerclient-j-0.1");
// login
service.setUserCredentials(BLOGGER_EMAIL, BLOGGER_PASSWD);
// get all posts
URL feedUrl = new URL(FEED_URL);
Query query = new Query(feedUrl);
query.setPublishedMin(new DateTime(TS_FORMATTER.parse("2005-01-01T00:00:00")));
query.setPublishedMax(new DateTime(TS_FORMATTER.parse("2009-12-31T00:00:00")));
query.setMaxResults(1000); // I just have about 150, so this will cover everything
Feed result = service.query(query, Feed.class);
List<Entry> entries = result.getEntries();
for (Entry entry : entries) {
String id = entry.getId();
String title = entry.getTitle().getPlainText();
String content = ((TextContent) entry.getContent()).getContent().getPlainText();
if (StringUtils.trim(content).length() == 0) {
String postId = id.substring(id.lastIndexOf('-') + 1);
System.out.println(postId + " (" + title + ")");
}
}
}
}
|
The testUploadByPubDate() tries the same approach as the the Python upload() function, downloading each post by publishedDate and trying to update. However, I found that some posts could not be retrieved using this strategy. I then tried the second approach shown in the testUploadAll(), which first downloads all the posts, then runs through them, applying updates to the ones that are not updated already. This resulted in several blogs just disappearing. Apparently, the upload did not go through completely, so I had to repeat them. The third test method testFindEmptyBlogs() was to figure out which ones to send for reprocessing.
Anyway, the Blog Beautification Project is over, at least for now. Hopefully the next time round it won't be so invasive. I hope you found the results visually appealing and this post itself interesting, at least as a case study of using the Blogger API.
In retrospect, the time I took to write the Python version using httplib and libxml2, convert to using the gdata module, then finally writing a Java version of the upload was probably about the same or more than it would have taken me to do the colorization manually, but it was much more fun. I haven't written much Python code lately, so it was a nice change to be able to use it again.
No comments:
Post a Comment
Comments are moderated to prevent spam.