Sunday, April 01, 2007

Building a Tag Cloud with Python

I first saw a Tag cloud on CNET News, about two years ago, when I worked at CNET. My initial reaction to it was "Wow! This is soooo cool!". I was curious about how they did it, so not being part of the news team, I had no idea where to look for the code, so I asked around and learned that this was a feed from the data warehouse guys, and that all the News site did was to pop this component into their pages. Not knowing much about the data warehousing team at that point, and not being curious enough to start, I mentally filed this off as a possible "to-do" for the future.

Since then, I have seen Tag Clouds on various web sites, and I have always found them quite useful as a consolidated indicator of the content that the site offers. Apparently they are more ubiquitous than I thought, since Jeffrey Zeldman thinks that Tag Clouds are the new mullets, because they seem to be everywhere. More recently, however, some of us got to talking about link clouds in general, so I thought I'll revisit this subject.

This time around, I found Pete Freitag's How to make a Tag Cloud article, which details one very simple way of building a tag cloud. I used these directions to write a Python script that takes a pipe-delimited text file containing the tags on my blog and their frequency and produces the HTML for a tag cloud component that I can then insert into my page. I had initially hoped to experiment a little with the presentation, but I ended up quite happy with what I got. So there is nothing really new here, except that it's a Python implementation instead of a CFML implementation.

Here is the Python code I wrote. The code is commented and should not be too hard to follow. As mentioned earlier, it uses Pete Freitag's algorithm, which is nicely described in his blog article, so refer to it if you don't understand something.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
#!/usr/bin/python
import string

def main():
  # get the list of tags and their frequency from input file
  taglist = getTagListSortedByFrequency('tags.txt')
  # find max and min frequency
  ranges = getRanges(taglist)
  # write out results to output, tags are written out alphabetically
  # with size indicating the relative frequency of their occurence
  writeCloud(taglist, ranges, 'tags.html')

def getTagListSortedByFrequency(inputfile):
  inputf = open(inputfile, 'r')
  taglist = []
  while (True):
    line = inputf.readline()[:-1]
    if (line == ''):
      break
    (tag, count) = line.split("|")
    taglist.append((tag, int(count)))
  inputf.close()
  # sort tagdict by count
  taglist.sort(lambda x, y: cmp(x[1], y[1]))
  return taglist

def getRanges(taglist):
  mincount = taglist[0][1]
  maxcount = taglist[len(taglist) - 1][1]
  distrib = (maxcount - mincount) / 4;
  index = mincount
  ranges = []
  while (index <= maxcount):
    range = (index, index + distrib)
    index = index + distrib
    ranges.append(range)
  return ranges

def writeCloud(taglist, ranges, outputfile):
  outputf = open(outputfile, 'w')
  outputf.write("<style type=\"text/css\">\n")
  outputf.write(".smallestTag {font-size: xx-small;}\n")
  outputf.write(".smallTag {font-size: small;}\n")
  outputf.write(".mediumTag {font-size: medium;}\n")
  outputf.write(".largeTag {font-size: large;}\n")
  outputf.write(".largestTag {font-size: xx-large;}\n")
  outputf.write("</style>\n")
  rangeStyle = ["smallestTag", "smallTag", "mediumTag", "largeTag", "largestTag"]
  # resort the tags alphabetically
  taglist.sort(lambda x, y: cmp(x[0], y[0]))
  for tag in taglist:
    rangeIndex = 0
    for range in ranges:
      url = "http://www.google.com/search?q=" + tag[0].replace(' ', '+') + "+site%3Asujitpal.blogspot.com"
      if (tag[1] >= range[0] and tag[1] <= range[1]):
        outputf.write("<span class=\"" + rangeStyle[rangeIndex] + "\"><a href=\"" + url + "\">" + tag[0] + "</a></span> ")
        break
      rangeIndex = rangeIndex + 1
  outputf.close()

if __name__ == "__main__":
  main()

Here is the tag cloud for my blog. I was tempted to dump the HTML for it, but realized that there would be a huge number of links to it which would probably not be very useful, I really want to show what it looks like:

It's possible that Tag Clouds are no longer cool among the web cognoscenti, but I am guessing that there must still be quite a few people like me who like them for their information content. It would be simple enough for me to take the above cloud and drop it into my blog template, but I don't want to have to regenerate it each time I add in a blog. If Blogger provided a Tag Cloud widget which would regenerate each time the tags were updated, I am sure it would be very well received (hint, hint, nudge, nudge). Tag Clouds also encourage readers to click on links they had not originally intended to visit when they came to a certain page, so I am guessing it would increase the page turns on Blogger too.

11 comments (moderated to prevent spam):

Anonymous said...

This is a great script, but if the difference between maxcount and mincount (line 30) is less than 4, it ends up in an infinite loop since distrib rounds to 0.

I added line 31+32:

if (distrib == 0):
distrib = 1

which fixes the problem.

Sujit Pal said...

Hi Derick,

Thanks very much for the fix. BTW, I liked the comment in your blog about a "real python programmer" :-).

-sujit

KC Leong said...

This tagcloud algorithm works great but the ranges get messed up with some min - max tags. Example; if you got min=1, max=7 the ranges list is 7 (too large).

Each ranges list must not have a biggger len than 5 or the writeCloud goes wrong. I've added the following adjustments so the ranges are correct.

distrib = (maxcount - mincount) / float(parts-1);
distrib = int(round(distrib))

This one is more exact, fixes the case: min=1, max=7.

while (index < maxcount):

The original one adds +1 to max count.

if len(ranges) == 1:
ranges = [(-1),(-1),ranges[0],(-1),(-1)]
elif len(ranges) == 2:
ranges = [(-1),(-1),ranges[0],ranges[1],(-1)]
elif len(ranges) == 3:
ranges = [(-1),ranges[0],ranges[1],ranges[2],(-1)]
elif len(ranges) == 4:
ranges = [(-1),ranges[0],ranges[1],ranges[2],ranges[3]]

This fills out the blanks if ranges list is smaller than 5.

Anonymous said...

This looks really cool but I cannot get it going, could you please let me know, what kind of format the tags.txt should be?
Like is it a list of tuples etc.?

Sujit Pal said...

@KC Leong: Thank you for the fix...my initial setup was to break them up into 5 discrete sets and size them accordingly, that design decision must have leaked through here :-).

@Anonymous: Yes, the tags.txt is a pipe separated tuple of (tagname|count). You should incorporate Derick's and KC Leong's changes too.

Jumand said...

An improvement (I think) to KC Leong's fix for filling in blank ranges; it handles situations where you may want more/less than 5 ranges.

if len(ranges) < parts:
missing = parts - len(ranges)
front = int(round(missing / 2.0))
end = int(missing - front)
for i in range(0,front):
ranges.insert(0,(-1,-1))
if end > 0:
for i in range(0,end):
ranges.append((-1,-1))

Sujit Pal said...

Thanks AJ...I don't have a need for this right now, but this is good to know if I rebuild the code.

Anonymous said...

Can you post the final code with the above mentioned changes? Thanks

Sujit Pal said...

Yes, I will, as soon as I have some time...

Uday said...

Thanks Sujit, your code was very helpful.

Sujit Pal said...

Thanks Uday, and you are welcome.