Couple of weeks ago, I mentioned how nice it would be if I had a program that suggested "tags" (called Labels in Blogger) for my blog posts, using my existing tags and the content of the post as input. The motivation was to clean up my rather defocused, but nevertheless useful Tag Cloud on the left rail.
I initially thought of rolling up the tags into a set of 10 or so categories, but I like the ability to find an obscure post by clicking on the particular technology, something I would miss with this approach. So the solution I came up with was to keep both, but roll up the obscure tags into larger category tags. In addition, because I often refer to the same thing in slightly different ways, synonymy had to be considered, both for the base tag as well as when rolling it up into one or more categories. So my "dictionary" (full file available here) of tags looks something like this.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | # Source: src/main/resources/blog_dict.txt
# Format:
# Label:Synonyms:Categories
acegi::java,security
actor:actors:concurrent-programming,patterns
actorfoundry::java,concurrent-programming,patterns,actor
ajax::javascript,web-development
algorithms::
annotations::java,programming
ant::java,programming
apache-cxf::java,webservices,soap
apache-solr::java,search,lucene
atom::xml,webservices
...
|
The fields are separated by the colon character. Label is the only field that is required, synonyms and categories are optional. The original list of labels (the first column) was built using a download of my blogs from Blogger and a few simple Unix commands. The second and third columns are built by hand, and are of course, rather subjective and context-dependent.
The usage of this tool depends a lot on the peculiarities of my blogging habits. Because I write my blog offline and copy-paste it into the Blogger editor tool for upload, I have access to the original text to pass into the script. The script parses the blog into an inverted index of words to positions, then loops through the labels and their synonyms in the dictionary to find the occurrences of these words or phrases. The output is a list of labels sorted in descending order by their frequency.
As before, the logic for parsing the body of the blog post into an inverted index depends on some home-grown markup that I use when writing my blogs. For one, I like to decide the title beforehand, and put it at the first line of my post in HTML <title> tags. I also put code and data inside <pre> tags. So the parsing logic uses the <title> tag marker to count any dictionary phrases TITLE_BOOST times, based on the assumption that the title describes the content quite well. Further, the parser skips (usually multi-line) content between the <pre> tags.
The script is written in Python and initially creates two data structures, the Dictionary and the InvertedIndex. The Dictionary provides methods to return the list of labels, and the synonyms and categories for a given label. The inverted index provides a method to return the occurrence count for a given word or phrase. Here is the script:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 | #!/usr/bin/python
import sys
import getopt
import os.path
import re
### Configuration ###
# path to dict file
DEFAULT_DICT_PATH = "/home/sujit/bin/blog_dict.txt"
TITLE_BOOST = 5 # score boost for title occurrences
COUNT_CUTOFF = 4 # min count to report a term
### Configuration ###
class InvertedIndex:
"""
Class to convert the input file into an inverted index of words
and their corresponding word positions. The class uses this data
structure to return a count of occurrences of single or multi-
word phrases in the document.
"""
def __init__(self, inpath):
self.index = {}
self.wordSplitter = re.compile(r'[-_]')
infile = open(inpath, 'r')
regex = re.compile(r'[;,\.-_\" ]|<[^>].*>')
pos = 0
ignorable = False # words are not counted while this is True
while (True):
line = infile.readline()
if (not line):
break
# code blocks start with <pre> and end with </pre>,
# which we ignore.
if (line.startswith("<pre>")):
ignorable = True
if (ignorable and line.startswith("</pre>")):
ignorable = False
if (ignorable):
continue
line = line[:-1].lower()
words = regex.split(line)
# make sure phrase matches spanning paragraphs are not considered
pos = pos + 1
boost = 1
if line.startswith("<title>") and
line.endswith("</title>"):
# word matches in title are scored TITLE_BOOST times normal
# this is done by considering the title line TITLE_BOOST
# times, that way, neighboring words are still neighboring
# words, but they just "occur" TITLE_BOOST times in the title.
boost = TITLE_BOOST
for i in range(0, boost):
for word in words:
try:
positions = self.index[word]
except KeyError:
positions = set()
pos = pos + 1
positions.add(pos)
self.index[word] = positions
infile.close()
def getCount(self, term):
words = self.wordSplitter.split(term.lower())
prevPositions = set()
newPositions = set()
for word in words:
try:
positions = self.index[word]
if (len(prevPositions) == 0):
# the previous positions set is empty, which could only
# happen if this is the first word in our term. So there
# is nothing to filter against, so the current positions
# becomes the basis for the filtering. If it exits at this
# point, because this is a single word phrase, then this
# will report success or failure based on the length of
# positions collection.
prevPositions.update(positions)
else:
# we have a previous position set, so we compare each
# entry in our current positions to see if it is 1 ahead
# of an entry in our previous positions, The intersection
# will form the basis of the previous positions for the
# next word in the phrase.
newPositions = positions.intersection(
map(lambda x: x+1, prevPositions))
prevPositions.clear()
prevPositions.update(newPositions)
newPositions.clear()
except KeyError:
return 0
return len(prevPositions)
class Dictionary():
"""
Class to encapsulate the creation of a data structure from the
dictionary file. Format of the dictionary file is as follows:
label:label_synonyms:label_categories
where:
label = the word or phrase that represents a label. If the label
is a multi-word label, it should be hyphenated.
label_synonyms = a comma-separated list of synonyms for the label.
For example, webservice may be used instead of web-service.
As in the label field, multi-word synonyms should be
hyphenated.
label_categories = a comma-separated list of categories the label
should roll up to. For example, cx_oracle could roll up to
databases, oracle, python and scripting.
"""
def __init__(self, dictpath):
self.cats = {}
self.syns = {}
dictfile = open(dictpath, 'r')
while (True):
line = dictfile.readline()
if (not line):
break
if (line.startswith("#")):
# comment line, skip
continue
# strip out the line terminator, lowercase and replace all
# whitespace with hyphen.
line = line[:-1].lower().replace(" ", "-")
(label, synonyms, categories) = line.split(":")
# empty comma-separated synonyms or categories are stored in the
# cats and syns dictionaries as a empty element, the filter
# removes it so an empty string maps to an empty list
self.cats[label] = filter(
lambda x: len(x.strip()) > 0, categories.split(","))
self.syns[label] = filter(
lambda x: len(x.strip()) > 0, synonyms.split(","))
dictfile.close
def labels(self):
"""
Return the list of all labels in the dictionary.
@return the list of all labels.
"""
return self.cats.keys()
def synonyms(self, label):
"""
Return the synonyms for the specified label. If no synonyms
exist, returns an empty List.
@param label the label to look up in the dictionary.
@return a List of synonyms mapped to the label
"""
try:
return self.syns[label]
except KeyError:
return []
def categories(self, label):
"""
Return the categories for the specified label, If no categories
exist, returns an empty List.
@param label the label to look up in the dictionary.
@return a List of categories mapped to the label.
"""
try:
return self.cats[label]
except KeyError:
return []
def usage(message=""):
if (len(message) > 0):
print "Error: %s" % (message)
print "Usage: %s --help|([--dict=dict_file] --file=input_file)" %
(sys.argv[0])
print "-d|--dict: full path to dictionary file"
print "-f|--file: full path to input file to be analyzed"
print "-i|--interactive: prompt for each label and create a label string"
print "-h|--help: print this information"
sys.exit(2)
def validate():
"""
Parses the command line parameters and extracts the relevant info
from it. Returns a triple of (dictpath, filepath, interactive).
@return a triple extracted from the command line parameters.
"""
(opts, args) = getopt.getopt(sys.argv[1:], "d:f:ih",
["dict=", "file=", "interactive", "help"])
dictpath = None
filepath = None
interactive = False
for option, argval in opts:
if (option in ("-h", "--help")):
usage()
if (option in ("-d", "--dict")):
dictpath = argval
if (not os.path.exists(dictpath)):
usage("Dictionary File [%s] does not exist" % (dictpath))
if (option in ("-f", "--file")):
filepath = argval
if (not os.path.exists(filepath)):
usage("Input File [%s] does not exist" % filepath)
if (option in ("-i", "--interactive")):
interactive = True
if (dictpath == None):
dictpath = DEFAULT_DICT_PATH
if (filepath == None):
usage("Input file must be specified")
return (dictpath, filepath, interactive)
def addCount(countmap, term, count):
"""
Adds the count to the term count mapping. If no term count mapping
exists, then one is created and the count updated.
@param countmap the map of term to term counts to update.
@param term the term to update for.
@param count the term count to update.
"""
try:
origCount = countmap[term]
except KeyError:
origCount = 0
countmap[term] = origCount + count
def main():
(dictpath, inpath, interactive) = validate()
dictionary = Dictionary(dictpath)
invertedIndex = InvertedIndex(inpath)
occurrences = {}
# grab the basic occurrence counts for the labels and its synonyms
for term in dictionary.labels():
termCount = invertedIndex.getCount(term)
addCount(occurrences, term, termCount)
for synonym in dictionary.synonyms(term):
# if synonyms exist, also look for occurrences of the synonyms and
# add it to the occurrence count for the label
addCount(occurrences, term, invertedIndex.getCount(synonym))
for category in dictionary.categories(term):
# add the updated term count (for base label and all its synonyms)
# to the category occurrences.
addCount(occurrences, category, occurrences[term])
terms = occurrences.keys()
# filter out terms whose counts are below our cutoff
terms = filter(lambda x: occurrences[x] > COUNT_CUTOFF, terms)
# sort the remaining (filtered) terms by their count descending
terms.sort(lambda x, y: occurrences[y] - occurrences[x])
if (interactive):
labels = []
for term in terms:
yorn = raw_input("%s (%d) - include[Y/n]? " %
(term, occurrences[term]))
if (yorn == 'n' or yorn == 'N'):
continue
labels.append(term)
print "Labels: %s" % (",".join(labels))
else:
for term in terms:
print "%s (%d)" % (term, occurrences[term])
print "Labels: %s" % (",".join(terms))
if __name__ == "__main__":
main()
|
The choice of the default dictionary file and the title boost and minimum word occurrence score cutoffs is dependent on a particular setup, so they are in the Configuration block at the beginning of the script. The code is pretty heavily commented, so I won't bore you trying to describe it here. The interactive mode returns the labels in descending order one by one, and you can opt to keep it or not keep it, based on your judgement.
I ran the script on this post and here is what I got. To the suggested labels, I also added data-structure and algorithms, since apart from the user defined InvertedIndex and Dictionary structures, this also has the very useful Python set structure.
1 2 3 4 | sujit@sirocco:$ ./tagsuggest.py --file=/path/to/my/post.txt
python (6)
scripting (6)
Labels: python,scripting
|
Overall, I think this script is going to be pretty useful for me. The quality of its suggestions is dependent on the content of the dictionary, and is not 100% accurate. Fortunately, since it is interactive, it does not have to be - it just has to be better than my own memory, so it "knows" about tags that I have used but no longer remember.
Be the first to comment. Comments are moderated to prevent spam.
Post a Comment