The small script I describe here came about as a result of a suggestion from a colleague. Some time ago I had built a Lucene analyzer that converted from British spellings to American spellings (for example, "colour" to "color"), based on a set of prefix, suffix and infix regular expressions. Unfortunately, the same regex that converts colour correctly also converts "four" to "for". Since our search is backed by a taxonomy, we can treat the synonyms defined in it as a controlled vocabulary, so my colleague suggested running the transformer against all the words in the (possibly multi-word) synonyms, then for the words matching a regex, checking against a dictionary that the original and transformed words mean the same.
When I built the Lucene analyzer, I was not as handy with NLTK as I am now, so this time around, I almost immediately thought about NLTK's Wordnet interface. The idea is to pass the two words to Wordnet. Each word can potentially result in one or more synsets (depending on its part of speech (POS)). We can conclude that they are the same word if one of the pairs of synsets have a path_similarity of 1 (the path_similarity varies from 0 to 1, 1 being identical). Here is the code:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | from __future__ import division
from nltk.corpus import wordnet as wn
import sys
def similarity(w1, w2, sim=wn.path_similarity):
synsets1 = wn.synsets(w1)
synsets2 = wn.synsets(w2)
sim_scores = []
for synset1 in synsets1:
for synset2 in synsets2:
sim_scores.append(sim(synset1, synset2))
if len(sim_scores) == 0:
return 0
else:
return max(sim_scores)
def main():
f = open(sys.argv[1], 'rb')
for line in f:
(word1, word2) = line.strip().split("\t")
if similarity(word1, word2) != 1.0:
print word1
f.close()
if __name__ == "__main__":
main()
|
The similarity is calculated in the similarity() method as the maximum similarity between any two synset pairs. The main() method just reads a file of word pairs and writes out words that don't convert to an equivalent word. For example, the following input file:
1 2 3 4 | favour favor
favourite favorite
four for
colour color
|
results in "four" being printed to the console, indicating that "four" should be treated as an exclusion for the analyzer.
I was also using the analyzer to normalize some Greek and Latin plurals to their corresponding singulars. These words are quite common in (English) medical text and normal stemming does not handle it correctly, so I used a set of (suffix) rules to do the conversion in the same analyzer. As with the British to American words, there are exceptions here also, which can be handled using the same code. Turns out that the plural and singular words map to the same node in Wordnet, so the task is even simpler. In any case, an input file like this:
1 2 3 4 | humeri humerus
femora femur
insomnia insomnium
media medium
|
results in "insomnia" being printed on the console, once more indicating that "insomnia" should be treated as an exclusion.
And thats all I have for today. Just goes to show how simple some problems can become when you have the right tools.
4 comments (moderated to prevent spam):
Thanks! you make it clear how to work on path similarity
You are welcome sura.
this was really helpful. Thank-you
You are welcome, glad it helped.
Post a Comment