In a previous post I described how I used Scrapy to download a set of 2000+ (anonymized) clinical notes, then converted them to a JSON format, one document per file, as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | {
category: "Neurosurgery",
description: "Donec vitae sapien ut libero venenatis faucibus. Nullam quis
ante. Etiam sit amet orci eget eros faucibus tincidunt.",
title: "Donec vitae sapien ut libero",
text: "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean
commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et
magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis,
ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa
quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget,
arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo.",
sample: "Wound Care",
keywords: [
"dor fundoplication",
"lysis of adhesions",
"red rubber catheter"
]
}
|
In this post, I use this data with KEA, a keyword extraction algorithm from the University of Waikato (the same folks who gave us Weka) to extract keywords from the text. KEA can work with or without an additional Controlled Vocabulary (CV) - I used the approach where the CV is not used, also known as "free indexing".
In this approach, KEA expects to see its data organized into a directory structure similar to that shown below. The training set is a set of .txt files containing the document text and an associated .key file containing the manually assigned keywords, one keyword per line. The test set should be the set of .txt files containing bodies of the documents for which keywords are to be generated. KEA extracts keywords into the .key files in the test directory. In order to measure the performance, I also have a test/keys directory with the original keywords.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | .
+-- test
| |-- 0000.txt
| |-- 0001.txt
| |-- ...
| +-- keys
| |-- 0000.key
| |-- 0001.key
| +-- ...
+-- train
|-- 0002.key
|-- 0002.txt
|-- 0010.key
|-- 0010.txt
+-- ...
|
Here is some simple Python code to transform the JSON files into the structure KEA expects. I initially used a 75/25 train/test split, but later reran with different splits, this is controlled by the value of the parameter (0.75 here) that builds the train variable.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # Source: kea_preprocess.py
import json
import os
import random
import shutil
JSONS_DIR = "/path/to/jsons"
KEA_TRAIN_DIR = "/path/to/kea/train/dir"
KEA_TEST_DIR = "/path/to/kea/test/dir"
shutil.rmtree(KEA_TRAIN_DIR)
shutil.rmtree(KEA_TEST_DIR)
os.mkdir(KEA_TRAIN_DIR)
os.mkdir(KEA_TEST_DIR)
os.mkdir(os.path.join(KEA_TEST_DIR, "keys"))
for filename in os.listdir(JSONS_DIR):
print "Converting %s..." % (filename)
fjson = open(os.path.join(JSONS_DIR, filename), 'rb')
data = json.load(fjson)
fjson.close()
basename = os.path.splitext(filename)[0]
# do a 30/70 split for training vs test
train = random.uniform(0, 1) < 0.75
txtdir = KEA_TRAIN_DIR if train else KEA_TEST_DIR
ftxt = open(os.path.join(txtdir, basename + ".txt"), 'wb')
ftxt.write(data["text"].encode("utf-8"))
ftxt.close()
# write keywords
keydir = KEA_TRAIN_DIR if train else os.path.join(KEA_TEST_DIR, "keys")
fkey = open(os.path.join(keydir, basename + ".key"), 'wb')
keywords = data["keywords"]
for keyword in keywords:
fkey.write("%s\n" % (keyword.encode("utf-8")))
fkey.close()
|
KEA's API is pretty simple. To train KEA, instantiate a KEAModelBuilder, set its parameters, then build and save the model. To extract keywords, instantiate a KEAKeyPhraseExtractor, set its parameters and then extract keywords. Here is some Scala code to do just this. It is modeled on the TestKea.java program that is supplied with the KEA source download. We train a KEA model with the contents of the train directory, then extract keywords using the model into the test directory. The algorithm is also quite fast - the training and extraction on my dataset of 2000+ documents typically happened in 30-40s. The only thing I had to hack was the location of the English stopwords file, KEA expects it to be at an offset of data/stopwords similar to the source distribution, so I just created a temporary symlink to make it work.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 | // Source: src/main/scala/com/mycompany/scalcium/keyextract/KeaClient.scala
package com.mycompany.scalcium.keyextract
import java.io.File
import kea.main.KEAModelBuilder
import kea.stemmers.PorterStemmer
import kea.stopwords.StopwordsEnglish
import kea.main.KEAKeyphraseExtractor
object KeaClient extends App {
val trainDir = "/path/to/kea/train/dir"
val testDir = "/path/to/kea/test/dir"
val modelFile = "/path/to/kea/model"
val valKeysDir = "/path/to/kea/test/dir/" + "keys"
val kc = new KeaClient()
kc.train(trainDir, modelFile)
kc.test(modelFile, testDir)
}
class KeaClient {
def train(trainDir: String, modelFilePath: String): Unit = {
val modelBuilder = new KEAModelBuilder()
modelBuilder.setDirName(trainDir)
modelBuilder.setModelName(modelFilePath)
modelBuilder.setVocabulary("none")
modelBuilder.setEncoding("UTF-8")
modelBuilder.setDocumentLanguage("en")
modelBuilder.setStemmer(new PorterStemmer())
modelBuilder.setStopwords(new StopwordsEnglish())
modelBuilder.setMaxPhraseLength(5)
modelBuilder.setMinPhraseLength(1)
modelBuilder.setMinNumOccur(2)
modelBuilder.buildModel(modelBuilder.collectStems())
modelBuilder.saveModel()
}
def test(modelFilePath: String, testDir: String): Unit = {
val keyExtractor = new KEAKeyphraseExtractor()
keyExtractor.setDirName(testDir)
keyExtractor.setModelName(modelFilePath)
keyExtractor.setVocabulary("none")
keyExtractor.setEncoding("UTF-8")
keyExtractor.setDocumentLanguage("en")
keyExtractor.setStemmer(new PorterStemmer())
keyExtractor.setNumPhrases(10)
keyExtractor.setBuildGlobal(true)
keyExtractor.loadModel()
keyExtractor.extractKeyphrases(keyExtractor.collectStems())
}
}
|
In order to measure KEA's accuracy on my dataset, I calculated the similarity of the KEA generated keywords against the originally assigned keywords in the test set. Since I tell KEA to generate a max of 10 keywords (and the original documents have anywhere upto 40 keywords), my similarity is just the number of keywords KEA found that match the original set, divided by 10. The distribution of matches across the corpus is shown below on the left.
I then performed the same exercise with a 10/90 training/test split. The reasoning is that in most cases I won't have the luxury of having manually assigned keywords to train KEA, so a small sample would have to be manually tagged, and I wanted to see if there was any significant difference in accuracy with a 75/25 split vs a 10/90 split. As you can see from the distribution on the right, the distribution seems largely unchanged - the tendency towards the normal distribution can be simply attributed to a larger sample in the 10/90 case. The code to generate these distributions is shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | # Source: kea_measure.py
from __future__ import division
import os
import matplotlib.pyplot as plt
import numpy as np
EXPECTED_DIR = "/path/to/kea/test/keys"
PREDICTED_DIR = "/path/to/kea/test"
def proportion_matched(set1, set2):
return len(set1.intersection(set2)) / 10
fout = open("/tmp/kea_stats.csv", 'wb')
for filename in os.listdir(EXPECTED_DIR):
fexp = open(os.path.join(EXPECTED_DIR, filename), 'rb')
expected_keywords = set([x.strip().lower() for x in fexp.readlines()])
fexp.close()
fpred = open(os.path.join(PREDICTED_DIR, filename), 'rb')
predicted_keywords = set([x.strip().lower() for x in fpred.readlines()])
fpred.close()
sim = proportion_matched(expected_keywords, predicted_keywords)
fout.write("%6.4f\n" % (sim))
fout.close()
# draw histogram
data = np.loadtxt("/tmp/kea_stats.csv")
plt.hist(data)
plt.show()
# calculate mean and 95% confidence interval
mean = np.mean(data)
std = np.std(data)
# Header MEAN, CF_DN, CF_UP
print "%6.4f\t%6.4f\t%6.4f" % (mean, mean - 1.96 * std, mean + 1.96 * std)
|
In order to find a possible sweet spot in the split size where KEA performs optimally, I varied the split size between 10/90 and 90/10 and computed the mean and +/- 95% confidence at each step. I then charted this aggregated data as follows:
1 2 3 4 5 6 7 8 | # Source: kea_graph.py
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_csv("kea_means.csv", delimiter="\t")
print data.head()
data.plot()
plt.show()
|
As you can see from the chart below, KEA's performance seems quite robust against training size - the mean accuracy (blue line) varies between 3 and 4 regardless of training size. The upper and lower bounds of the 95% confidence interval are also similar.
I eyeballed a few .key files in order to do a sanity check, and the generated keywords appeared surprisingly good. I then built tag clouds of the original and generated keywords in my 90% test set for comparison.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | # Source: kea_tagcloud.py
import os
import wordcloud
ORIG_KEYDIR = "/path/to/kea/test/keys"
GEN_KEYDIR = "/path/to/kea/test"
OUTPUT_DIR = "/path/to/kea/output/graphs"
def get_keyword_freqs(dirname):
freqs = dict()
for keyfile in [f for f in os.listdir(dirname) if f.endswith(".key")]:
fkey = open(os.path.join(dirname, keyfile), 'rb')
for line in fkey:
keyword = line.strip().lower()
count = 0
if freqs.has_key(keyword):
count = freqs[keyword]
freqs[keyword] = count + 1
fkey.close()
freqlist = []
for key, value in freqs.items():
if value < 2:
continue
freqlist.append((key, value))
return freqlist
tag_counts = get_keyword_freqs(ORIG_KEYDIR)
elements = wordcloud.fit_words(tag_counts)
wordcloud.draw(elements, os.path.join(OUTPUT_DIR, "kea_origtags.png"))
tag_counts = get_keyword_freqs(GEN_KEYDIR)
elements = wordcloud.fit_words(tag_counts)
wordcloud.draw(elements, os.path.join(OUTPUT_DIR, "kea_gentags.png"))
|
I used Andreas Mueller's wordcloud project described in blog post to generate the tag clouds. I tried pytagcloud initially, but couldn't make it work because of dependencies. The tag clouds for the original and generated keywords are shown below left and right respectively. As you can see, while the RHS cloud is not identical to the LHS one, it has quite a few good keywords in it.
I was curious about the KEA algorithm so I went in and peeked at the code - it decomposes the input text to n-grams then computes custom features on each n-gram. It then uses the manually tagged keywords as a way to tag these ngrams as phrase or not phrase, and trains the Weka Naive Bayes classifier. The classifier is then used against unseen documents to predict keywords, which are then further filtered by probability to produce the final keyword recommendations. The KEA Algorithm is described in depth in this paper (PDF). A less formal description is available on KEA's description page. A slightly terser version of a KEA HOWTO than this post can be found here.
I hope you found this post useful. KEA seems to be a useful tool for keyword extraction when you have (or can build) some keyword tagged documents for training.
Very good post, thanks.
ReplyDeleteI wonder if you ever had a look or played with MAUI, which is supposed to be an evolution of KEA?
Thanks Enzo
Thanks Enzo. I have looked at Maui, but for some reason I never used it. Can't remember why exactly, but one reason could be that Maui is a supervised learning approach (you need a training set) while KEA was unsupervised.
ReplyDelete