Friday, August 29, 2014

Topic Modeling with gensim


Over the past couple of weeks, I've been trying different ways to gain insight into my little corpus of 2000+ Patient Health Records (PHRs). Topic modeling is one way to do it, and I've been meaning to learn gensim, a Python library for topic modeling, so I decided to use gensim to do some topic modeling on this dataset. I have used Mahout for Topic Modeling before, but my data is quite small and doesn't need the complexity of Map-Reduce (besides the objective is to learn gensim :-)). As a bonus, gensim also offers a wrapper for Mallet, another popular Java topic modeling library.

This post describes my experiment. Ultimately, the insights I got were not particulary interesting, but it got me familiar with gensim. I hope you find it useful.

The source format of the dataset is a collection of JSON files. My first step is to pre-process the data so that the text portion is written out into a collection of text files. In hindsight, this seems redundant since this could be merged with the next step, but it makes things a bit clearer if separated.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
# Source: gensim_preprocess.py
import json
import os

JSONS_DIR = "/path/to/jsons/dir"
TEXTS_DIR = "/path/to/texts/dir"

for fn in os.listdir(JSONS_DIR):
    print "Converting JSON: %s" % (fn)
    fjson = open(os.path.join(JSONS_DIR, fn), 'rb')
    data = json.load(fjson)
    fjson.close()
    tfn = os.path.splitext(fn)[0] + ".txt"
    ftext = open(os.path.join(TEXTS_DIR, tfn), 'wb')
    ftext.write(data["text"].encode("utf-8"))
    ftext.close()

The next step is to convert the texts to a format that gensim can use - namely a Bag of Words (BoW) representation. Gensim expects to be fed a corpus data structure, basically a list of sparse vectors. The sparse vector elements consists of (id, score) pairs, where the id is a numeric ID that is mapped to the term via a dictionary. Gensim's author has taken pains to keep its memory requirements down by using a streaming approach to build corpora and dictionaries. The iter_docs() function below implements this streaming approach.

In the code below, I read the text of each file, pass the words through gensim's tokenizer and filter out stopwords (from NLTK's English stopword list) using our custom MyCorpus class. These words are used to create a dictionary and BoW corpus, which is serialized to files for use in the next step.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
# Source: bow_model.py
import logging
import os
import nltk
import gensim

def iter_docs(topdir, stoplist):
    for fn in os.listdir(topdir):
        fin = open(os.path.join(topdir, fn), 'rb')
        text = fin.read()
        fin.close()
        yield (x for x in 
            gensim.utils.tokenize(text, lowercase=True, deacc=True, 
                                  errors="ignore")
            if x not in stoplist)

class MyCorpus(object):

    def __init__(self, topdir, stoplist):
        self.topdir = topdir
        self.stoplist = stoplist
        self.dictionary = gensim.corpora.Dictionary(iter_docs(topdir, stoplist))
        
    def __iter__(self):
        for tokens in iter_docs(self.topdir, self.stoplist):
            yield self.dictionary.doc2bow(tokens)


logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

TEXTS_DIR = "/path/to/texts/dir"
MODELS_DIR = "/path/to/models/dir"

stoplist = set(nltk.corpus.stopwords.words("english"))
corpus = MyCorpus(TEXTS_DIR, stoplist)

corpus.dictionary.save(os.path.join(MODELS_DIR, "mtsamples.dict"))
gensim.corpora.MmCorpus.serialize(os.path.join(MODELS_DIR, "mtsamples.mm"), 
                                  corpus)

I didn't know (and didn't have an opinion about) how many topics this corpus should yield so I decided to compute this by reducing the features to two dimensions, then clustering the points for different values of K (number of clusters) to find an optimum value. Gensim offers various transforms that allow us to project the vectors in a corpus to a different coordinate space. One such transform is the Latent Semantic Indexing (LSI) transform, which we use to project the original data to 2D.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# Source: lsi_model.py
import logging
import os
import gensim

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

MODELS_DIR = "/path/to/models/dir"

dictionary = gensim.corpora.Dictionary.load(os.path.join(MODELS_DIR, 
                                            "mtsamples.dict"))
corpus = gensim.corpora.MmCorpus(os.path.join(MODELS_DIR, "mtsamples.mm"))

tfidf = gensim.models.TfidfModel(corpus, normalize=True)
corpus_tfidf = tfidf[corpus]

# project to 2 dimensions for visualization
lsi = gensim.models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)

# write out coordinates to file
fcoords = open(os.path.join(MODELS_DIR, "coords.csv"), 'wb')
for vector in lsi[corpus]:
    if len(vector) != 2:
        continue
    fcoords.write("%6.4f\t%6.4f\n" % (vector[0][1], vector[1][1]))
fcoords.close()

Next I clustered the points in the reduced 2D LSI space using KMeans, varying the number of clusters (K) from 1 to 10. The objective function used is the Inertia of the cluster, defined as the sum of squared differences of each point to its cluster centroid. This value is provided directly from the Scikit-Learn KMeans algorithm. Other popular functions include Distortion (Inertia divided by the number of points) or the Percentage of Variance Explained, as described on this StackOverflow page.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Source: num_topics.py
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

MODELS_DIR = "/path/to/models/dir"
MAX_K = 10

X = np.loadtxt(os.path.join(MODELS_DIR, "coords.csv"), delimiter="\t")
ks = range(1, MAX_K + 1)

inertias = np.zeros(MAX_K)
diff = np.zeros(MAX_K)
diff2 = np.zeros(MAX_K)
diff3 = np.zeros(MAX_K)
for k in ks:
    kmeans = KMeans(k).fit(X)
    inertias[k - 1] = kmeans.inertia_
    # first difference    
    if k > 1:
        diff[k - 1] = inertias[k - 1] - inertias[k - 2]
    # second difference
    if k > 2:
        diff2[k - 1] = diff[k - 1] - diff[k - 2]
    # third difference
    if k > 3:
        diff3[k - 1] = diff2[k - 1] - diff2[k - 2]

elbow = np.argmin(diff3[3:]) + 3

plt.plot(ks, inertias, "b*-")
plt.plot(ks[elbow], inertias[elbow], marker='o', markersize=12,
         markeredgewidth=2, markeredgecolor='r', markerfacecolor=None)
plt.ylabel("Inertia")
plt.xlabel("K")
plt.show()

I plotted the Inertias for different values of K, then used Vincent Granville's approach of calculating the third differential to find an elbow point. The elbow point happens here for K=5 and is marked with a red dot in the graph below.

I then re-ran the KMeans algorithm with K=5 and generated the clusters.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# Source: viz_topics_scatter.py
import os
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

MODELS_DIR = "/path/to/models/dir"
NUM_TOPICS = 5

X = np.loadtxt(os.path.join(MODELS_DIR, "coords.csv"), delimiter="\t")
kmeans = KMeans(NUM_TOPICS).fit(X)
y = kmeans.labels_

colors = ["b", "g", "r", "m", "c"]
for i in range(X.shape[0]):
    plt.scatter(X[i][0], X[i][1], c=colors[y[i]], s=10)    
plt.show()

which gives me clusters that look like this.


I then ran the full LDA transform against the BoW corpus, with the number of topics set to 5. As in LSI, I load up the corpus and dictionary from files, then apply the transform to project the documents into the LDA Topic space. Notice that LDA and LSI are conceptually similar in gensim - both are transforms that map one vector space to another.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Source: lda_model.py
import logging
import os
import gensim

MODELS_DIR = "/path/to/models/dir"
NUM_TOPICS = 5

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', 
                    level=logging.INFO)

dictionary = gensim.corpora.Dictionary.load(os.path.join(MODELS_DIR, 
                                            "mtsamples.dict"))
corpus = gensim.corpora.MmCorpus(os.path.join(MODELS_DIR, "mtsamples.mm"))

# Project to LDA space
lda = gensim.models.LdaModel(corpus, id2word=dictionary, num_topics=NUM_TOPICS)
lda.print_topics(NUM_TOPICS)

The results are as follows. As you can see, each topic is made up of a mixture of terms. The top 10 terms from each topic is shown in the output below and covers between 69-80% of the probability space for each topic.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
2014-08-27 20:59:04,850 : INFO : topic #0 (0.200): 0.015*patient + \
      0.014*right + 0.009*left + 0.009*procedure + 0.006*well + \
      0.005*placed + 0.005*history + 0.004*skin + 0.004*normal + \
      0.004*anesthesia
2014-08-27 20:59:04,852 : INFO : topic #1 (0.200): 0.015*patient + \
      0.009*placed + 0.008*right + 0.007*normal + 0.007*left + \
      0.006*procedure + 0.005*using + 0.004*anterior + 0.004*history + \
      0.004*skin
2014-08-27 20:59:04,854 : INFO : topic #2 (0.200): 0.012*left + \
      0.010*patient + 0.010*right + 0.007*history + 0.005*well + \
      0.005*normal + 0.004*pain + 0.004*also + 0.003*without + \
      0.003*disease
2014-08-27 20:59:04,856 : INFO : topic #3 (0.200): 0.019*patient + \
      0.011*left + 0.011*normal + 0.010*right + 0.006*well + \
      0.006*procedure + 0.005*pain + 0.004*placed + 0.004*history + \
      0.004*using
2014-08-27 20:59:04,858 : INFO : topic #4 (0.200): 0.022*patient + \
      0.013*history + 0.007*mg + 0.006*normal + 0.005*pain + \
      0.005*also + 0.005*left + 0.004*right + 0.004*year + 0.004*well

and here is the same information shown graphically using wordclouds (generated using the code below).

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import os
import wordcloud

MODELS_DIR = "models"

final_topics = open(os.path.join(MODELS_DIR, "final_topics.txt"), 'rb')
curr_topic = 0
for line in final_topics:
    line = line.strip()[line.rindex(":") + 2:]
    scores = [float(x.split("*")[0]) for x in line.split(" + ")]
    words = [x.split("*")[1] for x in line.split(" + ")]
    freqs = []
    for word, score in zip(words, scores):
        freqs.append((word, score))
    elements = wordcloud.fit_words(freqs, width=120, height=120)
    wordcloud.draw(elements, "gs_topic_%d.png" % (curr_topic),
                   width=120, height=120)
    curr_topic += 1
final_topics.close()












As I mentioned earlier, the only insight I get from the above is that the data seems to be predominantly about patients, history and procedures. But we already knew that :-).

Some ideas for improvement for the future. One could be to really get aggressive about removing noise words, much more than just the NLTK English stopwords, that way we could remove terms such as "without" and "year" from the input, possibly resulting in a cleaner set of topics. Another could be to check for higher values of K, perhaps against a higher dimensional space, since we are computing the elbow analytically anyway.

Saturday, August 23, 2014

Keyword Extraction with KEA


In a previous post I described how I used Scrapy to download a set of 2000+ (anonymized) clinical notes, then converted them to a JSON format, one document per file, as shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
{
  category: "Neurosurgery",
  description: "Donec vitae sapien ut libero venenatis faucibus. Nullam quis 
    ante. Etiam sit amet orci eget eros faucibus tincidunt.",
  title: "Donec vitae sapien ut libero",
  text: "Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Aenean 
    commodo ligula eget dolor. Aenean massa. Cum sociis natoque penatibus et 
    magnis dis parturient montes, nascetur ridiculus mus. Donec quam felis, 
    ultricies nec, pellentesque eu, pretium quis, sem. Nulla consequat massa 
    quis enim. Donec pede justo, fringilla vel, aliquet nec, vulputate eget, 
    arcu. In enim justo, rhoncus ut, imperdiet a, venenatis vitae, justo.",
  sample: "Wound Care",
  keywords: [
    "dor fundoplication", 
    "lysis of adhesions", 
    "red rubber catheter"
  ]
}

In this post, I use this data with KEA, a keyword extraction algorithm from the University of Waikato (the same folks who gave us Weka) to extract keywords from the text. KEA can work with or without an additional Controlled Vocabulary (CV) - I used the approach where the CV is not used, also known as "free indexing".

In this approach, KEA expects to see its data organized into a directory structure similar to that shown below. The training set is a set of .txt files containing the document text and an associated .key file containing the manually assigned keywords, one keyword per line. The test set should be the set of .txt files containing bodies of the documents for which keywords are to be generated. KEA extracts keywords into the .key files in the test directory. In order to measure the performance, I also have a test/keys directory with the original keywords.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
.
+-- test
|   |-- 0000.txt
|   |-- 0001.txt
|   |-- ...
|   +-- keys
|       |-- 0000.key
|       |-- 0001.key
|       +-- ...
+-- train
    |-- 0002.key
    |-- 0002.txt
    |-- 0010.key
    |-- 0010.txt
    +-- ...

Here is some simple Python code to transform the JSON files into the structure KEA expects. I initially used a 75/25 train/test split, but later reran with different splits, this is controlled by the value of the parameter (0.75 here) that builds the train variable.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Source: kea_preprocess.py
import json
import os
import random
import shutil

JSONS_DIR = "/path/to/jsons"
KEA_TRAIN_DIR = "/path/to/kea/train/dir"
KEA_TEST_DIR = "/path/to/kea/test/dir"

shutil.rmtree(KEA_TRAIN_DIR)
shutil.rmtree(KEA_TEST_DIR)
os.mkdir(KEA_TRAIN_DIR)
os.mkdir(KEA_TEST_DIR)
os.mkdir(os.path.join(KEA_TEST_DIR, "keys"))

for filename in os.listdir(JSONS_DIR):
    print "Converting %s..." % (filename)
    fjson = open(os.path.join(JSONS_DIR, filename), 'rb')
    data = json.load(fjson)
    fjson.close()
    basename = os.path.splitext(filename)[0]
    # do a 30/70 split for training vs test
    train = random.uniform(0, 1) < 0.75
    txtdir = KEA_TRAIN_DIR if train else KEA_TEST_DIR
    ftxt = open(os.path.join(txtdir, basename + ".txt"), 'wb')
    ftxt.write(data["text"].encode("utf-8"))
    ftxt.close()
    # write keywords
    keydir = KEA_TRAIN_DIR if train else os.path.join(KEA_TEST_DIR, "keys")    
    fkey = open(os.path.join(keydir, basename + ".key"), 'wb')     
    keywords = data["keywords"]
    for keyword in keywords:
        fkey.write("%s\n" % (keyword.encode("utf-8")))
    fkey.close()

KEA's API is pretty simple. To train KEA, instantiate a KEAModelBuilder, set its parameters, then build and save the model. To extract keywords, instantiate a KEAKeyPhraseExtractor, set its parameters and then extract keywords. Here is some Scala code to do just this. It is modeled on the TestKea.java program that is supplied with the KEA source download. We train a KEA model with the contents of the train directory, then extract keywords using the model into the test directory. The algorithm is also quite fast - the training and extraction on my dataset of 2000+ documents typically happened in 30-40s. The only thing I had to hack was the location of the English stopwords file, KEA expects it to be at an offset of data/stopwords similar to the source distribution, so I just created a temporary symlink to make it work.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
// Source: src/main/scala/com/mycompany/scalcium/keyextract/KeaClient.scala
package com.mycompany.scalcium.keyextract

import java.io.File
import kea.main.KEAModelBuilder
import kea.stemmers.PorterStemmer
import kea.stopwords.StopwordsEnglish
import kea.main.KEAKeyphraseExtractor

object KeaClient extends App {

  val trainDir = "/path/to/kea/train/dir"
  val testDir = "/path/to/kea/test/dir"
  val modelFile = "/path/to/kea/model"
  val valKeysDir = "/path/to/kea/test/dir/" + "keys"

  val kc = new KeaClient()
  kc.train(trainDir, modelFile)
  kc.test(modelFile, testDir)
}

class KeaClient {

  def train(trainDir: String, modelFilePath: String): Unit = {
    val modelBuilder = new KEAModelBuilder()
    modelBuilder.setDirName(trainDir)
    modelBuilder.setModelName(modelFilePath)
    modelBuilder.setVocabulary("none")
    modelBuilder.setEncoding("UTF-8")
    modelBuilder.setDocumentLanguage("en")
    modelBuilder.setStemmer(new PorterStemmer())
    modelBuilder.setStopwords(new StopwordsEnglish())
    modelBuilder.setMaxPhraseLength(5)
    modelBuilder.setMinPhraseLength(1)
    modelBuilder.setMinNumOccur(2)
    modelBuilder.buildModel(modelBuilder.collectStems())
    modelBuilder.saveModel()
  }
  
  def test(modelFilePath: String, testDir: String): Unit = {
    val keyExtractor = new KEAKeyphraseExtractor()
    keyExtractor.setDirName(testDir)
    keyExtractor.setModelName(modelFilePath)
    keyExtractor.setVocabulary("none")
    keyExtractor.setEncoding("UTF-8")
    keyExtractor.setDocumentLanguage("en")
    keyExtractor.setStemmer(new PorterStemmer())
    keyExtractor.setNumPhrases(10)
    keyExtractor.setBuildGlobal(true)
    keyExtractor.loadModel()
    keyExtractor.extractKeyphrases(keyExtractor.collectStems())
  }
}

In order to measure KEA's accuracy on my dataset, I calculated the similarity of the KEA generated keywords against the originally assigned keywords in the test set. Since I tell KEA to generate a max of 10 keywords (and the original documents have anywhere upto 40 keywords), my similarity is just the number of keywords KEA found that match the original set, divided by 10. The distribution of matches across the corpus is shown below on the left.




I then performed the same exercise with a 10/90 training/test split. The reasoning is that in most cases I won't have the luxury of having manually assigned keywords to train KEA, so a small sample would have to be manually tagged, and I wanted to see if there was any significant difference in accuracy with a 75/25 split vs a 10/90 split. As you can see from the distribution on the right, the distribution seems largely unchanged - the tendency towards the normal distribution can be simply attributed to a larger sample in the 10/90 case. The code to generate these distributions is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Source: kea_measure.py
from __future__ import division

import os
import matplotlib.pyplot as plt
import numpy as np

EXPECTED_DIR = "/path/to/kea/test/keys"
PREDICTED_DIR = "/path/to/kea/test"

def proportion_matched(set1, set2):
    return len(set1.intersection(set2)) / 10
    
fout = open("/tmp/kea_stats.csv", 'wb')
for filename in os.listdir(EXPECTED_DIR):
    fexp = open(os.path.join(EXPECTED_DIR, filename), 'rb')
    expected_keywords = set([x.strip().lower() for x in fexp.readlines()])
    fexp.close()
    fpred = open(os.path.join(PREDICTED_DIR, filename), 'rb')
    predicted_keywords = set([x.strip().lower() for x in fpred.readlines()])
    fpred.close()
    sim = proportion_matched(expected_keywords, predicted_keywords)
    fout.write("%6.4f\n" % (sim))
fout.close()

# draw histogram
data = np.loadtxt("/tmp/kea_stats.csv")
plt.hist(data)
plt.show()

# calculate mean and 95% confidence interval
mean = np.mean(data)
std = np.std(data)
# Header MEAN, CF_DN, CF_UP
print "%6.4f\t%6.4f\t%6.4f" % (mean, mean - 1.96 * std, mean + 1.96 * std)

In order to find a possible sweet spot in the split size where KEA performs optimally, I varied the split size between 10/90 and 90/10 and computed the mean and +/- 95% confidence at each step. I then charted this aggregated data as follows:

1
2
3
4
5
6
7
8
# Source: kea_graph.py
import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv("kea_means.csv", delimiter="\t")
print data.head()
data.plot()
plt.show()

As you can see from the chart below, KEA's performance seems quite robust against training size - the mean accuracy (blue line) varies between 3 and 4 regardless of training size. The upper and lower bounds of the 95% confidence interval are also similar.


I eyeballed a few .key files in order to do a sanity check, and the generated keywords appeared surprisingly good. I then built tag clouds of the original and generated keywords in my 90% test set for comparison.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
# Source: kea_tagcloud.py
import os
import wordcloud

ORIG_KEYDIR = "/path/to/kea/test/keys"
GEN_KEYDIR = "/path/to/kea/test"
OUTPUT_DIR = "/path/to/kea/output/graphs"

def get_keyword_freqs(dirname):
    freqs = dict()
    for keyfile in [f for f in os.listdir(dirname) if f.endswith(".key")]:
        fkey = open(os.path.join(dirname, keyfile), 'rb')
        for line in fkey:
            keyword = line.strip().lower()
            count = 0
            if freqs.has_key(keyword):
                count = freqs[keyword]
            freqs[keyword] = count + 1
        fkey.close()
    freqlist = []
    for key, value in freqs.items():
        if value < 2: 
            continue
        freqlist.append((key, value))
    return freqlist
        
tag_counts = get_keyword_freqs(ORIG_KEYDIR)
elements = wordcloud.fit_words(tag_counts)
wordcloud.draw(elements, os.path.join(OUTPUT_DIR, "kea_origtags.png"))

tag_counts = get_keyword_freqs(GEN_KEYDIR)
elements = wordcloud.fit_words(tag_counts)
wordcloud.draw(elements, os.path.join(OUTPUT_DIR, "kea_gentags.png"))

I used Andreas Mueller's wordcloud project described in blog post to generate the tag clouds. I tried pytagcloud initially, but couldn't make it work because of dependencies. The tag clouds for the original and generated keywords are shown below left and right respectively. As you can see, while the RHS cloud is not identical to the LHS one, it has quite a few good keywords in it.




I was curious about the KEA algorithm so I went in and peeked at the code - it decomposes the input text to n-grams then computes custom features on each n-gram. It then uses the manually tagged keywords as a way to tag these ngrams as phrase or not phrase, and trains the Weka Naive Bayes classifier. The classifier is then used against unseen documents to predict keywords, which are then further filtered by probability to produce the final keyword recommendations. The KEA Algorithm is described in depth in this paper (PDF). A less formal description is available on KEA's description page. A slightly terser version of a KEA HOWTO than this post can be found here.

I hope you found this post useful. KEA seems to be a useful tool for keyword extraction when you have (or can build) some keyword tagged documents for training.

Saturday, August 16, 2014

Context-less POS tagging with NLTK


Last week I was handed an interesting task - flagging non-nouns from single word synonyms for concepts in our Medical Ontology. Now, the Part of Speech (POS) tag is not really an intrinsic attribute of a word - rather, the POS is determined by the context in which the word appears. For example, consider the two sentences - "He had an infectious laugh", where laugh is a Noun, and "He will laugh at the joke", where the same word is now a Verb.

However, a word with multiple senses (usually) has one dominant sense, so in the absence of my context, we could call that the word's POS. In order to calculate the dominant POS for words, I decided to use the words in the Brown Corpus, a manually tagged text corpus of about 500 articles. The Brown Corpus is available (after nltk.download()) as part of the Natural Language Processing Toolkit (NLTK).

The idea is to create a frequency distribution of (word, tag) using all the words and their tags in the Brown Corpus. Then for each word, we normalize the frequencies across the different POS tags to derive the probability of a word having each of these tags. Output is in a flat file. Heres the code to build this data.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# Source: dict_build.py
import nltk
from nltk.corpus import brown

DELIM = "_|_"
NORMED_TAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]
POSTAGS = {
    "NN" : "noun",
    "VB" : "verb",
    "JJ" : "adjective",
    "RB" : "adverb",
    "DT" : "determiner",
    "IN" : "preposition",
    "OT" : "other"
}

def normalize_brown_postags():
    brown_tags = open("../data/brown_tags.csv", 'rb')
    tag_map = dict()
    for line in brown_tags:
        line = line.strip()
        if len(line) == 0 or line.startswith("#"):
            continue
        tag_name, tag_description = line.split("\t")[0:2]
        tag_desc_words = set(nltk.word_tokenize(tag_description.lower()))
        is_tagged = False
        for normed_tag in NORMED_TAGS[:-1]:
            desc_pattern = POSTAGS[normed_tag]
            if desc_pattern in tag_desc_words:
                tag_map[tag_name] = normed_tag
                is_tagged = True
        if not is_tagged:
            tag_map[tag_name] = "OT"
    brown_tags.close()
    return tag_map

def retag_brown_words(tag_map):
    wordpos_fd = nltk.FreqDist()
    for word, tag in brown.tagged_words():
        if tag_map.has_key(tag):
            normed_pos = tag_map[tag]
            retagged_word = DELIM.join([word.lower(), normed_pos])
            wordpos_fd.inc(retagged_word)  
    return wordpos_fd
    
def compose_record(word, wordpos_fd):
    freqs = []
    for tag in NORMED_TAGS:
        wordpos = DELIM.join([word, tag])
        freqs.append(wordpos_fd[wordpos])
    sum_freqs = float(sum(freqs))
    nf = [float(f) / sum_freqs for f in freqs]
    return "%s\t%s\n" % (word, "\t".join(["%5.3f" % (x) for x in nf]))

# main
tag_map = normalize_brown_postags()
wordpos_fd = retag_brown_words(tag_map)
already_seen_words = set()
brown_dict = open("../data/brown_dict.csv", 'wb')
brown_dict.write("#WORD\t%s\n" % ("\t".join(NORMED_TAGS)))
for wordpos in wordpos_fd.keys():
    word, tag = wordpos.split(DELIM)
    if word in already_seen_words:
        continue
    brown_dict.write(compose_record(word, wordpos_fd))
    already_seen_words.add(word)
brown_dict.close()

I got the Brown tag descriptions from this page, I copied the table into OpenOffice Calc and wrote it out as a CVS file (brown_tags.csv) referenced above. I reduced the large set of Brown tags to a small set of 7 tags - Nouns (NN), Verbs (VB), Adjectives (JJ), Adverbs (RB), Determiners (DT), Prepositions (IN), and Other (OT). I had originally reduced it just 4 tags (similar to Wordnet) - NN, VB, JJ and RB - but extended it to account for the phrase sequences (see below). The output of the code above is a file that looks like this:

1
2
3
4
5
6
7
#WORD      NN      VB      JJ      RB      DT      IN      OT
as      0.000   0.000   0.000   0.000   0.000   0.017   0.983
his     0.000   0.000   0.000   0.000   0.995   0.000   0.005
be      0.000   1.000   0.000   0.000   0.000   0.000   0.000
on      0.000   0.000   0.000   0.083   0.000   0.917   0.000
at      0.000   0.000   0.000   0.000   0.000   1.000   0.000
...

The data is used from Java, for maintainability (we are a Java shop), and because part of the work involves checking against a middleware component that is Java-only, but here is some testing code that illustrates how this information can be used.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Source: predict.py
from __future__ import division
import numpy as np
import nltk

NORMTAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]

def load_word_dict(dict_file):
    word_dict = {}
    fdict = open(dict_file, "rb")
    for line in fdict:
        line = line.strip()
        if len(line) == 0 or line.startswith("#"):
            continue
        cols = line.split("\t")
        word = cols[0]
        probs = [float(x) for x in cols[1:]]
        word_dict[word] = probs
    fdict.close()
    return word_dict

def assert_true(fn, message):
    if fn != True:
        print "Assert failed:", message

def predict_pos(word, word_dict):
    if word_dict.has_key(word):
        probs = np.array(word_dict[word])
        return NORMTAGS[np.argmax(probs)]
    else:
        return "OT"
        
def predict_if_noun(word, word_dict):
    return predict_pos(word, word_dict) == "NN"

# test cases for individual words
word_dict = load_word_dict("../data/brown_dict.csv")
assert_true(predict_if_noun("hypothalamus", word_dict), 
            "Hypothalamus == NOUN!")
assert_true(not predict_if_noun("intermediate", word_dict), 
            "Intermediate != NOUN!")
assert_true(predict_if_noun("laugh", word_dict), "Laugh ~= NOUN!")

Having done this, I got to thinking that perhaps we could extend the idea for multi-word phrases as well, ie, we could flag noun phrases instead of nouns. Like their single word counterparts, these phrases are also synonyms of concepts, and do not provide any context to decide what kind of phrases they are.

A phrase is a sequence of POS tags. I decided to use the Penn Treebank (PTB) corpus. The full PTB is not free, but NLTK provides a small subset of the PTB for educational purposes. The idea here is to scan the chunks, filtering out noun phrases and creating a frequency distribution of the tag sequences for noun phrases.

Since not all possible English words are possible in the Brown Corpus, we create a transition matrix that will predict what POS tag will appear given the current POS tag. That way, we will be able to assign a likely POS tag for words that are not in our dictionary. As before, I reduce the PTB tags down to our set of 7 tags. The code for generating the POS sequences and the POS transition matrix is shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
# Source:phrase_seqs.py
from __future__ import division
import nltk
import numpy as np

from nltk.corpus import treebank_chunk

NORMTAGS = ["NN", "VB", "JJ", "RB", "DT", "IN", "OT"]
POSTAGS = {
    "NN" : "noun",
    "VB" : "verb",
    "JJ" : "adjective",
    "RB" : "adverb",
    "DT" : "determiner",
    "IN" : "preposition",
    "OT" : "other"
}

def normalize_ptb_tags():
    tag_map = {}
    ptb_tags = open("../data/ptb_tags.csv", 'rb')
    for line in ptb_tags:
        line = line.strip()
        if len(line) == 0 or line.startswith("#"):
            continue
        tag, desc = line.split("\t")
        desc_words = nltk.word_tokenize(desc.lower().replace("-", " "))
        is_tagged = False
        for key in NORMTAGS[:-1]:
            postag_desc = POSTAGS[key]
            if postag_desc in desc_words:
                tag_map[tag] = key
                is_tagged = True
        if not is_tagged:
            tag_map[tag] = "OT"
    ptb_tags.close()
    return tag_map
    
def get_chunks(tree, phrase_type, tags):
    try:
        tree.node
    except AttributeError:
        return 
    else:
        if tree.node == phrase_type:
            tags.append(tree)
        else:
            for child in tree:
                get_chunks(child, phrase_type, tags)

def index_of(tag):
    if tag == "START":
        return 0
    elif tag == "END":
        return len(NORMTAGS) + 1
    else:
        return NORMTAGS.index(tag) + 1
        
def update_trans_freqs(trans_freqs, tag_seq):
    tags = ["START"]
    tags.extend(tag_seq.split(" "))
    tags.append("END")
    bigrams = nltk.bigrams(tags)
    for bigram in bigrams:
        row = index_of(bigram[0])
        col = index_of(bigram[1])
        trans_freqs[row, col] += 1
    
# generate phrases as a sequence of (normalized) POS tags and
# transition probabilities across POS tags.
tag_map = normalize_ptb_tags()
np_fd = nltk.FreqDist()
trans_freqs = np.zeros((len(NORMTAGS) + 2, len(NORMTAGS) + 2))
for tree in treebank_chunk.chunked_sents():
    chunks = []
    get_chunks(tree, "NP", chunks)
    for chunk in chunks:
        tagged_poss = [tagged_word[1] for tagged_word in chunk]
        normed_tags = []
        for tagged_pos in tagged_poss:
            try:
                normed_tags.append(tag_map[tagged_pos])
            except KeyError:
                normed_tags.append("OT")
        np_fd.inc(" ".join(normed_tags))
        
fout = open("../data/np_tags.csv", 'wb')
for tag_seq in np_fd.keys():
    fout.write("%s\t%d\n" % (tag_seq, np_fd[tag_seq]))
    update_trans_freqs(trans_freqs, tag_seq)
fout.close()
# normalize so they are all probablities (by row sum)
trans_probs = trans_freqs / np.linalg.norm(trans_freqs, axis=1)[:, np.newaxis]
trans_probs[~np.isfinite(trans_probs)] = 0.0
np.savetxt("../data/pos_trans.csv", trans_probs, fmt="%7.5f", delimiter="\t")

Thanks to this presentation and this project for teaching me how to parse a NLTK tree. The output of the code above are the following two files, the first a list of inter-POS transition probabilities to figure out what an unknown POS tag should be given the current one.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
# START      NN      VB      JJ      RB      DT      IN      OT     END
0.00000 0.43902 0.06724 0.21358 0.04351 0.78312 0.02769 0.37574 0.00000 # START
0.00000 0.71857 0.07099 0.13159 0.00346 0.00693 0.00000 0.36708 0.57139 # NN
0.00000 0.83146 0.03079 0.18477 0.15397 0.01540 0.00000 0.27715 0.41573 # VB
0.00000 0.94283 0.06168 0.20266 0.00441 0.01322 0.00000 0.22029 0.13217 # JJ
0.00000 0.08762 0.04381 0.65716 0.04381 0.00000 0.00000 0.04381 0.74478 # RB
0.00000 0.79556 0.14763 0.50030 0.04921 0.03281 0.00000 0.29526 0.06561 # DT
0.00000 0.68599 0.00000 0.34300 0.00000 0.17150 0.00000 0.34300 0.51450 # IN
0.00000 0.66386 0.11246 0.33374 0.02177 0.06893 0.01814 0.59131 0.28296 # OT
0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 0.00000 # END

and the other is a list of POS tag sequences in the limited PTB corpus, alongwith their counts.

1
2
3
4
5
6
7
DT JJ NN        1162
OT NN           1097
DT NN NN        875
DT              830
NN NN NN        547
JJ              359
...

Once again, the consumer of this data is in the Java layer, but we do the following tests for our own satisfaction.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
def load_phrase_tags(phrase_tag_file):
    phrase_tags = set()
    ftags = open(phrase_tag_file, 'rb')
    for line in ftags:
        line = line.strip()
        if len(line) == 0 or line.startswith("#"):
            continue
        phrase, count = line.split("\t")
        phrase_tags.add(phrase)
    ftags.close()
    return phrase_tags

def assert_true(fn, message):
    if fn != True:
        print "Assert failed:", message

def tag_to_index(tag):
    if tag == "START":
        return 0
    elif tag == "END":
        return len(NORMTAGS) + 1
    else:
        return NORMTAGS.index(tag) + 1

def index_to_tag(index):
    if index == 0: 
        return "START"
    elif index == len(NORMTAGS) + 1:
        return "END"
    else:
        return NORMTAGS[index - 1]

def predict_likely_pos(prev_tag, trans_probs):
    row = tag_to_index(prev_tag)
    probs = trans_probs[row, :]
    return index_to_tag(np.argmax(probs))

def predict_if_noun_phrase(phrase, trans_probs, phrase_tags):
    words = nltk.word_tokenize(phrase)
    tags = []
    for word in words:
        if word_dict.has_key(word):
            tags.append(predict_pos(word, word_dict))
        else:
            prev_tag = "START" if len(tags) == 0 else tags[-1]
            tags.append(predict_likely_pos(prev_tag, trans_probs))
    return " ".join(tags) in phrase_tags

# test cases for phrases
phrase_tags = load_phrase_tags("../data/np_tags.csv")
trans_probs = np.loadtxt("../data/pos_trans.csv", delimiter="\t")
assert_true(predict_if_noun_phrase("time flies", trans_probs, phrase_tags), 
            "time flies == NP!")
assert_true(not predict_if_noun_phrase(
            "were spoken", trans_probs, phrase_tags), 
            "were spoken == VP!")

The phrase stuff seems to be a bit too loose at the moment, it matches almost any phrase you throw at it as a noun-phrase. One possibility is to manually combine rules into a smaller number of regex rules, with a cutoff mandating a minimum required frequency of a tag sequence for it to be considered. Alternatively, we could extract Verb, Adjective and Adverb phrases and attempt to match against all of them, choosing the one with the highest frequency.

One other problem with this approach is that the data sets we have considered are a bit on the small side, and they don't match our domain. An option is to POS tag and chunk a large amount of relevant text with something like OpenNLP, then use the above approach to find free-standing noun phrases.