Salmon Run: Language Model to detect Medical Sentences using NLTK

Saturday, April 20, 2013

Language Model to detect Medical Sentences using NLTK

I've been thinking of ways of singling out medical sentences in a body of mixed text for special processing, and one of the approaches I thought of was to train a trigram (backoff) language model using some medical text, then use the model to detect if a sentence is medical or non-medical. The joint probability of the words appearing in the model should be higher for medical sentences than for non-medical ones.

I initially looked at NLTK's NgramModel, but unfortunately could not make it work because the Lidstone probability distribution I was passing to it as an estimator expected a minimum number of bins to be configured. Unfortunately I could not reproduce the error with small amounts of data (so I could submit a bug report). In any case, I also found that the NgramModel can't be pickled (because of a probability distribution function object in it), which made it even less interesting.

In any case, you can find this (non-working) code in my GitHub here. It crashes with a "ValueError - A Lidstone probability distribution must have at least one bin" error message during the testing phase. Unfortunately I can't share the data for licensing reasons. But hopefully, if you have a reasonably large set of XML files (I had about 3,500) to feed the code, it should hopefully fail at around the same place. [Update: I found a publicly available XML sample and I have asked about this on the nltk-users mailing list - you can follow the discussion here, if you'd like.]

However, it turns out that a trigram language model is quite simple to build, especially using NLTK's building blocks. My language model attempts to first report trigram probabilities, falling back to corresponding bigram and unigram probabilities, and finally reporting a Laplace smoothed estimate if the unigram probability is also 0. Probabilities at lower n-grams are discounted by a (heuristically chosen) value alpha, and the final result normalized by the number of words in the sentence (to remove the effect of long sentences). Because this is a proof of concept to test the validity of the idea more than anything else, I decided to skip the calculation of alpha.

Here is the code for the home grown language model described above (also available in my GitHub). The train() method reads in sentences from a bunch of medical XML files, and parses out the sentences. These sentences are then used to instantiate the LanguageModel class, which is then pickled. The test() method then unpickles the model and uses it to compute the log probabilities of sentence trigrams, finally normalizing it with the length of the sentence.

from __future__ import division

import math
import os.path

import cPickle
import glob
import nltk
from nltk.corpus.reader import XMLCorpusReader

class LangModel:
  def __init__(self, order, alpha, sentences):
    self.order = order
    self.alpha = alpha
    if order > 1:
      self.backoff = LangModel(order - 1, alpha, sentences)
      self.lexicon = None
    else:
      self.backoff = None
      self.n = 0
    self.ngramFD = nltk.FreqDist()
    lexicon = set()
    for sentence in sentences:
      words = nltk.word_tokenize(sentence)
      wordNGrams = nltk.ngrams(words, order)
      for wordNGram in wordNGrams:
        self.ngramFD.inc(wordNGram)
        if order == 1:
          lexicon.add(wordNGram)
          self.n += 1
    self.v = len(lexicon)

  def logprob(self, ngram):
    return math.log(self.prob(ngram))
  
  def prob(self, ngram):
    if self.backoff != None:
      freq = self.ngramFD[ngram]
      backoffFreq = self.backoff.ngramFD[ngram[1:]]
      if freq == 0:
        return self.alpha * self.backoff.prob(ngram[1:])
      else:
        return freq / backoffFreq
    else:
      # laplace smoothing to handle unknown unigrams
      return ((self.ngramFD[ngram] + 1) / (self.n + self.v))

def train():
  if os.path.isfile("lm.bin"):
    return
  files = glob.glob("data/*.xml")
  sentences = []
  i = 0
  for file in files:
    if i > 0 and i % 500 == 0:
      print("%d/%d files loaded, #-sentences: %d" %
        (i, len(files), len(sentences)))
    dir, file = file.split("/")
    reader = XMLCorpusReader(dir, file)
    sentences.extend(nltk.sent_tokenize(" ".join(reader.words())))
    i += 1
  lm = LangModel(3, 0.4, sentences)
  cPickle.dump(lm, open("lm.bin", "wb"))

def test():
  lm1 = cPickle.load(open("lm.bin", 'rb'))
  testFile = open("sentences.test", 'rb')
  for line in testFile:
    sentence = line.strip()
    print "SENTENCE:", sentence,
    words = nltk.word_tokenize(sentence)
    wordTrigrams = nltk.trigrams(words)
    slogprob = 0
    for wordTrigram in wordTrigrams:
      logprob = lm1.logprob(wordTrigram)
      slogprob += logprob
    print "(", slogprob / len(words), ")"

def main():
  train()
  test()

if __name__ == "__main__":
  main()

And here are the language model's predictions for a set of test sentences I pulled off the Internet (mainly Wikipedia).

In biology, immunity is the state of having sufficient biological defences to avoid infection, disease, or other unwanted biological invasion. (-6.53506411778)
Naturally acquired immunity occurs through contact with a disease causing agent, when the contact was not deliberate, whereas artificially acquired immunity develops only through deliberate actions such as vaccination. (-7.90563670519)
Immunity from prosecution occurs when a prosecutor grants immunity, usually to a witness in exchange for testimony or production of other evidence. (-8.40420096533)
Transactional immunity (colloquially known as "blanket" or "total" immunity) completely protects the witness from future prosecution for crimes related to his or her testimony. (-8.60917860675)
Hearing loss is being partly or totally unable to hear sound in one or both ears. (-1.61661138183)
Conductive hearing loss (CHL) occurs because of a mechanical problem in the outer or middle ear. (-1.98718543565)
Sensorineural hearing loss (SNHL) occurs when the tiny hair cells (nerve endings) that detect sound in the ear are injured, diseased, do not work correctly, or have died. (-2.5566194904)
This type of hearing loss often cannot be reversed. (-2.72710898378)
In law, a hearing is a proceeding before a court or other decision-making body or officer, such as a government agency. (-5.87112753897)
Within some criminal justice systems, a preliminary hearing (evidentiary hearing) is a proceeding, after a criminal complaint has been filed by the prosecutor, to determine whether there is enough evidence to require a trial. (-7.44050739024)

As you can see, sentences that are obviously medical tend to have a higher normalized log probability (the value at the end of the sentence) than sentences that are not. Sentences #1 and #2 are right on the border with normalized log probability comparable to non-medical sentences. Depending on the results of more tests, this model may or may not be good enough. Alternatively, it may be more effective to reframe the problem as one where we have classify a sentence as belonging to one of multiple genres, and each genre has a language model.

Anyway, thats all I have for today. Hope you found it interesting.