Friday, April 04, 2014

More about Parsing Drug Dosage Phrases


In my previous post, I described using a Finite State Machine (FSM) implementation to parse Drug Dosage phrases into their constituent parts. While the results weren't too bad, the thing that struck me as being a bit sketchy was that I had to build up the state diagram by manually eyeballing some phrases and their parses (from Erin Rhode's Perl program). In this post, I attempt to improve upon that solution.

My first attempt is to use LingPipe's chunkers to build dictionary and regex driven Named Entity Recognizers (NER). A Dictionary Chunker is populated with terms from the drugs.dict, frequencies.dict, routes.dict and units.dict files - the chunker will tag words from these files as DRUG, FREQ, ROUTE and UNIT respectively. Similarly, the patterns in num_patterns.dict are used to drive a Regular Expression Chunker - the matching patterns are tagged as NUM. Notice that we lose the distinction of DOSAGE, REFILL and QTY that we had previously - but they can be recreated from the tags if needed by looking for patterns such as DOSAGE ::= NUM+ UNIT. The annotations from the Dictionary NER and the Regex NER are stacked, giving us results shown below - words annotated as O are ones that could not be tagged by the NERs.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
hydrocortizone cream, apply to rash bid
hydrocortizone/DRUG cream/DRUG ,/O apply/ROUTE to/O rash/O bid/FREQ

albuterol inhaler one to two puffs bid
albuterol/DRUG inhaler/DRUG one/NUM to/O two/NUM puffs/UNIT bid/FREQ

Enteric coated aspirin 81 mg tablets one qd
Enteric/DRUG coated/DRUG aspirin/DRUG 81/NUM mg/UNIT tablets/UNIT one/NUM qd/FREQ

Vitamin B12 1000 mcg IM
Vitamin/DRUG B12/DRUG 1000/NUM mcg/UNIT IM/ROUTE

atenolol 50 mg tabs one qd, #100, one year
atenolol/DRUG 50/NUM mg/UNIT tabs/UNIT one/NUM qd/FREQ ,/O #100,/NUM one/O year/NUM

The code to build the NERs using the LingPipe API is shown below. We follow a similar strategy as the one for FSMs - building generic NER implementations and then using it from application code.

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
// Source: src/main/scala/com/mycompany/scalcium/utils/NER.scala
package com.mycompany.scalcium.utils

import java.io.File
import java.util.regex.Pattern

import scala.collection.JavaConversions.asScalaIterator
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import com.aliasi.chunk.CharLmHmmChunker
import com.aliasi.chunk.Chunk
import com.aliasi.chunk.HmmChunker
import com.aliasi.chunk.RegExChunker
import com.aliasi.dict.DictionaryEntry
import com.aliasi.dict.ExactDictionaryChunker
import com.aliasi.dict.MapDictionary
import com.aliasi.hmm.HmmCharLmEstimator
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
import com.aliasi.util.AbstractExternalizable

trait NER {
  
  def chunk(s: String): List[Chunk]
  
  def tag(s: String): List[(String,String)] = {
    val chunks = chunk(s)
    var curr = 0
    val tags = new ArrayBuffer[(String,String)]
    chunks.map(chunk => {
      val start = chunk.start
      val end = chunk.end
      val ctype = chunk.`type`
      if (curr < start) {
        val prevtext = s.substring(curr, start)
        prevtext.split(" ")
          .foreach(word => tags += ((word, "O")))
      }
      val chunktext = s.substring(start, end)
      chunktext.split(" ")
        .foreach(word => tags += ((word, ctype)))
      curr = end
    })
    if (curr < s.length()) {
      val lasttext = s.substring(curr, s.length())
      lasttext.split(" ")
        .foreach(word => tags += ((word, "O")))
    }
    tags.filter(wt => wt._1.length() > 0)
      .toList
  }
  
  def merge(taggedWords: List[List[(String,String)]]): 
      List[(String,String)] = {
    val lengths = taggedWords.map(tag => tag.size)
    val maxlen = lengths.max
    val longest = lengths.zipWithIndex
      .sortBy(li => li._1)
      .head._2
    val words = taggedWords(longest).map(
      taggedWord => taggedWord._1)
    val mergedTags = (0 until maxlen).map(i => {
      val tags = taggedWords.map(taggedWord => 
        if (taggedWord.size > i) taggedWord(i)._2 else "O")
        .filter(tag => !"O".equals(tag))
      if (tags.isEmpty) "O"
      else tags.head // TODO: revisit use Bayes Net for disambig
    })
    .toList
    words.zip(mergedTags)
  }
}

/**
 * Dictionary based NER. Uses a set of files, each containing
 * terms that belong to a specified class.
 */
class DictNER(val data: Map[String,File]) extends NER {
  
  val dict = new MapDictionary[String]()
  data.foreach(entityData => {
    val entityName = entityData._1
    Source.fromFile(entityData._2).getLines()
      .foreach(line => dict.addEntry(
        new DictionaryEntry[String](line, entityName, 1.0D)))
  })
  val chunker = new ExactDictionaryChunker(dict, 
    IndoEuropeanTokenizerFactory.INSTANCE, false, false)

  override def chunk(s: String): List[Chunk] = {
    val chunking = chunker.chunk(s)
    chunking.chunkSet().iterator().toList
  }
}

/**
 * Regex based NER. Uses a set of files, each containing
 * regular expressions representing a specified class.
 */
class RegexNER(val data: Map[String,File]) extends NER {
  val chunkers = data.map(entityData => {
    val entityName = entityData._1
    Source.fromFile(entityData._2).getLines()
      .map(line => new RegExChunker(
        Pattern.compile(line), entityName, 1.0D))
  })
  .flatten
  .toList
  
  override def chunk(s: String): List[Chunk] = {
    chunkers.map(chunker => {
      val chunking = chunker.chunk(s)
      chunking.chunkSet().iterator()
    })
    .flatten
    .toList
    .sortBy(chunk => chunk.start)
  }
}

/**
 * Model based NER. A single multiclass HMM Language
 * Model is constructed out of the training data, and
 * used to predict classes for new words.
 */
class ModelNER(val modelFile: File) extends NER {
  val chunker = if (modelFile != null) 
    AbstractExternalizable
      .readObject(modelFile)
      .asInstanceOf[HmmChunker]
    else null
    
  override def chunk(s: String): List[Chunk] = {
    if (chunker == null) List.empty
    else chunker.chunk(s)
      .chunkSet()
      .iterator()
      .toList
  }

  def train(taggedFile: File, modelFile: File,
      ngramSize: Int, numChars: Int,
      lambda: Double): Unit = {
    val factory = IndoEuropeanTokenizerFactory.INSTANCE
    val estimator = new HmmCharLmEstimator(
      ngramSize, numChars, lambda)
    val chunker = new CharLmHmmChunker(factory, estimator)
    Source.fromFile(taggedFile)
      .getLines()
      .foreach(taggedWords => {
        taggedWords.split(" ")
        .foreach(taggedWord => {
          val slashAt = taggedWord.lastIndexOf('/')
          val word = taggedWord.substring(0, slashAt)
          val tag = taggedWord.substring(slashAt + 1)
          chunker.trainDictionary(word, tag)
      })
    })
    AbstractExternalizable.compileTo(chunker, modelFile)
  }
}

Having built the client that uses the DictNER and RegexNER classes to parse the Drug Dosage phrases, my second attempt at improvement was to generalize it to include unseen words. For this, I wrapped LingPipe's HMM based Chunker into a ModelNER. The ModelNER is trained with data generated by merging the annotations from DictNER and RegexNER - the code for that can be seen in the train() method of the DrugDosageNER.scala below:

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
// Source: src/main/scala/com/mycompany/scalcium/drugdosage/DrugDosageNER.scala
package com.mycompany.scalcium.drugdosage

import java.io.File
import java.io.FileWriter
import java.io.PrintWriter

import scala.Array.canBuildFrom
import scala.collection.TraversableOnce.flattenTraversableOnce
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import com.mycompany.scalcium.utils.DictNER
import com.mycompany.scalcium.utils.ModelNER
import com.mycompany.scalcium.utils.RegexNER

class DrugDosageNER(val modelFile: File,
    val debug: Boolean = false) {

  val tmpdir = new File("/tmp")
  val modelNER = new ModelNER(modelFile)
  
  def train(drugFile: File, freqFile: File, routeFile: File,
      unitsFile: File, numPatternsFile: File,
      inputFile: File, modelFile: File,
      ngramSize: Int, numChars: Int, 
      lambda: Double): Unit = {
    // use the dict NER and regex NER to build training
    // set out of rules.
    val dictNER = new DictNER(Map(
      ("DRUG", drugFile),
      ("FREQ", freqFile),
      ("ROUTE", routeFile),
      ("UNIT", unitsFile)))
    val regexNER = new RegexNER(Map(
      ("NUM", numPatternsFile)))
    val trainWriter = new PrintWriter(
      new FileWriter(new File(tmpdir, "model.train")))
    Source.fromFile(inputFile)
      .getLines()
      .foreach(line => {
        val dicttags = dictNER.tag(line)
        val regextags = regexNER.tag(line)
        val mergedtags = dictNER.merge(List(dicttags, regextags))
        trainWriter.println(mergedtags.map(wordTag => 
          wordTag._1 + "/" + wordTag._2).mkString(" "))
    })
    trainWriter.flush()
    trainWriter.close()
    // use the bootstrapped training set to train the
    // model NER
    modelNER.train(new File(tmpdir, "model.train"), 
      modelFile, ngramSize, numChars, lambda)
  }
  
  def evaluate(nfolds: Int, ntest: Int, 
      datafile: File, ngramSize: Int, numChars: Int, 
      lambda: Double): Double = {
    val accuracies = ArrayBuffer[Double]()
    (0 until nfolds).foreach(cv => {
      // get random list of rows that will be our test case
      val testrows = scala.collection.mutable.Set[Int]()
      val random = scala.util.Random
      do {
        testrows += (random.nextDouble * 100).toInt
      } while (testrows.size < ntest)
      // partition input dataset into train and test
      // we use the model.train file from the previous test
      val evaltrain = new PrintWriter(
        new FileWriter(new File(tmpdir, "eval.train")))
      val evaltest = new PrintWriter(
        new FileWriter(new File(tmpdir, "eval.test")))
      var curr = 0
      Source.fromFile(datafile)
        .getLines()
        .foreach(line => {
        if (testrows.contains(curr)) evaltest.println(line)
        else evaltrain.println(line)
        curr += 1
      })
      evaltrain.flush()
      evaltrain.close()
      evaltest.flush()
      evaltest.close()
      // now we use evaltrain to train the model
      val modelNER = new ModelNER(null)
      modelNER.train(new File(tmpdir, "eval.train"), 
        new File(tmpdir, "eval.bin"), 
        ngramSize, numChars, lambda)
      // now test against evaltest with the model
      val trainedModelNER = new ModelNER(
        new File(tmpdir, "eval.bin"))
      val results = Source.fromFile(
        new File(tmpdir, "eval.test"))
        .getLines()
        .map(line => {
           val words = line.split(" ").map(wordTag => 
             wordTag.substring(0, wordTag.lastIndexOf('/')))
           .mkString(" ")
           val rtags = line.split(" ").map(wordTag => 
             wordTag.substring(wordTag.lastIndexOf('/') + 1))
             .toList
           val ptags = trainedModelNER.tag(words)
             .map(wordTag => wordTag._2)
           (0 until List(rtags.size, ptags.size).min)
             .map(i => if (rtags(i).equals(ptags(i))) 1 else 0)
      })
      .flatten
      .toList
      val accuracy = results.sum.toDouble / results.size
      if (debug) 
        Console.println("CV-# %d: accuracy = %f"
        .format(cv, accuracy))
      accuracies += accuracy
    })
    accuracies.sum / accuracies.size
  }
  
  def parse(s: String): List[(String,String)] = modelNER.tag(s)
}

Treating the training data generated by the DictNER and RegexNER as the "gold set", a 10-fold cross-validation with a 70/30 train/test split gave an accuracy of 80.15% for the ModelNER based DrugDosageNER client. Here is how it annotated the first 5 phrases in the input.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
hydrocortizone cream, apply to rash bid
hydrocortizone/DRUG cream/DRUG ,/ROUTE apply/ROUTE to/ROUTE rash/DRUG bid/FREQ

albuterol inhaler one to two puffs bid
albuterol/DRUG inhaler/DRUG one/NUM to/NUM two/NUM puffs/UNIT bid/FREQ

Enteric coated aspirin 81 mg tablets one qd
Enteric/DRUG coated/DRUG aspirin/DRUG 81/NUM mg/UNIT tablets/UNIT one/NUM qd/FREQ

Vitamin B12 1000 mcg IM
Vitamin/DRUG B12/DRUG 1000/NUM mcg/UNIT IM/ROUTE

atenolol 50 mg tabs one qd, #100, one year
atenolol/DRUG 50/NUM mg/UNIT tabs/UNIT one/NUM qd/FREQ ,/NUM #100,/NUM one/NUM year/NUM

Code to call and evaluate the DrugDosageNER can be found in the JUnit class DrugDosageNERTest.scala shown below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
// Source: src/test/scala/com/mycompany/scalcium/drugdosage/DrugDosageNERTest.scala
package com.mycompany.scalcium.drugdosage

import java.io.File

import scala.io.Source

import org.junit.Assert
import org.junit.Ignore
import org.junit.Test

class DrugDosageNERTest {

  val datadir = "/path/to/data/dir"

  @Test
  def testTrainTest(): Unit = {
    val ddNER = new DrugDosageNER(null, true)
    val inputfile = new File(datadir, "input.txt")
    val modelfile = new File(datadir, "model.bin")
    ddNER.train(
      new File(datadir, "drugs.dict"),
      new File(datadir, "frequencies.dict"),
      new File(datadir, "routes.dict"),
      new File(datadir, "units.dict"),
      new File(datadir, "num_patterns.dict"),
      inputfile, modelfile, 8, 256, 8.0D)
    Assert.assertTrue(modelfile.exists())
    // now instantiate the NER with modelfile
    val trainedDDNER = new DrugDosageNER(modelfile, true)
    Source.fromFile(inputfile).getLines()
      .foreach(line => {
        val taggedline = trainedDDNER.parse(line)
          .map(taggedWord => 
            taggedWord._1 + "/" + taggedWord._2)
          .mkString(" ")
        Console.println(line)
        Console.println(taggedline)
        Console.println()
    })
  }
  
  @Test
  def testEvaluate(): Unit = {
    val ddNER = new DrugDosageNER(null, true)
    // we use model.train that was generated for internal
    // use during the training phase - this contains the
    // tags from the dict and regex NERs
    val accuracy = ddNER.evaluate(10, 30, 
      new File(datadir, "model.train"), 8, 256, 8.0)
    Console.println("Overall accuracy = " + accuracy)
    Assert.assertTrue(accuracy > 0.5)
  }
}

My third attempt at improvement is based on the realization that the raw annotation bigram frequencies of the "gold set" generated by merging the annotations of DictNER and RegexNER could be indicative of the transition probabilities of the (new) state diagram. In my previous FSM implementation, if the FSM could transition to multiple states, I would just arbitarily choose the first one - now I could use the transition with the highest probability. Here are the transition frequencies and associated probabilities (computed manually). The JUnit class to run the Probabilistic FSM so generated contains the code to generate it and is shown later.

SourceTargetFrequencyP(target|source)
DRUGDRUG340.25
DRUGFREQ10.01
DRUGNUM910.67
DRUGO60.04
DRUGROUTE30.02
FREQFREQ30.09
FREQNUM40.12
FREQO270.79
NUMFREQ140.12
NUMNUM20.02
NUMO140.12
NUMROUTE50.04
NUMUNIT850.71
ODRUG40.06
OFREQ40.06
ONUM200.32
OO290.63
OROUTE40.06
OUNIT10.02
ROUTEFREQ590.89
ROUTEO30.05
ROUTEROUTE40.06
UNITFREQ150.16
UNITNUM150.16
UNITO40.04
UNITROUTE510.54
UNITUNIT90.10

For this I extended the FSM implementation from the previous post into a probabilistic version PFSM. I also made some changes in FSM, so to avoid confusion in case you are following along, I show the FSM class as well below:

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
// Source: src/main/scala/com/mycompany/scalcium/utils/FSM.scala
package com.mycompany.scalcium.utils

import scala.collection.mutable.ArrayBuffer

trait Guard[T] {
  def accept(token: T): Boolean
}

trait Action[T] {
  def perform(currState: String, token: T): Unit
}

class FSM[T](val action: Action[T],
    val debug: Boolean = false) {

  val states = ArrayBuffer[String]()
  val transitions = scala.collection.mutable.Map[
    String,ArrayBuffer[(String,Guard[T])]]()
  var currState: String = "START"
  
  def addState(state: String): Unit = {
    states += state
  }
  
  def addTransition(from: String, to: String,
      guard: Guard[T]): Unit = {
    val translist = transitions.getOrElse(from, ArrayBuffer()) 
    translist += ((to, guard))
    transitions(from) = translist
  } 
  
  def transition(token: T): Unit = {
    val tgas = transitions.getOrElse(currState, List())
      .filter(tga => tga._2.accept(token))
    if (tgas.size == 1) {
      // no ambiguity, just take the path specified
      val tga = tgas.head
      if (debug)
        Console.println("%s -> %s".format(currState, tga._1))
      currState = tga._1
      action.perform(currState, token)
    } else {
      if (tgas.isEmpty) 
        action.perform(currState, token) 
      else {
        currState = tgas.head._1
        action.perform(currState, token)
      }
    }
  }
  
  def run(tokens: List[T]): Unit = tokens.foreach(transition(_))
}

class PFSM[T](action: Action[T], debug: Boolean = false) 
    extends FSM[T](action, debug) {

  val tprobs = scala.collection.mutable.Map[
    ((String,String)),Double]()

  def addTransition(from: String, to: String,
      tprob: Double, guard: Guard[T]): Unit = {
    super.addTransition(from, to, guard)
    tprobs((from, to)) = tprob
  }

  override def transition(token: T): Unit = {
    val tgas = transitions.getOrElse(currState, List())
      .filter(tga => tga._2.accept(token))
    if (tgas.size == 1) {
      // no ambiguity, just take the path specified
      val tga = tgas.head
      if (debug)
        Console.println("%s -> %s".format(currState, tga._1))
      currState = tga._1
      action.perform(currState, token)
    } else {
      // choose the most probable transition based
      // on tprobs. Break ties by choosing head as before
      if (tgas.isEmpty) 
        action.perform(currState, token) 
      else {
        val bestTga = tgas
          .map(tga => (tga, tprobs((currState, tga._1))))
          .sortWith((a, b) => a._2 > b._2)
          .head._1
        currState = bestTga._1
        action.perform(currState, token)
      }
    }
  }
}

The PFSM is built from the DrugDosagePFSM client using the transition probabilities in the table above. The parse() method does the annotation.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
// Source: src/main/scala/com/mycompany/scalcium/drugdosage/DrugDosagePFSM.scala
package com.mycompany.scalcium.drugdosage

import com.mycompany.scalcium.utils.PFSM
import java.io.File

class DrugDosagePFSM(val drugFile: File, 
    val freqFile: File, val routeFile: File,
    val unitsFile: File, val numPatternsFile: File,
    val debug: Boolean = false) {

  def parse(s: String): List[(String,String)] = {
    val collector = new CollectAction(debug)
    val fsm = buildFSM(collector, debug)
    val x = fsm.run(s.toLowerCase()
        .replaceAll("[,;]", " ")
        .replaceAll("\\s+", " ")
        .split(" ")
        .toList)
    collector.stab.toList
  }
  
  def buildFSM(collector: CollectAction, 
      debug: Boolean): PFSM[String] = {
    val pfsm = new PFSM[String](collector, debug)
    
    pfsm.addState("START")
    pfsm.addState("DRUG")
    pfsm.addState("FREQ")
    pfsm.addState("NUM")
    pfsm.addState("ROUTE")
    pfsm.addState("UNIT")
    pfsm.addState("END")
    
    val noGuard = new BoolGuard(false)
    val drugGuard = new DictGuard(drugFile)
    val freqGuard = new DictGuard(freqFile)
    val routeGuard = new DictGuard(routeFile)
    val unitsGuard = new DictGuard(unitsFile)
    val numGuard = new RegexGuard(numPatternsFile)
    
    pfsm.addTransition("START", "DRUG", 1.0, drugGuard)
    
    pfsm.addTransition("DRUG", "FREQ", 0.01, freqGuard)
    pfsm.addTransition("DRUG", "NUM", 0.67, numGuard)
    pfsm.addTransition("DRUG", "ROUTE", 0.02, routeGuard)

    pfsm.addTransition("FREQ", "NUM", 0.12, numGuard)
    
    pfsm.addTransition("NUM", "FREQ", 0.12, freqGuard)
    pfsm.addTransition("NUM", "ROUTE", 0.04, routeGuard)
    pfsm.addTransition("NUM", "UNIT", 0.71, unitsGuard)
    
    pfsm.addTransition("ROUTE", "FREQ", 0.89, freqGuard)

    pfsm.addTransition("UNIT", "FREQ", 0.16, freqGuard)
    pfsm.addTransition("UNIT", "NUM", 0.16, numGuard)
    pfsm.addTransition("UNIT", "ROUTE", 0.54, routeGuard)

    pfsm.addTransition("FREQ", "END", 0.25, noGuard)
    pfsm.addTransition("NUM", "END", 0.25, noGuard)
    pfsm.addTransition("UNIT", "END", 0.25, noGuard)
    pfsm.addTransition("ROUTE", "END", 0.25, noGuard)
    
    pfsm
  }
}

The JUnit class below contains code to compute the raw transition frequencies from the "gold set" annotations, process all the phrases in the input.txt file as well as to evaluate the DrugDosagePFSM against this "gold set".

1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
// Source: src/test/scala/com/mycompany/scalcium/drugdosage/DrugDosagePFSMTest.scala
package com.mycompany.scalcium.drugdosage

import java.io.File
import java.io.FileWriter
import java.io.PrintWriter

import scala.Array.canBuildFrom
import scala.collection.mutable.ArrayBuffer
import scala.io.Source

import org.junit.Ignore
import org.junit.Test

import com.mycompany.scalcium.utils.DictNER
import com.mycompany.scalcium.utils.NGram
import com.mycompany.scalcium.utils.RegexNER

class DrugDosagePFSMTest {

  val datadir = "/path/to/data/dir"
  val tmpdir = "/tmp"
    
  @Test
  def testComputeTransitionFrequencies(): Unit = {
    val dictNER = new DictNER(Map(
      ("DRUG", new File(datadir, "drugs.dict")),
      ("FREQ", new File(datadir, "frequencies.dict")),
      ("ROUTE", new File(datadir, "routes.dict")),
      ("UNIT", new File(datadir, "units.dict"))))
    val regexNER = new RegexNER(Map(
      ("NUM", new File(datadir, "num_patterns.dict"))    
    ))
    val transitions = ArrayBuffer[(String,String)]()
    Source.fromFile(new File(datadir, "input.txt"))
      .getLines()
      .foreach(line => {
        val dictTags = dictNER.tag(line)
        val regexTags = regexNER.tag(line)
        val mergedTags = dictNER.merge(List(dictTags, regexTags))
        val tagBigrams = NGram.bigrams(mergedTags.map(_._2))
          .map(bigram => transitions += 
            ((bigram(0).asInstanceOf[String], 
             bigram(1).asInstanceOf[String])))
    })
    val transitionFreqs = transitions
      .groupBy(pair => pair._1 + " -> " + pair._2)
      .map(pair => (pair._1, pair._2.size))
      .toList
      .sortBy(pair => pair._1)
    Console.println(transitionFreqs.mkString("\n"))
  }  

  @Test
  def testParse(): Unit = {
    val ddPFSM = new DrugDosagePFSM(
        new File(datadir, "drugs.dict"),
        new File(datadir, "frequencies.dict"),
        new File(datadir, "routes.dict"),
        new File(datadir, "units.dict"),
        new File(datadir, "num_patterns.dict"),
        false)
    val writer = new PrintWriter(new FileWriter(
      new File(datadir, "pfsm_output.txt")))
    Source.fromFile(new File(datadir, "input.txt"))
      .getLines()
      .foreach(line => {
         val stab = ddPFSM.parse(line)
         writer.println(line)
         writer.println(stab.map(st => st._2 + "/" + st._1)
           .mkString(" "))
         writer.println()
    })
    writer.flush()
    writer.close()
  }

  @Test
  def testEvaluateAccuracy(): Unit = {
    val accuracies = ArrayBuffer[Double]()
    (0 until 10).foreach(cv => {
      // get random list of rows that will be our test case
      val testrows = scala.collection.mutable.Set[Int]()
      val random = scala.util.Random
      do {
        testrows += (random.nextDouble * 100).toInt
      } while (testrows.size < 30)
      // test random 30% data of input. There
      // is no training involved here, so we just test
      // our parser against the "gold" set from model.train
      // generated by rule-based dict and regex NERs.
      val inputfile = new File(datadir, "model.train")
      var curr = 0
      var results = ArrayBuffer[Int]()
      Source.fromFile(inputfile)
        .getLines()
        .foreach(line => {
        if (testrows.contains(curr)) {
          val words = line.split(" ").map(wordTag => 
            wordTag.substring(0, wordTag.lastIndexOf('/')))
            .mkString(" ")
          val rtags = line.split(" ").map(wordTag => 
            wordTag.substring(wordTag.lastIndexOf('/') + 1))
            .toList
          val ddPFSM = new DrugDosagePFSM(
            new File(datadir, "drugs.dict"),
            new File(datadir, "frequencies.dict"),
            new File(datadir, "routes.dict"),
            new File(datadir, "units.dict"),
            new File(datadir, "num_patterns.dict"),
            false)
          val ptags = ddPFSM.parse(words)
            .map(wordTag => wordTag._2)
          val result = (0 until List(rtags.size, ptags.size).min)
            .map(i => if (rtags(i).equals(ptags(i))) 1 else 0)
          results ++= result
        }
        curr += 1
      })
      val accuracy = results.sum.toDouble / results.size
      Console.println("CV #%d: accuracy=%f".format(cv, accuracy))
      accuracies += accuracy
    })
    Console.println("Overall accuracy=%f".format(
      accuracies.sum / accuracies.size))
  }
}

Annotations produced by the DrugDosagePFSM class on the first 5 lines of the input is shown below.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
hydrocortizone cream, apply to rash bid
hydrocortizone/DRUG cream/DRUG apply/ROUTE to/ROUTE rash/ROUTE bid/FREQ

albuterol inhaler one to two puffs bid
albuterol/DRUG inhaler/DRUG one/NUM to/NUM two/NUM puffs/UNIT bid/FREQ

Enteric coated aspirin 81 mg tablets one qd
enteric/DRUG coated/DRUG aspirin/DRUG 81/NUM mg/UNIT tablets/UNIT one/NUM qd/FREQ

Vitamin B12 1000 mcg IM
vitamin/DRUG b12/DRUG 1000/NUM mcg/UNIT im/ROUTE

atenolol 50 mg tabs one qd, #100, one year
atenolol/DRUG 50/NUM mg/UNIT tabs/UNIT one/NUM qd/FREQ #100/FREQ one/NUM year/NUM

Evaluating the DrugDosagePFSM using 10-fold cross validation against a 70/30 train/test split results in a slightly higher overall accuracy of 83.49%.

Thats all I have for today. This post has been a bit code heavy, but hopefully it made sense and you had fun reading it. Next week I promise to talk about something else :-).

13 comments (moderated to prevent spam):

Unknown said...

so good, thank you

Sujit Pal said...

Thanks for the kind words, Sibel.

JohnT said...

I agree, I continue to be very impressed by the quality of your thought, code, and writing. You are definitely high on my list of favorite bloggers. Thanks for the amazing work!

Sujit Pal said...

Thanks, JohnT, glad you enjoyed it.

Anonymous said...

Thanks for your sharing , You are my favorite blogger also and I keep refresh you blog every week.
Could you share the NGram class also?

Anonymous said...

There is a little bugs.
val ptags = ddPFSM.parse(words)
.map(wordTag => wordTag._2) should update to
val ptags = ddPFSM.parse(words)
.map(wordTag => wordTag._1)

otherwise the output of accuracy is 0

Sujit Pal said...

Thanks for the kind words Xiangtao. Regarding the bug, I looked again at the code you pointed out and I don't think its a bug. The code compares the tags in wordTag which is really a pair (word, tag) and hence wordTag._2.

I built the NGram object earlier as part of some other post, so I didn't include, sorry about the oversight. Here it is (with leading spaces converted to dots to force Blogspot to respect the indentation):

---
package com.mycompany.scalcium.utils

import scala.collection.mutable.ArrayBuffer

object NGram {

..def bigrams(tokens: List[Any]): List[List[Any]] = ngrams(tokens, 2)

..def trigrams(tokens: List[Any]): List[List[Any]] = ngrams(tokens, 3)

..def ngrams(tokens: List[Any], n: Int): List[List[Any]] = {
....val nwords = tokens.size
....val ngrams = new ArrayBuffer[List[Any]]()
....for (i <- 0 to (nwords - n)) {
......ngrams += tokens.slice(i, i + n)
....}
....ngrams.toList
..}
}

Xiangtao Wang said...

Thanks Sujit, I run your code , then get accuracy 0 and the wordTag is a pair (tag,word) .
---------------
println(ddPFSM.parse(words))
output: List((DRUG,cimetidine), (NUM,20), (UNIT,mg), (ROUTE,po), (FREQ,qd))

I got accuracy 83% after I updated to wordTag._1

Sujit Pal said...

Hi Xiangtao, I reran the code locally (specifically the DrugDosagePFSMTest#testEvaluateAccuracy (after generating model.train from DrugDosageNERTest). Here is my output of ddPFSM.parse(words):

List((fosamax,DRUG), (70,NUM), (mg,UNIT), (po,ROUTE), (qwk,FREQ))

And I get around 80% accuracy as expected since my wordTag is (word, tag) and yours appears to be (tag, word) from your output, and the accuracy is measured by tag equivalence. I am guessing that somewhere your version of the code is setting it differently, so I think we are all good.

Xiangtao Wang said...

Hi Sujit, Based on the diagram in your previous post, the drug has 3 path. but only exist one path after filter token by guard.
. so I think there is no such case which has multiple path can be used probabilistic bigram.
so I try to debug the code . it never go inside below "else"
------------------
else {
val bestTga = tgas
.map(tga => (tga, tprobs((currState, tga._1))))
.sortWith((a, b) => a._2 > b._2)
.head._1
currState = bestTga._1
action.perform(currState, token)
}

Please try to print any thing inside the "else" , Thanks, kindly tell me if I misunderstand.

Sujit Pal said...

The diagram is not directly applicable to the PFSM since I am looking at slightly different entities here, but even so, the PFSM has 3 possibilities following DRUG, they are FREQ, NUM and ROUTE (lines 44-46 in DrugDosagePFSM). You are right, it never goes into the else block for the test cases, but I enabled debug and I confirmed that DRUG goes to NUM 270/296 times, to FREQ 16/296 times and ROUTE 10/296 times. It just so happens that there is no ambiguity in the entity detected. I think this could be an artifact of the data since the dictionaries track the test data very closely. It is possible that in some cases, a token can be matched to multiple entities, and in that case, the "most probable one" based on the PFSM definition would win.

Xiangtao Wang said...

got it, I learn lots of ideas from your blog. Thanks again.

Sujit Pal said...

You're welcome, this stuff is me learning as well :-).