In my previous post, I described using a Finite State Machine (FSM) implementation to parse Drug Dosage phrases into their constituent parts. While the results weren't too bad, the thing that struck me as being a bit sketchy was that I had to build up the state diagram by manually eyeballing some phrases and their parses (from Erin Rhode's Perl program). In this post, I attempt to improve upon that solution.
My first attempt is to use LingPipe's chunkers to build dictionary and regex driven Named Entity Recognizers (NER). A Dictionary Chunker is populated with terms from the drugs.dict, frequencies.dict, routes.dict and units.dict files - the chunker will tag words from these files as DRUG, FREQ, ROUTE and UNIT respectively. Similarly, the patterns in num_patterns.dict are used to drive a Regular Expression Chunker - the matching patterns are tagged as NUM. Notice that we lose the distinction of DOSAGE, REFILL and QTY that we had previously - but they can be recreated from the tags if needed by looking for patterns such as DOSAGE ::= NUM+ UNIT. The annotations from the Dictionary NER and the Regex NER are stacked, giving us results shown below - words annotated as O are ones that could not be tagged by the NERs.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | hydrocortizone cream, apply to rash bid
hydrocortizone/DRUG cream/DRUG ,/O apply/ROUTE to/O rash/O bid/FREQ
albuterol inhaler one to two puffs bid
albuterol/DRUG inhaler/DRUG one/NUM to/O two/NUM puffs/UNIT bid/FREQ
Enteric coated aspirin 81 mg tablets one qd
Enteric/DRUG coated/DRUG aspirin/DRUG 81/NUM mg/UNIT tablets/UNIT one/NUM qd/FREQ
Vitamin B12 1000 mcg IM
Vitamin/DRUG B12/DRUG 1000/NUM mcg/UNIT IM/ROUTE
atenolol 50 mg tabs one qd, #100, one year
atenolol/DRUG 50/NUM mg/UNIT tabs/UNIT one/NUM qd/FREQ ,/O #100,/NUM one/O year/NUM
|
The code to build the NERs using the LingPipe API is shown below. We follow a similar strategy as the one for FSMs - building generic NER implementations and then using it from application code.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 | // Source: src/main/scala/com/mycompany/scalcium/utils/NER.scala
package com.mycompany.scalcium.utils
import java.io.File
import java.util.regex.Pattern
import scala.collection.JavaConversions.asScalaIterator
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import com.aliasi.chunk.CharLmHmmChunker
import com.aliasi.chunk.Chunk
import com.aliasi.chunk.HmmChunker
import com.aliasi.chunk.RegExChunker
import com.aliasi.dict.DictionaryEntry
import com.aliasi.dict.ExactDictionaryChunker
import com.aliasi.dict.MapDictionary
import com.aliasi.hmm.HmmCharLmEstimator
import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory
import com.aliasi.util.AbstractExternalizable
trait NER {
def chunk(s: String): List[Chunk]
def tag(s: String): List[(String,String)] = {
val chunks = chunk(s)
var curr = 0
val tags = new ArrayBuffer[(String,String)]
chunks.map(chunk => {
val start = chunk.start
val end = chunk.end
val ctype = chunk.`type`
if (curr < start) {
val prevtext = s.substring(curr, start)
prevtext.split(" ")
.foreach(word => tags += ((word, "O")))
}
val chunktext = s.substring(start, end)
chunktext.split(" ")
.foreach(word => tags += ((word, ctype)))
curr = end
})
if (curr < s.length()) {
val lasttext = s.substring(curr, s.length())
lasttext.split(" ")
.foreach(word => tags += ((word, "O")))
}
tags.filter(wt => wt._1.length() > 0)
.toList
}
def merge(taggedWords: List[List[(String,String)]]):
List[(String,String)] = {
val lengths = taggedWords.map(tag => tag.size)
val maxlen = lengths.max
val longest = lengths.zipWithIndex
.sortBy(li => li._1)
.head._2
val words = taggedWords(longest).map(
taggedWord => taggedWord._1)
val mergedTags = (0 until maxlen).map(i => {
val tags = taggedWords.map(taggedWord =>
if (taggedWord.size > i) taggedWord(i)._2 else "O")
.filter(tag => !"O".equals(tag))
if (tags.isEmpty) "O"
else tags.head // TODO: revisit use Bayes Net for disambig
})
.toList
words.zip(mergedTags)
}
}
/**
* Dictionary based NER. Uses a set of files, each containing
* terms that belong to a specified class.
*/
class DictNER(val data: Map[String,File]) extends NER {
val dict = new MapDictionary[String]()
data.foreach(entityData => {
val entityName = entityData._1
Source.fromFile(entityData._2).getLines()
.foreach(line => dict.addEntry(
new DictionaryEntry[String](line, entityName, 1.0D)))
})
val chunker = new ExactDictionaryChunker(dict,
IndoEuropeanTokenizerFactory.INSTANCE, false, false)
override def chunk(s: String): List[Chunk] = {
val chunking = chunker.chunk(s)
chunking.chunkSet().iterator().toList
}
}
/**
* Regex based NER. Uses a set of files, each containing
* regular expressions representing a specified class.
*/
class RegexNER(val data: Map[String,File]) extends NER {
val chunkers = data.map(entityData => {
val entityName = entityData._1
Source.fromFile(entityData._2).getLines()
.map(line => new RegExChunker(
Pattern.compile(line), entityName, 1.0D))
})
.flatten
.toList
override def chunk(s: String): List[Chunk] = {
chunkers.map(chunker => {
val chunking = chunker.chunk(s)
chunking.chunkSet().iterator()
})
.flatten
.toList
.sortBy(chunk => chunk.start)
}
}
/**
* Model based NER. A single multiclass HMM Language
* Model is constructed out of the training data, and
* used to predict classes for new words.
*/
class ModelNER(val modelFile: File) extends NER {
val chunker = if (modelFile != null)
AbstractExternalizable
.readObject(modelFile)
.asInstanceOf[HmmChunker]
else null
override def chunk(s: String): List[Chunk] = {
if (chunker == null) List.empty
else chunker.chunk(s)
.chunkSet()
.iterator()
.toList
}
def train(taggedFile: File, modelFile: File,
ngramSize: Int, numChars: Int,
lambda: Double): Unit = {
val factory = IndoEuropeanTokenizerFactory.INSTANCE
val estimator = new HmmCharLmEstimator(
ngramSize, numChars, lambda)
val chunker = new CharLmHmmChunker(factory, estimator)
Source.fromFile(taggedFile)
.getLines()
.foreach(taggedWords => {
taggedWords.split(" ")
.foreach(taggedWord => {
val slashAt = taggedWord.lastIndexOf('/')
val word = taggedWord.substring(0, slashAt)
val tag = taggedWord.substring(slashAt + 1)
chunker.trainDictionary(word, tag)
})
})
AbstractExternalizable.compileTo(chunker, modelFile)
}
}
|
Having built the client that uses the DictNER and RegexNER classes to parse the Drug Dosage phrases, my second attempt at improvement was to generalize it to include unseen words. For this, I wrapped LingPipe's HMM based Chunker into a ModelNER. The ModelNER is trained with data generated by merging the annotations from DictNER and RegexNER - the code for that can be seen in the train() method of the DrugDosageNER.scala below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 | // Source: src/main/scala/com/mycompany/scalcium/drugdosage/DrugDosageNER.scala
package com.mycompany.scalcium.drugdosage
import java.io.File
import java.io.FileWriter
import java.io.PrintWriter
import scala.Array.canBuildFrom
import scala.collection.TraversableOnce.flattenTraversableOnce
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import com.mycompany.scalcium.utils.DictNER
import com.mycompany.scalcium.utils.ModelNER
import com.mycompany.scalcium.utils.RegexNER
class DrugDosageNER(val modelFile: File,
val debug: Boolean = false) {
val tmpdir = new File("/tmp")
val modelNER = new ModelNER(modelFile)
def train(drugFile: File, freqFile: File, routeFile: File,
unitsFile: File, numPatternsFile: File,
inputFile: File, modelFile: File,
ngramSize: Int, numChars: Int,
lambda: Double): Unit = {
// use the dict NER and regex NER to build training
// set out of rules.
val dictNER = new DictNER(Map(
("DRUG", drugFile),
("FREQ", freqFile),
("ROUTE", routeFile),
("UNIT", unitsFile)))
val regexNER = new RegexNER(Map(
("NUM", numPatternsFile)))
val trainWriter = new PrintWriter(
new FileWriter(new File(tmpdir, "model.train")))
Source.fromFile(inputFile)
.getLines()
.foreach(line => {
val dicttags = dictNER.tag(line)
val regextags = regexNER.tag(line)
val mergedtags = dictNER.merge(List(dicttags, regextags))
trainWriter.println(mergedtags.map(wordTag =>
wordTag._1 + "/" + wordTag._2).mkString(" "))
})
trainWriter.flush()
trainWriter.close()
// use the bootstrapped training set to train the
// model NER
modelNER.train(new File(tmpdir, "model.train"),
modelFile, ngramSize, numChars, lambda)
}
def evaluate(nfolds: Int, ntest: Int,
datafile: File, ngramSize: Int, numChars: Int,
lambda: Double): Double = {
val accuracies = ArrayBuffer[Double]()
(0 until nfolds).foreach(cv => {
// get random list of rows that will be our test case
val testrows = scala.collection.mutable.Set[Int]()
val random = scala.util.Random
do {
testrows += (random.nextDouble * 100).toInt
} while (testrows.size < ntest)
// partition input dataset into train and test
// we use the model.train file from the previous test
val evaltrain = new PrintWriter(
new FileWriter(new File(tmpdir, "eval.train")))
val evaltest = new PrintWriter(
new FileWriter(new File(tmpdir, "eval.test")))
var curr = 0
Source.fromFile(datafile)
.getLines()
.foreach(line => {
if (testrows.contains(curr)) evaltest.println(line)
else evaltrain.println(line)
curr += 1
})
evaltrain.flush()
evaltrain.close()
evaltest.flush()
evaltest.close()
// now we use evaltrain to train the model
val modelNER = new ModelNER(null)
modelNER.train(new File(tmpdir, "eval.train"),
new File(tmpdir, "eval.bin"),
ngramSize, numChars, lambda)
// now test against evaltest with the model
val trainedModelNER = new ModelNER(
new File(tmpdir, "eval.bin"))
val results = Source.fromFile(
new File(tmpdir, "eval.test"))
.getLines()
.map(line => {
val words = line.split(" ").map(wordTag =>
wordTag.substring(0, wordTag.lastIndexOf('/')))
.mkString(" ")
val rtags = line.split(" ").map(wordTag =>
wordTag.substring(wordTag.lastIndexOf('/') + 1))
.toList
val ptags = trainedModelNER.tag(words)
.map(wordTag => wordTag._2)
(0 until List(rtags.size, ptags.size).min)
.map(i => if (rtags(i).equals(ptags(i))) 1 else 0)
})
.flatten
.toList
val accuracy = results.sum.toDouble / results.size
if (debug)
Console.println("CV-# %d: accuracy = %f"
.format(cv, accuracy))
accuracies += accuracy
})
accuracies.sum / accuracies.size
}
def parse(s: String): List[(String,String)] = modelNER.tag(s)
}
|
Treating the training data generated by the DictNER and RegexNER as the "gold set", a 10-fold cross-validation with a 70/30 train/test split gave an accuracy of 80.15% for the ModelNER based DrugDosageNER client. Here is how it annotated the first 5 phrases in the input.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | hydrocortizone cream, apply to rash bid
hydrocortizone/DRUG cream/DRUG ,/ROUTE apply/ROUTE to/ROUTE rash/DRUG bid/FREQ
albuterol inhaler one to two puffs bid
albuterol/DRUG inhaler/DRUG one/NUM to/NUM two/NUM puffs/UNIT bid/FREQ
Enteric coated aspirin 81 mg tablets one qd
Enteric/DRUG coated/DRUG aspirin/DRUG 81/NUM mg/UNIT tablets/UNIT one/NUM qd/FREQ
Vitamin B12 1000 mcg IM
Vitamin/DRUG B12/DRUG 1000/NUM mcg/UNIT IM/ROUTE
atenolol 50 mg tabs one qd, #100, one year
atenolol/DRUG 50/NUM mg/UNIT tabs/UNIT one/NUM qd/FREQ ,/NUM #100,/NUM one/NUM year/NUM
|
Code to call and evaluate the DrugDosageNER can be found in the JUnit class DrugDosageNERTest.scala shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 | // Source: src/test/scala/com/mycompany/scalcium/drugdosage/DrugDosageNERTest.scala
package com.mycompany.scalcium.drugdosage
import java.io.File
import scala.io.Source
import org.junit.Assert
import org.junit.Ignore
import org.junit.Test
class DrugDosageNERTest {
val datadir = "/path/to/data/dir"
@Test
def testTrainTest(): Unit = {
val ddNER = new DrugDosageNER(null, true)
val inputfile = new File(datadir, "input.txt")
val modelfile = new File(datadir, "model.bin")
ddNER.train(
new File(datadir, "drugs.dict"),
new File(datadir, "frequencies.dict"),
new File(datadir, "routes.dict"),
new File(datadir, "units.dict"),
new File(datadir, "num_patterns.dict"),
inputfile, modelfile, 8, 256, 8.0D)
Assert.assertTrue(modelfile.exists())
// now instantiate the NER with modelfile
val trainedDDNER = new DrugDosageNER(modelfile, true)
Source.fromFile(inputfile).getLines()
.foreach(line => {
val taggedline = trainedDDNER.parse(line)
.map(taggedWord =>
taggedWord._1 + "/" + taggedWord._2)
.mkString(" ")
Console.println(line)
Console.println(taggedline)
Console.println()
})
}
@Test
def testEvaluate(): Unit = {
val ddNER = new DrugDosageNER(null, true)
// we use model.train that was generated for internal
// use during the training phase - this contains the
// tags from the dict and regex NERs
val accuracy = ddNER.evaluate(10, 30,
new File(datadir, "model.train"), 8, 256, 8.0)
Console.println("Overall accuracy = " + accuracy)
Assert.assertTrue(accuracy > 0.5)
}
}
|
My third attempt at improvement is based on the realization that the raw annotation bigram frequencies of the "gold set" generated by merging the annotations of DictNER and RegexNER could be indicative of the transition probabilities of the (new) state diagram. In my previous FSM implementation, if the FSM could transition to multiple states, I would just arbitarily choose the first one - now I could use the transition with the highest probability. Here are the transition frequencies and associated probabilities (computed manually). The JUnit class to run the Probabilistic FSM so generated contains the code to generate it and is shown later.
Source | Target | Frequency | P(target|source) |
---|---|---|---|
DRUG | DRUG | 34 | 0.25 |
DRUG | FREQ | 1 | 0.01 |
DRUG | NUM | 91 | 0.67 |
DRUG | O | 6 | 0.04 |
DRUG | ROUTE | 3 | 0.02 |
FREQ | FREQ | 3 | 0.09 |
FREQ | NUM | 4 | 0.12 |
FREQ | O | 27 | 0.79 |
NUM | FREQ | 14 | 0.12 |
NUM | NUM | 2 | 0.02 |
NUM | O | 14 | 0.12 |
NUM | ROUTE | 5 | 0.04 |
NUM | UNIT | 85 | 0.71 |
O | DRUG | 4 | 0.06 |
O | FREQ | 4 | 0.06 |
O | NUM | 20 | 0.32 |
O | O | 29 | 0.63 |
O | ROUTE | 4 | 0.06 |
O | UNIT | 1 | 0.02 |
ROUTE | FREQ | 59 | 0.89 |
ROUTE | O | 3 | 0.05 |
ROUTE | ROUTE | 4 | 0.06 |
UNIT | FREQ | 15 | 0.16 |
UNIT | NUM | 15 | 0.16 |
UNIT | O | 4 | 0.04 |
UNIT | ROUTE | 51 | 0.54 |
UNIT | UNIT | 9 | 0.10 |
For this I extended the FSM implementation from the previous post into a probabilistic version PFSM. I also made some changes in FSM, so to avoid confusion in case you are following along, I show the FSM class as well below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | // Source: src/main/scala/com/mycompany/scalcium/utils/FSM.scala
package com.mycompany.scalcium.utils
import scala.collection.mutable.ArrayBuffer
trait Guard[T] {
def accept(token: T): Boolean
}
trait Action[T] {
def perform(currState: String, token: T): Unit
}
class FSM[T](val action: Action[T],
val debug: Boolean = false) {
val states = ArrayBuffer[String]()
val transitions = scala.collection.mutable.Map[
String,ArrayBuffer[(String,Guard[T])]]()
var currState: String = "START"
def addState(state: String): Unit = {
states += state
}
def addTransition(from: String, to: String,
guard: Guard[T]): Unit = {
val translist = transitions.getOrElse(from, ArrayBuffer())
translist += ((to, guard))
transitions(from) = translist
}
def transition(token: T): Unit = {
val tgas = transitions.getOrElse(currState, List())
.filter(tga => tga._2.accept(token))
if (tgas.size == 1) {
// no ambiguity, just take the path specified
val tga = tgas.head
if (debug)
Console.println("%s -> %s".format(currState, tga._1))
currState = tga._1
action.perform(currState, token)
} else {
if (tgas.isEmpty)
action.perform(currState, token)
else {
currState = tgas.head._1
action.perform(currState, token)
}
}
}
def run(tokens: List[T]): Unit = tokens.foreach(transition(_))
}
class PFSM[T](action: Action[T], debug: Boolean = false)
extends FSM[T](action, debug) {
val tprobs = scala.collection.mutable.Map[
((String,String)),Double]()
def addTransition(from: String, to: String,
tprob: Double, guard: Guard[T]): Unit = {
super.addTransition(from, to, guard)
tprobs((from, to)) = tprob
}
override def transition(token: T): Unit = {
val tgas = transitions.getOrElse(currState, List())
.filter(tga => tga._2.accept(token))
if (tgas.size == 1) {
// no ambiguity, just take the path specified
val tga = tgas.head
if (debug)
Console.println("%s -> %s".format(currState, tga._1))
currState = tga._1
action.perform(currState, token)
} else {
// choose the most probable transition based
// on tprobs. Break ties by choosing head as before
if (tgas.isEmpty)
action.perform(currState, token)
else {
val bestTga = tgas
.map(tga => (tga, tprobs((currState, tga._1))))
.sortWith((a, b) => a._2 > b._2)
.head._1
currState = bestTga._1
action.perform(currState, token)
}
}
}
}
|
The PFSM is built from the DrugDosagePFSM client using the transition probabilities in the table above. The parse() method does the annotation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | // Source: src/main/scala/com/mycompany/scalcium/drugdosage/DrugDosagePFSM.scala
package com.mycompany.scalcium.drugdosage
import com.mycompany.scalcium.utils.PFSM
import java.io.File
class DrugDosagePFSM(val drugFile: File,
val freqFile: File, val routeFile: File,
val unitsFile: File, val numPatternsFile: File,
val debug: Boolean = false) {
def parse(s: String): List[(String,String)] = {
val collector = new CollectAction(debug)
val fsm = buildFSM(collector, debug)
val x = fsm.run(s.toLowerCase()
.replaceAll("[,;]", " ")
.replaceAll("\\s+", " ")
.split(" ")
.toList)
collector.stab.toList
}
def buildFSM(collector: CollectAction,
debug: Boolean): PFSM[String] = {
val pfsm = new PFSM[String](collector, debug)
pfsm.addState("START")
pfsm.addState("DRUG")
pfsm.addState("FREQ")
pfsm.addState("NUM")
pfsm.addState("ROUTE")
pfsm.addState("UNIT")
pfsm.addState("END")
val noGuard = new BoolGuard(false)
val drugGuard = new DictGuard(drugFile)
val freqGuard = new DictGuard(freqFile)
val routeGuard = new DictGuard(routeFile)
val unitsGuard = new DictGuard(unitsFile)
val numGuard = new RegexGuard(numPatternsFile)
pfsm.addTransition("START", "DRUG", 1.0, drugGuard)
pfsm.addTransition("DRUG", "FREQ", 0.01, freqGuard)
pfsm.addTransition("DRUG", "NUM", 0.67, numGuard)
pfsm.addTransition("DRUG", "ROUTE", 0.02, routeGuard)
pfsm.addTransition("FREQ", "NUM", 0.12, numGuard)
pfsm.addTransition("NUM", "FREQ", 0.12, freqGuard)
pfsm.addTransition("NUM", "ROUTE", 0.04, routeGuard)
pfsm.addTransition("NUM", "UNIT", 0.71, unitsGuard)
pfsm.addTransition("ROUTE", "FREQ", 0.89, freqGuard)
pfsm.addTransition("UNIT", "FREQ", 0.16, freqGuard)
pfsm.addTransition("UNIT", "NUM", 0.16, numGuard)
pfsm.addTransition("UNIT", "ROUTE", 0.54, routeGuard)
pfsm.addTransition("FREQ", "END", 0.25, noGuard)
pfsm.addTransition("NUM", "END", 0.25, noGuard)
pfsm.addTransition("UNIT", "END", 0.25, noGuard)
pfsm.addTransition("ROUTE", "END", 0.25, noGuard)
pfsm
}
}
|
The JUnit class below contains code to compute the raw transition frequencies from the "gold set" annotations, process all the phrases in the input.txt file as well as to evaluate the DrugDosagePFSM against this "gold set".
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 | // Source: src/test/scala/com/mycompany/scalcium/drugdosage/DrugDosagePFSMTest.scala
package com.mycompany.scalcium.drugdosage
import java.io.File
import java.io.FileWriter
import java.io.PrintWriter
import scala.Array.canBuildFrom
import scala.collection.mutable.ArrayBuffer
import scala.io.Source
import org.junit.Ignore
import org.junit.Test
import com.mycompany.scalcium.utils.DictNER
import com.mycompany.scalcium.utils.NGram
import com.mycompany.scalcium.utils.RegexNER
class DrugDosagePFSMTest {
val datadir = "/path/to/data/dir"
val tmpdir = "/tmp"
@Test
def testComputeTransitionFrequencies(): Unit = {
val dictNER = new DictNER(Map(
("DRUG", new File(datadir, "drugs.dict")),
("FREQ", new File(datadir, "frequencies.dict")),
("ROUTE", new File(datadir, "routes.dict")),
("UNIT", new File(datadir, "units.dict"))))
val regexNER = new RegexNER(Map(
("NUM", new File(datadir, "num_patterns.dict"))
))
val transitions = ArrayBuffer[(String,String)]()
Source.fromFile(new File(datadir, "input.txt"))
.getLines()
.foreach(line => {
val dictTags = dictNER.tag(line)
val regexTags = regexNER.tag(line)
val mergedTags = dictNER.merge(List(dictTags, regexTags))
val tagBigrams = NGram.bigrams(mergedTags.map(_._2))
.map(bigram => transitions +=
((bigram(0).asInstanceOf[String],
bigram(1).asInstanceOf[String])))
})
val transitionFreqs = transitions
.groupBy(pair => pair._1 + " -> " + pair._2)
.map(pair => (pair._1, pair._2.size))
.toList
.sortBy(pair => pair._1)
Console.println(transitionFreqs.mkString("\n"))
}
@Test
def testParse(): Unit = {
val ddPFSM = new DrugDosagePFSM(
new File(datadir, "drugs.dict"),
new File(datadir, "frequencies.dict"),
new File(datadir, "routes.dict"),
new File(datadir, "units.dict"),
new File(datadir, "num_patterns.dict"),
false)
val writer = new PrintWriter(new FileWriter(
new File(datadir, "pfsm_output.txt")))
Source.fromFile(new File(datadir, "input.txt"))
.getLines()
.foreach(line => {
val stab = ddPFSM.parse(line)
writer.println(line)
writer.println(stab.map(st => st._2 + "/" + st._1)
.mkString(" "))
writer.println()
})
writer.flush()
writer.close()
}
@Test
def testEvaluateAccuracy(): Unit = {
val accuracies = ArrayBuffer[Double]()
(0 until 10).foreach(cv => {
// get random list of rows that will be our test case
val testrows = scala.collection.mutable.Set[Int]()
val random = scala.util.Random
do {
testrows += (random.nextDouble * 100).toInt
} while (testrows.size < 30)
// test random 30% data of input. There
// is no training involved here, so we just test
// our parser against the "gold" set from model.train
// generated by rule-based dict and regex NERs.
val inputfile = new File(datadir, "model.train")
var curr = 0
var results = ArrayBuffer[Int]()
Source.fromFile(inputfile)
.getLines()
.foreach(line => {
if (testrows.contains(curr)) {
val words = line.split(" ").map(wordTag =>
wordTag.substring(0, wordTag.lastIndexOf('/')))
.mkString(" ")
val rtags = line.split(" ").map(wordTag =>
wordTag.substring(wordTag.lastIndexOf('/') + 1))
.toList
val ddPFSM = new DrugDosagePFSM(
new File(datadir, "drugs.dict"),
new File(datadir, "frequencies.dict"),
new File(datadir, "routes.dict"),
new File(datadir, "units.dict"),
new File(datadir, "num_patterns.dict"),
false)
val ptags = ddPFSM.parse(words)
.map(wordTag => wordTag._2)
val result = (0 until List(rtags.size, ptags.size).min)
.map(i => if (rtags(i).equals(ptags(i))) 1 else 0)
results ++= result
}
curr += 1
})
val accuracy = results.sum.toDouble / results.size
Console.println("CV #%d: accuracy=%f".format(cv, accuracy))
accuracies += accuracy
})
Console.println("Overall accuracy=%f".format(
accuracies.sum / accuracies.size))
}
}
|
Annotations produced by the DrugDosagePFSM class on the first 5 lines of the input is shown below.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | hydrocortizone cream, apply to rash bid
hydrocortizone/DRUG cream/DRUG apply/ROUTE to/ROUTE rash/ROUTE bid/FREQ
albuterol inhaler one to two puffs bid
albuterol/DRUG inhaler/DRUG one/NUM to/NUM two/NUM puffs/UNIT bid/FREQ
Enteric coated aspirin 81 mg tablets one qd
enteric/DRUG coated/DRUG aspirin/DRUG 81/NUM mg/UNIT tablets/UNIT one/NUM qd/FREQ
Vitamin B12 1000 mcg IM
vitamin/DRUG b12/DRUG 1000/NUM mcg/UNIT im/ROUTE
atenolol 50 mg tabs one qd, #100, one year
atenolol/DRUG 50/NUM mg/UNIT tabs/UNIT one/NUM qd/FREQ #100/FREQ one/NUM year/NUM
|
Evaluating the DrugDosagePFSM using 10-fold cross validation against a 70/30 train/test split results in a slightly higher overall accuracy of 83.49%.
Thats all I have for today. This post has been a bit code heavy, but hopefully it made sense and you had fun reading it. Next week I promise to talk about something else :-).
13 comments (moderated to prevent spam):
so good, thank you
Thanks for the kind words, Sibel.
I agree, I continue to be very impressed by the quality of your thought, code, and writing. You are definitely high on my list of favorite bloggers. Thanks for the amazing work!
Thanks, JohnT, glad you enjoyed it.
Thanks for your sharing , You are my favorite blogger also and I keep refresh you blog every week.
Could you share the NGram class also?
There is a little bugs.
val ptags = ddPFSM.parse(words)
.map(wordTag => wordTag._2) should update to
val ptags = ddPFSM.parse(words)
.map(wordTag => wordTag._1)
otherwise the output of accuracy is 0
Thanks for the kind words Xiangtao. Regarding the bug, I looked again at the code you pointed out and I don't think its a bug. The code compares the tags in wordTag which is really a pair (word, tag) and hence wordTag._2.
I built the NGram object earlier as part of some other post, so I didn't include, sorry about the oversight. Here it is (with leading spaces converted to dots to force Blogspot to respect the indentation):
---
package com.mycompany.scalcium.utils
import scala.collection.mutable.ArrayBuffer
object NGram {
..def bigrams(tokens: List[Any]): List[List[Any]] = ngrams(tokens, 2)
..def trigrams(tokens: List[Any]): List[List[Any]] = ngrams(tokens, 3)
..def ngrams(tokens: List[Any], n: Int): List[List[Any]] = {
....val nwords = tokens.size
....val ngrams = new ArrayBuffer[List[Any]]()
....for (i <- 0 to (nwords - n)) {
......ngrams += tokens.slice(i, i + n)
....}
....ngrams.toList
..}
}
Thanks Sujit, I run your code , then get accuracy 0 and the wordTag is a pair (tag,word) .
---------------
println(ddPFSM.parse(words))
output: List((DRUG,cimetidine), (NUM,20), (UNIT,mg), (ROUTE,po), (FREQ,qd))
I got accuracy 83% after I updated to wordTag._1
Hi Xiangtao, I reran the code locally (specifically the DrugDosagePFSMTest#testEvaluateAccuracy (after generating model.train from DrugDosageNERTest). Here is my output of ddPFSM.parse(words):
List((fosamax,DRUG), (70,NUM), (mg,UNIT), (po,ROUTE), (qwk,FREQ))
And I get around 80% accuracy as expected since my wordTag is (word, tag) and yours appears to be (tag, word) from your output, and the accuracy is measured by tag equivalence. I am guessing that somewhere your version of the code is setting it differently, so I think we are all good.
Hi Sujit, Based on the diagram in your previous post, the drug has 3 path. but only exist one path after filter token by guard.
. so I think there is no such case which has multiple path can be used probabilistic bigram.
so I try to debug the code . it never go inside below "else"
------------------
else {
val bestTga = tgas
.map(tga => (tga, tprobs((currState, tga._1))))
.sortWith((a, b) => a._2 > b._2)
.head._1
currState = bestTga._1
action.perform(currState, token)
}
Please try to print any thing inside the "else" , Thanks, kindly tell me if I misunderstand.
The diagram is not directly applicable to the PFSM since I am looking at slightly different entities here, but even so, the PFSM has 3 possibilities following DRUG, they are FREQ, NUM and ROUTE (lines 44-46 in DrugDosagePFSM). You are right, it never goes into the else block for the test cases, but I enabled debug and I confirmed that DRUG goes to NUM 270/296 times, to FREQ 16/296 times and ROUTE 10/296 times. It just so happens that there is no ambiguity in the entity detected. I think this could be an artifact of the data since the dictionaries track the test data very closely. It is possible that in some cases, a token can be matched to multiple entities, and in that case, the "most probable one" based on the PFSM definition would win.
got it, I learn lots of ideas from your blog. Thanks again.
You're welcome, this stuff is me learning as well :-).
Post a Comment