I first learned about a (baseline) approach to detecting and dealing with negation in text in the Coursera NLP class taught by Stanford profs Jurafsky and Manning. Essentially it involves ignoring everything in a sentence after certain "negation" words are seen. For example, a naive annotator simply looking for word matches could legitimately conclude that a patient is a smoker based on a doctor's note that says "Patient does not smoke". Using the simple negation rule would ignore everything after the negation keyword "not".
While this is an improvement over not doing it at all, natural language has nuances that are not captured by this simple rule. The Negex algorithm is a bit more complicated, but does a better job of indicating if a clinical condition should be negated in text. It uses around 281 phrases that indicate negation, further classified as CONJ (joins two phrases together), PSEU (Pseudo trigger terms), PREN (Pre-negative trigger terms), POST (Post negative trigger terms), PREP and POSP (Pre and post categories similar to PREN and POST but for a less strict matching mode). The algorithm is described by its authors here.
I started looking at it hoping that I could use it as a library, but the code in the repository looks more like examples that clients could use as inspiration for their own implementations. In fact, Hitex uses a GATE CREOLE resource that wraps the Negex algorithm. Apache cTakes, the other clinical pipeline I am studying, uses a more complex rule based approach using FSMs.
One of the nice things about the Negex algorithm is that it can be generalized (using different sets of trigger terms) to predict whether a condition refers to a historical or current event, a hypothetical or recent event, or if it refers to the patient or someone else. So implementing the algorithm yields benefits greater than just beefing up your negation detection logic. In fact, the ConText project, a spin-off from Negex, does just that (although its code repository seems to contain the same example code that the Negex project contains).
I initially tried reading the original paper on ConText: ConText: An Algorithm for Identifying Contextual Features from Clinical Text, in an attempt to figure out what the algorithm did, but the paper is pretty high level and covers a lot of ground, so I fell back to reading the Java code in the genConText sudirectory of the Negex source code. Unfortunately, Java is not really a good language for describing algorithms because of its verbosity, so I turned to the Python code in the negex.python subdirectory instead, which was far easier to read and understand.
My implementation does not faithfully implement the Negex algorithm description, but it ends up with an identical accuracy achieved on the same data set by the Python reference implementation. Specifically, my algorithm only looks for PREN and POST rules (and optionally PREP and POSP if non-strict mode is selected), but does not use PSEU (Pseudo negation) terms at all. The original algorithm prescribes finding all keywords in the sentence, then discarding any leading PSEU terms, then computing all the negation scopes in the sentence, and finally selecting and working on the negation scope that includes our target phrase. My implementation finds the location of the phrase first, then finds PREN and POST (and optionally PREP and POSP) around it and calculates negation based on the distance between the phrase and trigger term. It also checks for CONJ to split up the sentence (only working on the split containing the target phrase). Here is my code for the Negex algorithm.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 | // Source: src/main/scala/com/mycompany/scalcium/transformers/NegexAnnotator.scala
package com.mycompany.scalcium.transformers
import java.io.File
import java.util.regex.Pattern
import scala.Array.canBuildFrom
import scala.io.Source
case class Offset(val start: Int, val end: Int) {
def isNone = start == -1 && end == -1
def None = Offset(-1,-1)
}
class NegexAnnotator(ruleFile: File,
responses: List[String]) {
val rules = sortRules(ruleFile)
/**
* Predicts sense of the phrase in the sentence
* as one of the responses based on whether the
* negTagger method returns true or false.
* @param sentence the sentence.
* @param phrase the phrase.
* @nonStrict true if non-strict mode.
* @return a response (passed into constructor).
*/
def predict(sentence: String, phrase: String,
nonStrict: Boolean): String =
if (negTagger(sentence, phrase, nonStrict))
responses(0)
else responses(1)
/**
* Parses trigger rules file and converts them
* to a List of (trigger pattern, rule type) pairs
* sorted by descending order of length of
* original trigger string. This method is called
* on construction (should not be called from
* client code).
* @param the trigger rules File.
* @return List of (trigger pattern, rule type)
* sorted by trigger string length.
*/
def sortRules(ruleFile: File): List[(Pattern,String)] = {
Source.fromFile(ruleFile)
.getLines()
// input format: trigger phrase\t\t[TYPE]
.map(line => {
val cols = line.split("\t\t")
(cols(0), cols(1))
})
.toList
// sort by length descending
.sortWith((a,b) => a._1.length > b._1.length)
// replace spaces by \\s+ and convert to pattern
.map(pair => (
Pattern.compile("\\b(" + pair._1
.trim()
.replaceAll("\\s+", "\\\\s+") + ")\\b"),
pair._2))
}
/**
* This is the heart of the algorithm. It normalizes
* the incoming sentence, then finds the character
* offset (start,end) for the phrase. If a CONJ
* trigger is found, it only considers the part of
* the sentence where the phrase was found. It
* looks at the PREN, POST, PREP and POSP (the
* last 2 if tagPossible=true) looking for trigger
* terms within 5 words of the phrase.
* @param sentence the sentence (unnormalized).
* @param phrase the phrase (unnormalized)
* @param tagPossible true if non-strict mode
* annotation required.
*/
def negTagger(sentence: String, phrase: String,
tagPossible: Boolean): Boolean = {
val normSent = sentence.toLowerCase()
.replaceAll("\\s+", " ")
val wordPositions = 0 :: normSent.toCharArray()
.zipWithIndex
.filter(ci => ci._1 == ' ')
.map(ci => ci._2 + 1)
.toList
// tag the phrase
val phrasePattern = Pattern.compile(
"\\b(" +
phrase.replaceAll("\\s+", "\\\\s+") +
")\\b", Pattern.CASE_INSENSITIVE)
val phraseOffset = offset(normSent, phrasePattern)
if (phraseOffset.isNone) return false
// look for CONJ trigger terms
val conjOffsets = offsets(normSent, "[CONJ]", rules)
if (conjOffsets.isEmpty) {
// run through the different rule sets,
// terminating when we find a match
val triggerTypes = if (tagPossible)
List("[PREN]", "[POST]", "[PREP]", "[POSP]")
else List("[PREN]", "[POST]")
isTriggerInScope(normSent, rules,
phraseOffset, wordPositions, triggerTypes)
} else {
// chop off the side of the sentence where
// the phrase does not appear.
val conjOffset = conjOffsets.head
if (conjOffset.end < phraseOffset.start) {
val truncSent = normSent.substring(conjOffset.end + 1)
negTagger(truncSent, phrase, tagPossible)
} else if (phraseOffset.end < conjOffset.start) {
val truncSent = normSent.substring(0, conjOffset.start)
negTagger(truncSent, phrase, tagPossible)
} else {
false
}
}
}
/**
* Returns true if the trigger term is within the
* context of the phrase, ie, within 5 words of
* each other. Recursively checks each rule type
* in the triggerTypes in list.
* @param sentence the normalized sentence.
* @param rules the sorted list of rules.
* @param phraseOffset the phrase offset.
* @param wordPositions the positions of the
* starting character position of each
* word in the normalized sentence.
* @param triggerTypes the trigger types to
* check.
* @return true if trigger is in the context of
* the phrase, false if not.
*/
def isTriggerInScope(
sentence: String,
rules: List[(Pattern,String)],
phraseOffset: Offset,
wordPositions: List[Int],
triggerTypes: List[String]): Boolean = {
if (triggerTypes.isEmpty) false
else {
val currentTriggerType = triggerTypes.head
val triggerOffsets = offsets(sentence,
currentTriggerType, rules)
val selectedTriggerOffset = firstNonOverlappingOffset(
phraseOffset, triggerOffsets)
if (selectedTriggerOffset.isNone)
// try with the next trigger pattern
isTriggerInScope(sentence, rules,
phraseOffset, wordPositions,
triggerTypes.tail)
else {
// check how far the tokens are. If PRE*
// token, then there is no distance limit
// but 5 words is the distance limit for
// POS* rules.
if (currentTriggerType.startsWith("[PRE"))
selectedTriggerOffset.start <
phraseOffset.start
else
wordDistance(phraseOffset,
selectedTriggerOffset,
wordPositions) <= 5 &&
phraseOffset.start <
selectedTriggerOffset.start
}
}
}
/**
* Returns the distance in number of words
* between the phrase and trigger term.
* @param phraseOffset (start,end) for phrase.
* @param triggerOffset (start,end) for trigger.
* @param wordPositions a list of starting
* character positions for each word
* in (normalized) sentence.
* @return number words between phrase and trigger.
*/
def wordDistance(phraseOffset: Offset,
triggerOffset: Offset,
wordPositions: List[Int]): Int = {
if (phraseOffset.start < triggerOffset.start)
wordPositions
.filter(pos => pos > phraseOffset.end &&
pos < triggerOffset.start)
.size
else
wordPositions
.filter(pos => pos > triggerOffset.end &&
pos < phraseOffset.start)
.size
}
/**
* Compute the character offset of the phrase
* in the (normalized) sentence. If there is
* no match, then an Offset(-1,-1) is returned.
* @param sentence the normalized sentence.
* @param pattern the phras
*/
def offset(sentence: String,
pattern: Pattern): Offset = {
val matcher = pattern.matcher(sentence)
if (matcher.find())
Offset(matcher.start(), matcher.end())
else Offset(-1, -1)
}
/**
* Find all offsets for trigger terms for the
* specified rule type. Returns a list of offsets
* for trigger terms that matched.
* @param sentence the normalized sentence.
* @param ruleType the rule type to filter on.
* @param rules the list of sorted rule patterns.
* @return a List of Offsets for matched triggers
* of the specified rule type.
*/
def offsets(sentence: String, ruleType: String,
rules: List[(Pattern,String)]): List[Offset] = {
rules.filter(rule => ruleType.equals(rule._2))
.map(rule => offset(sentence, rule._1))
.filter(offset => (! offset.isNone))
}
/**
* Returns the first trigger term that does not
* overlap with the phrase. May return (-1,-1).
* @param phraseOffset the offset for the phrase.
* @param triggerOffsets a list of Offsets for the
* triggers.
* @return the first non-overlapping offset.
*/
def firstNonOverlappingOffset(phraseOffset: Offset,
triggerOffsets: List[Offset]): Offset = {
val phraseRange = Range(phraseOffset.start, phraseOffset.end)
val nonOverlaps = triggerOffsets
.filter(offset => {
val offsetRange = Range(offset.start, offset.end)
phraseRange.intersect(offsetRange).size == 0
})
if (nonOverlaps.isEmpty) Offset(-1,-1)
else nonOverlaps.head
}
}
|
The code is heavily documented, hopefully its easy to understand. The rules are read in from a file (I used the files provided in the genConText subdirectory of the Negex repository), longest rule first. The incoming sentence and phrases are normalized and the location of the phrase determined (start and end character positions). We first attempt to find occurrences of CONJ trigger terms. If found, we split the sentence, working only on the portion that contains our phrase. Next we try to match trigger terms of type PREN, POST, PREP and POSP (the last two if non-strict matching is turned on). If we find one, we calculate the distance (in words) from the phrase to the trigger term. For PREN (and PREP) there is no distance limit, for POST (and POSP) the maximum allowed distance is 5 words.
As I mentioned earlier, we can reuse the same code with different rule files - negex_triggers.txt for Negation, history_triggers.txt and hypothetical_triggers.txt for Temporality, and experiencer_triggers.txt for Experiencer. Here is some JUnit code that instantiates different types of annotators and evaluates them using test data (also from Negex).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | // Source: src/test/scala/com/mycompany/scalcium/transformers/NegexAnnotatorTest.scala
package com.mycompany.scalcium.transformers
import java.io.File
import scala.io.Source
import org.junit.Test
class NegexAnnotatorTest {
val dir = new File("src/main/resources/negex")
val input = new File(dir, "test_input.txt")
@Test
def detailReport(): Unit = {
val negAnnotator = new NegexAnnotator(
new File(dir, "negex_triggers.txt"),
List("Negated", "Affirmed"))
val histAnnotator = new NegexAnnotator(
new File(dir, "history_triggers.txt"),
List("Historical", "Recent"))
val hypAnnotator = new NegexAnnotator(
new File(dir, "hypothetical_triggers.txt"),
List("Not particular", "Recent"))
val expAnnotator = new NegexAnnotator(
new File(dir, "experiencer_triggers.txt"),
List("Other", "Patient"))
var numTested = 0
var numCorrectNeg = 0
var numCorrectHist = 0
var numCorrectHyp = 0
var numCorrectExp = 0
Source.fromFile(input)
.getLines()
.foreach(line => {
val cols = line.split("\t")
val phrase = cols(2)
val sentence = cols(3)
val negActual = cols(4)
val histActual = cols(5)
val hypActual = cols(5)
val expActual = cols(6)
val negPred = negAnnotator.predict(sentence, phrase, false)
val negCorrect = negActual.equals(negPred)
val histPred = histAnnotator.predict(sentence, phrase, false)
val histCorrect = histActual.equals(histPred)
val hypPred = hypAnnotator.predict(sentence, phrase, false)
val hypCorrect = hypActual.equals(hypPred)
val expPred = expAnnotator.predict(sentence, phrase, false)
val expCorrect = expActual.equals(expPred)
numCorrectNeg += (if (negCorrect) 1 else 0)
numCorrectHist += (if (histCorrect) 1 else 0)
numCorrectHyp += (if (hypCorrect)1 else 0)
numCorrectExp += (if (expCorrect) 1 else 0)
numTested += 1
Console.println("%s\t%s\t[%s] %s\t[%s] %s\t[%s] %s\t[%s] %s"
.format(phrase, sentence,
if (negCorrect) "+" else "-", negActual,
if (histCorrect) "+" else "-", histActual,
if (hypCorrect) "+" else "-", hypActual,
if (expCorrect) "+" else "-", expActual))
})
Console.println()
Console.println("Accuracy Scores")
Console.println("Accuracy (Negation) = %8.6f"
.format(numCorrectNeg.toDouble / numTested.toDouble))
Console.println("Accuracy (Historical) = %8.6f"
.format(numCorrectHist.toDouble / numTested.toDouble))
Console.println("Accuracy (Hypothetical) = %8.6f"
.format(numCorrectHyp.toDouble / numTested.toDouble))
Console.println("Accuracy (Experiencer) = %8.6f"
.format(numCorrectExp.toDouble / numTested.toDouble))
}
}
|
The first 20 lines of output from the above JUnit test (post-processed manually to make it suitable for HTML display) is shown below. The green responses indicate that my NegexAnnotator got it right (the actual text indicates the expected response), and red imply that it got it wrong.
Phrase | Sentence | Negation | Historical | Hypothetical | Experiencer |
---|---|---|---|---|---|
lungs were decreased at the bilateral bases | The LUNGS WERE DECREASED AT THE BILATERAL BASES. | Affirmed | Recent | Recent | Patient |
cough | She denies any COUGH or sputum production. | Negated | Recent | Recent | Patient |
right ventricular pressure or volume overload | 04) Flattened interventricular septum, consistent with RIGHT VENTRICULAR PRESSURE OR VOLUME OVERLOAD. | Affirmed | Recent | Recent | Patient |
aortic valve is normal | The AORTIC VALVE IS NORMAL. | Affirmed | Recent | Recent | Patient |
wheezes | No WHEEZES, rales, or rhonchi. | Negated | Recent | Recent | Patient |
Normal left ventricular size and function | FINAL IMPRESSIONS: 01) NORMAL LEFT VENTRICULAR SIZE AND FUNCTION. | Affirmed | Recent | Recent | Patient |
gastric varices | There was no evidence of GASTRIC VARICES, ulcers, or Mallory-Weiss tear. | Negated | Recent | Recent | Patient |
atypical follicular lymphoma | The most likely possibilities include marginal cell lymphoma vs. an ATYPICAL FOLLICULAR LYMPHOMA. | Affirmed | Recent | Recent | Patient |
hemodynamically stable | The patient remained HEMODYNAMICALLY STABLE with the heart rate in 70s. | Affirmed | Recent | Recent | Patient |
ALLERGIES: CODEINE | ALLERGIES: CODEINE. | Affirmed | Recent | Recent | Patient |
pulmonic regurgitation | There is trace PULMONIC REGURGITATION. | Affirmed | Recent | Recent | Patient |
Colitis in the sigmoid colon | COMPLICATIONS: None POSTOPERATIVE DIAGNOSIS(ES): 1) COLITIS IN THE SIGMOID COLON 2) No additional lesions PLAN: 1) Await biopsy results REPEAT EXAM: Colonoscopy in 10 year(s). | Affirmed | Recent | Recent | Patient |
ANEMIA | ANEMIA. | Affirmed | Recent | Recent | Patient |
Thickened aortic valve with mild regurgitation | 03) THICKENED AORTIC VALVE WITH MILD REGURGITATION. | Affirmed | Recent | Recent | Patient |
interatrial baffle | 07) There is an echodensity seen in the atria consistent with the patient''s know INTERATRIAL BAFFLE. | Affirmed | Historical | Historical | Patient |
bowel distention | S_O_H Counters Report Type Record Type Subgroup Classifier 13,MFZKSI+l8xGn DS DS 5004 E_O_H [Report de-identified (Safe-harbor compliant) by De-ID v.6.14.02] **INSTITUTION GENERAL MEDICINE DISCHARGE SUMMARY PATIENT NAME: **NAME[AAA, BBB M] ACCOUNT #: **ID-NUM **ROOM ATTENDING PHYSICIAN: **NAME[ZZZ M YYY] ADMISSION DATE: **DATE[Sep 19 2007] DISCHARGE DATE: **DATE[Sep 27 2007] The patient is a **AGE[in 40s]-year-old with past medical history of partial small bowel obstruction, who is admitted quite frequently for abdominal pain, BOWEL DISTENTION, and partial small bowel obstructions. | Affirmed | Historical | Historical | Patient |
B12 deficiency | She does have B12 DEFICIENCY and she is getting vitamin B12 shots every month. | Affirmed | Recent | Recent | Patient |
unusual headaches | She will alert the physician if any of the following are noted: Nosebleeds, excessive bruising, pink, red, or tea-colored urine, bright red or tarry black stools, UNUSUAL HEADACHES, or stomach pain. | Affirmed | Not particular | Not particular | Patient |
mitral regurgitation | SPECTRAL DOPPLER: There is trace MITRAL REGURGITATION. | Affirmed | Recent | Recent | Patient |
rib fracture | There is a 1cm area of linear fibrosis in the left lower lobe on image 45 of series 4 which may be related to an adjacent old RIB FRACTURE. | Affirmed | Historical | Historical | Patient |
The accuracy scores for the different annotators are shown below. The test data consists of 2,376 phrase/sentence combinations, annotated with the expected value for Negation, Historical/Hypothetical and Experiencer. As you can see, the accuracy for the Negation and Experiencer annotators are quite high, and Historical/Hypothetical isn't too bad either.
1 2 3 4 | Accuracy (Negation) = 0.978956
Accuracy (Historical) = 0.909091
Accuracy (Hypothetical) = 0.889731
Accuracy (Experiencer) = 0.997475
|
Thats all I have for today. Negex seems to be a useful algorithm to know. I liked that the same algorithm could be used to include/exclude clinical concepts based on temporality and experiencer as well, a bit like a Swiss army knife.
2 comments (moderated to prevent spam):
Can you please explain how to find the required phrases from the sentence? Urgent help needed.
Thats where the NER comes in (see your comment on this post). The Negex algorithm allows you to compute negation status of phrases in a sentence. The phrases are the sequences like "cough" or "sputum production" in the sentence: "She denies any cough or sputum production". You need the NER to annotate these sequences in your sentence as the phrases of interest.
Post a Comment