Salmon Run: Scala implementation of the Negex Algorithm

I first learned about a (baseline) approach to detecting and dealing with negation in text in the Coursera NLP class taught by Stanford profs Jurafsky and Manning. Essentially it involves ignoring everything in a sentence after certain "negation" words are seen. For example, a naive annotator simply looking for word matches could legitimately conclude that a patient is a smoker based on a doctor's note that says "Patient does not smoke". Using the simple negation rule would ignore everything after the negation keyword "not".

While this is an improvement over not doing it at all, natural language has nuances that are not captured by this simple rule. The Negex algorithm is a bit more complicated, but does a better job of indicating if a clinical condition should be negated in text. It uses around 281 phrases that indicate negation, further classified as CONJ (joins two phrases together), PSEU (Pseudo trigger terms), PREN (Pre-negative trigger terms), POST (Post negative trigger terms), PREP and POSP (Pre and post categories similar to PREN and POST but for a less strict matching mode). The algorithm is described by its authors here.

I started looking at it hoping that I could use it as a library, but the code in the repository looks more like examples that clients could use as inspiration for their own implementations. In fact, Hitex uses a GATE CREOLE resource that wraps the Negex algorithm. Apache cTakes, the other clinical pipeline I am studying, uses a more complex rule based approach using FSMs.

One of the nice things about the Negex algorithm is that it can be generalized (using different sets of trigger terms) to predict whether a condition refers to a historical or current event, a hypothetical or recent event, or if it refers to the patient or someone else. So implementing the algorithm yields benefits greater than just beefing up your negation detection logic. In fact, the ConText project, a spin-off from Negex, does just that (although its code repository seems to contain the same example code that the Negex project contains).

I initially tried reading the original paper on ConText: ConText: An Algorithm for Identifying Contextual Features from Clinical Text, in an attempt to figure out what the algorithm did, but the paper is pretty high level and covers a lot of ground, so I fell back to reading the Java code in the genConText sudirectory of the Negex source code. Unfortunately, Java is not really a good language for describing algorithms because of its verbosity, so I turned to the Python code in the negex.python subdirectory instead, which was far easier to read and understand.

My implementation does not faithfully implement the Negex algorithm description, but it ends up with an identical accuracy achieved on the same data set by the Python reference implementation. Specifically, my algorithm only looks for PREN and POST rules (and optionally PREP and POSP if non-strict mode is selected), but does not use PSEU (Pseudo negation) terms at all. The original algorithm prescribes finding all keywords in the sentence, then discarding any leading PSEU terms, then computing all the negation scopes in the sentence, and finally selecting and working on the negation scope that includes our target phrase. My implementation finds the location of the phrase first, then finds PREN and POST (and optionally PREP and POSP) around it and calculates negation based on the distance between the phrase and trigger term. It also checks for CONJ to split up the sentence (only working on the split containing the target phrase). Here is my code for the Negex algorithm.

// Source: src/main/scala/com/mycompany/scalcium/transformers/NegexAnnotator.scala
package com.mycompany.scalcium.transformers

import java.io.File
import java.util.regex.Pattern

import scala.Array.canBuildFrom
import scala.io.Source

case class Offset(val start: Int, val end: Int) {
  def isNone = start == -1 && end == -1
  def None = Offset(-1,-1)
}

class NegexAnnotator(ruleFile: File, 
    responses: List[String]) {

  val rules = sortRules(ruleFile)
  
  /**
   * Predicts sense of the phrase in the sentence
   * as one of the responses based on whether the
   * negTagger method returns true or false.
   * @param sentence the sentence.
   * @param phrase the phrase.
   * @nonStrict true if non-strict mode.
   * @return a response (passed into constructor).
   */
  def predict(sentence: String, phrase: String, 
      nonStrict: Boolean): String = 
    if (negTagger(sentence, phrase, nonStrict)) 
      responses(0)
    else responses(1)

  /**
   * Parses trigger rules file and converts them
   * to a List of (trigger pattern, rule type) pairs
   * sorted by descending order of length of 
   * original trigger string. This method is called
   * on construction (should not be called from 
   * client code).
   * @param the trigger rules File.
   * @return List of (trigger pattern, rule type)
   *       sorted by trigger string length.
   */
  def sortRules(ruleFile: File): List[(Pattern,String)] = {
    Source.fromFile(ruleFile)
      .getLines()
      // input format: trigger phrase\t\t[TYPE]
      .map(line => {
        val cols = line.split("\t\t")
        (cols(0), cols(1))
      })
      .toList
      // sort by length descending
      .sortWith((a,b) => a._1.length > b._1.length)
      // replace spaces by \\s+ and convert to pattern
      .map(pair => (
        Pattern.compile("\\b(" + pair._1
          .trim()
          .replaceAll("\\s+", "\\\\s+") + ")\\b"), 
            pair._2))
  }
  
  /**
   * This is the heart of the algorithm. It normalizes
   * the incoming sentence, then finds the character
   * offset (start,end) for the phrase. If a CONJ
   * trigger is found, it only considers the part of
   * the sentence where the phrase was found. It 
   * looks at the PREN, POST, PREP and POSP (the
   * last 2 if tagPossible=true) looking for trigger
   * terms within 5 words of the phrase.
   * @param sentence the sentence (unnormalized).
   * @param phrase the phrase (unnormalized)
   * @param tagPossible true if non-strict mode
   *        annotation required.
   */
  def negTagger(sentence: String, phrase: String, 
      tagPossible: Boolean): Boolean = {
    val normSent = sentence.toLowerCase()
      .replaceAll("\\s+", " ")
    val wordPositions = 0 :: normSent.toCharArray()
      .zipWithIndex
      .filter(ci => ci._1 == ' ')
      .map(ci => ci._2 + 1)
      .toList
    // tag the phrase
    val phrasePattern = Pattern.compile(
      "\\b(" + 
      phrase.replaceAll("\\s+", "\\\\s+") + 
      ")\\b", Pattern.CASE_INSENSITIVE)
    val phraseOffset = offset(normSent, phrasePattern)
    if (phraseOffset.isNone) return false
    // look for CONJ trigger terms
    val conjOffsets = offsets(normSent, "[CONJ]", rules)
    if (conjOffsets.isEmpty) {
      // run through the different rule sets, 
      // terminating when we find a match
      val triggerTypes = if (tagPossible) 
        List("[PREN]", "[POST]", "[PREP]", "[POSP]")
      else List("[PREN]", "[POST]")
      isTriggerInScope(normSent, rules, 
        phraseOffset, wordPositions, triggerTypes)      
    } else {
      // chop off the side of the sentence where
      // the phrase does not appear.
      val conjOffset = conjOffsets.head
      if (conjOffset.end < phraseOffset.start) {
        val truncSent = normSent.substring(conjOffset.end + 1)
        negTagger(truncSent, phrase, tagPossible)
      } else if (phraseOffset.end < conjOffset.start) {
        val truncSent = normSent.substring(0, conjOffset.start)
        negTagger(truncSent, phrase, tagPossible)
      } else {
        false
      }
    }
  }
  
  /**
   * Returns true if the trigger term is within the
   * context of the phrase, ie, within 5 words of
   * each other. Recursively checks each rule type
   * in the triggerTypes in list.
   * @param sentence the normalized sentence.
   * @param rules the sorted list of rules.
   * @param phraseOffset the phrase offset.
   * @param wordPositions the positions of the
   *        starting character position of each
   *        word in the normalized sentence.
   * @param triggerTypes the trigger types to
   *        check.
   * @return true if trigger is in the context of
   *        the phrase, false if not.
   */
  def isTriggerInScope(
      sentence: String,
      rules: List[(Pattern,String)],
      phraseOffset: Offset,
      wordPositions: List[Int],
      triggerTypes: List[String]): Boolean = {
    if (triggerTypes.isEmpty) false
    else {
      val currentTriggerType = triggerTypes.head
      val triggerOffsets = offsets(sentence, 
        currentTriggerType, rules)
      val selectedTriggerOffset = firstNonOverlappingOffset(
        phraseOffset, triggerOffsets)
      if (selectedTriggerOffset.isNone)
        // try with the next trigger pattern
        isTriggerInScope(sentence, rules, 
          phraseOffset, wordPositions,
          triggerTypes.tail)
      else {
        // check how far the tokens are. If PRE*
        // token, then there is no distance limit
        // but 5 words is the distance limit for 
        // POS* rules.
        if (currentTriggerType.startsWith("[PRE"))
          selectedTriggerOffset.start < 
            phraseOffset.start
        else
          wordDistance(phraseOffset, 
            selectedTriggerOffset, 
            wordPositions) <= 5 &&
            phraseOffset.start < 
            selectedTriggerOffset.start
      }
    }
  }
  
  /**
   * Returns the distance in number of words
   * between the phrase and trigger term.
   * @param phraseOffset (start,end) for phrase.
   * @param triggerOffset (start,end) for trigger.
   * @param wordPositions a list of starting
   *        character positions for each word
   *        in (normalized) sentence.
   * @return number words between phrase and trigger.
   */
  def wordDistance(phraseOffset: Offset, 
      triggerOffset: Offset,
      wordPositions: List[Int]): Int = {
    if (phraseOffset.start < triggerOffset.start)
      wordPositions
        .filter(pos => pos > phraseOffset.end && 
          pos < triggerOffset.start)
        .size
    else
      wordPositions
        .filter(pos => pos > triggerOffset.end && 
          pos < phraseOffset.start)
        .size
  }
  
  /**
   * Compute the character offset of the phrase
   * in the (normalized) sentence. If there is 
   * no match, then an Offset(-1,-1) is returned.
   * @param sentence the normalized sentence.
   * @param pattern the phras 
   */
  def offset(sentence: String, 
      pattern: Pattern): Offset = {
    val matcher = pattern.matcher(sentence)
    if (matcher.find()) 
      Offset(matcher.start(), matcher.end())
    else Offset(-1, -1)      
  }
  
  /**
   * Find all offsets for trigger terms for the
   * specified rule type. Returns a list of offsets
   * for trigger terms that matched.
   * @param sentence the normalized sentence.
   * @param ruleType the rule type to filter on.
   * @param rules the list of sorted rule patterns.
   * @return a List of Offsets for matched triggers
   *        of the specified rule type.
   */
  def offsets(sentence: String, ruleType: String,
      rules: List[(Pattern,String)]): List[Offset] = {
    rules.filter(rule => ruleType.equals(rule._2))
      .map(rule => offset(sentence, rule._1))
      .filter(offset => (! offset.isNone))
  }
  
  /**
   * Returns the first trigger term that does not
   * overlap with the phrase. May return (-1,-1).
   * @param phraseOffset the offset for the phrase.
   * @param triggerOffsets a list of Offsets for the
   *        triggers.
   * @return the first non-overlapping offset.
   */
  def firstNonOverlappingOffset(phraseOffset: Offset, 
      triggerOffsets: List[Offset]): Offset = {
    val phraseRange = Range(phraseOffset.start, phraseOffset.end)
    val nonOverlaps = triggerOffsets
      .filter(offset => {
        val offsetRange = Range(offset.start, offset.end)  
        phraseRange.intersect(offsetRange).size == 0
      })
    if (nonOverlaps.isEmpty) Offset(-1,-1) 
    else nonOverlaps.head
  }
}

The code is heavily documented, hopefully its easy to understand. The rules are read in from a file (I used the files provided in the genConText subdirectory of the Negex repository), longest rule first. The incoming sentence and phrases are normalized and the location of the phrase determined (start and end character positions). We first attempt to find occurrences of CONJ trigger terms. If found, we split the sentence, working only on the portion that contains our phrase. Next we try to match trigger terms of type PREN, POST, PREP and POSP (the last two if non-strict matching is turned on). If we find one, we calculate the distance (in words) from the phrase to the trigger term. For PREN (and PREP) there is no distance limit, for POST (and POSP) the maximum allowed distance is 5 words.

As I mentioned earlier, we can reuse the same code with different rule files - negex_triggers.txt for Negation, history_triggers.txt and hypothetical_triggers.txt for Temporality, and experiencer_triggers.txt for Experiencer. Here is some JUnit code that instantiates different types of annotators and evaluates them using test data (also from Negex).

// Source: src/test/scala/com/mycompany/scalcium/transformers/NegexAnnotatorTest.scala
package com.mycompany.scalcium.transformers

import java.io.File

import scala.io.Source

import org.junit.Test

class NegexAnnotatorTest {

  val dir = new File("src/main/resources/negex")
  val input = new File(dir, "test_input.txt")
  
  @Test
  def detailReport(): Unit = {
    val negAnnotator = new NegexAnnotator(
      new File(dir, "negex_triggers.txt"),
      List("Negated", "Affirmed"))
    val histAnnotator = new NegexAnnotator(
      new File(dir, "history_triggers.txt"),
      List("Historical", "Recent"))
    val hypAnnotator = new NegexAnnotator(
      new File(dir, "hypothetical_triggers.txt"),
      List("Not particular", "Recent"))
    val expAnnotator = new NegexAnnotator(
      new File(dir, "experiencer_triggers.txt"),
      List("Other", "Patient"))
    var numTested = 0
    var numCorrectNeg = 0
    var numCorrectHist = 0
    var numCorrectHyp = 0
    var numCorrectExp = 0
    Source.fromFile(input)
        .getLines()
        .foreach(line => {
      val cols = line.split("\t")
      val phrase = cols(2)
      val sentence = cols(3)
      val negActual = cols(4)
      val histActual = cols(5)
      val hypActual = cols(5)
      val expActual = cols(6)
      val negPred = negAnnotator.predict(sentence, phrase, false)
      val negCorrect = negActual.equals(negPred)
      val histPred = histAnnotator.predict(sentence, phrase, false)
      val histCorrect = histActual.equals(histPred) 
      val hypPred = hypAnnotator.predict(sentence, phrase, false)
      val hypCorrect = hypActual.equals(hypPred)
      val expPred = expAnnotator.predict(sentence, phrase, false)
      val expCorrect = expActual.equals(expPred)
      numCorrectNeg += (if (negCorrect) 1 else 0)
      numCorrectHist += (if (histCorrect) 1 else 0)
      numCorrectHyp += (if (hypCorrect)1 else 0)
      numCorrectExp += (if (expCorrect) 1 else 0)
      numTested += 1
      Console.println("%s\t%s\t[%s] %s\t[%s] %s\t[%s] %s\t[%s] %s"
        .format(phrase, sentence, 
          if (negCorrect) "+" else "-", negActual,
          if (histCorrect) "+" else "-", histActual,
          if (hypCorrect) "+" else "-", hypActual,
          if (expCorrect) "+" else "-", expActual))
    })
    Console.println()
    Console.println("Accuracy Scores")
    Console.println("Accuracy (Negation) = %8.6f"
      .format(numCorrectNeg.toDouble / numTested.toDouble))
    Console.println("Accuracy (Historical) = %8.6f"
      .format(numCorrectHist.toDouble / numTested.toDouble))
    Console.println("Accuracy (Hypothetical) = %8.6f"
      .format(numCorrectHyp.toDouble / numTested.toDouble))
    Console.println("Accuracy (Experiencer) = %8.6f"
      .format(numCorrectExp.toDouble / numTested.toDouble))
  }
}

The first 20 lines of output from the above JUnit test (post-processed manually to make it suitable for HTML display) is shown below. The green responses indicate that my NegexAnnotator got it right (the actual text indicates the expected response), and red imply that it got it wrong.

Phrase	Sentence	Negation	Historical	Hypothetical	Experiencer
lungs were decreased at the bilateral bases	The LUNGS WERE DECREASED AT THE BILATERAL BASES.	Affirmed	Recent	Recent	Patient
cough	She denies any COUGH or sputum production.	Negated	Recent	Recent	Patient
right ventricular pressure or volume overload	04) Flattened interventricular septum, consistent with RIGHT VENTRICULAR PRESSURE OR VOLUME OVERLOAD.	Affirmed	Recent	Recent	Patient
aortic valve is normal	The AORTIC VALVE IS NORMAL.	Affirmed	Recent	Recent	Patient
wheezes	No WHEEZES, rales, or rhonchi.	Negated	Recent	Recent	Patient
Normal left ventricular size and function	FINAL IMPRESSIONS: 01) NORMAL LEFT VENTRICULAR SIZE AND FUNCTION.	Affirmed	Recent	Recent	Patient
gastric varices	There was no evidence of GASTRIC VARICES, ulcers, or Mallory-Weiss tear.	Negated	Recent	Recent	Patient
atypical follicular lymphoma	The most likely possibilities include marginal cell lymphoma vs. an ATYPICAL FOLLICULAR LYMPHOMA.	Affirmed	Recent	Recent	Patient
hemodynamically stable	The patient remained HEMODYNAMICALLY STABLE with the heart rate in 70s.	Affirmed	Recent	Recent	Patient
ALLERGIES: CODEINE	ALLERGIES: CODEINE.	Affirmed	Recent	Recent	Patient
pulmonic regurgitation	There is trace PULMONIC REGURGITATION.	Affirmed	Recent	Recent	Patient
Colitis in the sigmoid colon	COMPLICATIONS: None POSTOPERATIVE DIAGNOSIS(ES): 1) COLITIS IN THE SIGMOID COLON 2) No additional lesions PLAN: 1) Await biopsy results REPEAT EXAM: Colonoscopy in 10 year(s).	Affirmed	Recent	Recent	Patient
ANEMIA	ANEMIA.	Affirmed	Recent	Recent	Patient
Thickened aortic valve with mild regurgitation	03) THICKENED AORTIC VALVE WITH MILD REGURGITATION.	Affirmed	Recent	Recent	Patient
interatrial baffle	07) There is an echodensity seen in the atria consistent with the patient''s know INTERATRIAL BAFFLE.	Affirmed	Historical	Historical	Patient
bowel distention	S_O_H Counters Report Type Record Type Subgroup Classifier 13,MFZKSI+l8xGn DS DS 5004 E_O_H [Report de-identified (Safe-harbor compliant) by De-ID v.6.14.02] INSTITUTION GENERAL MEDICINE DISCHARGE SUMMARY PATIENT NAME: NAME[AAA, BBB M] ACCOUNT #: ID-NUM ROOM ATTENDING PHYSICIAN: NAME[ZZZ M YYY] ADMISSION DATE: DATE[Sep 19 2007] DISCHARGE DATE: DATE[Sep 27 2007] The patient is a AGE[in 40s]-year-old with past medical history of partial small bowel obstruction, who is admitted quite frequently for abdominal pain, BOWEL DISTENTION, and partial small bowel obstructions.	Affirmed	Historical	Historical	Patient
B12 deficiency	She does have B12 DEFICIENCY and she is getting vitamin B12 shots every month.	Affirmed	Recent	Recent	Patient
unusual headaches	She will alert the physician if any of the following are noted: Nosebleeds, excessive bruising, pink, red, or tea-colored urine, bright red or tarry black stools, UNUSUAL HEADACHES, or stomach pain.	Affirmed	Not particular	Not particular	Patient
mitral regurgitation	SPECTRAL DOPPLER: There is trace MITRAL REGURGITATION.	Affirmed	Recent	Recent	Patient
rib fracture	There is a 1cm area of linear fibrosis in the left lower lobe on image 45 of series 4 which may be related to an adjacent old RIB FRACTURE.	Affirmed	Historical	Historical	Patient

The accuracy scores for the different annotators are shown below. The test data consists of 2,376 phrase/sentence combinations, annotated with the expected value for Negation, Historical/Hypothetical and Experiencer. As you can see, the accuracy for the Negation and Experiencer annotators are quite high, and Historical/Hypothetical isn't too bad either.

Accuracy (Negation) = 0.978956
Accuracy (Historical) = 0.909091
Accuracy (Hypothetical) = 0.889731
Accuracy (Experiencer) = 0.997475

Thats all I have for today. Negex seems to be a useful algorithm to know. I liked that the same algorithm could be used to include/exclude clinical concepts based on temporality and experiencer as well, a bit like a Swiss army knife.

2 comments (moderated to prevent spam):

Unknown said...: Can you please explain how to find the required phrases from the sentence? Urgent help needed.; 2/13/2015 11:33 AM
Sujit Pal said...: Thats where the NER comes in (see your comment on this post). The Negex algorithm allows you to compute negation status of phrases in a sentence. The phrases are the sequences like "cough" or "sputum production" in the sentence: "She denies any cough or sputum production". You need the NER to annotate these sequences in your sentence as the phrases of interest.; 2/13/2015 2:52 PM

Salmon Run

Saturday, April 26, 2014

Scala implementation of the Negex Algorithm

2 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me