Monday, October 03, 2011

Coordinate Expansion with OpenNLP

Background

Despite the fancy name, the problem I am trying to solve here is to eliminate missed opportunities for mapping text such as "hepatitis B and C" against our taxonomy containing concepts "hepatitis B" and "hepatitis C". Thanks to the many kind folks on the LinkedIn Natural Language Processing People group, I now know that this is variously called (or covered under) Distributive Predication, Coordinate Expansion/Construction, and Transformational Grammar. They also provided pointers and links to papers which were quite helpful, which I list in the References below.

My problem is more limited than the work described in the references aim to solve, however. My toy mapping application, TGNI, already uses OpenNLP to break the input text into sentences, so my problem is to detect these patterns in the sentences and expand them. For testing, here are some sentences I pulled off the internet (and in some cases, adapted to fit). Highlighted areas represent the patterns I am interested in expanding.

  1. Viral hepatitis, including hepatitis A, B, and C, are distinct diseases that affect the liver. (webmd.com)
  2. This page contains late breaking information, as well as an archival record of updates on safety and regulatory issues related to Hepatitis A and B, including product approvals, significant labeling changes, safety warnings, notices of upcoming public meetings, and notices about proposed regulatory guidances. (fda.gov)
  3. Lead poisoning can cause an increased risk of brain, lung, stomach and kidney cancer. (cleanupblackwell.com)
  4. Before we look at the difference between diabetes type I and II, let's firstly look at diabetes in general. (medicalnewstoday.com)
  5. Restricting and purging anorexia are two different types of anorexia that people suffer from (anorexiasurvivalguide.com)
  6. Here are some tips on pre and post surgery nutrition. (bestofmotherearth.com)
  7. A computer-based register linked to thyroid diagnostic laboratories was used to continuously identify all new cases of overt hyper- or hypothyroidism in two population cohorts with moderate and mild ID, respectively (Aalborg, n = 310,124; urinary iodine, 45 micro g/liter; and Copenhagen, n = 225,707; urinary iodine, 61 micro g/liter). (nlm.nih.gov)
  8. Medical and assistive devices are taxable for the most part, unconditionally zero-rated in certain cases, and conditionally zero-rated in certain cases. (revenuequebec.ca)
  9. These regions correspond to the least well conserved regions of the whole miRNA/snoRNA molecules. (ploscompbiol.org)
  10. Hetero- and Homogeneous mixtures are alike because they are both compounds, and both made up of different elements. (answers.com)

Identifying Candidate Phrases

Noun Phrases with AND, OR and slash

Based upon some observations extracting noun phrases using OpenNLP, I realized that I could probably get most of the candidate phrases if I considered only the noun phrases containing "AND", "OR" and "/" in them. Checking this is fairly trivial, so I tried that first. This resulted in finding 9 out of the 12 expected phrases from my test set.

Removing noisy Noun Phrases with slash

The noun phrases retrieved also contained a few noise phrases such as "45 micro g/liter" which are not candidates for coordinate expansion, along with "the whole miRNA/snoRNA molecules", which are. For these, I added a rule that posited that there must be a non-zero character overlap between the words preceding and following the "/" character.

I used the Strike-a-Match algorithm to compute string similarity. Some of the code below is directly adapted from the Java code explaining the algorithm.

Expanding dangling prefix words

Looking through the test set, I realized that OpenNLP seems to have a hard time handling dangling prefix words such as hyper- (it treats the "-" as a separate token) and gets all confused, as shown below.

[A/DT computer-based/JJ register/NN ]/NP [linked/VBN ]/VP [to/TO thyroid/VB ]/VP [diagnostic/JJ laboratories/NNS ]/NP [was/VBD used/VBN ]/VP [to/TO continuously/RB identify/VB ]/VP [all/DT new/JJ cases/NNS ]/NP [of/IN ]/PP [overt/JJ hyper/NN ]/NP -/: and/CC [hypothyroidism/NN ]/NP [in/IN ]/PP [two/CD population/NN cohorts/NNS ]/NP [with/IN ]/PP [moderate/JJ and/CC mild/JJ ID/NNP ]/NP ,/, [respectively/RB ]/ADVP [(/-LRB- Aalborg/UH ]/NP ,/, [n/RB ]/ADVP [=/SYM ]/VP [310,124/CD ]/NP ;/: [urinary/JJ iodine/NN ]/NP ,/, [45/CD micro/JJ g/liter/NN ]/NP ;/: and/CC [Copenhagen/NNP ]/NP ,/, [n/RB =/SYM 225,707/CD ]/NP ;/: [urinary/JJ iodine/NN ]/NP ,/, [61/CD micro/JJ g/liter/NN )/-RRB- ]/NP ./.

So I figured that if we expanded such dangling prefixes into the complete word using the token after the conjunction, ie, expanding strings such as "hyper- and hypothyroidism" to "hyperthyroidism and hypothyroidism", then it may make it easier for OpenNLP to tag them correctly.

I used some of the words from the English Prefixes wiki page. When I see one of these words preceding an "AND" or "OR" token, I check the word following the token, split it up and append the second part of the following word to the preceding word. This takes care of 2 more examples in my test set.

Phrase patterns and Parse Trees

I also noticed that some of my patterns seem to be incorrectly categorized (ie not noun phrases), such as these.

[Restricting/VBG and/CC purging/VBG ]/VP [anorexia/DT ]/NP [are/VBP ]/VP [two/CD different/JJ types/NNS ]/NP [of/IN ]/PP [anorexia/NN ]/NP [that/IN ]/PP [people/NNS ]/NP [suffer/VBP ]/VP [from/IN ]/PP ./.

[Before/IN ]/SBAR [we/PRP ]/NP [look/VBP ]/VP [at/IN ]/PP [the/DT difference/NN ]/NP [between/IN ]/PP [diabetes/NNS ]/NP [type-I/JJ ]/ADJP and/CC [II/NNP ]/NP ,/, [let/VB ]/VP ['s/PRP ]/NP [firstly/RB ]/ADVP [look/VBP ]/VP [at/IN ]/PP [diabaetes/NNS ]/NP [in/IN ]/PP [general/JJ ]/ADJP ./.

I suspect that these are training issues with the default models provided with the OpenNLP distribution I am using. Making my own training set is unfortunately not feasible given my resources. At this point, I considered making up rules to extract (VP,NP) and (NP,ADJP,CC,NP) sequences as well as noun phrases, but realized that this may lead to overfitting.

I remembered one comment which said that this problem is trivial if you have a good parse tree, so I decided to see if OpenNLP could give me a parse tree that I could work with for these corner cases (going by my test examples, these account for just 2 of my 12 cases). Here they are, for both the sentences considered above.

(TOP 
  (S 
    (SBAR 
       (IN Before) 
       (S 
         (NP 
           (PRP we)
         ) 
         (VP 
           (VBP look) 
           (PP 
             (IN at) 
             (NP 
               (NP 
                 (DT the) 
                 (NN difference)
               ) 
               (PP 
                 (IN between) 
                 (NP 
                   (NP 
                     (NN diabetes) 
                     (NN type-I)
                   ) 
                   (CC and) 
                   (NP 
                     (NNP II,)
                   )
                 )
               )
             )
           )
         )
       )
     ) 
     (VP 
       (VBD let's) 
       (RB firstly) 
       (VP 
         (VB look) 
         (PP 
           (IN at) 
           (NP 
             (NP 
               (NNS diabaetes)
             ) 
             (PP 
               (IN in) 
               (NP
                 (NN general.)
               )
             )
           )
         )
       )
     )
   )
)
    
(TOP 
  (S 
    (S 
      (VP 
        (VP 
          (VBG Restricting)
        ) 
        (CC and) 
        (VP 
          (VBG purging) 
          (NP 
            (DT anorexia)
          )
        )
      )
    ) 
    (VP 
      (VBP are) 
      (NP 
        (NP 
          (NP 
            (CD two) 
            (JJ different) 
            (NNS types)
          ) 
          (PP 
            (IN of) 
            (NP 
              (NN anorexia)
            )
          )
        ) 
        (PP 
          (IN that) 
          (NP 
            (NNS people)
          )
        )
      ) 
      (VBP suffer)
    ) 
    (. from.)
  )
)
    

I wanted to use the parser as a fallback to the chunker, so I count the number of "AND", "OR" and "/" in the sentence, and if the number of phrases discovered by the chunker is less than this, I invoke the parser and generate the tree. I did consider using only the parser, but there are instances where the chunker is "more correct" than the parser, and the chunker catches most of my phrases anyway, so it makes sense to use the parser to handle the corner cases. The other reason is that the chunker is less resource-intensive than the parser.

Anyway, the fallback check above exposes 6 occurrences where the chunker did not find all the phrases containing "AND", "OR" and "/", 2 of which are correct from my point of view. So we are going to invoke the parser 4 more times than we need to.

Expanding the phrase

I initially thought it would be relatively simple to just pass the retrieved phrases through a chain of regular expression parsers, but then realized that a custom parser would probably be easier (and more general), so I built a token based parser.

Code

Here is the code to do this all, written as a JUnit test. This will become a UIMA Analysis Engine that will be added to the front end aggregate AE in TGNI.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
// Source: src/test/java/com/mycompany/tgni/uima/annotators/nlp/CoordinateConstructionTest.java
package com.mycompany.tgni.uima.annotators.nlp;

import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import opennlp.tools.chunker.Chunker;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.parser.ParserTool;
import opennlp.tools.parser.Parse;
import opennlp.tools.parser.Parser;
import opennlp.tools.parser.ParserFactory;
import opennlp.tools.parser.ParserModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTagger;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.junit.Test;

public class CoordinateConstructionTest {

  private final String[] TEST_SENTENCES = new String[] { ... };
  
  private static final String[] CONJ_WORDS = new String[] {
    "and", "or",
  };
  // from wikipedia page on Prefix_Words
  private static final String[] PREFIX_WORDS = new String[] {
    "pre", "post", "hypo", "hyper", "inter", "intra", 
    "over", "under", "infra", "ultra", "hetero", "homo",
  };
  
  @Test
  public void findCandidatePhrases() throws Exception {
    Tokenizer tokenizer = null;
    POSTagger posTagger = null;
    Chunker chunker = null;
    Parser parser = null;
    InputStream tmis = null;
    InputStream pmis = null;
    InputStream cmis = null;
    InputStream pamis = null;
    Set<String> ccs = new HashSet<String>(Arrays.asList(CONJ_WORDS));
    Set<String> prefixes = new HashSet<String>(Arrays.asList(PREFIX_WORDS));
    try {
      tmis = new FileInputStream("/path/to/opennlp/models/en-token.bin");
      TokenizerModel tm = new TokenizerModel(tmis);
      tokenizer = new TokenizerME(tm);
      pmis = new FileInputStream("/path/to/opennlp/models/en-pos-maxent.bin");
      POSModel pm = new POSModel(pmis);
      posTagger = new POSTaggerME(pm);
      cmis = new FileInputStream("/path/to/opennlp/models/en-chunker.bin");
      ChunkerModel cm = new ChunkerModel(cmis);
      chunker = new ChunkerME(cm);
      pamis = new FileInputStream(
        "/path/to/opennlp/models/en-parser-chunking.bin");
      ParserModel pam = new ParserModel(pamis);
      parser = ParserFactory.create(pam);
      for (String sentence : TEST_SENTENCES) {
        System.out.println("sentence: " + sentence);
        sentence = removeCoordinatePrefixHyphens(sentence);
        int expectedNumPhrases = countConjWords(sentence);
        int actualNumPhrases = 0;
        Span[] tokenSpans = tokenizer.tokenizePos(sentence);
        String[] tokens = new String[tokenSpans.length];
        for (int i = 0; i < tokenSpans.length; i++) {
          tokens[i] = tokenSpans[i].getCoveredText(sentence).toString();
        }
        // preprocess tokens to rewrite prefix words
        preprocessPrefixes(prefixes, ccs, tokens);
        String[] tags = posTagger.tag(tokens);
        Set<String> candidatePhrases = new HashSet<String>();
        // shallow parse using chunker
        Span[] chunks = chunker.chunkAsSpans(tokens, tags);
        System.out.println(showExplanation(tokens, tags, chunks));
        for (Span chunk : chunks) {
          int chunkTokenStart = chunk.getStart();
          int chunkTokenEnd = chunk.getEnd();
          if ("NP".equals(chunk.getType()) &&
              containsToken(tokens, chunkTokenStart, chunkTokenEnd, ccs)) {
            String np = StringUtils.join(
              ArrayUtils.subarray(tokens, chunkTokenStart, chunkTokenEnd),
              " ");
            if (hasOverlapOnSlashSeparator(np)) {
              np = np.replace(" , ", ", ").
                replace(", and ", " and ").
                replace("-", " "); // clean up
              candidatePhrases.add(np);
              expectedNumPhrases--;
            } else {
              actualNumPhrases++;
            }
          }
        }
        // fallback to deep parse using parser
        if (actualNumPhrases < expectedNumPhrases) {
          Parse[] parses = ParserTool.parseLine(sentence, parser, 1);
          for (Parse parse : parses) {
            walkParseTree_r(sentence, parse, candidatePhrases);
          }
        }
        // print candidate phrases
        for (String candidatePhrase : candidatePhrases) {
          System.out.println(".. " + candidatePhrase);
          List<String> expandedPhrases = expandPhrase(candidatePhrase);
          for (String expandedPhrase : expandedPhrases) {
            System.out.println(".... " + expandedPhrase);
          }
        }
        // match phrases against taxonomy
      }
    } finally {
      IOUtils.closeQuietly(tmis);
      IOUtils.closeQuietly(pmis);
      IOUtils.closeQuietly(cmis);
    }
  }

  private List<String> expandPhrase(String candidatePhrase) {
    List<String> expandedPhrases = new ArrayList<String>();
    String[] tokens = StringUtils.split(candidatePhrase, " ");
    int ccpos = 0;
    String cctype = null;
    // first pass, find the position of the conjunction
    for (int i = 0; i < tokens.length; i++) {
      if ("and".equalsIgnoreCase(tokens[i]) ||
          "or".equalsIgnoreCase(tokens[i])) {
        ccpos = i;
        cctype = tokens[i];
        break;
      }
      if (tokens[i].indexOf('/') > -1) {
        ccpos = i;
        cctype = "/";
        break;
      }
    }
    if (ccpos > 0) {
      String phrasePre = "";
      String phrasePost = "";
      List<String> ccwords = new ArrayList<String>();
      if ("/".equals(cctype)) {
        // handles the following cases:
        // xx A/B/.. yy => xx A yy, xx B yy, ...
        ccwords.addAll(
          Arrays.asList(StringUtils.split(tokens[ccpos], "/")));
        phrasePre = StringUtils.join(
          Arrays.asList(
          ArrayUtils.subarray(tokens, 0, ccpos)).iterator(), " ");
        phrasePost = StringUtils.join(
          Arrays.asList(
          ArrayUtils.subarray(tokens, ccpos+1, tokens.length)).iterator(), 
          " ");
      } else {
        if (ccpos > 0 && ccpos < tokens.length - 1) {
          // handles the following cases:
          // xx A (and|or) B C yy => xx A C yy, xx B C yy
          // xx A B (and|or) C yy => xx A C yy, xx B C yy
          ccwords.add(tokens[ccpos - 1]);
          ccwords.add(tokens[ccpos + 1]);
          // look back from ccpos-1 until we stop seeing
          // words with trailing commas
          int currpos = ccpos - 2;
          while (currpos >= 0) {
            if (tokens[currpos].endsWith(",")) {
              ccwords.add(tokens[currpos].substring(
                0, tokens[currpos].length() - 1));
              currpos--;
            } else {
              break;
            }
          }
          if (currpos >= 0) {
            phrasePre = StringUtils.join(
              Arrays.asList(ArrayUtils.subarray(
              tokens, 0, currpos+1)), " ");
          }
          if (ccpos + 2 < tokens.length) {
            phrasePost = StringUtils.join(
              Arrays.asList(ArrayUtils.subarray(
              tokens, ccpos + 2, tokens.length)), " ");
          }
        }
      }
      for (String ccword : ccwords) {
        expandedPhrases.add(StringUtils.join(new String[] {
          phrasePre, ccword, phrasePost}, " "));
      }
    }
    return expandedPhrases;
  }

  private void walkParseTree_r(String sentence, Parse parse,
      Set<String> candidatePhrases) {
    Span span = parse.getSpan();
    try {
      String text = span.getCoveredText(sentence).toString();
      if ("and".equalsIgnoreCase(text) || "or".equalsIgnoreCase(text)) {
        Parse pparse = parse.getCommonParent(parse);
        Span pspan = pparse.getSpan();
        String chunk = pspan.getCoveredText(sentence).toString();
        if (! ("and".equalsIgnoreCase(chunk) ||
            "or".equalsIgnoreCase(chunk))) {
          // remove trailing punctuation
          if (chunk.matches("^.*\\p{Punct}$")) {
            chunk = chunk.substring(0, chunk.length() - 1);
          }
          chunk = chunk.replace("-", " ");
          candidatePhrases.add(chunk);
        }
      }
    } catch (Exception e) {
      // attempt to ignore IllegalArgumentException caused by
      // the span.getCoveredText() attempting to lookup a larger
      // span than the sentence length. Nothing much I can do 
      // here, this is probably an issue with OpenNLP.
    }
    for (Parse child : parse.getChildren()) {
      walkParseTree_r(sentence, child, candidatePhrases);
    }
  }

  private int countConjWords(String sentence) {
    int numConjWords = 0;
    for (String conjWord : CONJ_WORDS) {
      numConjWords += StringUtils.countMatches(sentence, 
        " " + conjWord + " ");
    }
    numConjWords += StringUtils.countMatches(sentence, "/");
    return numConjWords;
  }

  private String removeCoordinatePrefixHyphens(String sentence) {
    return sentence.replaceAll("- and ", "  and ").
      replaceAll("- or ", "  or ");
  }
  
  private void preprocessPrefixes(Set<String> prefixes, 
      Set<String> ccs, String[] tokens) {
    for (int i = 0; i < tokens.length; i++) {
      if (ccs.contains(StringUtils.lowerCase(tokens[i]))) {
        // check before and after
        String preWord = (i > 0) ? 
          StringUtils.lowerCase(tokens[i-1]) : null;
        String postWord = (i < tokens.length - 1) ? 
          StringUtils.lowerCase(tokens[i+1]) : null;
        if (preWord != null && postWord != null) {
          if (preWord.endsWith("-")) 
            preWord = preWord.substring(0, preWord.length() - 1);
          if (prefixes.contains(preWord)) {
            // attempt to split postWord along one of the prefixes
            String bareWord = null;
            for (String prefix : prefixes) {
              if (postWord.startsWith(prefix)) {
                bareWord = postWord.substring(prefix.length());
                break;
              }
            }
            if (StringUtils.isNotEmpty(bareWord)) {
              tokens[i-1] = preWord + bareWord;
            }
          }
        }
      }
    }
  }

  private String showExplanation(String[] tokens, String[] tags, 
      Span[] chunks) {
    StringBuilder buf = new StringBuilder();
    int ci = 0;
    for (int i = 0; i < tokens.length; i++) {
      if (ci < chunks.length && chunks[ci].getStart() == i) {
        buf.append("[");
      }
      buf.append(tokens[i]).append("/").append(tags[i]).append(" ");
      if (ci < chunks.length && chunks[ci].getEnd() - 1 == i) {
        buf.append("]/").append(chunks[ci].getType()).append(" ");
        ci++;
      }
    }
    return buf.toString();
  }

  private boolean containsToken(String[] tokens, int start,
      int end, Set<String> ccs) {
    String[] chunkTokens = (String[]) ArrayUtils.subarray(
      tokens, start, end);
    for (String chunkToken : chunkTokens) {
      if (ccs.contains(StringUtils.lowerCase(chunkToken))) {
        return true;
      }
      if (chunkToken.contains("/")) {
        return true;
      }
    }
    return false;
  }

  private boolean hasOverlapOnSlashSeparator(String np) {
    int slashPos = np.indexOf('/');
    if (slashPos > -1) {
      int start = np.lastIndexOf(' ', slashPos-1) + 1;
      if (start == -1) start = 0;
      int end = np.indexOf(' ', slashPos+1);
      if (end == -1) end = np.length();
      String prevWord = np.substring(start, slashPos);
      String nextWord = np.substring(slashPos+1, end);
      double similarity = computeStringSimilarity(prevWord, nextWord);
      return similarity > 0.0D;
    }
    return true;
  }

  public double computeStringSimilarity(String s1, String s2) {
    ArrayList<String> pairs1 = letterPairs(s1);
    ArrayList<String> pairs2 = letterPairs(s2);
    int intersection = 0;
    int union = pairs1.size() + pairs2.size();
    for (int i = 0; i < pairs1.size(); i++) {
      Object pair1=pairs1.get(i);
      for(int j = 0; j < pairs2.size(); j++) {
        Object pair2=pairs2.get(j);
        if (pair1.equals(pair2)) {
          intersection++;
          pairs2.remove(j);
          break;
        }
      }
    }
    return (2.0 * intersection) / union;
  }

  private ArrayList<String> letterPairs(String str) {
    int numPairs = str.length()-1;
    ArrayList<String> pairs = new ArrayList<String>();
    for (int i = 0; i < numPairs; i++) {
        pairs.add(str.substring(i, i + 2));
    }
    return pairs;
  }
}

which when run, yield results as shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
Viral hepatitis, including hepatitis A, B, and C, are distinct 
diseases that affect the liver.
.. hepatitis A, B and C
.... hepatitis B 
.... hepatitis C 
.... hepatitis A 
This page contains late breaking information, as well as an archival 
record of updates on safety and regulatory issues related to Hepatitis 
A and B, including product approvals, significant labeling changes, 
safety warnings, notices of upcoming public meetings, and notices about 
proposed regulatory guidances.
.. safety and regulatory issues
....  safety issues
....  regulatory issues
.. Hepatitis A and B
.... Hepatitis A 
.... Hepatitis B 
.. upcoming public meetings, and notices
.... upcoming public meetings, 
.... upcoming public notices 
Lead poisoning can cause an increased risk of brain, lung, stomach and 
kidney cancer.
.. brain, lung, stomach and kidney cancer
....  stomach cancer
....  kidney cancer
....  lung cancer
....  brain cancer
Before we look at the difference between diabetes type-I and II, let's 
firstly look at diabaetes in general.
.. diabetes type I and II
.... diabetes type I 
.... diabetes type II 
Restricting and purging anorexia are two different types of anorexia 
that people suffer from.
.. Restricting and purging anorexia
....  Restricting anorexia
....  purging anorexia
Here are some tips on pre and post surgery nutrition.
.. pre and post surgery nutrition
....  pre surgery nutrition
....  post surgery nutrition
A computer-based register linked to thyroid diagnostic laboratories 
was used to continuously identify all new cases of overt hyper- and 
hypothyroidism in two population cohorts with moderate and mild ID, 
respectively (Aalborg, n = 310,124; urinary iodine, 45 micro g/liter; 
and Copenhagen, n = 225,707; urinary iodine, 61 micro g/liter).
.. 310,124; urinary iodine, 45 micro g/liter; and Copenhagen, n
.... 310,124; urinary iodine, 45 micro g and Copenhagen, n
.... 310,124; urinary iodine, 45 micro liter; and Copenhagen, n
.. moderate and mild ID
....  moderate ID
....  mild ID
.. overt hyperthyroidism and hypothyroidism
.... overt hyperthyroidism 
.... overt hypothyroidism 
Medical and assistive devices are taxable for the most part, 
unconditionally zero-rated in certain cases, and conditionally 
zero-rated in certain cases.
.. Medical and assistive devices
....  Medical devices
....  assistive devices
.. Medical and assistive devices are taxable for the most part, 
unconditionally zero rated in certain cases, and conditionally 
zero rated in certain cases
....  Medical devices are taxable for the most part, unconditionally 
zero rated in certain cases, and conditionally zero rated in certain 
cases
....  assistive devices are taxable for the most part, unconditionally 
zero rated in certain cases, and conditionally zero rated in certain 
cases
These regions correspond to the least well conserved regions of the 
whole miRNA/snoRNA molecules.
.. the whole miRNA/snoRNA molecules
.... the whole miRNA molecules
.... the whole snoRNA molecules
Hetero- and Homogeneous mixtures are alike because they are both 
compounds, and both made up of different elements.
.. heterogeneous and Homogeneous mixtures
....  heterogeneous mixtures
....  Homogeneous mixtures

The final step (which I haven't done here) is to take each of these expanded phrases and hit the taxonomy index to map them to concepts. This will also get rid of some of the noisy phrases.

Conclusion

The approach described above solves a very simple subset of the more general problem of coordinate construction/expansion. It does this by identifying noun phrases in text containing trigger words "AND", "OR" and "/", and expanding the phrases into multiple phrases along the trigger words.

Our current (proprietary, and not described here) solution for this problem uses taxonomy lookups rather than NLP to determine the phrase boundaries. While it has some advantages in terms of accuracy (phrases such as "diabetes, lung and heart disease" will get correctly expanded to "diabetes", "lung disease" and "heart disease" rather than "diabetes disease", "lung disease" and "heart disease"), it involves a lot more processing (in user level code).

A possible improvement to the tiered lookup (chunker then parser) approach described above could be to build a custom rule-based chunker that works on the POS tags generated by OpenNLP, similar to NLTK's RegexpChunkParser. I will check out this approach, and if it works out, will replace the tiered lookup code.

References

Here are some reference type articles that were suggested by folks in the LinkedIn discussion, or that I found by following the pointers provided there. All of these attempt to solve the more general problem of coordinate construction, and as such are not directly applicable to my solution, but they did provide useful ideas, so including them here in case you want to read more about this stuff.

  1. Dealing with Conjunctions in a Machine Translation Environment, by Xiuming Huang, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing, China. (PDF)
  2. Lexicase Parsing: A Lexicon-driven approach to Syntactic Analusis, by Stanley Starosta and Hirosato Nomura, COLING'86 Proceedings of the 11th conference on Computational Linguistics, doi:10.3115/991365.991400. (PDF)
  3. Comparing methods for the syntactic simplification of sentences in information extraction by Richard J Evans, Lit Linguistic Computing (2011), doi: 10.1093/llc/fqr034. (PDF)

15 comments (moderated to prevent spam):

nitin said...

Really nice explanation of the idea.

Sujit, I've been following your blog for 3-4 months now and I did wanted to discuss with you some of the problems that I'm facing related to search in my application.

Can you provide me you contact info (email or phone so that I can discuss with you)?

Thanks
Nitin

Sujit Pal said...

Hi Nitin, I would prefer to not make my email address crawlable, so if you send me a comment with contact info, I can use that (I will delete that comment for your privacy of course).

Lance N. said...

Hello-

Are you willing to donate this code or these algorithms to Apache? I'm working on an OpenNLP addition to Solr, and this is a good example.

Sujit Pal said...

Hi Lance, sure. Let me know if I need to do anything to move that along. Just curious, is there a JIRA/WebPage on which your work is described, would be interesting how you plan to use OpenNLP with Solr (I've always thought of OpenNLP as something you would want to use on text.

Jun said...

Hi Pal,
In you blog,
"I remembered one comment which said that this problem is trivial if you have a good parse tree"

How do you use the parse tree to solve this problem?

Sujit Pal said...

Hi Jun, I have described it in the blog immediately after the statement... essentially you convert the sentence into its equivalent parse-tree and look for NP nodes containing a CC, and see if there are two or more NPs hanging off it - these form the parts that you will need to build the expansion off of.

Java Developer said...

what are the jar required to run this class .

Sujit Pal said...

Hi, you will need the commons-io and commons-lang and the opennlp JARs - the ones I used are commons-io-2.0.1.jar, commons-lang-3.0.1.jar, opennlp-maxent-3.0.1-incubating.jar and opennlp-tools-1.5.1-incubating.jar. Additionally this particular class is a JUnit test so you would need JUnit 4.0+.

Anonymous said...

can we find the relation between the noun pharse using the stanford dependency parser java

Sujit Pal said...

Hi Anonymous, thanks, an excellent idea! Even though I haven't tried this (obviously :-)), definitely worth checking out.

Anurag said...

I have started looking into the co-ordination problem using openNLP's chunker and I too did notice the chunker is not so accurate.

You can try a better chunker (in my tests) here: http://cogcomp.cs.illinois.edu/demo/shallowparse/

This may solve most of the problems - including not having to make use of the parser after chunking.

Sujit Pal said...

Thanks Anurag, looks interesting, but the link you provided links to a demo, and it timed out on the first sentence in my test set. Can you point me to where I can get the JAR for this?

Anurag said...

Sujit, It is working for me. You can download the chunker here: http://cogcomp.cs.illinois.edu/page/software

Sujit Pal said...

Thanks, lots of cool things here :-).

Sujit Pal said...

Thanks, lots of cool things here :-).