Background
Despite the fancy name, the problem I am trying to solve here is to eliminate missed opportunities for mapping text such as "hepatitis B and C" against our taxonomy containing concepts "hepatitis B" and "hepatitis C". Thanks to the many kind folks on the LinkedIn Natural Language Processing People group, I now know that this is variously called (or covered under) Distributive Predication, Coordinate Expansion/Construction, and Transformational Grammar. They also provided pointers and links to papers which were quite helpful, which I list in the References below.
My problem is more limited than the work described in the references aim to solve, however. My toy mapping application, TGNI, already uses OpenNLP to break the input text into sentences, so my problem is to detect these patterns in the sentences and expand them. For testing, here are some sentences I pulled off the internet (and in some cases, adapted to fit). Highlighted areas represent the patterns I am interested in expanding.
- Viral hepatitis, including hepatitis A, B, and C, are distinct diseases that affect the liver. (webmd.com)
- This page contains late breaking information, as well as an archival record of updates on safety and regulatory issues related to Hepatitis A and B, including product approvals, significant labeling changes, safety warnings, notices of upcoming public meetings, and notices about proposed regulatory guidances. (fda.gov)
- Lead poisoning can cause an increased risk of brain, lung, stomach and kidney cancer. (cleanupblackwell.com)
- Before we look at the difference between diabetes type I and II, let's firstly look at diabetes in general. (medicalnewstoday.com)
- Restricting and purging anorexia are two different types of anorexia that people suffer from (anorexiasurvivalguide.com)
- Here are some tips on pre and post surgery nutrition. (bestofmotherearth.com)
- A computer-based register linked to thyroid diagnostic laboratories was used to continuously identify all new cases of overt hyper- or hypothyroidism in two population cohorts with moderate and mild ID, respectively (Aalborg, n = 310,124; urinary iodine, 45 micro g/liter; and Copenhagen, n = 225,707; urinary iodine, 61 micro g/liter). (nlm.nih.gov)
- Medical and assistive devices are taxable for the most part, unconditionally zero-rated in certain cases, and conditionally zero-rated in certain cases. (revenuequebec.ca)
- These regions correspond to the least well conserved regions of the whole miRNA/snoRNA molecules. (ploscompbiol.org)
- Hetero- and Homogeneous mixtures are alike because they are both compounds, and both made up of different elements. (answers.com)
Identifying Candidate Phrases
Noun Phrases with AND, OR and slash
Based upon some observations extracting noun phrases using OpenNLP, I realized that I could probably get most of the candidate phrases if I considered only the noun phrases containing "AND", "OR" and "/" in them. Checking this is fairly trivial, so I tried that first. This resulted in finding 9 out of the 12 expected phrases from my test set.
Removing noisy Noun Phrases with slash
The noun phrases retrieved also contained a few noise phrases such as "45 micro g/liter" which are not candidates for coordinate expansion, along with "the whole miRNA/snoRNA molecules", which are. For these, I added a rule that posited that there must be a non-zero character overlap between the words preceding and following the "/" character.
I used the Strike-a-Match algorithm to compute string similarity. Some of the code below is directly adapted from the Java code explaining the algorithm.
Expanding dangling prefix words
Looking through the test set, I realized that OpenNLP seems to have a hard time handling dangling prefix words such as hyper- (it treats the "-" as a separate token) and gets all confused, as shown below.
[A/DT computer-based/JJ register/NN ]/NP [linked/VBN ]/VP [to/TO thyroid/VB ]/VP [diagnostic/JJ laboratories/NNS ]/NP [was/VBD used/VBN ]/VP [to/TO continuously/RB identify/VB ]/VP [all/DT new/JJ cases/NNS ]/NP [of/IN ]/PP [overt/JJ hyper/NN ]/NP -/: and/CC [hypothyroidism/NN ]/NP [in/IN ]/PP [two/CD population/NN cohorts/NNS ]/NP [with/IN ]/PP [moderate/JJ and/CC mild/JJ ID/NNP ]/NP ,/, [respectively/RB ]/ADVP [(/-LRB- Aalborg/UH ]/NP ,/, [n/RB ]/ADVP [=/SYM ]/VP [310,124/CD ]/NP ;/: [urinary/JJ iodine/NN ]/NP ,/, [45/CD micro/JJ g/liter/NN ]/NP ;/: and/CC [Copenhagen/NNP ]/NP ,/, [n/RB =/SYM 225,707/CD ]/NP ;/: [urinary/JJ iodine/NN ]/NP ,/, [61/CD micro/JJ g/liter/NN )/-RRB- ]/NP ./.
So I figured that if we expanded such dangling prefixes into the complete word using the token after the conjunction, ie, expanding strings such as "hyper- and hypothyroidism" to "hyperthyroidism and hypothyroidism", then it may make it easier for OpenNLP to tag them correctly.
I used some of the words from the English Prefixes wiki page. When I see one of these words preceding an "AND" or "OR" token, I check the word following the token, split it up and append the second part of the following word to the preceding word. This takes care of 2 more examples in my test set.
Phrase patterns and Parse Trees
I also noticed that some of my patterns seem to be incorrectly categorized (ie not noun phrases), such as these.
[Restricting/VBG and/CC purging/VBG ]/VP [anorexia/DT ]/NP [are/VBP ]/VP [two/CD different/JJ types/NNS ]/NP [of/IN ]/PP [anorexia/NN ]/NP [that/IN ]/PP [people/NNS ]/NP [suffer/VBP ]/VP [from/IN ]/PP ./.
[Before/IN ]/SBAR [we/PRP ]/NP [look/VBP ]/VP [at/IN ]/PP [the/DT difference/NN ]/NP [between/IN ]/PP [diabetes/NNS ]/NP [type-I/JJ ]/ADJP and/CC [II/NNP ]/NP ,/, [let/VB ]/VP ['s/PRP ]/NP [firstly/RB ]/ADVP [look/VBP ]/VP [at/IN ]/PP [diabaetes/NNS ]/NP [in/IN ]/PP [general/JJ ]/ADJP ./.
I suspect that these are training issues with the default models provided with the OpenNLP distribution I am using. Making my own training set is unfortunately not feasible given my resources. At this point, I considered making up rules to extract (VP,NP) and (NP,ADJP,CC,NP) sequences as well as noun phrases, but realized that this may lead to overfitting.
I remembered one comment which said that this problem is trivial if you have a good parse tree, so I decided to see if OpenNLP could give me a parse tree that I could work with for these corner cases (going by my test examples, these account for just 2 of my 12 cases). Here they are, for both the sentences considered above.
(TOP (S (SBAR (IN Before) (S (NP (PRP we) ) (VP (VBP look) (PP (IN at) (NP (NP (DT the) (NN difference) ) (PP (IN between) (NP (NP (NN diabetes) (NN type-I) ) (CC and) (NP (NNP II,) ) ) ) ) ) ) ) ) (VP (VBD let's) (RB firstly) (VP (VB look) (PP (IN at) (NP (NP (NNS diabaetes) ) (PP (IN in) (NP (NN general.) ) ) ) ) ) ) ) ) |
(TOP (S (S (VP (VP (VBG Restricting) ) (CC and) (VP (VBG purging) (NP (DT anorexia) ) ) ) ) (VP (VBP are) (NP (NP (NP (CD two) (JJ different) (NNS types) ) (PP (IN of) (NP (NN anorexia) ) ) ) (PP (IN that) (NP (NNS people) ) ) ) (VBP suffer) ) (. from.) ) ) |
I wanted to use the parser as a fallback to the chunker, so I count the number of "AND", "OR" and "/" in the sentence, and if the number of phrases discovered by the chunker is less than this, I invoke the parser and generate the tree. I did consider using only the parser, but there are instances where the chunker is "more correct" than the parser, and the chunker catches most of my phrases anyway, so it makes sense to use the parser to handle the corner cases. The other reason is that the chunker is less resource-intensive than the parser.
Anyway, the fallback check above exposes 6 occurrences where the chunker did not find all the phrases containing "AND", "OR" and "/", 2 of which are correct from my point of view. So we are going to invoke the parser 4 more times than we need to.
Expanding the phrase
I initially thought it would be relatively simple to just pass the retrieved phrases through a chain of regular expression parsers, but then realized that a custom parser would probably be easier (and more general), so I built a token based parser.
Code
Here is the code to do this all, written as a JUnit test. This will become a UIMA Analysis Engine that will be added to the front end aggregate AE in TGNI.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 | // Source: src/test/java/com/mycompany/tgni/uima/annotators/nlp/CoordinateConstructionTest.java
package com.mycompany.tgni.uima.annotators.nlp;
import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;
import opennlp.tools.chunker.Chunker;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.parser.ParserTool;
import opennlp.tools.parser.Parse;
import opennlp.tools.parser.Parser;
import opennlp.tools.parser.ParserFactory;
import opennlp.tools.parser.ParserModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTagger;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;
import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.junit.Test;
public class CoordinateConstructionTest {
private final String[] TEST_SENTENCES = new String[] { ... };
private static final String[] CONJ_WORDS = new String[] {
"and", "or",
};
// from wikipedia page on Prefix_Words
private static final String[] PREFIX_WORDS = new String[] {
"pre", "post", "hypo", "hyper", "inter", "intra",
"over", "under", "infra", "ultra", "hetero", "homo",
};
@Test
public void findCandidatePhrases() throws Exception {
Tokenizer tokenizer = null;
POSTagger posTagger = null;
Chunker chunker = null;
Parser parser = null;
InputStream tmis = null;
InputStream pmis = null;
InputStream cmis = null;
InputStream pamis = null;
Set<String> ccs = new HashSet<String>(Arrays.asList(CONJ_WORDS));
Set<String> prefixes = new HashSet<String>(Arrays.asList(PREFIX_WORDS));
try {
tmis = new FileInputStream("/path/to/opennlp/models/en-token.bin");
TokenizerModel tm = new TokenizerModel(tmis);
tokenizer = new TokenizerME(tm);
pmis = new FileInputStream("/path/to/opennlp/models/en-pos-maxent.bin");
POSModel pm = new POSModel(pmis);
posTagger = new POSTaggerME(pm);
cmis = new FileInputStream("/path/to/opennlp/models/en-chunker.bin");
ChunkerModel cm = new ChunkerModel(cmis);
chunker = new ChunkerME(cm);
pamis = new FileInputStream(
"/path/to/opennlp/models/en-parser-chunking.bin");
ParserModel pam = new ParserModel(pamis);
parser = ParserFactory.create(pam);
for (String sentence : TEST_SENTENCES) {
System.out.println("sentence: " + sentence);
sentence = removeCoordinatePrefixHyphens(sentence);
int expectedNumPhrases = countConjWords(sentence);
int actualNumPhrases = 0;
Span[] tokenSpans = tokenizer.tokenizePos(sentence);
String[] tokens = new String[tokenSpans.length];
for (int i = 0; i < tokenSpans.length; i++) {
tokens[i] = tokenSpans[i].getCoveredText(sentence).toString();
}
// preprocess tokens to rewrite prefix words
preprocessPrefixes(prefixes, ccs, tokens);
String[] tags = posTagger.tag(tokens);
Set<String> candidatePhrases = new HashSet<String>();
// shallow parse using chunker
Span[] chunks = chunker.chunkAsSpans(tokens, tags);
System.out.println(showExplanation(tokens, tags, chunks));
for (Span chunk : chunks) {
int chunkTokenStart = chunk.getStart();
int chunkTokenEnd = chunk.getEnd();
if ("NP".equals(chunk.getType()) &&
containsToken(tokens, chunkTokenStart, chunkTokenEnd, ccs)) {
String np = StringUtils.join(
ArrayUtils.subarray(tokens, chunkTokenStart, chunkTokenEnd),
" ");
if (hasOverlapOnSlashSeparator(np)) {
np = np.replace(" , ", ", ").
replace(", and ", " and ").
replace("-", " "); // clean up
candidatePhrases.add(np);
expectedNumPhrases--;
} else {
actualNumPhrases++;
}
}
}
// fallback to deep parse using parser
if (actualNumPhrases < expectedNumPhrases) {
Parse[] parses = ParserTool.parseLine(sentence, parser, 1);
for (Parse parse : parses) {
walkParseTree_r(sentence, parse, candidatePhrases);
}
}
// print candidate phrases
for (String candidatePhrase : candidatePhrases) {
System.out.println(".. " + candidatePhrase);
List<String> expandedPhrases = expandPhrase(candidatePhrase);
for (String expandedPhrase : expandedPhrases) {
System.out.println(".... " + expandedPhrase);
}
}
// match phrases against taxonomy
}
} finally {
IOUtils.closeQuietly(tmis);
IOUtils.closeQuietly(pmis);
IOUtils.closeQuietly(cmis);
}
}
private List<String> expandPhrase(String candidatePhrase) {
List<String> expandedPhrases = new ArrayList<String>();
String[] tokens = StringUtils.split(candidatePhrase, " ");
int ccpos = 0;
String cctype = null;
// first pass, find the position of the conjunction
for (int i = 0; i < tokens.length; i++) {
if ("and".equalsIgnoreCase(tokens[i]) ||
"or".equalsIgnoreCase(tokens[i])) {
ccpos = i;
cctype = tokens[i];
break;
}
if (tokens[i].indexOf('/') > -1) {
ccpos = i;
cctype = "/";
break;
}
}
if (ccpos > 0) {
String phrasePre = "";
String phrasePost = "";
List<String> ccwords = new ArrayList<String>();
if ("/".equals(cctype)) {
// handles the following cases:
// xx A/B/.. yy => xx A yy, xx B yy, ...
ccwords.addAll(
Arrays.asList(StringUtils.split(tokens[ccpos], "/")));
phrasePre = StringUtils.join(
Arrays.asList(
ArrayUtils.subarray(tokens, 0, ccpos)).iterator(), " ");
phrasePost = StringUtils.join(
Arrays.asList(
ArrayUtils.subarray(tokens, ccpos+1, tokens.length)).iterator(),
" ");
} else {
if (ccpos > 0 && ccpos < tokens.length - 1) {
// handles the following cases:
// xx A (and|or) B C yy => xx A C yy, xx B C yy
// xx A B (and|or) C yy => xx A C yy, xx B C yy
ccwords.add(tokens[ccpos - 1]);
ccwords.add(tokens[ccpos + 1]);
// look back from ccpos-1 until we stop seeing
// words with trailing commas
int currpos = ccpos - 2;
while (currpos >= 0) {
if (tokens[currpos].endsWith(",")) {
ccwords.add(tokens[currpos].substring(
0, tokens[currpos].length() - 1));
currpos--;
} else {
break;
}
}
if (currpos >= 0) {
phrasePre = StringUtils.join(
Arrays.asList(ArrayUtils.subarray(
tokens, 0, currpos+1)), " ");
}
if (ccpos + 2 < tokens.length) {
phrasePost = StringUtils.join(
Arrays.asList(ArrayUtils.subarray(
tokens, ccpos + 2, tokens.length)), " ");
}
}
}
for (String ccword : ccwords) {
expandedPhrases.add(StringUtils.join(new String[] {
phrasePre, ccword, phrasePost}, " "));
}
}
return expandedPhrases;
}
private void walkParseTree_r(String sentence, Parse parse,
Set<String> candidatePhrases) {
Span span = parse.getSpan();
try {
String text = span.getCoveredText(sentence).toString();
if ("and".equalsIgnoreCase(text) || "or".equalsIgnoreCase(text)) {
Parse pparse = parse.getCommonParent(parse);
Span pspan = pparse.getSpan();
String chunk = pspan.getCoveredText(sentence).toString();
if (! ("and".equalsIgnoreCase(chunk) ||
"or".equalsIgnoreCase(chunk))) {
// remove trailing punctuation
if (chunk.matches("^.*\\p{Punct}$")) {
chunk = chunk.substring(0, chunk.length() - 1);
}
chunk = chunk.replace("-", " ");
candidatePhrases.add(chunk);
}
}
} catch (Exception e) {
// attempt to ignore IllegalArgumentException caused by
// the span.getCoveredText() attempting to lookup a larger
// span than the sentence length. Nothing much I can do
// here, this is probably an issue with OpenNLP.
}
for (Parse child : parse.getChildren()) {
walkParseTree_r(sentence, child, candidatePhrases);
}
}
private int countConjWords(String sentence) {
int numConjWords = 0;
for (String conjWord : CONJ_WORDS) {
numConjWords += StringUtils.countMatches(sentence,
" " + conjWord + " ");
}
numConjWords += StringUtils.countMatches(sentence, "/");
return numConjWords;
}
private String removeCoordinatePrefixHyphens(String sentence) {
return sentence.replaceAll("- and ", " and ").
replaceAll("- or ", " or ");
}
private void preprocessPrefixes(Set<String> prefixes,
Set<String> ccs, String[] tokens) {
for (int i = 0; i < tokens.length; i++) {
if (ccs.contains(StringUtils.lowerCase(tokens[i]))) {
// check before and after
String preWord = (i > 0) ?
StringUtils.lowerCase(tokens[i-1]) : null;
String postWord = (i < tokens.length - 1) ?
StringUtils.lowerCase(tokens[i+1]) : null;
if (preWord != null && postWord != null) {
if (preWord.endsWith("-"))
preWord = preWord.substring(0, preWord.length() - 1);
if (prefixes.contains(preWord)) {
// attempt to split postWord along one of the prefixes
String bareWord = null;
for (String prefix : prefixes) {
if (postWord.startsWith(prefix)) {
bareWord = postWord.substring(prefix.length());
break;
}
}
if (StringUtils.isNotEmpty(bareWord)) {
tokens[i-1] = preWord + bareWord;
}
}
}
}
}
}
private String showExplanation(String[] tokens, String[] tags,
Span[] chunks) {
StringBuilder buf = new StringBuilder();
int ci = 0;
for (int i = 0; i < tokens.length; i++) {
if (ci < chunks.length && chunks[ci].getStart() == i) {
buf.append("[");
}
buf.append(tokens[i]).append("/").append(tags[i]).append(" ");
if (ci < chunks.length && chunks[ci].getEnd() - 1 == i) {
buf.append("]/").append(chunks[ci].getType()).append(" ");
ci++;
}
}
return buf.toString();
}
private boolean containsToken(String[] tokens, int start,
int end, Set<String> ccs) {
String[] chunkTokens = (String[]) ArrayUtils.subarray(
tokens, start, end);
for (String chunkToken : chunkTokens) {
if (ccs.contains(StringUtils.lowerCase(chunkToken))) {
return true;
}
if (chunkToken.contains("/")) {
return true;
}
}
return false;
}
private boolean hasOverlapOnSlashSeparator(String np) {
int slashPos = np.indexOf('/');
if (slashPos > -1) {
int start = np.lastIndexOf(' ', slashPos-1) + 1;
if (start == -1) start = 0;
int end = np.indexOf(' ', slashPos+1);
if (end == -1) end = np.length();
String prevWord = np.substring(start, slashPos);
String nextWord = np.substring(slashPos+1, end);
double similarity = computeStringSimilarity(prevWord, nextWord);
return similarity > 0.0D;
}
return true;
}
public double computeStringSimilarity(String s1, String s2) {
ArrayList<String> pairs1 = letterPairs(s1);
ArrayList<String> pairs2 = letterPairs(s2);
int intersection = 0;
int union = pairs1.size() + pairs2.size();
for (int i = 0; i < pairs1.size(); i++) {
Object pair1=pairs1.get(i);
for(int j = 0; j < pairs2.size(); j++) {
Object pair2=pairs2.get(j);
if (pair1.equals(pair2)) {
intersection++;
pairs2.remove(j);
break;
}
}
}
return (2.0 * intersection) / union;
}
private ArrayList<String> letterPairs(String str) {
int numPairs = str.length()-1;
ArrayList<String> pairs = new ArrayList<String>();
for (int i = 0; i < numPairs; i++) {
pairs.add(str.substring(i, i + 2));
}
return pairs;
}
}
|
which when run, yield results as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | Viral hepatitis, including hepatitis A, B, and C, are distinct
diseases that affect the liver.
.. hepatitis A, B and C
.... hepatitis B
.... hepatitis C
.... hepatitis A
This page contains late breaking information, as well as an archival
record of updates on safety and regulatory issues related to Hepatitis
A and B, including product approvals, significant labeling changes,
safety warnings, notices of upcoming public meetings, and notices about
proposed regulatory guidances.
.. safety and regulatory issues
.... safety issues
.... regulatory issues
.. Hepatitis A and B
.... Hepatitis A
.... Hepatitis B
.. upcoming public meetings, and notices
.... upcoming public meetings,
.... upcoming public notices
Lead poisoning can cause an increased risk of brain, lung, stomach and
kidney cancer.
.. brain, lung, stomach and kidney cancer
.... stomach cancer
.... kidney cancer
.... lung cancer
.... brain cancer
Before we look at the difference between diabetes type-I and II, let's
firstly look at diabaetes in general.
.. diabetes type I and II
.... diabetes type I
.... diabetes type II
Restricting and purging anorexia are two different types of anorexia
that people suffer from.
.. Restricting and purging anorexia
.... Restricting anorexia
.... purging anorexia
Here are some tips on pre and post surgery nutrition.
.. pre and post surgery nutrition
.... pre surgery nutrition
.... post surgery nutrition
A computer-based register linked to thyroid diagnostic laboratories
was used to continuously identify all new cases of overt hyper- and
hypothyroidism in two population cohorts with moderate and mild ID,
respectively (Aalborg, n = 310,124; urinary iodine, 45 micro g/liter;
and Copenhagen, n = 225,707; urinary iodine, 61 micro g/liter).
.. 310,124; urinary iodine, 45 micro g/liter; and Copenhagen, n
.... 310,124; urinary iodine, 45 micro g and Copenhagen, n
.... 310,124; urinary iodine, 45 micro liter; and Copenhagen, n
.. moderate and mild ID
.... moderate ID
.... mild ID
.. overt hyperthyroidism and hypothyroidism
.... overt hyperthyroidism
.... overt hypothyroidism
Medical and assistive devices are taxable for the most part,
unconditionally zero-rated in certain cases, and conditionally
zero-rated in certain cases.
.. Medical and assistive devices
.... Medical devices
.... assistive devices
.. Medical and assistive devices are taxable for the most part,
unconditionally zero rated in certain cases, and conditionally
zero rated in certain cases
.... Medical devices are taxable for the most part, unconditionally
zero rated in certain cases, and conditionally zero rated in certain
cases
.... assistive devices are taxable for the most part, unconditionally
zero rated in certain cases, and conditionally zero rated in certain
cases
These regions correspond to the least well conserved regions of the
whole miRNA/snoRNA molecules.
.. the whole miRNA/snoRNA molecules
.... the whole miRNA molecules
.... the whole snoRNA molecules
Hetero- and Homogeneous mixtures are alike because they are both
compounds, and both made up of different elements.
.. heterogeneous and Homogeneous mixtures
.... heterogeneous mixtures
.... Homogeneous mixtures
|
The final step (which I haven't done here) is to take each of these expanded phrases and hit the taxonomy index to map them to concepts. This will also get rid of some of the noisy phrases.
Conclusion
The approach described above solves a very simple subset of the more general problem of coordinate construction/expansion. It does this by identifying noun phrases in text containing trigger words "AND", "OR" and "/", and expanding the phrases into multiple phrases along the trigger words.
Our current (proprietary, and not described here) solution for this problem uses taxonomy lookups rather than NLP to determine the phrase boundaries. While it has some advantages in terms of accuracy (phrases such as "diabetes, lung and heart disease" will get correctly expanded to "diabetes", "lung disease" and "heart disease" rather than "diabetes disease", "lung disease" and "heart disease"), it involves a lot more processing (in user level code).
A possible improvement to the tiered lookup (chunker then parser) approach described above could be to build a custom rule-based chunker that works on the POS tags generated by OpenNLP, similar to NLTK's RegexpChunkParser. I will check out this approach, and if it works out, will replace the tiered lookup code.
References
Here are some reference type articles that were suggested by folks in the LinkedIn discussion, or that I found by following the pointers provided there. All of these attempt to solve the more general problem of coordinate construction, and as such are not directly applicable to my solution, but they did provide useful ideas, so including them here in case you want to read more about this stuff.
- Dealing with Conjunctions in a Machine Translation Environment, by Xiuming Huang, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing, China. (PDF)
- Lexicase Parsing: A Lexicon-driven approach to Syntactic Analusis, by Stanley Starosta and Hirosato Nomura, COLING'86 Proceedings of the 11th conference on Computational Linguistics, doi:10.3115/991365.991400. (PDF)
- Comparing methods for the syntactic simplification of sentences in information extraction by Richard J Evans, Lit Linguistic Computing (2011), doi: 10.1093/llc/fqr034. (PDF)
Really nice explanation of the idea.
ReplyDeleteSujit, I've been following your blog for 3-4 months now and I did wanted to discuss with you some of the problems that I'm facing related to search in my application.
Can you provide me you contact info (email or phone so that I can discuss with you)?
Thanks
Nitin
Hi Nitin, I would prefer to not make my email address crawlable, so if you send me a comment with contact info, I can use that (I will delete that comment for your privacy of course).
ReplyDeleteHello-
ReplyDeleteAre you willing to donate this code or these algorithms to Apache? I'm working on an OpenNLP addition to Solr, and this is a good example.
Hi Lance, sure. Let me know if I need to do anything to move that along. Just curious, is there a JIRA/WebPage on which your work is described, would be interesting how you plan to use OpenNLP with Solr (I've always thought of OpenNLP as something you would want to use on text.
ReplyDeleteHi Pal,
ReplyDeleteIn you blog,
"I remembered one comment which said that this problem is trivial if you have a good parse tree"
How do you use the parse tree to solve this problem?
Hi Jun, I have described it in the blog immediately after the statement... essentially you convert the sentence into its equivalent parse-tree and look for NP nodes containing a CC, and see if there are two or more NPs hanging off it - these form the parts that you will need to build the expansion off of.
ReplyDeletewhat are the jar required to run this class .
ReplyDeleteHi, you will need the commons-io and commons-lang and the opennlp JARs - the ones I used are commons-io-2.0.1.jar, commons-lang-3.0.1.jar, opennlp-maxent-3.0.1-incubating.jar and opennlp-tools-1.5.1-incubating.jar. Additionally this particular class is a JUnit test so you would need JUnit 4.0+.
ReplyDeletecan we find the relation between the noun pharse using the stanford dependency parser java
ReplyDeleteHi Anonymous, thanks, an excellent idea! Even though I haven't tried this (obviously :-)), definitely worth checking out.
ReplyDeleteI have started looking into the co-ordination problem using openNLP's chunker and I too did notice the chunker is not so accurate.
ReplyDeleteYou can try a better chunker (in my tests) here: http://cogcomp.cs.illinois.edu/demo/shallowparse/
This may solve most of the problems - including not having to make use of the parser after chunking.
Thanks Anurag, looks interesting, but the link you provided links to a demo, and it timed out on the first sentence in my test set. Can you point me to where I can get the JAR for this?
ReplyDeleteSujit, It is working for me. You can download the chunker here: http://cogcomp.cs.illinois.edu/page/software
ReplyDeleteThanks, lots of cool things here :-).
ReplyDeleteThanks, lots of cool things here :-).
ReplyDelete