Friday, October 21, 2011

Computing Document Similarity using Lucene Term Vectors

Someone asked me a question recently about implementing document similarity, and since he was using Lucene, I pointed him to the Lucene Term Vector API. I hadn't used the API myself, but I knew in general how it worked, so when he came back and told me that he could not make it work, I wanted to try it out for myself, to give myself a basis for asking further questions.

I already had a Lucene index (built by SOLR) of about 3000 medical articles for whose content field I had enabled term vectors as part of something I was trying out for highlighting, so I decided to use that. If you want to follow along and have to build your index from scratch, you can either use a field definition in your SOLR schema.xml file similar to this:

1
2
   <field name="content" type="text" indexed="true" stored="true" 
       termVectors="true" termPositions="true" termOffsets="true"/>

or enable term vectors in your Lucene indexing code like so:

1
2
  doc.addField(new Field("content", content, Store.YES, Index.ANALYZED,
    TermVector.WITH_POSITIONS_OFFSETS));

Note that you don't need positions and offsets enabled for term vectors for doing what I am doing, but I had it on in my schema.xml file already, so decided (to avoid confusion) to show the equivalent Lucene call in the code snippet above.

Once you have the index, find the list of all the terms in the "content" field across the entire index. These terms would represent the entries in the document vector for a document. Then for the documents which you want to consider for your similarity computation, extract its term vector. The term vector gives you two arrays, an array of terms within this document, and the corresponding frequency of that term in this document. Using these three data structures, it is easy to construct a (sparse) document vector representing the document(s).

Once you have the two document vectors, the similarity between the two can be computed using Cosine Similarity (or other measure - most of them work with document vectors).

For my test, I chose 4 documents out of my corpus, the titles of which (and their associated Lucene document IDs) are shown below:

1
2
3
4
Document#0: Mitral valve surgery - minimally invasive (31825)
Document#1: Mitral valve surgery - open (31835)
Document#2: Laryngectomy (31706)
Document#3: Shaken baby syndrome (30)

Notice that the first two are both about surgery and about the same body part, so intuitively I should expect a very high similarity between the documents 0 and 1. Document 2 is still about surgery, so I should expect a good, but smaller similarity value between documents 0 and 2. Document 3 is completely different, so I should expect an even smaller similarity value between documents 0 and 3.

The code for this is shown below (as a JUnit test).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
package com.mycompany.termpos;

import java.io.File;
import java.util.HashMap;
import java.util.Map;

import org.apache.commons.math.linear.OpenMapRealVector;
import org.apache.commons.math.linear.RealVectorFormat;
import org.apache.commons.math.linear.SparseRealVector;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermEnum;
import org.apache.lucene.index.TermFreqVector;
import org.apache.lucene.store.FSDirectory;
import org.junit.Assert;
import org.junit.Test;

public class TermVectorBasedSimilarityTest {

  @Test
  public void testSimilarity() throws Exception {
    IndexReader reader = IndexReader.open(
      FSDirectory.open(new File("/path/to/my/index")));
    // first find all terms in the index
    Map<String,Integer> terms = new HashMap<String,Integer>();
    TermEnum termEnum = reader.terms(new Term("content"));
    int pos = 0;
    while (termEnum.next()) {
      Term term = termEnum.term();
      if (! "content".equals(term.field())) 
        break;
      terms.put(term.text(), pos++);
    }
    int[] docIds = new int[] {31825, 31835, 31706, 30};
    DocVector[] docs = new DocVector[docIds.length];
    int i = 0;
    for (int docId : docIds) {
      TermFreqVector[] tfvs = reader.getTermFreqVectors(docId);
      Assert.assertTrue(tfvs.length == 1);
      docs[i] = new DocVector(terms); 
      for (TermFreqVector tfv : tfvs) {
        String[] termTexts = tfv.getTerms();
        int[] termFreqs = tfv.getTermFrequencies();
        Assert.assertEquals(termTexts.length, termFreqs.length);
        for (int j = 0; j < termTexts.length; j++) {
          docs[i].setEntry(termTexts[j], termFreqs[j]);
        }
      }
      docs[i].normalize();
      i++;
    }
    // now get similarity between doc[0] and doc[1]
    double cosim01 = getCosineSimilarity(docs[0], docs[1]);
    System.out.println("cosim(0,1)=" + cosim01);
    // between doc[0] and doc[2]
    double cosim02 = getCosineSimilarity(docs[0], docs[2]);
    System.out.println("cosim(0,2)=" + cosim02);
    // between doc[0] and doc[3]
    double cosim03 = getCosineSimilarity(docs[0], docs[3]);
    System.out.println("cosim(0,3)=" + cosim03);
    reader.close();
  }
  
  private double getCosineSimilarity(DocVector d1, DocVector d2) {
    return (d1.vector.dotProduct(d2.vector)) /
      (d1.vector.getNorm() * d2.vector.getNorm());
  }

  private class DocVector {
    public Map<String,Integer> terms;
    public SparseRealVector vector;
    
    public DocVector(Map<String,Integer> terms) {
      this.terms = terms;
      this.vector = new OpenMapRealVector(terms.size());
    }
    
    public void setEntry(String term, int freq) {
      if (terms.containsKey(term)) {
        int pos = terms.get(term);
        vector.setEntry(pos, (double) freq);
      }
    }
    
    public void normalize() {
      double sum = vector.getL1Norm();
      vector = (SparseRealVector) vector.mapDivide(sum);
    }
    
    public String toString() {
      RealVectorFormat formatter = new RealVectorFormat();
      return formatter.format(vector);
    }
  }
}

The results of running this test are shown below, and match up with my expectations, as shown below:

1
2
3
  cosim(0,1)=0.9672563304581603
  cosim(0,2)=0.7496880467405691
  cosim(0,3)=0.23968406019250035

Using Lucene's term vectors to generate document vectors can be useful not only for similarity calculations, but for other tasks where you need document vectors, such as clustering and classification. Computing document vectors from raw data is typically more involved (requires development time and can have scalability issues) than loading data into a Lucene index.

The other scalability issue is the size of the document vectors. My index is relatively tiny (about 3.000 documents), but we are still talking 10,000 unique terms after stopwording and stemming. However, document vectors are typically quite sparse (my test document vectors have an average density of 0.01), so an OpenMapRealVector is a good choice of data structure, and should scale to much larger sizes as well.

Friday, October 14, 2011

Lucene Wildcard Query and Permuterm Index

At a project I am working on, I needed to allow wildcard queries against the titles of documents stored in a Lucene index. The behavior expected is similar to a database LIKE query. My first thought (and current implementation) was to delegate the lookup to a database using a custom component, then using the results as the basis for a secondary search against Lucene. This approach has some limitations, though, so I decided to see if I could support this natively within Lucene.

Its not like Lucene doesn't provide ways to do this - there is the PrefixQuery which supports wildcards at the end, and the WildcardQuery which supports wildcards (both "*" and "?") at the beginning, middle, and end. However, the performance of a wildcard at the beginning is bad enough that it comes with a warning in the Javadocs. The way both of these work is by enumerating through the terms starting at the prefix, creating a massive BooleanQuery with all matching terms, wrapping it in a ConstantScoreQuery and executing it. So if you have a leading wildcard, the enumeration has to happen across all the terms in the index.

I needed a way to efficiently support the "*" wildcard in the beginning, middle and end of a string, and using the WildcardQuery with its performance caveat was not an option. Looking around, I discovered this rather old (circa 2006) thread on MarkMail that discussed various possibilities. One of these that that felt like it would suit my requirements was the use of the Permuterm index, which is basically an index of your terms and all its permutations.

In this post, I am going to describe my implementation (including code) of wildcard query using the permuterm index approach. Obviously the idea is not new, but a working implementation may help other people with similar problems, as well as suggest ideas for improvements, so here goes...

Permuterm Generating Token Filter

First, you need to create a permuterm index for the terms in the fields you want to query using wildcards. In my case it was only the title field, so I built a TokenFilter that placed the original term (a word in the title), as well as all its permuters at the same virtual location in the index. Here is the code for the TokenFilter.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
// Source: src/main/java/com/mycompany/permuterm/PermutermTokenFilter.java
package com.mycompany.permuterm;

import java.io.IOException;
import java.util.Stack;

import org.apache.lucene.analysis.TokenFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.PositionIncrementAttribute;
import org.apache.lucene.util.AttributeSource;

/**
 * TokenFilter to inject permuterms.
 */
public class PermutermTokenFilter extends TokenFilter {

  private CharTermAttribute termAttr;
  private PositionIncrementAttribute posIncAttr;
  private AttributeSource.State current;
  private TokenStream input;
  private Stack<char[]> permuterms;
  
  protected PermutermTokenFilter(TokenStream input) {
    super(input);
    this.termAttr = addAttribute(CharTermAttribute.class);
    this.posIncAttr = addAttribute(PositionIncrementAttribute.class);
    this.input = input;
    this.permuterms = new Stack<char[]>();
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (permuterms.size() > 0) {
      char[] permuterm = permuterms.pop();
      restoreState(current);
      termAttr.copyBuffer(permuterm, 0, permuterm.length);
      posIncAttr.setPositionIncrement(0);
      return true;
    }
    if (! input.incrementToken()) {
      return false;
    }
    if (addPermuterms()) {
      current = captureState();
    }
    return true;
  }

  private boolean addPermuterms() {
    char[] buf = termAttr.buffer();
    char[] obuf = new char[termAttr.length() + 1];
    for (int i = 0; i < obuf.length - 1; i++) {
      obuf[i] = buf[i];
    }
    obuf[obuf.length-1] = '$';
    for (int i = 0; i < obuf.length; i++) {
      char[] permuterm = getPermutermAt(obuf, i);
      permuterms.push(permuterm);
    }
    return true;
  }

  private char[] getPermutermAt(char[] obuf, int pos) {
    char[] pbuf = new char[obuf.length];
    int curr = pos;
    for (int i = 0; i < pbuf.length; i++) {
      pbuf[i] = obuf[curr];
      curr++;
      if (curr == obuf.length) {
        curr = 0;
      }
    }
    return pbuf;
  }

}

The token filter takes each term in the incoming token stream, permute its characters to produce all possible permutations of the characters, and places them at the same (virtual) position as the original term. Using this, a query with a wildcard at the beginning or middle can be converted to an equivalent query with a wildcard at the end. So for example, the term "cataract" is permuted to the following terms:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
cataract
$cataract
t$catarac
ct$catara
act$catar
ract$cata
aract$cat
taract$ca
ataract$c
cataract$

Building the Permuterm Index

The token filter can be invoked and terms loaded into an index using code such as the following. I had an old index lying around, so I just used the titles in it to create a new index with the titles analyzed using a chain containing the permuterm generating token filter described above.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
  @Test
  public void testLoadTitles() throws Exception {
    Analyzer analyzer = new Analyzer() {
      @Override
      public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream tokens = new WhitespaceTokenizer(Version.LUCENE_31, reader);
        tokens = new LowerCaseFilter(Version.LUCENE_31, tokens);
        tokens = new PermutermTokenFilter(tokens);
        return tokens;
      }
    };
    IndexReader reader = IndexReader.open(
      FSDirectory.open(new File("/path/to/my/old/index")));
    IndexWriterConfig iwconf = new IndexWriterConfig(
      Version.LUCENE_31, analyzer);
    iwconf.setOpenMode(OpenMode.CREATE);
    IndexWriter writer = new IndexWriter(
      FSDirectory.open(new File("/path/to/my/new/index")), iwconf);
    int numdocs = reader.maxDoc();
    for (int i = 0; i < numdocs; i++) {
      Document idoc = reader.document(i);
      Document odoc = new Document();
      String title = idoc.get("title");
      writer.addDocument(odoc);
    }
    reader.close();
    writer.commit();
    writer.optimize();
    writer.close();
  }

Since this is a test index it doesn't contain anything outside what I need. In a real life situation, you could invoke the custom analyzer using a PerFieldAnalyzerWrapper or define it in your fieldtype definition in Solr.

Wildcard Query using Permuterm Index

Once you have your index, a query with a wildcard at the beginning or middle can be rewritten to a Prefix query (ie one with a wildcard at the end). For example, a query "*act" would become "act$*" and a query such as "cat*act" would become "act$cat*". Queries which already have a wildcard at the end are modified slightly - "cat*" becoems "$cat*". The "$" marker indicates a word boundary.

I decided to implement this logic as a custom Query type, using a pattern similar to the WildcardQuery, somewhat unimaginatively named PermutermWildcardQuery. Here it is.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
// Source: src/main/java/com/mycompany/permuterm/PermutermWildcardQuery.java
package com.mycompany.permuterm;

import java.io.IOException;

import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.FilteredTermEnum;
import org.apache.lucene.search.MultiTermQuery;
import org.apache.lucene.search.PrefixTermEnum;
import org.apache.lucene.search.SingleTermEnum;

public class PermutermWildcardQuery extends MultiTermQuery {

  private static final long serialVersionUID = 1245766111925148372L;

  private Term term;
  
  public PermutermWildcardQuery(Term term) {
    this.term = term;
  }
  
  @Override
  protected FilteredTermEnum getEnum(IndexReader reader) throws IOException {
    String field = term.field();
    String text = term.text();
    if (text.indexOf('*') == -1) {
      // no wildcards, use a term query
      return new SingleTermEnum(reader, term);
    } else {
      if (text.charAt(0) == '*' && 
          text.charAt(text.length()-1) == '*') {
        // leading and trailing '*', remove trailing '*'
        // and permute to make suffix query
        String ptext = permute(
          text.substring(0, text.length()-1), 0);
        return new PrefixTermEnum(reader, new Term(field, new String(ptext)));
      } else if (text.charAt(text.length()-1) == '*') {
        // trailing '*', pad with '$' on front and convert
        // to prefix query
        String ptext = "$" + text.substring(0, text.length()-1);
        return new PrefixTermEnum(reader, new Term(field, ptext));
      } else {
        // leading/within '*', pad with '$" on end and permute
        // to convert to prefix query
        String ptext = permute(text + "$", text.indexOf('*'));
        return new PrefixTermEnum(reader, new Term(field, ptext));
      }
    }
  }

  private String permute(String text, int starAt) {
    char[] tbuf = new char[text.length()];
    int tpos = 0;
    for (int i = starAt + 1; i < tbuf.length; i++) {
      tbuf[tpos++] = text.charAt(i);
    }
    for (int i = 0; i < starAt; i++) {
      tbuf[tpos++] = text.charAt(i);
    }
    return new String(tbuf, 0, tpos);
  }

  @Override
  public String toString(String field) {
    return "PermutermWildcardQuery(" + term + ")";
  }
}

The getEnum() method, which all subclasses of MultiTermQuery are required to implement, contains the logic to detect different types of query patterns and handling them slightly differently. The FilteredTermEnum that is generated is used by the superclass to construct a BooleanQuery of all the terms and to return the results.

Running the Query

To run the query with different query terms, I used the following simple JUnit test.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
  private static final String[] SEARCH_TERMS = new String[] {
    "cataract", "cat*", "*act", "cat*act", "*ara*"
  };
  @Test
  public void testPermutermQuery() throws Exception {
    IndexSearcher searcher = new IndexSearcher(
      FSDirectory.open(new File("/path/to/new/index")));
    for (String searchTerm : SEARCH_TERMS) {
      PermutermWildcardQuery q = new PermutermWildcardQuery(
        new Term("title", searchTerm));
      TopScoreDocCollector collector = 
        TopScoreDocCollector.create(10, true);
      searcher.search(q, collector);
      int numResults = collector.getTotalHits();
      ScoreDoc[] hits = collector.topDocs().scoreDocs;
      System.out.println("=== " + numResults + 
          " results for: " + searchTerm + 
          " (query=" + q.toString() + ") ===");
      for (int i = 0; i < hits.length; i++) {
        Document doc = searcher.doc(hits[i].doc);
        System.out.println(".. " + doc.get("title"));
      }
    }
    searcher.close();
  }

The results from this run are shown below. As you can see, the results look accurate enough, and they also return fast.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
=== 3 results for: cataract (query=PermutermWildcardQuery(title:cataract)) ===
.. Cataract
.. Congenital cataract
.. Cataract removal
=== 17 results for: cat* (query=PermutermWildcardQuery(title:cat*)) ===
.. Catheter-associated UTI
.. Cataract
.. Cat scratch disease
.. Congenital cataract
.. Caterpillars
.. Cataract removal
.. Cardiac catheterization
.. Catecholamines - blood
.. Catecholamines - urine
.. Urine culture - clean catch
=== 8 results for: *act (query=PermutermWildcardQuery(title:*act)) ===
.. Urinary tract infection - children
.. Urinary tract infection - adults
.. Contact dermatitis
.. Cataract
.. Developmental disorders of the female reproductive tract
.. Congenital cataract
.. Cataract removal
.. Biopsy - biliary tract
=== 3 results for: cat*act (query=PermutermWildcardQuery(title:cat*act)) ===
.. Cataract
.. Congenital cataract
.. Cataract removal
=== 44 results for: *ara* (query=PermutermWildcardQuery(title:*ara*)) ===
.. Parapneumonic pulmonary effusion
.. Periodic paralysis with hypokalemia
.. Hypokalemic periodic paralysis
.. Hyperkalemic periodic paralysis
.. Secondary hyperparathyroidism
.. Thyrotoxic periodic paralysis
.. Pseudohypoparathyroidism
.. Primary hyperparathyroidism
.. Hypoparathyroidism
.. Subarachnoid hemorrhage

This implementation does have limitations, of course. For one, multiple wildcards (except one each at the beginning and end) are not supported. I need to figure out what I should do in that case. It is likely that I will end up building my own custom FilteredTermEnum to iterate through the TermEnum using the first part (until the first occurrence of the wildcard), then match each subsequent using regular expressions. I haven't thought this through completely yet. A lot depends on whether we decide to switch out the current database backed implementation with this one (in that case I will think about improving it). If I do end up improving it, I will describe it here.

Monday, October 03, 2011

Coordinate Expansion with OpenNLP

Background

Despite the fancy name, the problem I am trying to solve here is to eliminate missed opportunities for mapping text such as "hepatitis B and C" against our taxonomy containing concepts "hepatitis B" and "hepatitis C". Thanks to the many kind folks on the LinkedIn Natural Language Processing People group, I now know that this is variously called (or covered under) Distributive Predication, Coordinate Expansion/Construction, and Transformational Grammar. They also provided pointers and links to papers which were quite helpful, which I list in the References below.

My problem is more limited than the work described in the references aim to solve, however. My toy mapping application, TGNI, already uses OpenNLP to break the input text into sentences, so my problem is to detect these patterns in the sentences and expand them. For testing, here are some sentences I pulled off the internet (and in some cases, adapted to fit). Highlighted areas represent the patterns I am interested in expanding.

  1. Viral hepatitis, including hepatitis A, B, and C, are distinct diseases that affect the liver. (webmd.com)
  2. This page contains late breaking information, as well as an archival record of updates on safety and regulatory issues related to Hepatitis A and B, including product approvals, significant labeling changes, safety warnings, notices of upcoming public meetings, and notices about proposed regulatory guidances. (fda.gov)
  3. Lead poisoning can cause an increased risk of brain, lung, stomach and kidney cancer. (cleanupblackwell.com)
  4. Before we look at the difference between diabetes type I and II, let's firstly look at diabetes in general. (medicalnewstoday.com)
  5. Restricting and purging anorexia are two different types of anorexia that people suffer from (anorexiasurvivalguide.com)
  6. Here are some tips on pre and post surgery nutrition. (bestofmotherearth.com)
  7. A computer-based register linked to thyroid diagnostic laboratories was used to continuously identify all new cases of overt hyper- or hypothyroidism in two population cohorts with moderate and mild ID, respectively (Aalborg, n = 310,124; urinary iodine, 45 micro g/liter; and Copenhagen, n = 225,707; urinary iodine, 61 micro g/liter). (nlm.nih.gov)
  8. Medical and assistive devices are taxable for the most part, unconditionally zero-rated in certain cases, and conditionally zero-rated in certain cases. (revenuequebec.ca)
  9. These regions correspond to the least well conserved regions of the whole miRNA/snoRNA molecules. (ploscompbiol.org)
  10. Hetero- and Homogeneous mixtures are alike because they are both compounds, and both made up of different elements. (answers.com)

Identifying Candidate Phrases

Noun Phrases with AND, OR and slash

Based upon some observations extracting noun phrases using OpenNLP, I realized that I could probably get most of the candidate phrases if I considered only the noun phrases containing "AND", "OR" and "/" in them. Checking this is fairly trivial, so I tried that first. This resulted in finding 9 out of the 12 expected phrases from my test set.

Removing noisy Noun Phrases with slash

The noun phrases retrieved also contained a few noise phrases such as "45 micro g/liter" which are not candidates for coordinate expansion, along with "the whole miRNA/snoRNA molecules", which are. For these, I added a rule that posited that there must be a non-zero character overlap between the words preceding and following the "/" character.

I used the Strike-a-Match algorithm to compute string similarity. Some of the code below is directly adapted from the Java code explaining the algorithm.

Expanding dangling prefix words

Looking through the test set, I realized that OpenNLP seems to have a hard time handling dangling prefix words such as hyper- (it treats the "-" as a separate token) and gets all confused, as shown below.

[A/DT computer-based/JJ register/NN ]/NP [linked/VBN ]/VP [to/TO thyroid/VB ]/VP [diagnostic/JJ laboratories/NNS ]/NP [was/VBD used/VBN ]/VP [to/TO continuously/RB identify/VB ]/VP [all/DT new/JJ cases/NNS ]/NP [of/IN ]/PP [overt/JJ hyper/NN ]/NP -/: and/CC [hypothyroidism/NN ]/NP [in/IN ]/PP [two/CD population/NN cohorts/NNS ]/NP [with/IN ]/PP [moderate/JJ and/CC mild/JJ ID/NNP ]/NP ,/, [respectively/RB ]/ADVP [(/-LRB- Aalborg/UH ]/NP ,/, [n/RB ]/ADVP [=/SYM ]/VP [310,124/CD ]/NP ;/: [urinary/JJ iodine/NN ]/NP ,/, [45/CD micro/JJ g/liter/NN ]/NP ;/: and/CC [Copenhagen/NNP ]/NP ,/, [n/RB =/SYM 225,707/CD ]/NP ;/: [urinary/JJ iodine/NN ]/NP ,/, [61/CD micro/JJ g/liter/NN )/-RRB- ]/NP ./.

So I figured that if we expanded such dangling prefixes into the complete word using the token after the conjunction, ie, expanding strings such as "hyper- and hypothyroidism" to "hyperthyroidism and hypothyroidism", then it may make it easier for OpenNLP to tag them correctly.

I used some of the words from the English Prefixes wiki page. When I see one of these words preceding an "AND" or "OR" token, I check the word following the token, split it up and append the second part of the following word to the preceding word. This takes care of 2 more examples in my test set.

Phrase patterns and Parse Trees

I also noticed that some of my patterns seem to be incorrectly categorized (ie not noun phrases), such as these.

[Restricting/VBG and/CC purging/VBG ]/VP [anorexia/DT ]/NP [are/VBP ]/VP [two/CD different/JJ types/NNS ]/NP [of/IN ]/PP [anorexia/NN ]/NP [that/IN ]/PP [people/NNS ]/NP [suffer/VBP ]/VP [from/IN ]/PP ./.

[Before/IN ]/SBAR [we/PRP ]/NP [look/VBP ]/VP [at/IN ]/PP [the/DT difference/NN ]/NP [between/IN ]/PP [diabetes/NNS ]/NP [type-I/JJ ]/ADJP and/CC [II/NNP ]/NP ,/, [let/VB ]/VP ['s/PRP ]/NP [firstly/RB ]/ADVP [look/VBP ]/VP [at/IN ]/PP [diabaetes/NNS ]/NP [in/IN ]/PP [general/JJ ]/ADJP ./.

I suspect that these are training issues with the default models provided with the OpenNLP distribution I am using. Making my own training set is unfortunately not feasible given my resources. At this point, I considered making up rules to extract (VP,NP) and (NP,ADJP,CC,NP) sequences as well as noun phrases, but realized that this may lead to overfitting.

I remembered one comment which said that this problem is trivial if you have a good parse tree, so I decided to see if OpenNLP could give me a parse tree that I could work with for these corner cases (going by my test examples, these account for just 2 of my 12 cases). Here they are, for both the sentences considered above.

(TOP 
  (S 
    (SBAR 
       (IN Before) 
       (S 
         (NP 
           (PRP we)
         ) 
         (VP 
           (VBP look) 
           (PP 
             (IN at) 
             (NP 
               (NP 
                 (DT the) 
                 (NN difference)
               ) 
               (PP 
                 (IN between) 
                 (NP 
                   (NP 
                     (NN diabetes) 
                     (NN type-I)
                   ) 
                   (CC and) 
                   (NP 
                     (NNP II,)
                   )
                 )
               )
             )
           )
         )
       )
     ) 
     (VP 
       (VBD let's) 
       (RB firstly) 
       (VP 
         (VB look) 
         (PP 
           (IN at) 
           (NP 
             (NP 
               (NNS diabaetes)
             ) 
             (PP 
               (IN in) 
               (NP
                 (NN general.)
               )
             )
           )
         )
       )
     )
   )
)
    
(TOP 
  (S 
    (S 
      (VP 
        (VP 
          (VBG Restricting)
        ) 
        (CC and) 
        (VP 
          (VBG purging) 
          (NP 
            (DT anorexia)
          )
        )
      )
    ) 
    (VP 
      (VBP are) 
      (NP 
        (NP 
          (NP 
            (CD two) 
            (JJ different) 
            (NNS types)
          ) 
          (PP 
            (IN of) 
            (NP 
              (NN anorexia)
            )
          )
        ) 
        (PP 
          (IN that) 
          (NP 
            (NNS people)
          )
        )
      ) 
      (VBP suffer)
    ) 
    (. from.)
  )
)
    

I wanted to use the parser as a fallback to the chunker, so I count the number of "AND", "OR" and "/" in the sentence, and if the number of phrases discovered by the chunker is less than this, I invoke the parser and generate the tree. I did consider using only the parser, but there are instances where the chunker is "more correct" than the parser, and the chunker catches most of my phrases anyway, so it makes sense to use the parser to handle the corner cases. The other reason is that the chunker is less resource-intensive than the parser.

Anyway, the fallback check above exposes 6 occurrences where the chunker did not find all the phrases containing "AND", "OR" and "/", 2 of which are correct from my point of view. So we are going to invoke the parser 4 more times than we need to.

Expanding the phrase

I initially thought it would be relatively simple to just pass the retrieved phrases through a chain of regular expression parsers, but then realized that a custom parser would probably be easier (and more general), so I built a token based parser.

Code

Here is the code to do this all, written as a JUnit test. This will become a UIMA Analysis Engine that will be added to the front end aggregate AE in TGNI.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
// Source: src/test/java/com/mycompany/tgni/uima/annotators/nlp/CoordinateConstructionTest.java
package com.mycompany.tgni.uima.annotators.nlp;

import java.io.FileInputStream;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import opennlp.tools.chunker.Chunker;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.cmdline.parser.ParserTool;
import opennlp.tools.parser.Parse;
import opennlp.tools.parser.Parser;
import opennlp.tools.parser.ParserFactory;
import opennlp.tools.parser.ParserModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTagger;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.Tokenizer;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

import org.apache.commons.io.IOUtils;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.junit.Test;

public class CoordinateConstructionTest {

  private final String[] TEST_SENTENCES = new String[] { ... };
  
  private static final String[] CONJ_WORDS = new String[] {
    "and", "or",
  };
  // from wikipedia page on Prefix_Words
  private static final String[] PREFIX_WORDS = new String[] {
    "pre", "post", "hypo", "hyper", "inter", "intra", 
    "over", "under", "infra", "ultra", "hetero", "homo",
  };
  
  @Test
  public void findCandidatePhrases() throws Exception {
    Tokenizer tokenizer = null;
    POSTagger posTagger = null;
    Chunker chunker = null;
    Parser parser = null;
    InputStream tmis = null;
    InputStream pmis = null;
    InputStream cmis = null;
    InputStream pamis = null;
    Set<String> ccs = new HashSet<String>(Arrays.asList(CONJ_WORDS));
    Set<String> prefixes = new HashSet<String>(Arrays.asList(PREFIX_WORDS));
    try {
      tmis = new FileInputStream("/path/to/opennlp/models/en-token.bin");
      TokenizerModel tm = new TokenizerModel(tmis);
      tokenizer = new TokenizerME(tm);
      pmis = new FileInputStream("/path/to/opennlp/models/en-pos-maxent.bin");
      POSModel pm = new POSModel(pmis);
      posTagger = new POSTaggerME(pm);
      cmis = new FileInputStream("/path/to/opennlp/models/en-chunker.bin");
      ChunkerModel cm = new ChunkerModel(cmis);
      chunker = new ChunkerME(cm);
      pamis = new FileInputStream(
        "/path/to/opennlp/models/en-parser-chunking.bin");
      ParserModel pam = new ParserModel(pamis);
      parser = ParserFactory.create(pam);
      for (String sentence : TEST_SENTENCES) {
        System.out.println("sentence: " + sentence);
        sentence = removeCoordinatePrefixHyphens(sentence);
        int expectedNumPhrases = countConjWords(sentence);
        int actualNumPhrases = 0;
        Span[] tokenSpans = tokenizer.tokenizePos(sentence);
        String[] tokens = new String[tokenSpans.length];
        for (int i = 0; i < tokenSpans.length; i++) {
          tokens[i] = tokenSpans[i].getCoveredText(sentence).toString();
        }
        // preprocess tokens to rewrite prefix words
        preprocessPrefixes(prefixes, ccs, tokens);
        String[] tags = posTagger.tag(tokens);
        Set<String> candidatePhrases = new HashSet<String>();
        // shallow parse using chunker
        Span[] chunks = chunker.chunkAsSpans(tokens, tags);
        System.out.println(showExplanation(tokens, tags, chunks));
        for (Span chunk : chunks) {
          int chunkTokenStart = chunk.getStart();
          int chunkTokenEnd = chunk.getEnd();
          if ("NP".equals(chunk.getType()) &&
              containsToken(tokens, chunkTokenStart, chunkTokenEnd, ccs)) {
            String np = StringUtils.join(
              ArrayUtils.subarray(tokens, chunkTokenStart, chunkTokenEnd),
              " ");
            if (hasOverlapOnSlashSeparator(np)) {
              np = np.replace(" , ", ", ").
                replace(", and ", " and ").
                replace("-", " "); // clean up
              candidatePhrases.add(np);
              expectedNumPhrases--;
            } else {
              actualNumPhrases++;
            }
          }
        }
        // fallback to deep parse using parser
        if (actualNumPhrases < expectedNumPhrases) {
          Parse[] parses = ParserTool.parseLine(sentence, parser, 1);
          for (Parse parse : parses) {
            walkParseTree_r(sentence, parse, candidatePhrases);
          }
        }
        // print candidate phrases
        for (String candidatePhrase : candidatePhrases) {
          System.out.println(".. " + candidatePhrase);
          List<String> expandedPhrases = expandPhrase(candidatePhrase);
          for (String expandedPhrase : expandedPhrases) {
            System.out.println(".... " + expandedPhrase);
          }
        }
        // match phrases against taxonomy
      }
    } finally {
      IOUtils.closeQuietly(tmis);
      IOUtils.closeQuietly(pmis);
      IOUtils.closeQuietly(cmis);
    }
  }

  private List<String> expandPhrase(String candidatePhrase) {
    List<String> expandedPhrases = new ArrayList<String>();
    String[] tokens = StringUtils.split(candidatePhrase, " ");
    int ccpos = 0;
    String cctype = null;
    // first pass, find the position of the conjunction
    for (int i = 0; i < tokens.length; i++) {
      if ("and".equalsIgnoreCase(tokens[i]) ||
          "or".equalsIgnoreCase(tokens[i])) {
        ccpos = i;
        cctype = tokens[i];
        break;
      }
      if (tokens[i].indexOf('/') > -1) {
        ccpos = i;
        cctype = "/";
        break;
      }
    }
    if (ccpos > 0) {
      String phrasePre = "";
      String phrasePost = "";
      List<String> ccwords = new ArrayList<String>();
      if ("/".equals(cctype)) {
        // handles the following cases:
        // xx A/B/.. yy => xx A yy, xx B yy, ...
        ccwords.addAll(
          Arrays.asList(StringUtils.split(tokens[ccpos], "/")));
        phrasePre = StringUtils.join(
          Arrays.asList(
          ArrayUtils.subarray(tokens, 0, ccpos)).iterator(), " ");
        phrasePost = StringUtils.join(
          Arrays.asList(
          ArrayUtils.subarray(tokens, ccpos+1, tokens.length)).iterator(), 
          " ");
      } else {
        if (ccpos > 0 && ccpos < tokens.length - 1) {
          // handles the following cases:
          // xx A (and|or) B C yy => xx A C yy, xx B C yy
          // xx A B (and|or) C yy => xx A C yy, xx B C yy
          ccwords.add(tokens[ccpos - 1]);
          ccwords.add(tokens[ccpos + 1]);
          // look back from ccpos-1 until we stop seeing
          // words with trailing commas
          int currpos = ccpos - 2;
          while (currpos >= 0) {
            if (tokens[currpos].endsWith(",")) {
              ccwords.add(tokens[currpos].substring(
                0, tokens[currpos].length() - 1));
              currpos--;
            } else {
              break;
            }
          }
          if (currpos >= 0) {
            phrasePre = StringUtils.join(
              Arrays.asList(ArrayUtils.subarray(
              tokens, 0, currpos+1)), " ");
          }
          if (ccpos + 2 < tokens.length) {
            phrasePost = StringUtils.join(
              Arrays.asList(ArrayUtils.subarray(
              tokens, ccpos + 2, tokens.length)), " ");
          }
        }
      }
      for (String ccword : ccwords) {
        expandedPhrases.add(StringUtils.join(new String[] {
          phrasePre, ccword, phrasePost}, " "));
      }
    }
    return expandedPhrases;
  }

  private void walkParseTree_r(String sentence, Parse parse,
      Set<String> candidatePhrases) {
    Span span = parse.getSpan();
    try {
      String text = span.getCoveredText(sentence).toString();
      if ("and".equalsIgnoreCase(text) || "or".equalsIgnoreCase(text)) {
        Parse pparse = parse.getCommonParent(parse);
        Span pspan = pparse.getSpan();
        String chunk = pspan.getCoveredText(sentence).toString();
        if (! ("and".equalsIgnoreCase(chunk) ||
            "or".equalsIgnoreCase(chunk))) {
          // remove trailing punctuation
          if (chunk.matches("^.*\\p{Punct}$")) {
            chunk = chunk.substring(0, chunk.length() - 1);
          }
          chunk = chunk.replace("-", " ");
          candidatePhrases.add(chunk);
        }
      }
    } catch (Exception e) {
      // attempt to ignore IllegalArgumentException caused by
      // the span.getCoveredText() attempting to lookup a larger
      // span than the sentence length. Nothing much I can do 
      // here, this is probably an issue with OpenNLP.
    }
    for (Parse child : parse.getChildren()) {
      walkParseTree_r(sentence, child, candidatePhrases);
    }
  }

  private int countConjWords(String sentence) {
    int numConjWords = 0;
    for (String conjWord : CONJ_WORDS) {
      numConjWords += StringUtils.countMatches(sentence, 
        " " + conjWord + " ");
    }
    numConjWords += StringUtils.countMatches(sentence, "/");
    return numConjWords;
  }

  private String removeCoordinatePrefixHyphens(String sentence) {
    return sentence.replaceAll("- and ", "  and ").
      replaceAll("- or ", "  or ");
  }
  
  private void preprocessPrefixes(Set<String> prefixes, 
      Set<String> ccs, String[] tokens) {
    for (int i = 0; i < tokens.length; i++) {
      if (ccs.contains(StringUtils.lowerCase(tokens[i]))) {
        // check before and after
        String preWord = (i > 0) ? 
          StringUtils.lowerCase(tokens[i-1]) : null;
        String postWord = (i < tokens.length - 1) ? 
          StringUtils.lowerCase(tokens[i+1]) : null;
        if (preWord != null && postWord != null) {
          if (preWord.endsWith("-")) 
            preWord = preWord.substring(0, preWord.length() - 1);
          if (prefixes.contains(preWord)) {
            // attempt to split postWord along one of the prefixes
            String bareWord = null;
            for (String prefix : prefixes) {
              if (postWord.startsWith(prefix)) {
                bareWord = postWord.substring(prefix.length());
                break;
              }
            }
            if (StringUtils.isNotEmpty(bareWord)) {
              tokens[i-1] = preWord + bareWord;
            }
          }
        }
      }
    }
  }

  private String showExplanation(String[] tokens, String[] tags, 
      Span[] chunks) {
    StringBuilder buf = new StringBuilder();
    int ci = 0;
    for (int i = 0; i < tokens.length; i++) {
      if (ci < chunks.length && chunks[ci].getStart() == i) {
        buf.append("[");
      }
      buf.append(tokens[i]).append("/").append(tags[i]).append(" ");
      if (ci < chunks.length && chunks[ci].getEnd() - 1 == i) {
        buf.append("]/").append(chunks[ci].getType()).append(" ");
        ci++;
      }
    }
    return buf.toString();
  }

  private boolean containsToken(String[] tokens, int start,
      int end, Set<String> ccs) {
    String[] chunkTokens = (String[]) ArrayUtils.subarray(
      tokens, start, end);
    for (String chunkToken : chunkTokens) {
      if (ccs.contains(StringUtils.lowerCase(chunkToken))) {
        return true;
      }
      if (chunkToken.contains("/")) {
        return true;
      }
    }
    return false;
  }

  private boolean hasOverlapOnSlashSeparator(String np) {
    int slashPos = np.indexOf('/');
    if (slashPos > -1) {
      int start = np.lastIndexOf(' ', slashPos-1) + 1;
      if (start == -1) start = 0;
      int end = np.indexOf(' ', slashPos+1);
      if (end == -1) end = np.length();
      String prevWord = np.substring(start, slashPos);
      String nextWord = np.substring(slashPos+1, end);
      double similarity = computeStringSimilarity(prevWord, nextWord);
      return similarity > 0.0D;
    }
    return true;
  }

  public double computeStringSimilarity(String s1, String s2) {
    ArrayList<String> pairs1 = letterPairs(s1);
    ArrayList<String> pairs2 = letterPairs(s2);
    int intersection = 0;
    int union = pairs1.size() + pairs2.size();
    for (int i = 0; i < pairs1.size(); i++) {
      Object pair1=pairs1.get(i);
      for(int j = 0; j < pairs2.size(); j++) {
        Object pair2=pairs2.get(j);
        if (pair1.equals(pair2)) {
          intersection++;
          pairs2.remove(j);
          break;
        }
      }
    }
    return (2.0 * intersection) / union;
  }

  private ArrayList<String> letterPairs(String str) {
    int numPairs = str.length()-1;
    ArrayList<String> pairs = new ArrayList<String>();
    for (int i = 0; i < numPairs; i++) {
        pairs.add(str.substring(i, i + 2));
    }
    return pairs;
  }
}

which when run, yield results as shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
Viral hepatitis, including hepatitis A, B, and C, are distinct 
diseases that affect the liver.
.. hepatitis A, B and C
.... hepatitis B 
.... hepatitis C 
.... hepatitis A 
This page contains late breaking information, as well as an archival 
record of updates on safety and regulatory issues related to Hepatitis 
A and B, including product approvals, significant labeling changes, 
safety warnings, notices of upcoming public meetings, and notices about 
proposed regulatory guidances.
.. safety and regulatory issues
....  safety issues
....  regulatory issues
.. Hepatitis A and B
.... Hepatitis A 
.... Hepatitis B 
.. upcoming public meetings, and notices
.... upcoming public meetings, 
.... upcoming public notices 
Lead poisoning can cause an increased risk of brain, lung, stomach and 
kidney cancer.
.. brain, lung, stomach and kidney cancer
....  stomach cancer
....  kidney cancer
....  lung cancer
....  brain cancer
Before we look at the difference between diabetes type-I and II, let's 
firstly look at diabaetes in general.
.. diabetes type I and II
.... diabetes type I 
.... diabetes type II 
Restricting and purging anorexia are two different types of anorexia 
that people suffer from.
.. Restricting and purging anorexia
....  Restricting anorexia
....  purging anorexia
Here are some tips on pre and post surgery nutrition.
.. pre and post surgery nutrition
....  pre surgery nutrition
....  post surgery nutrition
A computer-based register linked to thyroid diagnostic laboratories 
was used to continuously identify all new cases of overt hyper- and 
hypothyroidism in two population cohorts with moderate and mild ID, 
respectively (Aalborg, n = 310,124; urinary iodine, 45 micro g/liter; 
and Copenhagen, n = 225,707; urinary iodine, 61 micro g/liter).
.. 310,124; urinary iodine, 45 micro g/liter; and Copenhagen, n
.... 310,124; urinary iodine, 45 micro g and Copenhagen, n
.... 310,124; urinary iodine, 45 micro liter; and Copenhagen, n
.. moderate and mild ID
....  moderate ID
....  mild ID
.. overt hyperthyroidism and hypothyroidism
.... overt hyperthyroidism 
.... overt hypothyroidism 
Medical and assistive devices are taxable for the most part, 
unconditionally zero-rated in certain cases, and conditionally 
zero-rated in certain cases.
.. Medical and assistive devices
....  Medical devices
....  assistive devices
.. Medical and assistive devices are taxable for the most part, 
unconditionally zero rated in certain cases, and conditionally 
zero rated in certain cases
....  Medical devices are taxable for the most part, unconditionally 
zero rated in certain cases, and conditionally zero rated in certain 
cases
....  assistive devices are taxable for the most part, unconditionally 
zero rated in certain cases, and conditionally zero rated in certain 
cases
These regions correspond to the least well conserved regions of the 
whole miRNA/snoRNA molecules.
.. the whole miRNA/snoRNA molecules
.... the whole miRNA molecules
.... the whole snoRNA molecules
Hetero- and Homogeneous mixtures are alike because they are both 
compounds, and both made up of different elements.
.. heterogeneous and Homogeneous mixtures
....  heterogeneous mixtures
....  Homogeneous mixtures

The final step (which I haven't done here) is to take each of these expanded phrases and hit the taxonomy index to map them to concepts. This will also get rid of some of the noisy phrases.

Conclusion

The approach described above solves a very simple subset of the more general problem of coordinate construction/expansion. It does this by identifying noun phrases in text containing trigger words "AND", "OR" and "/", and expanding the phrases into multiple phrases along the trigger words.

Our current (proprietary, and not described here) solution for this problem uses taxonomy lookups rather than NLP to determine the phrase boundaries. While it has some advantages in terms of accuracy (phrases such as "diabetes, lung and heart disease" will get correctly expanded to "diabetes", "lung disease" and "heart disease" rather than "diabetes disease", "lung disease" and "heart disease"), it involves a lot more processing (in user level code).

A possible improvement to the tiered lookup (chunker then parser) approach described above could be to build a custom rule-based chunker that works on the POS tags generated by OpenNLP, similar to NLTK's RegexpChunkParser. I will check out this approach, and if it works out, will replace the tiered lookup code.

References

Here are some reference type articles that were suggested by folks in the LinkedIn discussion, or that I found by following the pointers provided there. All of these attempt to solve the more general problem of coordinate construction, and as such are not directly applicable to my solution, but they did provide useful ideas, so including them here in case you want to read more about this stuff.

  1. Dealing with Conjunctions in a Machine Translation Environment, by Xiuming Huang, Institute of Linguistics, Chinese Academy of Social Sciences, Beijing, China. (PDF)
  2. Lexicase Parsing: A Lexicon-driven approach to Syntactic Analusis, by Stanley Starosta and Hirosato Nomura, COLING'86 Proceedings of the 11th conference on Computational Linguistics, doi:10.3115/991365.991400. (PDF)
  3. Comparing methods for the syntactic simplification of sentences in information extraction by Richard J Evans, Lit Linguistic Computing (2011), doi: 10.1093/llc/fqr034. (PDF)