Saturday, August 20, 2011

Implementing Concept Subsumption with Bitsets

According to this page, Concept Subsumption is defined as follows:

Given an ontology O and two classes A, B, verify whether the interpretation of A is a subset of the interpretation of B in every model of O.

For example, concept mapping the string "Lung Cancer Symptom" would yield the following concepts - "lung cancer symptoms", "lung", "cancer", "lung cancer" and "cancer symptoms". Using the definition above, we can say that the concept for "lung cancer symptoms" subsumes the other concepts listed, since the meaning of the words in this concept is a subset of the meaning of the same words in the set of other concepts.

However, if you now look at the same problem from a pure text processing point of view, its just an application of the Longest Match or Maximal Munch rule - ie, map the longest substring(s) that can be mapped to one or more concepts.

My TGNI application reports all concepts found (along with their start and end positions) in a body of text. While some clients may prefer to filter the concepts based on custom (usually semantic) criteria, subsumption is a common enough requirement to be the default expected behavior. So I need to modify TGNI to only return the longest matches (unless all concepts are specifically requested).

A naive implementation of such a filter would be to sort the concepts based on match length (longest first), then for each matched text, check for subsumption with the concepts previous to it on the list, basically an O(n2) operation. Checking for subsumption involves either checking whether the current matched text is a substring of all the previous ones, or whether the range defined by the start and end positions are contained in all the previous ranges.

Another interesting way, which I thought about recently, would be to compute intersections of bitsets representing the input text and the matched text and drop a concept if the intersection caused no change in cardinality. Its probably easier to describe with an example:

1
2
3
4
5
6
input   : patient with lung cancer symptoms 111111111111111111111111111111111 
(13,32) : lung cancer symptoms              111111111111100000000000000000000 
(18,32) : cancer symptoms                   111111111111100000000000000000000 
(14,23) : lung cancer                       111111111111100000000000000000000
(18,23) : cancer                            111111111111100000000000000000000
(13,17) : lung                              111111111111100000000000000000000

Notice how the only time the cardinality (number of 1s) of the input bitset changes is when the bitset for "lung cancer symptoms" is ANDed with the input bitset. Using our test described above, the only concept that would survive the subsumption filtering is this one.

Here is some JUnit prototype code I wrote to prove to myself that this will work as described above. Folding this code into the ConceptAnnotator is fairly trivial (it will be called as a post-processing step to remove ConceptAnnotations from the FSIndex after all the concepts are mapped), so I will not describe that code on the blog unless I make some major changes to that class in the future.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
// Source: src/test/java/com/mycompany/tgni/uima/annotators/concept/MaximalMunchTest.java
package com.mycompany.tgni.uima.annotators.concept;

import java.util.Arrays;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.List;

import org.apache.lucene.util.OpenBitSet;
import org.junit.Test;

public class MaximalMunchTest {

  private static class Annot {
    public int start;
    public int end;
    public String text;
    public Annot(int start, int end, String text) {
      this.start = start;
      this.end = end;
      this.text = text;
    }
  };

  private static final String INPUT = 
    "lung cancer symptoms and kidney cancer";
  
  // 0         1         2         3
  // 0123456789012345678901234567890123456789
  // lung cancer symptoms and kidney cancer
  public static final Annot[] ANNOTS = new Annot[] {
    new Annot(0, 3, "lung"),
    new Annot(0, 10, "lung cancer"),
    new Annot(0, 19, "lung cancer symptoms"),
    new Annot(5, 10, "cancer"),
    new Annot(5, 19, "cancer symptoms"),
    new Annot(5, 10, "cancer"),
    new Annot(25, 30, "kidney"),
    new Annot(25, 37, "kidney cancer"),
    new Annot(32, 37, "cancer")
  };
  
  @SuppressWarnings("unchecked")
  @Test
  public void testMaximalMunch() throws Exception {
    OpenBitSet tset = new OpenBitSet(INPUT.length());
    tset.set(0, tset.length());
    List<Annot> annots = Arrays.asList(ANNOTS);
    // sort the annotations, longest first
    Collections.sort(annots, new Comparator<Annot>() {
      @Override
      public int compare(Annot annot1, Annot annot2) {
        Integer len1 = annot1.end - annot1.start;
        Integer len2 = annot2.end - annot2.start;
        return len2.compareTo(len1);
      }
    });
    List<Annot> maxMunchAnnots = new ArrayList<Annot>();
    long prevCardinality = tset.cardinality();
    for (Annot annot : annots) {
      OpenBitSet aset = new OpenBitSet(tset.length());
      aset.set(0, tset.length());
      aset.flip(annot.start, annot.end);
      tset.intersect(aset);
      long cardinality = tset.cardinality();
      if (cardinality == prevCardinality) {
        // complete overlap, skip
        continue;
      }
      maxMunchAnnots.add(annot);
      prevCardinality = cardinality;
    }
    for (Annot annot : maxMunchAnnots) {
      System.out.println("(" + annot.start + "," + 
        annot.end + "): " + annot.text);
    }
  }
}

As expected, the code successfully identifies the longest matched concepts in the list.

1
2
    [junit] (0,19): lung cancer symptoms
    [junit] (25,37): kidney cancer

The nice thing about this algorithm is that it has O(n), ie linear complexity, compared to the naive approach I described above, at a cost of slightly more memory used to hold the bitsets. What do you think? If you have suggestions on a different/better algorithm/data structure to achieve this stuff, would appreciate you letting me know.

8 comments (moderated to prevent spam):

Anonymous said...

Why is scala.actors.threadpool.Arrays used?

Sujit Pal said...

Sorry it should have been java.util.Arrays (Eclipse must have put this in based on classpath availability, apparently the asList() method behave similarly for both) - I will update the code.

Anonymous said...

Hello Sujit,
Brilliant piece of work. I religiously read your blog, lovely explanation and code, keep it up.

I often show your blog to Junior developers to explain to them how code should be written. Hope you dont mind using your blog for my personal gain :-)

I had a small question, If I were to grab all contiguous nouns in a text document and apply the concept you have delineated and apply some sort of thresholding would it give me the theme of the text ? (in the same lines as open calais) Just verifying before I ump in and prototype, we also use SOLR so have access to term vectors.

Ravi Kiran Bhaskar
Principal Software Engineer
The Washington Post
1150 15th Street NW, Washington, DC 20071

Sujit Pal said...

Thanks for the kind words, Ravi. Hopefully your junior developers are not led astray by my crazy talk :-).

To answer your question, I have been thinking of doing something similar, but to improve the quality of TGNI's concept mapping - basically pass the sentences through another filter which only passes through noun-phrases and noun tokens in non-noun phrases.

By theme I am assuming you mean a category? If so, I think your idea of using noun phrases to compute the theme is sound, although I would solve it slightly differently than what you propose (or my understanding of what you propose). What I would do would be to map the noun phrases to concepts in my taxonomy, then use the concepts (or principal concepts) to generate a TD vector for the document, then test it for similarity against centroid vectors (computed from vectors of representative documents for each theme). But that may just be my environment clouding my thinking - I am sure that there are other (better) ways.

Anonymous said...

Hello Sujit,
That is a neat idea, however, the idea of convergence towards a centroid assumes the distribution of terms are within certain boundaries i.e. medical terminologies/products etc can be depicted in a finite set (not many variations)...I believe for news articles its hard to converge as you can refer to something without even mentioning it (blogs or commentaries etc). For example article about Stem Cell Research cold be taking about "Bone Marrow Transplant" "Stem Cell" etc(a very finite set) now throw in US Government stand, GOP opposition, some senator suing Stem Cell Research activists, campaigning stunts etc into the document text...soon we end up with multiple disparate/unconnected nouns phrases that don't converge and also you cannot assign an ontology category as it now spans "Health Care" "Public Policy" "US Laws" etc...all of which can be potentially the themes/categories that is where Iam stuck trying to make sense out of the phrases to find a dominant theme(s) for it

Ravi Kiran Bhaskar
Principal Software Engineer
Washington Post Digital
1150 15th Street NW, Washington, DC 20071

Sujit Pal said...

Ah, I see, I guess I was right about me getting too comfortable in my space then :-). Now that I understand what themes are, I think you are on the right track - definitely worth exploring at least - longest noun phrases would tend to reflect the theme(s) a document can represent, and searching by these strings would return "similar" documents along that theme. You may want to take a look at the Carrot API if you haven't already... it does something similar to the results during search time, and I think the Carrot guys are working on integrating with Solr.

Anonymous said...

Hi Sujit,
Thank you for your continued support. BTW carrot has been in SOLR from 1.4. I haven't had time to play with it, will do in a while. I believe the carrot integration to SOLR is nascent in 1.4, However, beyond 3.1 it has been integrated fully. Still only 3 algorithms are supported

org.carrot2.clustering.lingo.LingoClusteringAlgorithm

org.carrot2.clustering.stc.STCClusteringAlgorithm

org.carrot2.clustering.kmeans.BisectingKMeansClusteringAlgorithm

Probably would need some mahout integration is to be done oout of the loop.

Ravi Kiran Bhaskar
Principal Software Engineer
Washington Post Digital
1150 15th Street NW, Washington, DC 20071

Sujit Pal said...

Yes, sorry and thanks for correcting, I guess I remembered seeing a set of messages from the carrot devs on the lucene/solr lists and extrapolated from there. Best of luck with your theme extraction project.