Tuesday, August 23, 2011

Implementing Concordance with Lucene Span Queries

I recently needed to build a Named Entity Recognizer (NER) for our proprietary concept mapping/indexing platform to recognize and extract age group data from our document corpus. The approach I envisioned was to match specific age related patterns in the data and map them into specific age brackets.

I have also been reading the NLTK Book (free online version available here) lately, and came across a concept called concordance, which is basically a list of occurrences of a particular keyword with the context in the document corpus. It occurred to me that running a concordance on the document corpus for selected keywords would help me extract the patterns I needed.

Thinking through this some more, I remembered reading Accessing words around a positional match in Lucene by Grant Ingersoll, where he demonstrates the use of Span Queries to find collocated terms.

Since I already had an index whose body was indexed with term vectors, positions and offsets, I figured it would be easier to adapt Grant's code than set up NLTK to find the concordances for a few key terms. So this is what I did - the JUnit test below shows my version, which generates output very similar to that generated by NLTK's concordance() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
// Source: src/test/java/com/mycompany/tgni/utils/ConcordanceGeneratorTest.java
package com.mycompany.tgni.utils;

import java.io.File;
import java.io.FileWriter;
import java.io.PrintWriter;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.IntRange;
import org.apache.lucene.document.Document;
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import org.apache.lucene.index.TermVectorMapper;
import org.apache.lucene.index.TermVectorOffsetInfo;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.spans.SpanTermQuery;
import org.apache.lucene.search.spans.Spans;
import org.apache.lucene.store.FSDirectory;
import org.junit.Test;

public class ConcordanceGeneratorTest {

  private static final int NUM_CONTEXT_CHARS = 25;
  private static final String[] AGE_TERMS = new String[] {
    "aged", "age", "year"
  };
  
  @Test
  public void testGenerateConcordance() throws Exception {
    IndexSearcher searcher = new IndexSearcher(FSDirectory.open(
      new File("/path/to/my/index")));
    IndexReader reader = searcher.getIndexReader();
    PrintWriter writer = new PrintWriter(new FileWriter(
      new File("/tmp/concordance.txt")), true);
    for (String ageTerm : AGE_TERMS) {
      writer.println("==== concordance for term='" + ageTerm + "' ===="); 
      SpanTermQuery spt = new SpanTermQuery(new Term("body", ageTerm));
      Spans spans = spt.getSpans(reader);
      OffsetTermVectorMapper tvm = new OffsetTermVectorMapper();
      while (spans.next()) {
        Document doc = reader.document(spans.doc());
        String body = doc.get("body");
        tvm.start = spans.start();
        tvm.end = spans.end();
        reader.getTermFreqVector(spans.doc(), "body", tvm);
        String conc = StringUtils.substring(body, 
          tvm.range.getMinimumInteger() - NUM_CONTEXT_CHARS, 
          tvm.range.getMaximumInteger() + NUM_CONTEXT_CHARS);
        if (StringUtils.isNotEmpty(conc)) {
          writer.println(StringUtils.join(new String[] {
            "...", conc, "..."
          }));
        }
      }
    }
    searcher.close();
    writer.flush();
    writer.close();
  }
  
  private class OffsetTermVectorMapper extends TermVectorMapper {

    public int start;
    public int end;
    public IntRange range;
    
    @Override
    public void map(String term, int frequency, TermVectorOffsetInfo[] offsets,
        int[] positions) {
      for (int i = 0; i < positions.length; i++) {
        if (positions[i] >= start && positions[i] < end) {
          TermVectorOffsetInfo offset = offsets[i];
          range = new IntRange(offset.getStartOffset(), offset.getEndOffset());
        }
      }
    }

    @Override
    public void setExpectations(String field, int numTerms,
        boolean storeOffsets, boolean storePositions) {
      // NOOP
    }
  }
}

The code scans the index for spans containing the keywords "age", "aged" and "year", finds the character offsets of these spans, then returns substrings consisting of 25 character snippets on either side for context. Here is some sample output (truncated to 20 top results for brevity).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
==== concordance for term='aged' ====
...y feel trapped within an aged body. Grief, a sense of ...
...ated deaths among people aged 65 years or older and ar...
...tal falls involve people aged 75 years or older-only 4...
...population. Among people aged 65 to 69 years, 1 of eve...
...racture, and among those aged 85 years or older, 1 fal...
...g Medicare beneficiaries aged &gt; or = 65 yearsUnite...
...or, ME) at 68 weeks and aged in our colony room at Ru...
...nalyzed with Age (middle-aged and young)  Treatment (...
...ntly a disease of middle-aged white males, with a medi...
..., but also occurs in the aged. It was the first human ...
...es. Among the population aged five years and above, HI...
... 121 consecutive adults (aged &gt;16 years) who underw...
...c abnormality. In middle-aged and older adults, in who...
...es more common in middle-aged patients withchronic act...
...ctures are mostly middle-aged and actively employed. I...
...d 426 non-twin siblings, aged 12-18 years, was recruit...
...y much involved. Edwina, aged 77, is a young mother an...
...eening for 94% of people aged 55 years and over....
... criteria (), were those aged 18 or older (male or fem...
...g 301 toddlers (children aged 1 to 2 years) with upper...
...
==== concordance for term='age' ====
...ric Depression Scale Old age isn't so bad when you co...
...n individual 65 years of age or older. Among such ind...
... social integration. The age range of this rapidly gr...
...er of people 65 years of age or older is projected to...
...llion people 65 years of age and older. This group wi...
...n older than 85 years of age represents the fastest-g...
...e older than 65 years of age. These statistics are ex...
... individuals 95 years of age and older. African Ameri...
...but only 8% of the older age groups. In 1986, most ol...
...n older than 65 years of age continue to work; 25% ar...
...ans will be 100 years of age or older by the year 205...
...y 2 million will be that age by the year 2080. Falls ...
... individuals 65 years of age and older. Two thirds of...
... in patients 85 years of age and older are caused by ...
...s older than 65 years of age have this fear. Structur...
...rises progressively with age, whereas diastolic press...
...d to half by 75 years of age. Vascular changes may al...
...he case in the geriatric age group. It is well docume...
...s older than 70 years of age do not have chest pain w...
...s older than 65 years of age use about 25% of all pre...
...
==== concordance for term='year' ====
...ntegrity, for at least 1 year after nerve injury. Irre...
... often has occurred by 2 years. Sensory cross-reinnerva...
...ed state for a period of years. Thus, even late reinner...
..., on the order of 2 to 3 years, may be able to restore ...
...ocols were used over the years the data was gathered. E...
...ient is an individual 65 years of age or older. Among s...
...ation spans more than 40 years. The world's geriatric p...
...ng at a rate of 2.5% per yearsignificantly faster tha...
... Americans older than 65 years increased from a little ...
... the number of people 65 years of age or older is proje...
...re 146 million people 65 years of age and older. This g...
...se to 232 million by the year 2020. A decreasing rate ...
...population older than 85 years of age represents the fa...
...cted to continue. By the year 2020, one fifth of the p...
...on will be older than 65 years of age. These statistics...
...3:1 among individuals 95 years of age and older. Africa...
...population older than 65 years. Approximately 12% of th...
...population older than 65 years of age continue to work;...
...on Americans will be 100 years of age or older by the y...
...s of age or older by the year 2050 and that nearly 2 m...
...

As you can see, this list is a good way to find common patterns that need to be extracted from the corpus. All you need is a bit of imagination to think of some good representative terms that cover most patterns you are likely to encounter in the corpus. You also need to scan the list manually to weed it out. It is now relatively simple to craft a number of regular expressions that capture the lower and upper bound (where available) of the date ranges and assign these to predefined age group blocks.

Of course, the downside is that it kind of puts the cart before the horse. The Age-Group NER is part of the indexing pipeline, but we need an index to be built without this filter first in order to get data to build this filter. The right way would probably be to generate the concordance data with something like NLTK. But it is relatively cheap resource-wise to build a plain old Lucene index from your corpus, so perhaps its not quite so bad.

8 comments (moderated to prevent spam):

Anonymous said...

Hi Sujit,

Nice article. Helping me a lot in a personal project where I'm doing some text analysis.

Not being very familiar with low level Lucene apis, I was wondering if you could point me to the proper direction for the following:

I wanted to restrict the span display for a specific document. I looked into the user groups for some idea - but didn't find anything in this regard. The closest idea I got was to use spans.skipTo() to skipto a specific document/article and then look for spans - but I'm not able to put this into proper code. If possible, can you please provide some pseudo-code in this regard?

Thanks
Manas

Sujit Pal said...

Hi Manas, haven't used spans.skipTo() myself, so dont have a code example available offhand. I'll check it out though and if I find a way to use it, I'll post it.

Unknown said...

hi Sujit, very nice and helpful article.i have apply your code to a small pdf document containing these 3 lines of text
Hello World by PDFBox
World is beautiful place
No world is better than home
i search for {"input","better", "beautiful" , "World","PDFBox" };
the result is
==== concordance for term='input' ====
==== concordance for term='better' ====
iful place
No world is better than home

==== concordance for term='beautiful' ====
rld by PDFBox
World is beautiful place
No world is bett
==== concordance for term='World' ====
==== concordance for term='PDFBox' ====
program can not find concordance for 'World' and 'PDFBox'
i applied the same program for different pdf book documents and find that the words start with capital letters are ignore in search

Sujit Pal said...

Thanks Asma. To answer your question, I suspect that your body field is being analyzed with an analyzer that has a lowercase tokenizer in it, so PDFBox is probably getting analyzed down to "pdfbox". Assuming you have Solr, you can see what "PDFBox" in "body" is analyzed to on its analysis page (otherwise you will have to write a little code) , and use the analyzed version in your AGE_TERMS list.

As an aside, in some cases such as "World", lowercasing can be a good thing since it will allow you to match "world" as well. Ideally (and perhaps somewhat unrealistically) your body analyzer (for concordance calculation purposes) should avoid stemming to avoid confusion, maybe just be a whitespace tokenizer and a lowercase filter.

Unknown said...

Thank you so much Sujit for such great article.

I want to achieve the same with Drupal. What kind of changes did you made to Solr schema.xml and solrconfig.xml to achieve this concordance functionality?

I am not familiar with Java.

thanks,
Moh

Sujit Pal said...

Hi Moh, you are welcome and thank you for the compliment. The only change to schema.xml is to set termVectors=true, termPositions=true and termOffsets=true for the field on which you want to build concordances on. The code here is Lucene, and I didn't do this in Solr because it was meant for backend use and I could just read a copy of the production index to get my data. However, if you want to have this available in Solr, you could just create a component that wraps this code.

breandan said...

Update, apparently this is an upcoming feature in Lucene: https://issues.apache.org/jira/browse/LUCENE-5317

Sujit Pal said...

Yay! I met Tim Allison couple of years ago at a Haystack conference and I knew he was working on it, but it is very cool to see it finally make it into Lucene.