Saturday, February 12, 2011

Solr, Porter Stemming and Stemming Exclusions

The Porter Stemmer is somewhat of a gold standard when it comes to stemming for search applications, allowing you to match inflected words in your query against similarly inflected words in your index. For example, a search for "abnormal" would return documents containing "abormality", "abnormalities", "abnormally" because all these words have been stemmed to "abnorm" at index time, and "abnormal" is also stemmed to "abnorm" at query time. However, as Ted Dziuba points out, when stemming works, it is very, very good, but when it doesn't, the results can be pretty horrible.

There are other stemmers available, but as paper by Hull comparing various stemmers for a group of queries (PDF Download) shows, there is no one true stemmer that outperforms others consistently. Most people end up with using one or the other stemmer with exclusion sets, or less commonly, modify the stemmer rules directly.

In our case, we built a custom analyzer that checks the exclusion set (supplied as a flat file of words that should not be stemmed). If the word is in the exclusion set, Porter stemming is skipped. In Solr one has to supply the filters as a chain, so our current approach wouldn't carry over directly. An alternative would have been to build this functionality into a custom token filter which would invoke the Porter stemmer only if the token was not found in its exclusion set (subclassing the Porter Stem TokenFilter is not possible since TokenFilters implementations are all final).

Solr anticipates this use case and provides the SnowballPorterFilterFactory, which allows you to provide the exclusion set via an init-arg named "protected" (which would point to a file in the CLASSPATH). It does this by inserting the KeywordMarkerFilter in front of the language specific Snowball filter. When a word is found in the exclusion set, it is marked as a keyword (using the Keyword attribute). The Snowball filter checks to see if the incoming term is marked as a keyword, and if so, does not stem the word.

To test the functionality, I wrote a little JUnit class (I know, Solr has built in regression testing, and it doesn't make much sense for me to test it). But my original intent was to compare the stemming of our custom analyzer versus the one I was trying to set up as my default text analyzer in Solr (using off-the-shelf Solr components as far as possible), so this was actually a part of it, so I figured it couldn't hurt to check it out for myself. Here is the relevant snippet of the testcase.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
// Source: src/test/org/apache/solr/analysis/ext/PorterStemmerWithExclusionsTest.java
package org.apache.solr.analysis.ext;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileFilter;
import java.io.FileReader;
import java.io.FileWriter;
import java.io.FilenameFilter;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintWriter;
import java.io.Reader;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.Collections;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.math.NumberUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.core.LowerCaseFilter;
import org.apache.lucene.analysis.core.StopFilter;
import org.apache.lucene.analysis.core.WhitespaceTokenizer;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.apache.lucene.analysis.tokenattributes.KeywordAttribute;
import org.apache.lucene.util.Version;
import org.apache.solr.analysis.SnowballPorterFilterFactory;
import org.apache.solr.common.ResourceLoader;
import org.dom4j.Document;
import org.dom4j.Element;
import org.dom4j.io.SAXReader;
import org.junit.Test;

public class PorterStemmerWithExclusionsTest {

  @Test
  public void testPorterExclusion() throws Exception {
   String[] inputTerms = new String[] {
     "mariner", "marin", "marketing", "market"
   };
   // without stemming exclusions
   System.out.println("==== without stemming exclusions =====");
   List<String> protectedWords = new ArrayList<String>();
   Analyzer analyzer0 = getAnalyzer(protectedWords);
   for (String inputTerm : inputTerms) {
     TokenStream input = analyzer0.tokenStream(
      "f", new StringReader(inputTerm));
     while (input.incrementToken()) {
       CharTermAttribute termAttribute = 
         input.getAttribute(CharTermAttribute.class);
       String outputTerm = termAttribute.toString();
       boolean isKeyword = 
         input.getAttribute(KeywordAttribute.class).isKeyword();
       System.out.println(inputTerm + "(keyword=" + isKeyword + ") => " + 
         outputTerm);
     }
   }
   // with stemming exclusions
   System.out.println("==== with stemming exclusions =====");
   protectedWords.add("marketing");
   protectedWords.add("mariner");
   Analyzer analyzer1 = getAnalyzer(protectedWords);
   for (String inputTerm : inputTerms) {
     TokenStream input = analyzer1.tokenStream(
       "f", new StringReader(inputTerm));
     while (input.incrementToken()) {
       CharTermAttribute termAttribute = 
         input.getAttribute(CharTermAttribute.class);
       String outputTerm = termAttribute.toString();
       boolean isKeyword = 
         input.getAttribute(KeywordAttribute.class).isKeyword();
       System.out.println(inputTerm + "(keyword=" + isKeyword + ") => " + 
         outputTerm);
     }
   }
  }

  private Analyzer getAnalyzer(final List<String> protectedWords) {
    return new Analyzer() {
      @Override 
      public TokenStream tokenStream(String fieldName, Reader reader) {
        TokenStream input = new WhitespaceTokenizer(Version.LUCENE_40, reader);
        input = new LowerCaseFilter(Version.LUCENE_40, input);
        SnowballPorterFilterFactory factory = new SnowballPorterFilterFactory();
        Map<String,String> args = new HashMap<String,String>();
        args.put("luceneMatchVersion", Version.LUCENE_40.name());
        args.put("language", "English");
        if (! protectedWords.isEmpty()) {
          args.put("protected", "not-a-null.txt");
        }
        factory.init(args);
        factory.inform(new LinesMockSolrResourceLoader(protectedWords));
        return factory.create(input);
      }
    };
  }

  private class LinesMockSolrResourceLoader implements ResourceLoader {
    List<String> lines;
    
    public LinesMockSolrResourceLoader(List<String> lines) {
      this.lines = lines;
    }

    @Override
    public List<String> getLines(String resource) throws IOException {
      return lines;
    }

    @Override
    public Object newInstance(String cname, String... subpackages) {
      return null;
    }

    @Override
    public InputStream openResource(String resource) throws IOException {
      return null;
    }
  }
}

The results of the test are shown below. As you can see, by default Porter Stemmer stems both "mariner" and "marin" to "marin", and "market" and "marketing" down to "market". Once the exclusions are added, the results are in line with user expectations. These examples are taken from Ted Dziuba's post (referenced earlier), read it for the background if you haven't already.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
    [junit] ==== without stemming exclusions =====
    [junit] mariner(keyword=false) => marin
    [junit] marin(keyword=false) => marin
    [junit] marketing(keyword=false) => market
    [junit] market(keyword=false) => market
    [junit] ==== with stemming exclusions =====
    [junit] mariner(keyword=true) => mariner
    [junit] marin(keyword=false) => marin
    [junit] marketing(keyword=true) => marketing
    [junit] market(keyword=false) => market

So Solr provides the necessary functionality to override your stemming algorithm with a list of exclusions. Of course, you still need to figure out the words to put in the exclusion set. One approach is to start with an empty set and add them in by scanning and stemming queries from your search logs. You could also start by scanning your document set (or a representative sample) to find words that are mis-stemmed and add them to the exclusion set, and then scan search logs periodically to find new occurrences.

I describe the second approach below. Documents that make up our index come to us in various formats - HTML (for crawled content), XML (from our content providers) and JSON (from our CMS). Basically, what we need to do is to extract the text from these documents and feed it in, word by word, to our analyzer, and collect the stemmed form. We then create a report of the stemmed form and a list of the various words that stemmed to this form. Here is the code snippet (modelled as a JUnit test in the same class as the one showed above).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
  ...
  @Test
  public void testFindCandidatesForExclusion() throws Exception {
    Map<String,Set<String>> stemmedTerms = new HashMap<String,Set<String>>();
    List<String> protectedWords = new ArrayList<String>();
    Analyzer analyzer = getAnalyzer(protectedWords);
    Set<String> stopSet = getStopSet();
    File[] xmls = new File("/path/to/xmlfiles").listFiles(
      new FilenameFilter() {
        @Override public boolean accept(File dir, String name) {
          return name.endsWith(".xml");
        }
      }
    );
    for (File xml : xmls) {
      System.out.println("Processing file: " + xml.getAbsolutePath());
      SAXReader saxReader = new SAXReader();
      saxReader.setValidation(false);
      Document xdoc = saxReader.read(xml);
      StringBuilder buf = new StringBuilder();
      extractTextFromElementAndChildren(xdoc.getRootElement(), buf);
      // break up the input by whitespace and punctuation
      String[] words = buf.toString().split("[\\p{Punct}|\\p{Space}]");
      for (String word : words) {
        if (NumberUtils.isNumber(word) || StringUtils.isEmpty(word)) {
          continue;
        }
        word = word.replaceAll("\"", "");
        word = word.replaceAll("[^\\p{ASCII}]", "");
        word = StringUtils.lowerCase(word);
        if (stopSet.contains(word)) {
          continue;
        }
        TokenStream input = analyzer.tokenStream("f", new StringReader(word));
        while (input.incrementToken()) {
          CharTermAttribute termAttribute = 
            input.getAttribute(CharTermAttribute.class);
          String stemmed = termAttribute.toString();
          Set<String> originalWords = stemmedTerms.containsKey(stemmed) ?
            stemmedTerms.get(stemmed) : new HashSet<String>();
          originalWords.add(word);
          stemmedTerms.put(stemmed, originalWords);
        }
      }
    }
    // write this out
    PrintWriter writer = new PrintWriter(new FileWriter(
      new File("/tmp/stem-results.txt")));
    List<String> stemmedKeys = new ArrayList<String>();
    stemmedKeys.addAll(stemmedTerms.keySet());
    Collections.sort(stemmedKeys);
    for (String stemmedKey : stemmedKeys) {
      Set<String> originalWords = stemmedTerms.get(stemmedKey);
      if (originalWords.size() > 1) {
        writer.println(stemmedKey + " => " + 
          StringUtils.join(stemmedTerms.get(stemmedKey).iterator(), ", "));
      }
    }
    writer.flush();
    writer.close();
  }
  
  private void extractTextFromElementAndChildren(
      Element parent, StringBuilder buf) {
    String text = parent.getTextTrim();
    if (text.length() > 0) {
      buf.append(text).append(text.endsWith(".") ? " " : ". ");
    }
    List<Element> children = parent.elements();
    for (Element child : children) {
      extractTextFromElementAndChildren(child, buf);
    }
  }

  private Set<String> getStopSet() {
    Set<String> stopset = new HashSet<String>();
    try {
      BufferedReader reader = new BufferedReader(new FileReader(
        new File("/path/to/stopwords.txt")));
      String line;
      while ((line = reader.readLine()) != null) {
        if (StringUtils.isEmpty(line) || line.startsWith("#")) {
          continue;
        }
        stopset.add(line);
      }
      reader.close();
      return stopset;
    } catch (Exception e) {
      return stopset;
    }
  }
  ...

After the run, I manually went through the report and picked out the ones I think were mis-stemmed for my context. They are shown below - there are only 16 out of over 6000 stems that were created from this corpus (of around 100 documents), so Porter stemmer did the right thing 99.7% of the time (for this corpus), which is quite impressive. Some of these are kind of context-dependent, for example "race" and "racing" mean different things in a health context, but probably not in a sports context.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
aerob => aerobics, aerobic
aid => aids, aiding, aid, aided, aides
angl => angles, angling, angled, angle
anim => anim, animals, animation, animal
arm => arms, arm, armed
bitter => bittering, bitterly, bitter
coupl => couplings, coupling, coupled, couple, couples
dead => deadly, dead
depress => depressants, depress, depresses, depressed, depressions, depressing, depressant, depressive, depression
easter => easter, easterly
head => headings, headed, heads, head, heading
mortal => mortalities, mortal, mortality
physic => physical, physics, physically
plagu => plague, plagued
plumb => plumb, plumbing
race => racing, races, race

The process of selecting the misstemmed terms is manual and quite painful (somewhat like looking for a needle in a haystack), but I think the report can be whittled down somewhat by calculating the similarity between the meanings of the original words - for example, "depressed" and "depression" are probably close enough so we wouldn't care about them if they were the only words stemmed to "depress". I haven't tried that yet, but this approach seems feasible based on this paper describing Wordnet::Similarity by Pedersen, Patwardhan and Michelizzi (PDF Download). I will report my findings on this in a future post.

No comments:

Post a Comment

Comments are moderated to prevent spam.