Saturday, December 01, 2007

Spelling Checker using Lucene

My initial interest in spell checking algorithms started when I had to fix some bugs in the spell checking code at work over a year ago. We use Jazzy, a Java implementation of GNU Aspell, as our spell checking library. Jazzy uses a combination of Metaphone and Levenshtein distance (aka Edit distance) to match a misspelled word to a set of words in its dictionary.

An alternative approach to spell checking is the use of n-grams. Basically you break up your dictionary word into sequences of characters of size n, moving your pointer forward one character at a time, and store it in an index. When the user enters a misspelled word, you do the same thing with his input word, then match the ngrams generated to the ngrams in your dictionary. The idea is that a misspelled word would have only one or two or three characters misspelled, transposed or missing, so by comparing the n-grams, you would get matches to correctly spelled words that are close to it. This is a newer approach to spell checking, which has become feasible with the availability of open source search libraries such as Lucene, since it requires the functionality to tokenize your dictionary words and store them in an index for quick retrieval.

I read about this approach first on Tom White's article "Did you mean Lucene?" on java.net quite some time back, but I never completely understood it then, probably because I was new to Lucene. However, what I found attractive about it was its compactness and performance.

You see, the spell checking on most search sites is surfaced as a one line "Did you mean..." component at the top of your search results. Compared to the actual search results, its pretty much irrelevant, unless its suggestions are way off the mark, at which point it has entertainment value to the user and embarrassment value to the site serving the search results. So you want spell checking code thats quick to maintain and execute, even though its internals are likely to be much more complex than the code serving up the search results.

And besides, as a search company heavily invested in Lucene, it was probably more appropriate for us to be using a Lucene based approach, if one existed and was on par, performance and results-wise, with other approaches.

I recently came across this wiki page on the jakarta-lucene site, where the n-gram approach is more succinctly described. You can download the code implementing this algorithm from the original Bugzilla submission. It is written by Nicolas Maisonneuve (and inspired by David Spencer's code according to the class Javadocs). However, the code is written against the Lucene 1.x API and I was using Lucene 2.2, so it would not compile for me.

However, the algorithm is quite simple and intuitive. Basically, each dictionary word is tokenized and stored in an index with the following fields.

Field Description Example
word The dictionary word cancer
gram3 Multi-field of all 3-grams for the word can, anc, nce, cer
gram4 Multi-field of all 4-grams for the word canc, ance, ncer
start3 The first 3 characters of the word, for edge boosting can
end3 The last 3 characters of the word, for edge boosting cer
start4 The first 4 characters of the word, for edge boosting canc
end4 The last 4 characters of the word, for edge boosting ncer

The search query is a BooleanQuery consisting of all the fields (except word) of the index, OR'd together. If edge boosting is desired (on the assumption that people make fewer mistakes at the start and end of words), then the start* and end* field queries are given a higher boost in the query.

Since I had the algorithm anyway, I decided to rewrite the code using Nicolas's code as a guide. For one thing, quite some time has passed since he wrote it, and now n-gram tokenizers are available from the Lucene sandbox, so I could simply grab that instead of having to write one. Second, I found he had implemented a custom version of Edit Distance, where I preferred using the getLevenshtiensDistance() method from the Jakarta commons-lang project. Third, his method of loading up a dictionary was from an existing index, whereas I wanted something more generic. For these reasons, I built my own Lucene based Spell checker, the code for which is shown below:

I first built an NGramAnalyzer using the NGramTokenizer provided by the Lucene Analyzers sandbox project.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// NGramAnalyzer.java
package com.mycompany.spellcheck;

import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.ngram.NGramTokenizer;

public class NGramAnalyzer extends Analyzer {

  private int minGram;
  private int maxGram;
  
  public NGramAnalyzer(int minGram, int maxGram) {
    this.minGram = minGram;
    this.maxGram = maxGram;
  }
  
  @Override
  public TokenStream tokenStream(String fieldName, Reader reader) {
    return new NGramTokenizer(reader, minGram, maxGram);
  }
}

Next I built a SpellChecker which exposes two methods: addToDictionary() and suggestSimilar(). The first allows you to add words to the spelling dictionary (which is a Lucene index with the fields described in the SpellChecker wiki page), and the second returns a List of spelling suggestions for a given word. In addition (unlike Nicolas's code), it allows results that have Soundex equivalence, since sometimes the user's spelling may be wildly off the mark, but he knows what the word sounds like, and we want to catch that too.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
// LuceneSpellChecker.java
package com.mycompany.spellcheck;

import java.io.File;
import java.io.IOException;
import java.io.StringReader;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

import org.apache.commons.codec.EncoderException;
import org.apache.commons.codec.language.Soundex;
import org.apache.commons.lang.StringUtils;
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.Token;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Field.Index;
import org.apache.lucene.document.Field.Store;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.index.Term;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.TermQuery;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.store.FSDirectory;

public class LuceneSpellChecker {
  
  private String spellIndexDir;
  private boolean boostEdges;
  private int editDistanceCutoff = 3;
  private int soundexDiffCutoff = 4;
  
  private IndexWriter spellIndexWriter;
  private IndexSearcher spellIndexSearcher;
  
  private class AnalysisResult {
    public String input;
    public List<String> gram3s = new ArrayList<String>();
    public List<String> gram4s = new ArrayList<String>();
    public String start3 = "";
    public String start4 = "";
    public String end3 = "";
    public String end4 = "";
  }
  
  public void setBoostEdges(boolean boostEdges) {
    this.boostEdges = boostEdges;
  }
  
  public void setEditDistanceCutoff(int editDistanceCutoff) {
    this.editDistanceCutoff = editDistanceCutoff;
  }
  
  public void setSoundexDiffCutoff(int soundexDiffCutoff) {
    this.soundexDiffCutoff = soundexDiffCutoff;
  }
  
  public void setSpellIndexDir(String spellIndexDir) {
    this.spellIndexDir = spellIndexDir;
  }
  
  public void init() throws Exception {
    File spellIndexDirFile = new File(spellIndexDir);
    this.spellIndexWriter = new IndexWriter(
      FSDirectory.getDirectory(spellIndexDirFile), new StandardAnalyzer());
    this.spellIndexSearcher = new IndexSearcher(spellIndexDir);
  }
  
  public void destroy() throws Exception {
    spellIndexWriter.close();
    spellIndexSearcher.close();
  }

  public void flush() throws Exception {
    spellIndexWriter.optimize();
  }
  
  public void addToDictionary(String dictionaryEntry) throws IOException {
    AnalysisResult result = analyze(dictionaryEntry);
    Document doc = new Document();
    doc.add(new Field("word", result.input, Store.YES, Index.TOKENIZED));
    for (String gram3 : result.gram3s) {
      doc.add(new Field("gram3", gram3, Store.YES, Index.TOKENIZED));
    }
    for (String gram4 : result.gram4s) {
      doc.add(new Field("gram4", gram4, Store.YES, Index.TOKENIZED));
    }
    doc.add(new Field("start3", result.start3, Store.YES, Index.TOKENIZED));
    doc.add(new Field("start4", result.start4, Store.YES, Index.TOKENIZED));
    doc.add(new Field("end3", result.end3, Store.YES, Index.TOKENIZED));
    doc.add(new Field("end4", result.end4, Store.YES, Index.TOKENIZED));
    spellIndexWriter.addDocument(doc);
  }
  
  public List<String> suggestSimilar(String input, int maxSuggestions) 
      throws IOException, EncoderException {
    AnalysisResult result = analyze(input);
    BooleanQuery query = new BooleanQuery();
    addGramQuery(query, "gram3", result.gram3s);
    addGramQuery(query, "gram4", result.gram4s);
    addEdgeQuery(query, "start3", result.start3);
    addEdgeQuery(query, "end3", result.end3);
    addEdgeQuery(query, "start4", result.start4);
    addEdgeQuery(query, "end4", result.end4);
    Set<String> words = new HashSet<String>();
    Hits hits = spellIndexSearcher.search(query);
    int numHits = hits.length();
    for (int i = 0; i < numHits; i++) {
      Document doc = hits.doc(i);
      String suggestion = doc.get("word");
      // if the suggestion is the same as the input, ignore it
      if (suggestion.equalsIgnoreCase(input)) {
        continue;
      }
      // if the suggestion is within the specified levenshtein's distance, include it
      if (StringUtils.getLevenshteinDistance(input, suggestion) < editDistanceCutoff) {
        words.add(suggestion);
      }
      // if they sound the same, include it
      Soundex soundex = new Soundex();
      if (soundex.difference(input, suggestion) >= soundexDiffCutoff) {
        words.add(suggestion);
      }
    }
    List<String> wordlist = new ArrayList<String>();
    wordlist.addAll(words);
    return wordlist.subList(0, Math.min(maxSuggestions, wordlist.size()));
  }

  private AnalysisResult analyze(String input) throws IOException {
    AnalysisResult result = new AnalysisResult();
    result.input = input;
    Analyzer analyzer = new NGramAnalyzer(3, 4);
    TokenStream tokenStream = analyzer.tokenStream("dummy", new StringReader(input));
    Token token;
    while ((token = tokenStream.next()) != null) {
      String text = token.termText();
      if (text.length() == 3) {
        result.gram3s.add(text);
      } else if (text.length() == 4) {
        result.gram4s.add(text);
      } else {
        continue;
      }
    }
    result.start3 = input.substring(0, Math.min(input.length(), 3));
    result.start4 = input.substring(0, Math.min(input.length(), 4));
    result.end3 = input.substring(Math.max(0, input.length() - 3), input.length());
    result.end4 = input.substring(Math.max(0, input.length() - 4), input.length());
    return result;
  }
  
  private void addGramQuery(BooleanQuery query, String fieldName, List<String> grams) {
    for (String gram : grams) {
      query.add(new TermQuery(new Term(fieldName, gram)), Occur.SHOULD);
    }
  }
  
  private void addEdgeQuery(BooleanQuery query, String fieldName, String fieldValue) {
    TermQuery start3Query = new TermQuery(new Term(fieldName, fieldValue));
    if (boostEdges) {
      start3Query.setBoost(2.0F);
    }
    query.add(start3Query, Occur.SHOULD);
  }
}

Finally, here is the test case which a client for the SpellChecker could use. It is basically the JUnit test I used to test the code above. The test opens a file of pairs of misspelled word and the expected correct spelling and runs through it. The first test puts the correct word into the index, and the next searches for them. The SpellChecker class is instantiated in the test, but could also be accessed from a Spring BeanFactory, since the SpellChecker class is set up to use setter injection.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
// LuceneSpellCheckerTest.java
package com.mycompany.spellcheck;

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.util.List;

import org.apache.commons.io.FileUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.After;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

public class LuceneSpellCheckerTest {

  private static final String INPUT_FILE = "src/main/resources/my_tests.txt";
  private static final String INDEX_DIR = "/tmp/spellindex";
  private static final Log LOG = LogFactory.getLog(LuceneSpellCheckerTest.class);

  private LuceneSpellChecker spellChecker;
  
  @BeforeClass
  public static void setUpBeforeClass() throws Exception {
    File indexDirFile = new File(INDEX_DIR);
    if (indexDirFile.exists()) {
      FileUtils.forceDelete(new File(INDEX_DIR));
    }
  }
  
  @Before
  public void setUp() throws Exception {
    spellChecker = new LuceneSpellChecker();
    spellChecker.setSpellIndexDir(INDEX_DIR);
    spellChecker.setBoostEdges(true);
    spellChecker.setEditDistanceCutoff(3);
    spellChecker.setSoundexDiffCutoff(4);
    spellChecker.init();
  }
  
  @After
  public void tearDown() throws Exception {
    spellChecker.destroy();
  }
  
  @Test
  public void testIndexing() throws Exception {
    BufferedReader reader = new BufferedReader(
      new InputStreamReader(new FileInputStream(INPUT_FILE)));
    String line;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        continue;
      }
      String[] parts = StringUtils.split(line, "=");
      if (parts.length == 2) {
        String correctSpelling = StringUtils.split(line, "=")[1];
        if (StringUtils.isNotBlank(correctSpelling)) {
          LOG.debug("Adding to index:" + correctSpelling);
          spellChecker.addToDictionary(correctSpelling);
        }
      }
    }
    spellChecker.flush();
    reader.close();
  }

  @Test
  public void testSearching() throws Exception {
    BufferedReader reader = new BufferedReader(
      new InputStreamReader(new FileInputStream(INPUT_FILE)));
    String line;
    while ((line = reader.readLine()) != null) {
      if (line.startsWith("#")) {
        continue;
      }
      String wrongSpelling = StringUtils.split(line, "=")[0];
      List<String> suggestions = spellChecker.suggestSimilar(wrongSpelling, 2);
      LOG.debug("suggestion[" + wrongSpelling + "]=" + suggestions);
    }
    reader.close();
  }
}

The above code is not as complete as Nicolas's, since it does not have the code to do popularity based boosting in suggestSimilar(). Essentially, this means that if the spell checker has two or more suggestions, the caller can specify that the suggestions be returned in order of decreasing popularity in the index. Internally, the code would call docFreq() for each of the suggestions and reorder accordingly.

From my minimal testing, it appears that multi-word dictionary entries are handled as well as single-word entries, but I cannot say for sure. If this is true, then it would make sense to have the popularity functionality. With Jazzy this is not the case, so the application has to deal with figuring out how to get the "best" multi-word suggestion based on a cartesian join of all single-word suggestions. If the same holds true for this approach, then it may make sense to actually apply the popularity metric on the combination of single word suggestions instead.

19 comments (moderated to prevent spam):

Debasish said...

I am sure u have seen this .. one of the most beautiful spell correctors ever written in 21 lines of Python code .. by one of the master programmers .. http://norvig.com/spell-correct.html.

Cheers.
- Debasish

Sujit Pal said...

Hi Debasish, as a matter of fact I did not, thanks for the link.

Anonymous said...

Hi,

Have you taken a look at JMySpell?
http://jmyspell.javahispano.net/index_en.html

It uses openoffice dictionaries and seems to work much better than Jazzy.

daniel said...

BTW: No need to download the code from the old issue, the spellchecker is an official part of Lucene (contrib/spellchecker).

Sujit Pal said...

Hi, I haven't looked at JMySpell, thanks for the link, will take a look at it. We are using Jazzy with a custom dictionary and I have found single word suggestions from Jazzy to be quite good. However, what I am really after is the ability to provide good spelling suggestions for multi-word strings.

Sujit Pal said...

Hi Daniel, thanks for the comment. I do see the org.apache.lucene.search.spell package in the Lucene 2.0.0 javadocs now, I had missed it earlier...

pavel.veinik said...

Sujit,

I discovered your blog, and many articles are very useful for me.

Thank you very much,
Pavel Veinik.

Sujit Pal said...

Thanks Pavel, you are welcome.

Vijay said...

Hi Salmon ,i found this blog aritcle is very excellent. nice to see more from you

Sujit Pal said...

Thanks Vijay.

Jian said...

I've done something similar and rolled my own did-you-mean program using bigram/trigrams approach. But, what I found out was that, for single word queries, it was ok. But for multi-word queries, due to the sheer volume of the grams used, the performance degraded.

So, I am more thinking towards a google approach, rather than using grams based segmentation technique, just use the query log data and some edit distance algorithm instead.

Sujit Pal said...

Yes, we did a proof of concept also based on the ideas in this post, but had to go back to our old dictionary based approach because of performance issues. We handle multi-word suggestions using the dictionary by doing a vector products each word+suggestion vector and matching against our taxonomy - still expensive, but works adequately.

Santhoshkumar said...

Hi Sujit,
your examples are very useful to me. I need a collocation extraction program in java. I searched in net.only algorithms are available. I need a java implementation. Please can you help me in this?

Sujit Pal said...

Hi Santhoshkumar, if you are asking for code, then no, sorry. I am currently playing with some other shiny object, dont have much time left after that :-).

David Been said...

Has anyone created input files in other languages for the index to use in spell suggestions

Sujit Pal said...

Hi David, to create an input file you could look at your own corpus (presumably in a foreign language), tokenize it into words, group them by count, remove words whose count is below a threshold, and use that as input to your dictionary.

Sujit Pal said...

Thanks for the comment Vocab Monk, if everyone followed your suggestion, there would be no need for automated spell checking :-). Although I can see situations where too much vocabulary on the spelling suggester end (without adequate context) can lead to bad spelling suggestions because of ambiguity.

amit said...

Mr.Sujit, thanks for the wonderful article. I have a doubt can you please help me solve it?
In your example in this article you have taken a case of only one field "dummy" but had there been many fields how would it work? Can you give me some pointers? For multi query search we use MultiFieldQueryParser how can it be used?

Sujit Pal said...

Thanks Amit. The spellcheck index is a separate index from the rest of your corpus, all it has are grams for individual words (from a dictionary or from your corpus), so you will only have this one field. If for some reason you had multiple fields (for example, say you wanted to combine soundex with n-gram in a OR query, you would simply change your query to an OR BooleanQuery so either clause matching would cause the suggestion to come up.