Salmon Run: Phrase Spelling Corrector using Word Collocation Probabilities

Spelling correction is one of those things that people don't notice when it works well. Indeed, for web-based search applications, its manifestation is usually a little "Did you mean: xxx?" component that appears when the application is not able to recognize the term being queried for. In spite of its relative non-ubiquity, however, users do notice when the suggestion is incorrect.

There are various approaches to spelling corrections. One popular approach is to use a Lucene index with character n-grams for terms in the index - I have written previously about my implementation of this approach.

Another popular approach is to compute edit costs and return from a dictionary the words that are within a predefined edit cost from the mispelt word. This is the approach used by GNU Aspell and its Java cousin Jazzy, which we use here.

Both these approaches work very well for single words, so they are very usable for applications such as word processors, where you need to be able to flag and suggest alternatives for mispelt words. In a typical search page, however, a user can type in a multi-word phrase, with one or more words mispelt. The job of the spelling corrector, in this case, is to tie the best suggestions together so that the corrected phrase makes sense within the context of the original phrase. A much harder problem, as you will no doubt agree.

Various approaches to solve this have been suggested and tried - I noticed one such suggestion almost by accident here on the Aspell TODO list, which set me thinking about this whole thing again.

Thinking about this suggestion a bit, I realized that a much simpler way would be to compute conditional probabilities between consecutive words in the phrase, and then consider the "best" suggestion to be the one which connects the words via the most probable path, i.e. the path with the highest sum of conditional probabilities. This effectively boils down a graph theory problem of computing the shortest path in a weighted directed graph. This post describes an implementation of this idea.

Consider Knud Sorensen's example from the Aspell TODO list. Two mispelt phrases and their corrected forms are shown below. As you can see, the correct form of the mispelt word 'fone' differs based on other words in the term.

1 2	a fone number => a phone number a fone dress => a fine dress

The list below shows the suggestions returned by Jazzy for the mispelt word 'fone', ordered by cost, i.e. the first suggestion is the one with the least edit cost to convert from the mispelt word. Notice that neither 'phone' nor 'fine' is the first result. The Java code for the CLI that I built for quickly looking up suggestions is available later in this post.

sujit@sirocco:~$ mvn -o exec:java \
  -Dexec.mainClass=com.mycompany.myapp.Shell \
  -Dexec.args=true
jazzy> fone
foe, one, fine, bone, zone, fore, lone, fond, font, hone, cone, gone, 
none, done, tone, fen, foes, on, fan, fin, fun, money, phone, son, fee, 
for, fog, fox, fined, found, fount, fence, fines, finer, honey, non, 
don, ton, ion, yon, vane, vine, June, gene, mane, mine, mono, bane, 
bony, pane, pine, pony, sane, sine, fade, fate, food, foot, vote, face, 
foci, fuse, fare, fire, four, free, fame, foam, fume, file, flee, foil, 
fool, foul, fowl, fake, lane, line, fife, five, fogs, find, fund, fans, 
fins, fang, cane, nine, dine, dune, tune, wane, wine
jazzy> \q

The approach I propose is to construct a graph of our input phrase ('a fone book'), adding vertices corresponding to the original word and each of its spelling suggestions, as shown below. The edge weights represent the conditional probability of the edge target B being followed by the edge source A (or P(B|A)). The numbers are all cooked up for this example, but I describe a way to compute them further down. I still need to populate my database tables with real data, I will describe this in a subsequent post.

What you will immediately notice is that we cannot prune the graph as we encounter each word in the phrase, i.e. we cannot select the most likely suggestion as we parse each word, since the "best path" is the most probable path through the graph from the start vertex to the finish vertex.

Since we are going to use Dijkstra's shortest path algorithm to find the shortest path (aka Graph Geodesic) through the graph, we need to convert the edge probabilities to a weight function given by w_A,B, like so:

  w_A,B = 1 - P(B|A)
  where:
    w_A,B = cost to get from vertex A to B
    P(B|A) = probability of the occurrence of B given A

The probability P(B|A) can be computed as the number of times A and B co-occur in our dataset divided by the number of times word A occurs in the dataset, as shown below:

  If the occurrence of A and B are dependent:
    P(B ∩ A) = P(B|A) * P(A)
  so:
    P(B|A) = P(B ∩ A) / P(A)
           = N(B ∩ A) / N(A)

To get experimental values for N(A) and N(B ∩ A), we will need to extract data from actual search terms used by users from our Apache access logs and populate the following tables:

mysql> desc occur_a;
+---------+---------------+------+-----+---------+-------+
| Field   | Type          | Null | Key | Default | Extra |
+---------+---------------+------+-----+---------+-------+
| word    | varchar(32)   | NO   | PRI | NULL    |       | 
| n_words | mediumint(11) | NO   |     | NULL    |       | 
+---------+---------------+------+-----+---------+-------+

mysql> desc occur_ab;
+---------+---------------+------+-----+---------+-------+
| Field   | Type          | Null | Key | Default | Extra |
+---------+---------------+------+-----+---------+-------+
| word_a  | varchar(32)   | NO   | PRI | NULL    |       | 
| word_b  | varchar(32)   | NO   | PRI | NULL    |       | 
| n_words | mediumint(11) | NO   |     | NULL    |       | 
+---------+---------------+------+-----+---------+-------+

Without any data in the database tables, the code degrades very gracefully. It just returns what we typed in, as you can see below. This happens because we always insert the original word in the first position of the suggestion list returned by Jazzy, so the "best" among equals is the one that comes first. As before, the Java code for this CLI is provided later in the article.

sujit@sirocco:~$ mvn -o exec:java \
  -Dexec.mainClass=com.mycompany.myapp.Shell
spell-check> a fone book
a fone book
spell-check> a fone dress
a fone dress

Once some (still cooked up) occurrence data is entered manually into these tables...

mysql> select * from occur_a;
+-------+---------+
| word  | n_words |
+-------+---------+
| a     |     100 | 
| book  |      43 | 
| dress |      10 | 
| fine  |      12 | 
| phone |      18 | 
+-------+---------+
5 rows in set (0.00 sec)

mysql> select * from occur_ab;
+--------+--------+---------+
| word_a | word_b | n_words |
+--------+--------+---------+
| a      | fine   |       8 | 
| a      | phone  |      13 | 
| book   | phone  |      12 | 
| dress  | fine   |       7 | 
+--------+--------+---------+
4 rows in set (0.00 sec)

...our spelling corrector behaves more intelligently. The beauty of this approach is that its intelligence can be localized to your industry. So for example, if you were in the clothing business, your search terms are more likely to include fine dresses than phone books, and therefore the probability of P(dress|fine) would be higher than P(dress|phone).

sujit@sirocco:~$ mvn -o exec:java \
  -Dexec.mainClass=com.mycompany.myapp.Shell
spell-check> a fone book
a phone book
spell-check> a fone dress
a fine dress
spell-check> fone book
phone book
spell-check> fone dress
fine dress

Here is the code for the actual Spelling corrector. It uses Jazzy for its word suggestions, and JGraphT to construct a graph and run Dijkstra's shortest path algorithm (included in the JGraphT library) to find the most likely path based on word co-occurrence probabilities.

// Source: src/main/java/com/mycompany/myapp/SpellingCorrector.java
package com.mycompany.myapp;

import java.io.File;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;

import javax.sql.DataSource;

import org.apache.commons.lang.StringUtils;
import org.jgrapht.alg.DijkstraShortestPath;
import org.jgrapht.graph.ClassBasedEdgeFactory;
import org.jgrapht.graph.DefaultWeightedEdge;
import org.jgrapht.graph.SimpleDirectedWeightedGraph;
import org.springframework.dao.IncorrectResultSizeDataAccessException;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import com.swabunga.spell.engine.SpellDictionary;
import com.swabunga.spell.engine.SpellDictionaryHashMap;
import com.swabunga.spell.engine.Word;

/**
 * Uses probability of word-collocations to determine best phrases to be
 * returned from a SpellingCorrector for multi-word mispelt queries.
 */
public class SpellingCorrector {

  private static final int SCORE_THRESHOLD = 200;
  private static final String DICTIONARY_FILENAME = 
    "src/main/resources/english.0";
  
  private long occurASumWords = 1L;
  private JdbcTemplate jdbcTemplate;
  
  @SuppressWarnings("unchecked")
  public String getSuggestion(String input) throws Exception {
    // initialize Jazzy spelling dictionary
    SpellDictionary dictionary = new SpellDictionaryHashMap(
      new File(DICTIONARY_FILENAME));
    // initialize database connection
    DataSource dataSource = new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/spelldb", 
      "foo", "secret");
    jdbcTemplate = new JdbcTemplate(dataSource);
    occurASumWords = jdbcTemplate.queryForLong(
      "select sum(n_words) from occur_a");
    if (occurASumWords == 0L) {
      // just a hack to prevent divide by zero for empty db
      occurASumWords = 1L;
    }
    // set up graph and create root vertex
    final SimpleDirectedWeightedGraph<SuggestedWord,DefaultWeightedEdge> g = 
      new SimpleDirectedWeightedGraph<SuggestedWord,DefaultWeightedEdge>(
      new ClassBasedEdgeFactory<SuggestedWord,DefaultWeightedEdge>(
      DefaultWeightedEdge.class));
    SuggestedWord startVertex = new SuggestedWord("START", 0);
    g.addVertex(startVertex);
    // set up variables to hold results of previous iteration
    List<SuggestedWord> prevVertices = 
      new ArrayList<SuggestedWord>();
    List<SuggestedWord> currentVertices = 
      new ArrayList<SuggestedWord>();
    int tokenId = 1;
    prevVertices.add(startVertex);
    // parse the string
    String[] tokens = input.toLowerCase().split("[ -]");
    for (String token : tokens) {
      // build up spelling suggestions for individual word
      List<String> possibleTokens = new ArrayList<String>();
      if (token.trim().length() <= 2) {
        // people usually don't make mistakes for words 2 words or less,
        // just pass it back unchanged
        possibleTokens.add(token);
      } else if (dictionary.isCorrect(token)) {
        // no need to find suggestions, token is recognized as valid spelling
        possibleTokens.add(token);
      } else {
        possibleTokens.add(token);
        List<Word> words = 
          dictionary.getSuggestions(token, SCORE_THRESHOLD);
        for (Word word : words) {
          possibleTokens.add(word.getWord());
        }
      }
      // populate the graph with these values
      for (String possibleToken : possibleTokens) {
        SuggestedWord currentVertex = 
          new SuggestedWord(possibleToken, tokenId); 
        g.addVertex(currentVertex);
        currentVertices.add(currentVertex);
        for (SuggestedWord prevVertex : prevVertices) {
          DefaultWeightedEdge edge = new DefaultWeightedEdge();
          double weight = computeEdgeWeight(
            prevVertex.token, currentVertex.token);
          g.setEdgeWeight(edge, weight);
          g.addEdge(prevVertex, currentVertex, edge);
        }
      }
      prevVertices.clear();
      prevVertices.addAll(currentVertices);
      currentVertices.clear();
      tokenId++;
    } // for token : tokens
    // finally set the end vertex
    SuggestedWord endVertex = new SuggestedWord("END", tokenId);
    g.addVertex(endVertex);
    for (SuggestedWord prevVertex : prevVertices) {
      DefaultWeightedEdge edge = new DefaultWeightedEdge();
      g.setEdgeWeight(edge, 1.0D);
      g.addEdge(prevVertex, endVertex, edge);
    }
    // find shortest path between START and END
    DijkstraShortestPath<SuggestedWord,DefaultWeightedEdge> dijkstra =
      new DijkstraShortestPath<SuggestedWord, DefaultWeightedEdge>(
      g, startVertex, endVertex);
    List<DefaultWeightedEdge> edges = dijkstra.getPathEdgeList();
    List<String> bestMatch = new ArrayList<String>();
    for (DefaultWeightedEdge edge : edges) {
      if (startVertex.equals(g.getEdgeSource(edge))) {
        // skip the START vertex
        continue;
      }
      bestMatch.add(g.getEdgeSource(edge).token);
    }
    return StringUtils.join(bestMatch.iterator(), " ");
  }

  private Double computeEdgeWeight(String prevToken, String currentToken) {
    if (prevToken.equals("START")) {
      // this is the first word, return 1-P(B)
      try {
        double nb = (Double) jdbcTemplate.queryForObject(
          "select n_words/? from occur_a where word = ?", 
          new Object[] {occurASumWords, currentToken}, Double.class);
        return 1.0D - nb;
      } catch (IncorrectResultSizeDataAccessException e) {
        // in case there is no match, then we should return weight of 1
        return 1.0D;
      }
    }
    double na = 0.0D;
    try {
      na = (Double) jdbcTemplate.queryForObject(
        "select n_words from occur_a where word = ?", 
        new String[] {prevToken}, Double.class);
    } catch (IncorrectResultSizeDataAccessException e) {
      // no match, should be 0
      na = 0.0D;
    }
    if (na == 0.0D) {
      // if N(A) == 0, A does not exist, and hence N(A ^ B) == 0 too,
      // so we guard against a DivideByZero and an additional useless
      // computation.
      return 1.0D;
    }
    // for the A^B lookup, alphabetize so A is lexically ahead of B
    // since that is the way we store it in the database
    String[] tokens = new String[] {prevToken, currentToken};
    Arrays.sort(tokens); // alphabetize before lookup
    double nba = 0.0D;
    try {
      nba = (Double) jdbcTemplate.queryForObject(
        "select n_words from occur_ab where word_a = ? and word_b = ?",
        tokens, Double.class);
    } catch (IncorrectResultSizeDataAccessException e) {
      // no result found so N(B^A) = 0
      nba = 0.0D;
    }
    return 1.0D - (nba / na);
  }

  /**
   * Holder for the graph vertex information.
   */
  private class SuggestedWord {
    public String token;
    public int id;
    
    public SuggestedWord(String token, int id) {
      this.token = token;
      this.id = id;
    }
    
    @Override
    public int hashCode() {
      return toString().hashCode();
    }
    
    @Override
    public boolean equals(Object obj) {
      if (!(obj instanceof SuggestedWord)) {
        return false;
      }
      SuggestedWord that = (SuggestedWord) obj;
      return (this.id == that.id && 
        this.token.equals(that.token));
    }
    
    @Override
    public String toString() {
      return id + ":" + token;
    }
  };
}

The CLI proved to be very useful for checking out assumptions quickly when I was developing the algorithm. Its quite simple, it just wraps the functionality within a JLine ConsoleReader. I included it here for completeness and to illustrate how easy it is to build. Depending on the presence of a command line argument, it can function either as an interface over the Jazzy dictionary or to the Phrase Spelling Corrector described in this post.

// Source: src/main/java/com/mycompany/myapp/Shell.java
package com.mycompany.myapp;

import java.io.File;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import jline.ConsoleReader;

import org.apache.commons.lang.StringUtils;

import com.swabunga.spell.engine.SpellDictionary;
import com.swabunga.spell.engine.SpellDictionaryHashMap;
import com.swabunga.spell.engine.Word;

public class Shell {

  private final int SPELL_CHECK_THRESHOLD = 250;

  public Shell() throws Exception {
    ConsoleReader reader = new ConsoleReader(
      System.in, new PrintWriter(System.out));
    SpellingCorrector spellingCorrector = new SpellingCorrector();
    for (;;) {
      String line = reader.readLine("spell-check> ");
      if ("\\q".equals(line)) {
        break;
      }
      System.out.println(spellingCorrector.getSuggestion(line));
    }
  }

  // === this is really for exploratory testing purposes ===
  
  /**
   * Wrapper over Jazzy native spell checking functionality.
   * @param b always true (to differentiate from the new ctor).
   * @throws Exception if one is thrown.
   */
  public Shell(boolean b) throws Exception {
    ConsoleReader reader = new ConsoleReader(
      System.in, new PrintWriter(System.out));
    SpellDictionary dictionary = new SpellDictionaryHashMap(
      new File("src/main/resources/english.0"));
    for (;;) {
      String line = reader.readLine("jazzy> ");
      if ("\\q".equals(line)) {
        break;
      }
      String suggestions = suggest(dictionary, line);
      System.out.println(suggestions);
    }
  }
  
  /**
   * Looks up single words from Jazzy's English dictionary.
   * @param dictionary the dictionary object to look up.
   * @param incorrect the suspected mispelt word.
   * @return if the incorrect word is correct according to 
   * Jazzy's dictionary, then it is returned, else a set of possible
   * corrections is returned. If no possible corrections were found, 
   * this method returns (no suggestions).
   */
  @SuppressWarnings("unchecked")
  private String suggest(SpellDictionary dictionary, String incorrect) {
    if (dictionary.isCorrect(incorrect)) {
      // return the entered word
      return incorrect;
    }
    List<Word> words = dictionary.getSuggestions(
      incorrect, SPELL_CHECK_THRESHOLD);
    List<String> suggestions = new ArrayList<String>();
    final Map<String,Integer> costs = 
      new HashMap<String,Integer>();
    for (Word word : words) {
      costs.put(word.getWord(), word.getCost());
      suggestions.add(word.getWord());
    }
    if (suggestions.size() == 0) {
      return "(no suggestions)";
    }
    Collections.sort(suggestions, new Comparator<String>() {
      public int compare(String s1, String s2) {
        Integer cost1 = costs.get(s1);
        Integer cost2 = costs.get(s2);
        return cost1.compareTo(cost2);
      }
    });
    return StringUtils.join(suggestions.iterator(), ", ");
  }
  
  public static void main(String[] args) throws Exception {
    if (args.length == 1) {
      // word mode
      new Shell(true);
    } else {
      new Shell();
    }
  }
}

Note that I still don't know whether this works well for a large set of mispelt phrases. I need to put this through a lot more real data to say that with any degree of certainty. It is also fairly slow in my development testing. I have a few ideas as to how that can be improved, although I will attempt them after I have some real data to play with. As always, any suggestions/corrections much appreciated.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

7 comments (moderated to prevent spam):

Ram Awasthi said...: Excellent Article. Very Helpfull; 1/19/2009 10:29 PM
Sujit Pal said...: Thanks Ram, glad you found it useful.; 1/20/2009 12:15 PM
pavel.veinik said...: Sujit,

This article is of great help for me.

Thank you.; 10/07/2009 9:47 AM
Mahdi said...: Thanks for your article. I feel it has a close relation to "decoding" in machine translation. here is a presentation about this topic.; 12/14/2014 8:43 AM
Sujit Pal said...: You are welcome Mahdi and thanks for the link. I think you are right, there does seem to be a similarity, especially around picking the most probable word at the position. BTW, the link didn't work (looks like blogger is putting a prefix on it for some reason), putting in the corrected link here for posterity.; 12/14/2014 9:36 AM
Foolish said...: Hello Sir,

I am referring your article for my study of genetic algorithms. I found it very helpful. I have downloaded whole source code but can you please provide information on how to run it in eclipse?

Please provide information on how to set it up in eclipse and run it.
I find no main method to run code.

Please help.; 1/17/2015 12:01 AM
Sujit Pal said...: Hi, glad that you found it helpful. To make an Eclipse project you can run mvn eclipse:eclipse, this will download all (public) JAR files to your local repository, and create your .classpath and .project files. You can then create a new project in Eclipse with the project root the same as where the pom.xml is. In order to run code, I use JUnit tests - they are in a parallel package structure under src/test/java and are named like MainClassNameTest.java. You can run it using "mvn test" for all tests, or for individual tests "mvn test -Dtest=unqualified_test_classname". Also I have deleted your other comment which asked the same thing but which also supplied an email address, assuming that you don't want your email address to become publicly available.; 1/17/2015 5:04 PM

Salmon Run

Friday, October 24, 2008

Phrase Spelling Corrector using Word Collocation Probabilities

7 comments (moderated to prevent spam):

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me