Thursday, February 06, 2014

Substring matching using Aho-Corasick and Redis


The Aho-Corasick algorithm is a fast string matching algorithm that can match a large set of keywords simultaneously against incoming text. It does this by using a trie like data structure, and attempting to navigate the tree along nodes corresponding to characters in the keywords.

The first phase builds up the trie by reading through each keyword in the dictionary, and building nodes for each character in each keyword if it doesn't already exist. To each node, it attaches a set of "transition" nodes - ie, characters it can traverse to from the current character given the set of keywords. Additionally, the keyword itself is associated with the character that ends it. The tree building complexity is linear, O(m) where m is the number of characters across all keywords.

Once the keyword trie is built, searching it is accomplished in a single pass through the text to be matched. The search process recovers substrings which occur in the keywords from the dictionary. The complexity of search is also linear, O(n) where n is the size of the input text, since all keywords are matched simultaneously.

Even though the build time is linear, it can become significant for large dictionaries. If the dictionary is relatively static, the trie building step can be avoided by storing it in a non-volatile key-value store such as Redis. Since Redis operates on in-memory data, you get the best of both worlds - no startup penalty and search speeds almost as good as using in-memory data structures.

In this post, I describe a small (and incomplete, but sufficient for my purposes) implementation of the Aho-Corasick algorithm in Scala that relies on Redis to store the keyword trie. It is similar to the (also slightly modified) Python implementation described in this Sidelines blog post.

Here is the code for the algorithm. The prepare() method takes in a list of keywords and builds a Redis-backed data structure that of two Maps of Sets keyed by the current character, the first one representing the transitions and the second the results (keywords). The search() method takes a phrase to be searched and returns the List of substrings which matches words in the keyword list.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
// Source: src/main/scala/com/mycompany/solr4extras/dictann/AhoCorasickRedis.scala
package com.mycompany.solr4extras.dictann

import com.redis.RedisClient
import scala.collection.mutable.ArrayBuffer

class AhoCorasickRedis {

  val redis = new RedisClient("localhost", 6379)
  
  def prepare(keywords: List[String]): Unit = {
    keywords.foreach(keyword => {
      var prevCh = '\0'
      keyword.foreach(ch => {
        redis.sadd(tkey(prevCh), ch)
        prevCh = ch
      })
      redis.sadd(rkey(prevCh), keyword)
    })
  }
  
  def search(phrase: String): List[String] = {
    val matches = ArrayBuffer[String]()
    var prevCh = '\0'
    phrase.foreach(ch => {
      prevCh = if (redis.sismember(tkey(prevCh), ch)) ch 
               else '\0'
      val cmatches = redis.smembers(rkey(prevCh)) match {
        case Some(results) => results.flatten
        case None => List() 
      }
      matches ++= cmatches
    })
    matches.toSet.toList
  }
  
  def tkey(ch: Char) = "trn:" + ch
  def rkey(ch: Char) = "res:" + ch
}

To test this, we run the following simple JUnit test which loads the trie with three strings and then searches it using a sentence containing these 3 strings. As you can see, the algorithm finds the 3 strings we expect it to find.

1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
// Source: src/test/scala/com/mycompany/solr4extras/dictann/AhoCorasickRedisTest.scala
package com.mycompany.solr4extras.dictann

import org.junit.Test
import org.junit.Assert

class AhoCorasickRedisTest {

  @Test
  def testBuild(): Unit = {
    val aho = new AhoCorasickRedis()
    aho.prepare(List("Seahawks", "Broncos", "Super Bowl"))
    val matches = aho.search(
      "The Seahawks defeated the Broncos at the Super Bowl.")
    Console.println("matches=" + matches)
    Assert.assertEquals(3, matches.size)
    Assert.assertTrue(matches.contains("Seahawks"))
    Assert.assertTrue(matches.contains("Broncos"))
    Assert.assertTrue(matches.contains("Super Bowl"))
  }
}

For more information about this algorithm, Pekka Kilpelainen's lecture slides provides lots of detail, and Ivan Kuckir's Blog post has a nice animation that illustrates how it works. Both links are also listed under the External Links section in the algorithm's Wikipedia page (referenced earlier).


4 comments:

  1. I'd like to acquaint you with OpenSextant's "Solr Text Tagger". It is used by OpenSextant, a place name text-tagger, and by Apache Stanbol which does some entity tagging. I created this because the Aho-Corassic based tagger used in GATE was taking up a whopping 80GB of RAM for our ~10M place names -- compare that to 100-200MB or so for the Solr Text Tagger. This is in large part due to Lucene's amazing Finite State Transducers. And the tagger tags plenty fast -- measure for yourself but we've stopped caring at this point how many milliseconds per document. If Solr adds more weight then you want, then you can just depend on Lucene, since most of the functionality doesn't depend on Solr. Similar amazing performance & size gains were seen for Apache Stanbol once it switched to the Solr Text Tagger.

    For further information, see my Lucene Revolution presentation Text Tagging with Finite State Transducers (slides & video). In the presentation, I identified various improvements which have largely been made by now in a combination of the master branch (v1.x) and the MemPF branch (v2.x).

    ReplyDelete
  2. Thanks David, I am aware of your SolrTextTagger application - you pointed it out to me in this article where I used LingPipe's ExactDictionaryChunker to do exact substring matching. I did look at SolrTextTagger at that time, but it seemed like a bit of overkill resource wise - but I will look at it again based on your recommendations.

    ReplyDelete
  3. Hi, it seems like you'd be able to get keywords back that don't match. Say you added sapped and mad, then when you search for sad, you would get a match with sapped returned. This is a really interesting idea, so hopefully I'm just missing something :)

    ReplyDelete
  4. You are right, thanks for the counter-example. My implementation is incomplete, there is a final step where the matched string is matched against a traditional dictionary to see if it exists. I have since followed David's advice (above) and switched to using his SolrTextTagger, which is very robust and scalable - I have used it successfully for dictionaries with 14M+ entries.

    ReplyDelete

Comments are moderated to prevent spam.