Saturday, May 17, 2008

Tokenizing Test : Token Recognition

Last week I wrote about using ICU4j's RuleBasedBreakIterator to tokenize a body of text. Custom rules were added to customize the default BreakIterator behavior. Unlike the BreakIterators available in java.text, the ICU4j's RuleBasedBreakIterator exposes a nice getRuleStatus() method, which provides a good first cut at recognizing what kind of tokens are being returned. In this post, I refine the tokenization slightly by passing it through a chain of token recognizers. My latest customizations are shown here (minus the built in word rules which I generated. For the full file, see my previous post.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
!!chain;
...
# Custom : Modified to add an optional trailing % sign
$NumericEx = $Numeric ($Extend |  $Format)*(\%)*;
...
# ============= Custom Rules ================
# Abbreviation: Uppercase alpha chars separated by period and optionally followed by a period 
$Abbreviation = [A-Z](\.[A-Z0-9])+(\.)*;
# Hyphenated Word : sequence of letter or digit, (punctuated by [/+&_-], with following letter or digit sequence)+
$HyphenatedWord = [A-Za-z0-9]+([\-\+\&_][A-Za-z0-9]+)+;
# Email address: sequence of letters, digits and punctuation followed by @ and followed by another sequence
$EmailAddress = [A-Za-z0-9_\-\.]+\@[A-Za-z][A-Za-z0-9_]+\.[a-z]+;
# Internet Addresses: http://www.foo.com(/bar)
$InternetAddress = [a-z]+\:\/\/[a-z0-9]+(\.[a-z0-9]+)+(\/[a-z0-9][a-z0-9\.]+);
# XML markup: A run begins with < and ends with the first matching >
$XmlMarkup = \<[^\>]+\>; 
# Emoticon: A run that starts with :;B8{[ and contains only one or more of the following -=/{})(
$Emoticon = [B8\:\;\{\[][-=\/\{\}\)\(]+; 
# Internet IP Address - a block of 4 numbers of max 3 numbers each separated by period
$InternetIpAddress = [0-9]+\.[0-9]+\.[0-9]+\.[0-9]+;
# Internet Site Address - such as www.ibm.com
$InternetSiteAddress = [a-z][a-z0-9]*(\.[a-z0-9])+;

!!forward;
...
# =========== Custom Forwards ====================
$Abbreviation {500};
$HyphenatedWord {501};
$EmailAddress {502};
$InternetAddress {503};
$XmlMarkup {504};
$Emoticon {505};
$InternetIpAddress {506};
$InternetSiteAddress {507};

...

Why would I want to do that? Well, some tokens can be split into multiple tokens by doing some basic pattern matching and database lookups - consider the token 120k(WORD), which is split into 120(NUMBER) and k(ABBREVIATION). Multiple tokens can be combined into a single phrase, again using a database lookup - an example is pony(WORD) and up(WORD) which is combined here into 'pony up'(PHRASE). Yet other tokens can be reclassified - an example here are spaces and punctuations, which are classified as matching with rule status 0 by RuleBasedBreakIterator, but are classified as WHITESPACE or PUNCTUATION in this stage.

Another reason (which, to be fair, wasn't an issue in my case) is that the pattern matching in the RuleBasedBreakIterator is rather anemic. For example, if I wanted to match an internet address, I am limited to using "\d+\.\d+\.\d+\.\d+" rather than the slightly more robust "\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}". I wish Sun would hurry up and make this class available as part of the JDK - it is currently available in java.text but is package protected so is not usable in application code. Given their history, it is very likely that once it is made public, it will have standard Java regular expression support.

On the flip side, however, had I used java.text.BreakIterator, this layer would have been far beefier than it is now. The ICU4j RuleBasedBreakIterator does a lot of token recognition already, thanks to its getRuleStatus() method, so this reduces the code at this level.

So starting again from our test input (slightly modified again to show the phrase recognition in action)...:

1
2
3
4
Jaguar will sell its new XJ-6 model in the U.S. for a small fortune :-). 
Expect to pony up around USD 120ks. Custom options can set you back another 
few 10,000 dollars. For details, go to <a href="http://www.jaguar.com/sales" 
alt="Click here">Jaguar Sales</a> or contact xj-6@jaguar.com.

Currently the only three recognizers I have are the boundary token recognizer (re-classifies boundaries into whitespace and punctuation tokens), abbreviation recognizer and the phrase recognizer. I have used data from the TextMine project to power some of the recognizers. The TextMine project also provides (Perl-based) recognizers for places, organizations, etc, but I don't need them at the moment, so I didn't build them.

We also have a RecognizerChain that implements a Chain of Responsibility Pattern, taking a List of IRecognizer objects as input and calling recognize() on each of them. Here is my test case:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
package com.mycompany.myapp.recognizers;

import java.util.Arrays;
import java.util.LinkedList;
import java.util.List;

import javax.sql.DataSource;

import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.junit.BeforeClass;
import org.junit.Test;
import org.springframework.jdbc.datasource.DriverManagerDataSource;

import com.mycompany.myapp.tokenizers.SentenceTokenizer;
import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.WordTokenizer;

public class RecognizerChainTest {

  private final Log log = LogFactory.getLog(getClass());
  
  private static RecognizerChain chain;
  
  @BeforeClass
  public static void setupBeforeClass() throws Exception {
    DataSource dataSource = new DriverManagerDataSource(
      "com.mysql.jdbc.Driver", "jdbc:mysql://localhost:3306/tmdb", "me", "xxx");
    chain = new RecognizerChain(Arrays.asList(new IRecognizer[] {
      new BoundaryRecognizer(),
      new AbbreviationRecognizer(dataSource),
      new PhraseRecognizer(dataSource)
    }));
    chain.init();
  }
  
  @Test
  public void testRecognizeAbbreviations() throws Exception {
    String paragraph = "...";
    SentenceTokenizer sentenceTokenizer = new SentenceTokenizer();
    sentenceTokenizer.setText(paragraph);
    WordTokenizer wordTokenizer = new WordTokenizer();
    List<Token> tokens = new LinkedList<Token>();
    String sentence = null;
    while ((sentence = sentenceTokenizer.nextSentence()) != null) {
      wordTokenizer.setText(sentence);
      Token token = null;
      while ((token = wordTokenizer.nextToken()) != null) {
        tokens.add(token);
      }
      List<Token> recognizedTokens = chain.recognize(tokens);
      for (Token recognizedToken: recognizedTokens) {
        log.debug("token=" + recognizedToken.getValue() + 
          " [" + recognizedToken.getType() + "]");
      }
      tokens.clear();
    }
  }
}

As you can see, we construct a RecognizerChain out of the three IRecognizer implementations, and initialize them during setup, and then we run the Tokens generated from each sentence through this, which results in a different set of Tokens. All Recognizer (including the RecognizerChain implement IRecognizer, which exposes a recognize() and an init() method.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
package com.mycompany.myapp.recognizers;

import java.util.List;

import com.mycompany.myapp.tokenizers.Token;

public interface IRecognizer {

  /**
   * Rule initialization code goes here.
   * @throws Exception
   */
  public void init() throws Exception;
  
  /**
   * Runs through the list of input tokens, classifying as many tokens as
   * it can into this particular entity.
   * @param tokens the List of Tokens.
   * @return the output List of Tokens. The size of the input and output
   * may not match, since some tokens may be coalesced into a single one
   * or a token may be broken up into multiple tokens.
   */
  public List<Token> recognize(List<Token> tokens);
}

BoundaryRecognizer

The BoundaryRecognizer classifies the tokens returned with rule status 0 (UNKNOWN) from RuleBasedBreakIterator into either WHITESPACE or PUNCTUATION. Not much to say here, its just matching the tokens against a pattern of either whitepace or punctuation. Here is the code:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
package com.mycompany.myapp.recognizers;

import java.util.LinkedList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import org.apache.commons.lang.StringUtils;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

/**
 * Identifies Punctuations in the returned text. A BreakIterator will treat
 * both punctuation and whitespace as word boundaries. This is usually the
 * first extractor in a chain, so it reduces the number of unknowns.
 */
public class BoundaryRecognizer implements IRecognizer {

  private Pattern whitespacePattern;
  private Pattern punctuationPattern;
  
  public void init() throws Exception {
    this.whitespacePattern = Pattern.compile("\\s+");
    this.punctuationPattern = Pattern.compile("\\p{Punct}");
  }

  public List<Token> recognize(List<Token> tokens) {
    List<Token> extractedTokens = new LinkedList<Token>();
    for (Token token : tokens) {
      String value = token.getValue();
      TokenType type = token.getType();
      if (type != TokenType.UNKNOWN) {
        // we already know what this is, continue
        extractedTokens.add(token);
        continue;
      }
      // whitespace
      Matcher whitespaceMatcher = whitespacePattern.matcher(value);
      if (whitespaceMatcher.matches()) {
        extractedTokens.add(new Token(value, TokenType.WHITESPACE));
        continue;
      }
      // punctuation
      Matcher punctuationMatcher = punctuationPattern.matcher(value);
      if (punctuationMatcher.matches()) {
        extractedTokens.add(new Token(value, TokenType.PUNCTUATION));
        continue;
      }
      // we came this far, then its still UNKNOWN
      extractedTokens.add(token);
    }
    return extractedTokens;
  }
}

AbbreviationRecognizer

The RuleBasedBreakIterator already has rules to identify abbreviations. However, it is likely to miss some special cases which are covered here. Read the class Javadocs in the code below for details.

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
package com.mycompany.myapp.recognizers;

import java.util.ArrayList;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Set;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import javax.sql.DataSource;

import org.apache.commons.lang.StringUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.jdbc.core.JdbcTemplate;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

/**
 * Rule to identify Abbreviation entities. This is a multi-pass rule that does
 * the following checks:
 * (1) Check for abbreviations embedded with numbers, eg 3.0pct. This will 
 *     result in two tokens 3.0 (NUMBER) and pct (ABBREVIATION). The first
 *     token will have ___ appended to it to signify that it should be joined
 *     to the token following it.
 * (2) Check for abbreviations with periods, eg. U.S.
 * (3) Check for abbreviations without periods, eg US or MD.
 * In all cases, the abbreviation is looked up against the co_abbrev 
 * (enc_type=a) table to ensure a correct match.
 */
public class AbbreviationRecognizer implements IRecognizer {

  private final Log log = LogFactory.getLog(getClass());
  
  private JdbcTemplate jdbcTemplate;
  private Pattern abbrevEmbeddedInWordPattern;
  private Pattern abbrevWithPeriodsPattern;
  private Pattern abbrevWithAllCapsPattern;
  
  private Set<String> abbreviations = new HashSet<String>();

  public AbbreviationRecognizer(DataSource dataSource) {
    super();
    setDataSource(dataSource);
  }

  public void setDataSource(DataSource dataSource) {
    this.jdbcTemplate = new JdbcTemplate(dataSource);
  }

  public void init() throws Exception {
    abbrevEmbeddedInWordPattern = Pattern.compile("\\d+(\\.\\d+)*(\\w+)"); // eg 3.3pct
    abbrevWithPeriodsPattern = Pattern.compile("\\w(\\.\\w)+"); // U.S
    abbrevWithAllCapsPattern = Pattern.compile("[A-Z]+"); // MD, USA
  }

  public List<Token> recognize(List<Token> tokens) {
    List<Token> recognizedTokens = new LinkedList<Token>();
    for (Token token : tokens) {
      TokenType type = token.getType();
      if (type != TokenType.WORD) {
        // we only apply abbreviation recognition rules to WORD tokens, so
        // if this is not a word, its a pass-through for this rule set.
        recognizedTokens.add(token);
        continue;
      }
      String word = token.getValue();
      // match abbreviations embedded in numbers
      Matcher abbrevEmbeddedInWordMatcher = abbrevEmbeddedInWordPattern.matcher(word);
      if (abbrevEmbeddedInWordMatcher.matches()) {
        String abbrevPart = abbrevEmbeddedInWordMatcher.group(2);
        String numberPart = word.substring(0, word.indexOf(abbrevPart));
        if (isAbbreviation(abbrevPart)) {
          recognizedTokens.add(new Token(numberPart + "___", TokenType.NUMBER));
          recognizedTokens.add(new Token(abbrevPart, TokenType.ABBREVIATION));
          continue;
        }
      }
      // match if word contains embedded periods
      Matcher abbrevWithPeriodsMatcher = abbrevWithPeriodsPattern.matcher(word);
      if (abbrevWithPeriodsMatcher.matches()) {
        if (isAbbreviation(word)) {
          token.setType(TokenType.ABBREVIATION);
          recognizedTokens.add(token);
          continue;
        }
      }
      // match if word is all uppercase
      Matcher abbrevWithAllCapsMatcher = abbrevWithAllCapsPattern.matcher(word);
      if (abbrevWithAllCapsMatcher.matches()) {
        // embed periods in the potential abbreviation, and check for both 
        // the original and the period embedded word against our database list
        char[] wordchars = word.toCharArray();
        List<Character> wordChars = new ArrayList<Character>();
        for (int i = 0; i < wordchars.length; i++) {
          wordChars.add(wordchars[i]);
        }
        String periodEmbeddedWord = StringUtils.join(wordChars.iterator(), ".");
        if (isAbbreviation(word) || isAbbreviation(periodEmbeddedWord)) {
          token.setType(TokenType.ABBREVIATION);
          recognizedTokens.add(token);
          continue;
        }
      }
      // if we came this far, none of our tests matched, so we cannot mark
      // this token as an abbreviation...pass through
      recognizedTokens.add(token);
    }
    return recognizedTokens;
  }

  @SuppressWarnings("unchecked")
  private boolean isAbbreviation(String abbrevPart) {
    List<Map<String,String>> rows = jdbcTemplate.queryForList(
      "select enc_name from co_abbrev where enc_type = ? and enc_name = ?", 
      new String[] {"a", StringUtils.lowerCase(abbrevPart)});
    for (Map<String,String> row : rows) {
      return true;
    }
    return false;
  }
}

The isAbbreviation() method looks up the co_abbrev table that is from the TextMine project. Here is the structure of the table.

1
2
3
4
5
6
7
+-----------------+-----------+------+-----+---------+-------+
| Field           | Type      | Null | Key | Default | Extra |
+-----------------+-----------+------+-----+---------+-------+
| enc_name        | char(100) | NO   | MUL | NULL    |       | 
| enc_type        | char(50)  | YES  |     | NULL    |       | 
| enc_description | char(50)  | YES  |     | NULL    |       | 
+-----------------+-----------+------+-----+---------+-------+

PhraseRecognizer

The PhraseRecognizer uses a predefined table of phrases. Each phrase is modelled as a triplet consisting of the lead word, the number of tokens and the actual phrase. For example, the phrase "pony up" is modeled as (pony, 3, pony up). So whenever we see a word token that is the lead word, we look forward the number of tokens and see if we have a match. Obviously, the success of this strategy depends on us having a comprehensive collection of phrases. The code for the PhraseRecognizer is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
package com.mycompany.myapp.recognizers;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import javax.sql.DataSource;

import org.apache.commons.collections15.iterators.TransformIterator;
import org.apache.commons.lang.StringUtils;
import org.apache.commons.lang.WordUtils;
import org.apache.commons.logging.Log;
import org.apache.commons.logging.LogFactory;
import org.springframework.jdbc.core.JdbcTemplate;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

/**
 * Combines co-located words into phrases by looking up a table of
 * common phrases.
 */
public class PhraseRecognizer implements IRecognizer {

  private final Log log = LogFactory.getLog(getClass());
  
  private JdbcTemplate jdbcTemplate;
  private Map<String,String> phraseMap;
  
  public PhraseRecognizer(DataSource dataSource) {
    super();
    setDataSource(dataSource);
  }
  
  public void setDataSource(DataSource dataSource) {
    this.jdbcTemplate = new JdbcTemplate(dataSource);
  }
  
  public void init() throws Exception { /* :NOOP: */ }

  @SuppressWarnings("unchecked")
  public List<Token> recognize(List<Token> tokens) {
    List<Token> extractedTokens = new ArrayList<Token>();
    int numTokens = tokens.size();
    TOKEN_LOOP:
    for (int i = 0; i < numTokens; i++) {
      Token token = tokens.get(i);
      TokenType type = token.getType();
      if (type != TokenType.WORD) {
        // we don't care about phrases that begin with types other than word
        extractedTokens.add(token);
        continue;
      }
      String word = token.getValue();
      List<Map<String,Object>> rows = jdbcTemplate.queryForList(
        "select coc_phrase, coc_num_words from my_colloc where coc_lead_word = ?", 
        new String[] {StringUtils.lowerCase(word)});
      for (Map<String,Object> row : rows) {
        String phrase = (String) row.get("COC_PHRASE");
        int numWords = (Integer) row.get("COC_NUM_WORDS");
        if (numTokens > i + numWords) {
          // we don't want to look beyond the actual size of the sentence
          String inputPhrase = getInputPhrase(tokens, i, numWords);
          if (phrase.equals(inputPhrase)) {
            extractedTokens.add(new Token(phrase + "|||" + numWords, TokenType.PHRASE));
            // move the pointer forward (numWords - 1)
            i += (numWords - 1);
            continue TOKEN_LOOP;
          }
        }
      }
      // if we came this far, then there is no phrase starting at
      // this position...pass through
      extractedTokens.add(token);
    }
    return extractedTokens;
  }

  private String getInputPhrase(List<Token> tokens, int start, int length) {
    List<Token> tokenSublist = tokens.subList(start, start + length);
    StringBuilder phraseBuilder = new StringBuilder();
    for (Token token : tokenSublist) {
      TokenType type = token.getType();
      phraseBuilder.append(token.getValue());
    }
    return phraseBuilder.toString();
  }
}

The PhraseRecognizer uses the my_colloc table, which is adapted from the co_colloc table from the TextMine project, and exposes a simpler structure that is more in line with what we want to do with this data. Here is the structure of the my_colloc table.

1
2
3
4
5
6
7
+---------------+------------+------+-----+---------+-------+
| Field         | Type       | Null | Key | Default | Extra |
+---------------+------------+------+-----+---------+-------+
| coc_lead_word | char(30)   | NO   | MUL | NULL    |       | 
| coc_phrase    | char(255)  | NO   |     | NULL    |       | 
| coc_num_words | int(11)    | NO   |     | NULL    |       | 
+---------------+------------+------+-----+---------+-------+

RecognizerChain

Finally, the RecognizerChain just puts these things all together:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
package com.mycompany.myapp.recognizers;

import java.util.LinkedList;
import java.util.List;

import com.mycompany.myapp.tokenizers.Token;
import com.mycompany.myapp.tokenizers.TokenType;

public class RecognizerChain implements IRecognizer {

  private List<IRecognizer> recognizers;
  
  public RecognizerChain(List<IRecognizer> recognizers) {
    super();
    setRecognizers(recognizers);
  }
  
  public void setRecognizers(List<IRecognizer> recognizers) {
    this.recognizers = recognizers;
  }

  public void init() throws Exception {
    for (IRecognizer recognizer : recognizers) {
      recognizer.init();
    }
  }
  
  /**
   * Applies a chain of IEntityExtractor implementations to the input Token
   * List and transforms it into another Token List.
   * @param a List of Tokens.
   * @return another List of Tokens.
   */
  public List<Token> recognize(final List<Token> tokens) {
    List<Token> recognizedTokens = new LinkedList<Token>();
    recognizedTokens.addAll(tokens);
    for (IRecognizer recognizer : recognizers) {
      recognizedTokens = recognizer.recognize(recognizedTokens);
    }
    return recognizedTokens;
  }
}

The of running the unit test are shown below. As you can see, we have successfully determined that 120k is not a word, but really a number and abbreviation. We have also successfully identified that USD as well as U.S. are abbreviations. In addition, we have identified 3 phrases in this paragraph: "pony up", "small fortune" and "go to".

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
Sentence=Jaguar will sell its new XJ-6 model in the U.S. for a small fortune :-). 
Token=Jaguar [WORD]
Token=  [WHITESPACE]
Token=will [WORD]
Token=  [WHITESPACE]
Token=sell [WORD]
Token=  [WHITESPACE]
Token=its [WORD]
Token=  [WHITESPACE]
Token=new [WORD]
Token=  [WHITESPACE]
Token=XJ-6 [WORD]
Token=  [WHITESPACE]
Token=model [WORD]
Token=  [WHITESPACE]
Token=in [WORD]
Token=  [WHITESPACE]
Token=the [WORD]
Token=  [WHITESPACE]
Token=U.S. [ABBREVIATION]
Token=  [WHITESPACE]
Token=for [WORD]
Token=  [WHITESPACE]
Token=a [WORD]
Token=  [WHITESPACE]
Token=small fortune|||3 [PHRASE]
Token=  [WHITESPACE]
Token=:-) [EMOTICON]
Token=. [PUNCTUATION]
Token=  [WHITESPACE]
Sentence=Expect to pony up around USD 120ks. 
Token=Expect [WORD]
Token=  [WHITESPACE]
Token=to [WORD]
Token=  [WHITESPACE]
Token=pony up|||3 [PHRASE]
Token=  [WHITESPACE]
Token=around [WORD]
Token=  [WHITESPACE]
Token=USD [ABBREVIATION]
Token=  [WHITESPACE]
Token=120___ [NUMBER]
Token=ks [ABBREVIATION]
Token=. [PUNCTUATION]
Token=  [WHITESPACE]
Sentence=Custom options can set you back another few 10,000 dollars. 
Token=Custom [WORD]
Token=  [WHITESPACE]
Token=options [WORD]
Token=  [WHITESPACE]
Token=can [WORD]
Token=  [WHITESPACE]
Token=set [WORD]
Token=  [WHITESPACE]
Token=you [WORD]
Token=  [WHITESPACE]
Token=back [WORD]
Token=  [WHITESPACE]
Token=another [WORD]
Token=  [WHITESPACE]
Token=few [WORD]
Token=  [WHITESPACE]
Token=10,000 [NUMBER]
Token=  [WHITESPACE]
Token=dollars [WORD]
Token=. [PUNCTUATION]
Token=  [WHITESPACE]
Sentence=For details, go to <a href="http://www.jaguar.com/sales" alt="Click here">Jaguar Sales</a> or contact xj-6@jaguar.com.
Token=For [WORD]
Token=  [WHITESPACE]
Token=details [WORD]
Token=, [PUNCTUATION]
Token=  [WHITESPACE]
Token=go to|||3 [PHRASE]
Token=  [WHITESPACE]
Token=<a href="http://www.jaguar.com/sales" alt="Click here"> [MARKUP]
Token=Jaguar [WORD]
Token=  [WHITESPACE]
Token=Sales [WORD]
Token=</a> [MARKUP]
Token=  [WHITESPACE]
Token=or [WORD]
Token=  [WHITESPACE]
Token=contact [WORD]
Token=  [WHITESPACE]
Token=xj-6@jaguar.com [INTERNET]
Token=. [PUNCTUATION]

Where to go from here? Well, I have a few ideas, but to properly implement these ideas, I have to pick up a few other technologies, which will take me a while, so I will probably not be talking about this much over the next few weeks. But once I have these technologies under my belt, I will take up again where I left off here. Till then, I hope this has been useful - I know I learnt quite a bit over the last three weeks or so.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

15 comments (moderated to prevent spam):

Abhilasha said...

Respected Sir
I was just going through your programs on token recognizers. Since I am new to programming Java, I was just trying to make your code understandable a bit to me. I got stuck in the methods used by you in com.mycompany.myapp.tokenizers.Token
as setValue(value), setType(type). Similiar is the case with some other classes too. Although I got the idea a bit as to what, probably, these functions would be doing, but being a newbie, I couldn't understand it totally.
I mean could you just get me as to how these methods will be used. Though I am very confusing in my words, I just hope that you got what I am asking.
Thanks
Abhilasha

Sujit Pal said...

Hi Abhilasha, setXXX() methods are used to set variables in an object from external code. I suggest you read up a bit on Java, this is a fairly common idiom.

Unknown said...

UR A GREAT SOUL...RODGL

Sujit Pal said...

Thanks Deepika, I am guessing from your comment that the post was useful to you. What is RODGL, btw? - I've heard of ROGL, not sure about this one.

Unknown said...

respect on diligent great learners Sir:)

Srijith said...

Hello sujit...Great piece of code.In the Phrase Recognizer program u r refered predefined set of phrases in the phrase table.How can i get that phrase table resource..

Srijith said...

Hello sujit..Great piece of code..U r using a predefined Phrase Recognizer table details.How can i get that resource,so i can able to use it...
Thank u...

Srijith said...

Hi...sujit..abbrevations are not used in the creation of words index.rght?then y it called in the vectorGenerator program...?

Sujit Pal said...

Thanks Srijith. To answer your question, I adapted the data for phrases from Dr Konchady's TextMine project. You can find the schema (tmdb_my_colloc.sql) and load script (tmdb_load.sql) in the src/main/sql directory of my JTMT project.

To answer your question about why abbreviations are called in the vector generator program, I don't remember, sorry :-). Its been a while...but if you point me to where you saw it used, perhaps I can try to figure out why.

Srijith said...

Hi..Sujit..Thnk u...
getWordFrequencies function in the VectorGenerator.java, u called AbbreviationRecognizer.But the LSI matrix contains content words only..
Since u r not including abbreaviations in LSI..Is it necessary??

Sujit Pal said...

If we can guarantee that the LSI matrix will only contain content words, and its very likely, since abbreviations tend to be used less often in a text than regular content words, then yes, you are right. I would be hesitant to take it out, though, because some documents may have many occurrences of this and they would show up in the LSI matrix - you know, medical/legal/engineering type docs, where they define the abbreviation once and then keep using it throughout the document.

Unknown said...

can you help me?
show me source code suffix tree clustering..
this algoritma, so cool, but i'm not sure because. I'm newbie for text mining...

Sujit Pal said...

Hi Muhammad, it is indeed a very interesting algorithm, thanks for pointing it out. I found a Java implementation in Carrot that you may perhaps use for reference. I may build one myself if I need it, but its unlikely, the only time I foresee needing it is if I want to cluster my search results, and Solr already provides the Carrot plugin (it would be mainly XML configuration to enable).

Unknown said...

hi sir,

I am going thourgh ur code for tokenizers but I am unable to trace out the package named com.mycompany.myap.tokenizers.tokne can you please help

Sujit Pal said...

I have the source code for this project on jtmt.sf.net, but I would suggest not using this approach. I found ICU4j to be very hard to understand and configure and nowadays I use OpenNLP's tokenizers instead.