Saturday, April 21, 2007

Partitioning Spring configuration with JavaConfig

There has been a lot of buzz about annotation based dependency-injection lately. Spring offers JavaConfig (still in its 1.0 milestone release), there is Guice from Google, and PicoContainer is adding in annotation based configuration support as well. In this post, I am going to write about my experience converting the XML based ApplicationContext for a small web application to one thats a hybrid of XML and JavaConfig annotations.

If you have worked on moderate to large Spring enabled web applications, you will be familiar with the XML bloat in the Spring configuration files. Lots of time, the bloat can be addressed by partitioning the context into a two-level hierarchy of XML files. The top level contains a single file with global configurations, and the next level contains module configurations, one file for each module. Something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
<!-- Top level configuration -->
<beans>
  <!-- common bean definitions -->
  <bean id="common1" class="com.mycompany.common.Bean1"/>
  ...
  <!-- module definitions -->
  <import resource="classpath:applicationContext-foo.xml"/>
  <import resource="classpath:applicationContext-bar.xml"/>
  ...
</beans>

Most of the applicationContext-*.xml files represent beans being injected into other beans, which can now live in the Java layer with Spring-Javaconfig annotations. This is probably nicer, since Java IDEs typically have better tool support for Java code than for XML files. For example, using an IDE such as Eclipse, it is easier to find a bean definition method in a Configuration file than it is in XML. Also, you get immediate feedback that a bean is incorrectly built in Java, rather than wait for the application to complain during startup, as is the case with XML.

Here is the applicationContext.xml file I started with. As you can see, nothing fancy. There is a database DataSource configuration, a reference to Lucene index directory, and a bunch of bean definitions that use dependency injection to build more complex beans based on the previously defined beans.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
<beans ...>

  <!-- Database Datasource configuration -->
  <bean id="dataSource" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
    <property name="driverClassName" value="com.mysql.jdbc.Driver" />
    <property name="url" value="jdbc:mysql://localhost:3306/facetdb" />
    <property name="username" value="root" />
    <property name="password" value="secret" />
  </bean>

  <!-- Lucene index datasource configuration -->
  <bean id="fsDirectory" class="org.springmodules.lucene.index.support.FSDirectoryFactoryBean">
    <property name="location" value="file:/tmp/soapindex" />
  </bean>

  <bean id="searcherFactory" class="org.springmodules.lucene.search.factory.SimpleSearcherFactory">
    <property name="directory" ref="fsDirectory" />
  </bean>

  <!-- DAO -->
  <bean id="facetsDao" class="com.mycompany.mymodule.db.FacetsDao">
    <property name="dataSource" ref="dataSource" />
  </bean>

  <!-- Searcher -->
  <bean id="facetedSoapSearcher" class="com.mycompany.mymodule.search.FacetedSoapSearcher">
    <property name="searcherFactory" ref="searcherFactory" />
    <property name="analyzer">
      <bean class="org.apache.lucene.analysis.SimpleAnalyzer" />
    </property>
    <property name="facetsDao" ref="facetsDao" />
  </bean>

  <!-- Controller -->
  <bean id="facetedSearchController" class="com.mycompany.mymodule.controller.FacetedSearchController">
    <property name="facetedSoapSearcher" ref="facetedSoapSearcher" />
  </bean>

</beans>

My plan was to leave the dataSource and fsDirectory definitions in Spring, since they represent the part of the application that can be customized by operations folks, and move the rest of the definitions out to a Java class. After I was done, this was what the Java class looked like:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
@Configuration
public class MyConfiguration extends ConfigurationSupport {

  @Bean(scope = Scope.SINGLETON)
  public SearcherFactory searcherFactory() throws Exception {
    SimpleSearcherFactory searcherFactory = new SimpleSearcherFactory();
    searcherFactory.setDirectory((FSDirectory) getBean("fsDirectory"));
    return searcherFactory;
  }

  @Bean(scope = Scope.SINGLETON)
  public FacetsDao facetsDao() {
    FacetsDao facetsDao = new FacetsDao();
    facetsDao.setDataSource((DataSource) getBean("dataSource"));
    return facetsDao;
  }

  @Bean(scope = Scope.SINGLETON)
  public FacetedSoapSearcher facetedSoapSearcher() throws Exception {
    FacetedSoapSearcher facetedSoapSearcher = new FacetedSoapSearcher();
    facetedSoapSearcher.setSearcherFactory(searcherFactory());
    facetedSoapSearcher.setAnalyzer(new SimpleAnalyzer());
    facetedSoapSearcher.setFacetsDao(facetsDao());
    return facetedSoapSearcher;
  }

  @Bean(scope = Scope.SINGLETON)
  public FacetedSearchController facetedSearchController() throws Exception {
    FacetedSearchController facetedSearchController = new FacetedSearchController();
    facetedSearchController.setFacetedSoapSearcher(facetedSoapSearcher());
    return facetedSearchController;
  }
}

Notice that our configuration class subclasses ConfigurationSupport which gives us access to the getBean(String) method that can pull out beans from the main application context. For unit-testing, this class can be used standalone. You can get bean references out of the AnnotationApplicationContext, like so:

1
2
  ApplicationContext context = new AnnotationApplicationContext(MyConfiguration.class);
  MyBean myBean = (MyBean) context.getBean("myBean");

On the XML side, you need to define this class in the applicationContext, and define a PostProcessor that will read these annotated configuration files and put it into the applicationContext. The new applicationContext.xml file is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<beans ...>

  <!-- common definitions -->

  <!-- Database Datasource configuration -->
  <bean id="dataSource" class="org.springframework.jdbc.datasource.DriverManagerDataSource">
    <property name="driverClassName" value="com.mysql.jdbc.Driver" />
    <property name="url" value="jdbc:mysql://localhost:3306/facetdb" />
    <property name="username" value="root" />
    <property name="password" value="secret" />
  </bean>

  <!-- Lucene index datasource configuration -->
   <bean id="fsDirectory" class="org.springmodules.lucene.index.support.FSDirectoryFactoryBean">
    <property name="location" value="file:/tmp/soapindex" />
  </bean>

  <!-- module definitions -->
  <bean class="com.mycompany.mymodule.MyConfiguration"/>

  <bean class="org.springframework.beans.factory.java.ConfigurationPostProcessor"/>

</beans>

The conversion is almost too easy, but it all works. My unit tests behave identically under the old and new configuration setups. My initial concern was that the annotations did not have all the features that I was used to with the bean elements attributes. I use the init-method and destroy-method quite frequently, but could not find any documentation on it. A quick look at the source code for the Bean annotation reveals a TODO to include an initMethod and destroyMethod as well. So I am guessing that a later release would have these features. I think it is important to include these methods, more important than Autowire anyway, which is already included.

One small nit. I realize this is beta software, but It would be nice if JavaConfig could be made available from ibiblio's maven2 repository. For those of us who use maven2, trying to figure out the dependencies from the currently available JAR is like going back to writing Java code in vim after having used an IDE for a while - possible, but not fun. FWIW, I had to add aspectj-runtime-1.0.jar into my classpath from the libraries included with the distribution.

Resources

  1. Craig Walls, the author of Spring in Action, compares Spring-JavaConfig and Guice in his "Guice vs Spring" blog entry.
  2. A page from Sun describing Annotations in JDK 1.5

Saturday, April 14, 2007

Lucene Search within Search with BitSets

The Search within Search functionality is typically used when presenting results from an index in response to a user defined query. In addition to the search results returned to the user, we may want to show the user that he can drill down to get more focused search results for his query. So, for example, if the user searched for "cancer" on a medical search engine, in additions to pages that discuss cancer, we may want to show him how many occurences of "brain cancer", "lung cancer", etc, he can get from our index. These would be represented as a set of navigation links on the page. Clicking one of these links would spawn a fresh search with the augmented query term.

If you use Lucene, you will know that the popular ways of doing this is to either use a Boolean Query or search using a QueryFilter. A less popular, but incredibly powerful, way to count facet hits is to use BitSets returned by the QueryFilter. My ex-colleague Chris Hostetter refers to it in passing when announcing that CNET Category pages are powered by Lucene.

In this article, I present a Facet Hit counting class, which when passed in a reference to an IndexSearcher object, a reference to the base Lucene Query object, and a Map of facet names and their corresponding Query objects, returns a Map of facet names and their counts.

Caller code

The code for the caller would look something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
  IndexSearcher searcher = new IndexSearcher("/path/to/index");
  Query baseQuery = new TermQuery(new Term("body", "cancer"));
  Map<String,Query> subQueries = new HashMap<String,Query>();
  subQueries.put("Lung Cancer", new TermQuery(new Term("body", "lung")));
  subQueries.put("Brain Cancer", new TermQuery(new Term("body", "brain")));
  subQueries.put("Skin Cancer", new TermQuery(new Term("body", "skin")));
  ...
  BitSetFacetHitCounter facetHitCounter = new BitSetFacetHitCounter();
  facetHitCounter.setSearcher(searcher);
  facetHitCounter.setBaseQuery(baseQuery);
  facetHitCounter.setSubQueries(subQueries);
  Map<String,Integer> counts = facetHitCounter.getFacetHitCounts();

The BitSetFacetHitCounter class

The code for the BitSetFacetHitCounter is shown below. The getFacetHitCounts() method creates the QueryFilter objects for the baseQuery and each of their subqueries and extracts their BitSets. Each bit in the BitSet corresponds to a Document in the index. If the bit is turned on, then the Document matched the query, else not. The intersection of the BitSets for the base query and the subquery is another BitSet, whose bits are turned on for those records where both the base query and the subquery are satisfied. In our example, the resulting BitSet will have only the bits for records containing "cancer" and "lung" turned on, so counting the 1's will give us the number of records for "lung cancer".

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
public class BitSetFacetHitCounter implements IFacetHitCounter {

  private Query baseQuery;
  private Map<String,Query> subQueries;
  private IndexSearcher searcher;

  public BitSetFacetHitCounter() {
    super();
  }

  public void setBaseQuery(Query baseQuery) {
    this.baseQuery = baseQuery;
  }

  public void setSubQueries(Map<String,Query> subQueries) {
    this.subQueries = subQueries;
  }

  public void setSearcher(IndexSearcher searcher) {
    this.searcher = searcher;
  }

  public Map<String,Integer> getFacetHitCounts() throws Exception {
    Map<String,Integer> facetCounts = new HashMap<String,Integer>();
    IndexReader reader = searcher.getIndexReader();
    QueryFilter baseQueryFilter = new QueryFilter(baseQuery);
    BitSet baseBitSet = baseQueryFilter.bits(reader);
    for (String attribute : subQueries.keySet()) {
      QueryFilter filter = new QueryFilter(subQueries.get(attribute));
      BitSet filterBitSet = filter.bits(reader);
      facetCounts.put(attribute, getFacetHitCount(baseBitSet, filterBitSet));
    }
    return facetCounts;
  }

  private int getFacetHitCount(BitSet baseBitSet, BitSet filterBitSet) {
    filterBitSet.and(baseBitSet);
    return filterBitSet.cardinality();
  }
}

Alternate implementations

The other options for doing this are using BooleanQueries and searching with QueryFilters. The code for the getFacetHitCounts() method using these methods are also shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
  // using Boolean queries
  public Map<String,Integer> getFacetHitCounts() throws Exception {
    Map<String,Integer> facetCounts = new HashMap<String,Integer>();
    for (String attribute : subQueries.keySet()) {
      BooleanQuery facetQuery = new BooleanQuery();
      facetQuery.add(baseQuery, BooleanClause.Occur.MUST);
      facetQuery.add(subQueries.get(attribute), BooleanClause.Occur.MUST);
      Hits hits = searcher.search(facetQuery);
      facetCounts.put(attribute, hits.length());
    }
    return facetCounts;
  }

  // using search with QueryFilters
  public Map<String, Integer> getFacetHitCounts() throws Exception {
    Map<String,Integer> facetCounts = new HashMap<String,Integer>();
    for (String attribute : subQueries.keySet()) {
      QueryFilter filter = new QueryFilter(subQueries.get(attribute));
      Hits hits = searcher.search(baseQuery, filter);
      facetCounts.put(attribute, hits.length());
    }
    return facetCounts;
  }

Some performance numbers

I ran the same test case through all three implementations, working with 6 facets, in order to compare results. My experiments are not controlled, its simply tracking elapsed time using System.currentTimeMillis() in my JUnit test code, so your mileage may vary.

Implementation Elapsed time (ms)
Bit Set implementation 40
Query Filter implementation 41
Boolean Query Implementation 50

As you can see, there does not seem to be much difference, performance-wise, between the three implementations. However, I am guessing that the BitSet approach will outperform the others as the invocations are increased. Also, both the QueryFilter and the BitSet approach will take advantage of QueryFilter caching within Lucene, which can be useful if you are not using external caching.

Saturday, April 07, 2007

Document Classification using Naive Bayes

I have written earlier about faceted searching where each facet a document exposed represented a tag that was associated with the document. Of course, one of the most difficult aspects of setting up such a system is the setting up of the tags themselves. One way to build up the tag associations is to delegate it to the creator of the document, an approach taken by sites with user-generated content. Often, however, we have a large number of untagged documents, which we want to present as a searchable faceted collection. In such cases, we would have to assign the tags ourselves, which can be quite labor-intensive if we decide to do this manually.

One popular way to automatically tag documents is to use the Naive Bayes Classifier (BNC) algorithm. You can read about the math in the link, but basically BNC is based on the fact that if we know the probabilities of words appearing in a certain category of document, given the set of words in a new document, we can correctly predict if this new document is or is not that category of document.

I first heard of BNC from an ex-colleague who suggested the automatic tagging idea, extending what he understood about how SpamAssasin email spam filter works. Shortly thereafter, I read about it in Dr Dobb's Journal. But I never had the opportunity to actually use it until now.

I figured that since BNC seemed to be a useful algorithm, there would be open source implementation available on the web. I found a quite a few here. I looked through a few, but the only one I saw with halfway decent user-documentation was Classifier4J, so I chose that for my implementation of the automated tagger.

For my test data, I chose a collection of 21 articles I had written on my website years ago, and manually categorized into "Databases", "Web Development" and "Linux". The plan was to train a Bayesian Classifier instance with one match document from the target category and two non-match documents from the two other categories, then make it analyze all 21 documents. My initial implementation used the components provided in the classifier4j distribution - SimpleWordsDataSource for the words data source, the SimpleHTMLTokenizer for the tokenizer and the DefaultStopWordsProvider for the stop words provider.

However, the classification results were quite poor, and I wanted to find out why. I tried to build the package from source, but the project uses Maven 1.x which I am not familiar with, and I ended up building an empty jar file. I then tried to look at the words and their probabilities using the Eclipse debugger, but it did not give me any additional insights. So even though I try to avoid recreating functionality as much as possible, I ended up replacing most of the user-level components, depending only on classifier4j's core classes to do the probability calculations.

Usage

For convenience, I created the AutoTagger class, which is called from client code as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
  AutoTagger autoTagger = new AutoTagger();
  autoTagger.setStopwordFile(new File("/path/to/my/stopwords.txt"));
  autoTagger.setDataSource(new DriverManagerDataSource("com.mysql.jdbc.Driver",
    "jdbc:mysql://localhost:3306/classifierdb", "user", "pass"));

  autoTagger.addTrainingFile("database", databaseFilesArray);
  autoTagger.addTrainingFile("web", webFilesArray);
  autoTagger.addTrainingFile("linux", linuxFilesArray);
  autoTagger.train();

  double p = autoTagger.getProbabilityOfFileInCategory("database", someDbFile);

The AutoTagger internally contains references to a Map of Classifier objects keyed by category. The train() call will teach each of the Classifier the matched words for that category as well as the non-matches for all the other categories. The Bayesian classifier tends to produce probabilities that are either 0.01 to indicate no match and 0.99 to indicate a match.

The source for the AutoTagger class is shown below.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
public class AutoTagger {

  private static final double CLASSIFICATION_CUTOFF_PROBABILITY = 0.5;
  
  private File stopwordFile;
  private DataSource dataSource;
  
  private Map<String,BatchingBayesianClassifier> classifiers = 
    new HashMap<String,BatchingBayesianClassifier>();
  private MultiMap categoryMap = new MultiHashMap();
  
  public AutoTagger() {
    super();
  }
  
  public void setStopwordFile(File stopwordFile) {
    this.stopwordFile = stopwordFile;
  }
  
  public void setDataSource(DataSource dataSource) {
    this.dataSource = dataSource;
  }
  
  public void addTrainingFiles(String category, File[] trainingFiles) {
    for (File trainingFile : trainingFiles) {
      categoryMap.put(category, trainingFile);
    }
    // if an instance of the classifier does not exist for category, create one
    if (! classifiers.containsKey(category)) {
      BatchingBayesianClassifier classifier = new BatchingBayesianClassifier(
        new JdbcWordsDataSource(dataSource),
        new CyberNekoHtmlTokenizer(DefaultTokenizer.BREAK_ON_WORD_BREAKS),
        new FileDrivenStopWordProvider(stopwordFile));
      classifiers.put(category, classifier);
    }
  }
  
  @SuppressWarnings("unchecked")
  public void train() throws WordsDataSourceException, ClassifierException, IOException {
    List<String> categoryList = new ArrayList<String>();
    categoryList.addAll(categoryMap.keySet());
    // teach the classifiers in all categories
    for (int i = 0; i < categoryList.size(); i++) {
      String matchCategory = categoryList.get(i);
      List<String> nonmatchCategories = new ArrayList<String>();
      for (int j = 0; j < categoryList.size(); j++) {
        if (i != j) {
          nonmatchCategories.add(categoryList.get(j));
        }
      }
      BatchingBayesianClassifier classifier = classifiers.get(matchCategory);
      List<File> teachMatchFiles = (List<File>) categoryMap.get(matchCategory);
      for (File teachMatchFile : teachMatchFiles) {
        String trainingFileName = teachMatchFile.getName();
        classifier.teachMatch(matchCategory, FileUtils.readFileToString(teachMatchFile, "UTF-8"));
        classifiers.put(matchCategory, classifier);
        for (String nonmatchCategory : nonmatchCategories) {
            classifier.teachNonMatch(nonmatchCategory,
            FileUtils.readFileToString(teachMatchFile, "UTF-8"));
          classifiers.put(nonmatchCategory, classifier);
        }
      }
    }
    classifiers.clear();
  }
  
  public boolean isFileInCategory(String category, File file)
      throws ClassifierException, WordsDataSourceException, IOException {
    return getProbabilityOfFileInCategory(category, file) >= CLASSIFICATION_CUTOFF_PROBABILITY;
  }
  
  public double getProbabilityOfFileInCategory(String category, File file) 
      throws ClassifierException, WordsDataSourceException, IOException {
    if (! classifiers.containsKey(category)) {
      BatchingBayesianClassifier classifier = new BatchingBayesianClassifier(
        new JdbcWordsDataSource(dataSource),
        new CyberNekoHtmlTokenizer(DefaultTokenizer.BREAK_ON_WORD_BREAKS),
        new FileDrivenStopWordProvider(stopwordFile));
      classifiers.put(category, classifier);
    }
    BatchingBayesianClassifier classifier = classifiers.get(category);
    if (classifier == null) {
      throw new IllegalArgumentException("Unknown category:" + category);
    }
    return classifier.classify(category, FileUtils.readFileToString(file, "UTF-8"));
  }
}

JdbcWordsDataSource

To be able to view (for debugging) the words that were being considered for the classification process, I needed to put them in a database. However, the provided JDBCWordsDataSource is very slow, because it tries to do an insert/update for each word that is not a stop word in the input document. I created a similar implementation of a JdbcWordsDataSource that will accumulate the inserts and updates until the entire document is read, then apply them all at once. It does the same thing during classification, by batching up all the words and issuing a single select call to get back all the word probability data. This produces a much more tolerable response time for the train() call (which is actually 3 calls, one teachMatch() and two teachNonMatch() calls in my case), and an almost instantaneous response for the classify() call. The code for my JdbcWordsDataSource is shown below:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
/**
 * A Jdbc based implementation of ICategorisedWordsDataSource that can be
 * independently trained using files.
 */
public class JdbcWordsDataSource implements ICategorisedWordsDataSource {

  private JdbcTemplate jdbcTemplate;
  private Map<String,Integer> wordCountMap = new HashMap<String,Integer>();
  private Transformer quotingLowercasingTransformer = new Transformer() {
    public Object transform(Object input) {
      return "'" + StringUtils.lowerCase((String) input) + "'";
    }
  };
  
  public JdbcWordsDataSource(DataSource dataSource) {
    this.jdbcTemplate = new JdbcTemplate(dataSource);
  }
  
  public void addMatch(String word) throws WordsDataSourceException {
    addMatch(ICategorisedClassifier.DEFAULT_CATEGORY, word);
  }

  public void addMatch(String category, String word) throws WordsDataSourceException {
    addWord(word);
  }

  public void addNonMatch(String word) throws WordsDataSourceException {
    addNonMatch(ICategorisedClassifier.DEFAULT_CATEGORY, word);
  }

  public void addNonMatch(String category, String word) throws WordsDataSourceException {
    addWord(word);
  }

  public WordProbability getWordProbability(String word) throws WordsDataSourceException {
    return getWordProbability(ICategorisedClassifier.DEFAULT_CATEGORY, word);
  }

  @SuppressWarnings("unchecked")
  public WordProbability getWordProbability(String category, String word) 
      throws WordsDataSourceException {
    int matchCount = 0;
    int nonmatchCount = 0;
    List<Map<String,Integer>> rows = jdbcTemplate.queryForList(
      "select match_count, nonmatch_count " +
      "from word_probability " +
      "where word = ? and category = ?", 
      new String[] {word, category});
    for (Map<String,Integer> row : rows) {
      matchCount = row.get("MATCH_COUNT");
      nonmatchCount = row.get("NONMATCH_COUNT");
      break;
    }
    return new WordProbability(word, matchCount, nonmatchCount);
  }

  @SuppressWarnings("unchecked")
  public WordProbability[] calcWordsProbability(String category, String[] words) {
    List<WordProbability> wordProbabilities = new ArrayList<WordProbability>();
    List<String> wordsList = Arrays.asList(words);
    String query = "select word, match_count, nonmatch_count from word_probability where word in (" +
      StringUtils.join(new TransformIterator(wordsList.iterator(), quotingLowercasingTransformer), ',') +
      ") and category=?"; 
    List<Map<String,Object>> rows = jdbcTemplate.queryForList(query, new String[] {category});
    for (Map<String,Object> row : rows) {
      String word = (String) row.get("WORD");
      int matchCount = (Integer) row.get("MATCH_COUNT");
      int nonmatchCount = (Integer) row.get("NONMATCH_COUNT");
      WordProbability wordProbability = new WordProbability(word, matchCount, nonmatchCount);
      wordProbability.setCategory(category);
      wordProbabilities.add(wordProbability);
    }
    return wordProbabilities.toArray(new WordProbability[0]);
  }
  
  public void initWordCountMap() {
    wordCountMap.clear();
  }
  
  public void flushWordCountMap(String category, boolean isMatch) {
    for (String word : wordCountMap.keySet()) {
      int count = wordCountMap.get(word);
      if (isWordInCategory(category, word)) {
        updateWordMatch(category, word, count, isMatch);
      } else {
        insertWordMatch(category, word, count, isMatch);
      }
    }
  }
  
  @SuppressWarnings("unchecked")
  public void removeDuplicateWords() {
    List<Map<String,Object>> rows = jdbcTemplate.queryForList(
      "select word, count(*) dup_count " +
      "from word_probability " +
      "group by word " +
      "having dup_count > 1");
    List<String> words = new ArrayList<String>();
    for (Map<String,Object> row : rows) {
      words.add((String) row.get("WORD"));
    }
    jdbcTemplate.update("delete from word_probability where word in (" +
      StringUtils.join(new TransformIterator(words.iterator(), quotingLowercasingTransformer), ',') +
      ")");
  }
  
  private void addWord(String word) {
    int originalCount = 0;
    if (wordCountMap.containsKey(word)) {
      originalCount = wordCountMap.get(word);
    }
    wordCountMap.put(word, (originalCount + 1));
  }
  
  /**
   * Return true if the word is found in the category.
   * @param category the category to look up 
   * @param word the word to look up.
   * @return true or false
   */
  @SuppressWarnings("unchecked")
  private boolean isWordInCategory(String category, String word) {
    List<Map<String,String>> rows = jdbcTemplate.queryForList(
      "select word from word_probability where category = ? and word = ?", 
      new String[] {category, word});
    return (rows.size() > 0);
  }

  /**
   * @param category the category to update.
   * @param word the word to update.
   * @param isMatch if true, the word is a match for the category.
   */
  private void updateWordMatch(String category, String word, int count, boolean isMatch) {
    if (isMatch) { 
      jdbcTemplate.update(
        "update word_probability set match_count = match_count + ? " +
        "where category = ? and word = ?", 
        new Object[] {count, category, word});
    } else {
      jdbcTemplate.update(
        "update word_probability set nonmatch_count = nonmatch_count + ? " +
        "where category = ? and word = ?", 
        new Object[] {count, category, word});
    }
  }

  /**
   * @param category the category to insert.
   * @param word the word to update.
   * @param isMatch if true, the word is a match for the category.
   */
  private void insertWordMatch(String category, String word, int count, boolean isMatch) {
    if (isMatch) {
      jdbcTemplate.update("insert into word_probability(" +
        "category, word, match_count, nonmatch_count) values (?, ?, ?, 0)", 
        new Object[] {category, word, count});
    } else {
      jdbcTemplate.update("insert into word_probability(" +
          "category, word, match_count, nonmatch_count) values (?, ?, 0, ?)", 
          new Object[] {category, word, count});
    }
  }
}

The JdbcWordsDataSource decouples the word accumulation and persistence into two separate methods, which need to be called by the classifier. The accumulation is all done in memory, and a flushWordCountMap() will actually persist the map into the database.

BatchingBayesianClassifier

In order to use the batching capability, I needed to create a subclass of BayesianClassifier that would only take this particular implementation, and override the database dependent methods in the parent. The BatchingBayesianClassifier is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
/**
 * Batches words for performance against the JdbcWordsDataSource. This is 
 * specific to this application's needs, so the constructor forces the caller
 * to provide specific implementations of the super-class's ctor args.
 */
public class BatchingBayesianClassifier extends BayesianClassifier {

  public BatchingBayesianClassifier(JdbcWordsDataSource wordsDataSource, 
      CyberNekoHtmlTokenizer tokenizer, FileDrivenStopWordProvider stopwordsProvider) {
    super(wordsDataSource, tokenizer, stopwordsProvider);
  }
  
  protected boolean isMatch(String category, String input[]) throws WordsDataSourceException {
    return (super.classify(category, input) > super.getMatchCutoff());
  }

  protected double classify(String category, String words[]) throws WordsDataSourceException {
    List<String> nonStopwords = new ArrayList<String>();
    FileDrivenStopWordProvider stopwordsProvider = (FileDrivenStopWordProvider) super.getStopWordProvider();
    for (String word : words) {
      if (stopwordsProvider.isStopWord(word)) {
        continue;
      }
      nonStopwords.add(word);
    }
    JdbcWordsDataSource wds = (JdbcWordsDataSource) super.getWordsDataSource();
    WordProbability[] wordProbabilities = wds.calcWordsProbability(category, nonStopwords.toArray(new String[0]));
    return super.normaliseSignificance(super.calculateOverallProbability(wordProbabilities));
  }

  protected void teachMatch(String category, String words[]) throws WordsDataSourceException {
    JdbcWordsDataSource wds = (JdbcWordsDataSource) super.getWordsDataSource();
    wds.initWordCountMap();
    super.teachMatch(category, words);
    wds.flushWordCountMap(category, true);
  }

  protected void teachNonMatch(String category, String words[]) throws WordsDataSourceException {
    JdbcWordsDataSource wds = (JdbcWordsDataSource) super.getWordsDataSource();
    wds.initWordCountMap();
    super.teachNonMatch(category, words);
    wds.flushWordCountMap(category, false);
  }

}

CyberNekoHtmlTokenizer

I also created my own implementation of the HTML Tokenizer using the NekoHTML parser from Cyberneko. This was because the SimpleHtmlTokenizer was crashing with the (admittedly bad and nowhere near spec-compliant) HTML in the documents. Cyberneko's NekoHTML parser is more forgiving, and I was able to pull out the body of my HTML document with the following implementation:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
public class CyberNekoHtmlTokenizer extends DefaultTokenizer {

  public CyberNekoHtmlTokenizer() {
    super();
  }
  
  public CyberNekoHtmlTokenizer(int tokenizerConfig) {
    super(tokenizerConfig);
  }
  
  /**
   * Uses the Cyberneko HTML parser to parse out the body content from the
   * HTML file as a stream of text.
   * @see net.sf.classifier4J.ITokenizer#tokenize(java.lang.String)
   */
  public String[] tokenize(String input) {
    return super.tokenize(getBody(input));
  }
  
  public String getBody(String input) {
    try {
      DOMParser parser = new DOMParser();
      parser.parse(new InputSource(new ByteArrayInputStream(input.getBytes())));
      Document doc = parser.getDocument();
      NodeList bodyTags = doc.getElementsByTagName("BODY");
      if (bodyTags.getLength() == 0) {
        throw new Exception("No body tag in this HTML document");
      }
      Node bodyTag = bodyTags.item(0);
      return bodyTag.getTextContent();
    } catch (Exception e) {
      throw new RuntimeException("HTML Parsing failed on this document", e);
    }
  }
}

FileDrivenStopWordProvider

The DefaultStopWordProvider contained a hard coded array of stop words, which was pretty basic, so I built one to work off a file (the contents of which I scraped from the classifier4j message board, btw), which also treats numbers as stopwords. The code for the FileDrivenStopWordProvider is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
public class FileDrivenStopWordProvider implements IStopWordProvider {

  private SortedSet<String> words = new TreeSet<String>();
  
  public FileDrivenStopWordProvider(File stopWordFile) {
    try {
      BufferedReader reader = new BufferedReader(
          new InputStreamReader(new FileInputStream(stopWordFile)));
      String word;
      while ((word = reader.readLine()) != null) {
        words.add(StringUtils.lowerCase(word.trim()));
      }
    } catch (FileNotFoundException e) {
      LOGGER.error("File:" + stopWordFile.getName() + " does not exist", e);
    } catch (IOException e) {
      LOGGER.error("Error reading file:" + stopWordFile.getName(), e);
    }
  }
  
  public boolean isStopWord(String word) {
    return words.contains(StringUtils.lowerCase(word.trim())) || StringUtils.isNumeric(word);
  }
}

Results

I ran the AutoTagger in two scenarios. The first was with low training, where I took the first file that was created in each category, and trained the classifiers with them, then ran the rest of the files against the trained classifiers. The assumption was that I knew what I was doing when classifying the first article, rather than attempt to shoehorn an article into an existing category set. The results from the run is shown below. The rows in gray indicate the files which were used for training.

File name Orig. class P(database) P(web) P(linux) Tags
artdb001 database 0.99 0.01 0.01 database
artdb002 database 0.99 0.01 0.01 database
artdb003 database 0.01 0.01 0.01 (none)
artdb005 database 0.01 0.01 0.01 (none)
artdb006 database 0.01 0.01 0.01 (none)
artdb007 database 0.01 0.01 0.01 (none)
artwb001 web 0.01 0.99 0.01 web
artwb002 web 0.01 0.01 0.01 (none)
artwb003 web 0.01 0.01 0.01 (none)
artwb004 web 0.01 0.01 0.01 (none)
artwb005 web 0.01 0.01 0.01 (none)
artwb006 web 0.01 0.01 0.01 (none)
artwb007 web 0.01 0.01 0.01 (none)
artli001 linux 0.01 0.01 0.01 (none)
artli002 linux 0.01 0.01 0.01 (none)
artli003 linux 0.01 0.01 0.01 (none)
artli004 linux 0.01 0.01 0.01 (none)
artli005 linux 0.01 0.01 0.01 (none)
artli006 linux 0.01 0.01 0.99 linux
artli007 linux 0.01 0.01 0.01 (none)
artli008 linux 0.01 0.01 0.01 (none)

As you can see, the results are not too great. Almost none of the documents besides the ones used for training were matched. This could be because of the paucity of training data. To rectify the situation, I created a high training scenario, where all but one of the files in each category is used for the training, then the trained classifiers are let loose on that one remaining file to see what category it is. The results for this test is shown below:

File name Orig. class P(database) P(web) P(linux) Tags
artdb001 database 0.99 0.01 0.01 database
artdb002 database 0.99 0.01 0.01 database
artdb003 database 0.99 0.01 0.01 database
artdb005 database 0.01 0.01 0.01 (none)
artdb006 database 0.99 0.99 0.01 database, web
artdb007 database 0.01 0.01 0.01 (none)
artwb001 web 0.01 0.99 0.01 web
artwb002 web 0.01 0.99 0.01 web
artwb003 web 0.01 0.01 0.01 (none)
artwb004 web 0.01 0.01 0.01 (none)
artwb005 web 0.01 0.99 0.01 web
artwb006 web 0.01 0.99 0.01 web
artwb007 web 0.99 0.99 0.01 database, web
artli001 linux 0.01 0.01 0.01 (none)
artli002 linux 0.01 0.01 0.99 linux
artli003 linux 0.01 0.01 0.01 (none)
artli004 linux 0.99 0.99 0.99 database, web, linux
artli005 linux 0.01 0.01 0.99 linux
artli006 linux 0.01 0.01 0.99 linux
artli007 linux 0.01 0.01 0.99 linux
artli008 linux 0.01 0.01 0.99 linux

The results are better than the first one, but it still misses a few. A surprising finding is that it finds that some articles can belong to multiple categories. Not so surprising, if you think that its the same person writing all three types, so a web article could involve a database, or a linux article could describe a database or webserver installation.

Conclusion

The BNC algorithm probably works best when there is much more training data available than what I provided it, and where the documents are more stratified, for example, politics versus technology, so there is less chance of overlapping words in each category. In my case, it does detect some things, but the results can probably be improved by providing more training data or pruning the words in the database after the training is complete and before classification is done.