## Friday, March 06, 2009

### Binary Naive Bayes Classifier using Lucene

Something came up at work recently that sparked off my interest in this stuff again, and it also meshes nicely with my objective of working through my TMAP book, so this week, I decided to explore building a binary Naive Bayes Classifier with Lucene. This post is a result of that exploration.

The math behind Naive Bayesian Classifiers is explained in detail here, but I needed to work backward through it to figure out what values to collect and use for training the classifier and to do the actual classification. So...

For classification purposes, document D is in category C if:
p(C|D)
r = --------- > 1
p(¬C|D)
where:
r       = likelihood ratio
p(C|D)  = probability that category is C given document D
p(¬C|D) = probability that category is not C given document D

We can rewrite p(C|D) as follows:
p(C|D) = p(D ∩ C) / p(D)                ... by probability axiom
= p(C) * p(D|C) / p(D)           ... by Bayes theorem

Similarly, we can rewrite p(C|¬D):
p(¬C|D) = p(¬C) * p(D|¬C) / p(D)

Here p(C) and p(¬C) can be represented in terms of the number of
documents in each category C and ¬C divided by the total number of
documents in the training set.
p(C)  = μ(C) / n
p(¬C) = μ(¬C) / n

where:
μ(C) = number of docs in category C in training set.
μ(¬C) = number of docs not in category C in training set.
n = number of docs in the training set.

If the words in the documents in the training set are represented by the
set {w0, w1, ..., wk}, then:

p(D|C)  = Πi=0..k p(wi|C)
p(D|¬C) = Πi=0..k p(wi|¬C)

So our likelihood ratio can be re-written as:
μ(C)         Πi=0..k p(wi ∩ C)
r = (-------) * (----------------------)
μ(¬C)        Πi=0..k p(wii ∩ ¬C)

So, during our training phase, we need to compute the ratio of number of documents in C vs ¬C, and for each individual word, we need to find the probability of the word in category C and not C. The probability of individual words can be found by dividing the frequency of a word in each category by the total number of words in the training set.

My classifier uses Lucene, so it expects an index of pre-classified documents as its input, as well as the field name that contains the classification information. Because it is a binary classifier, it also needs to know the class that should be treated as positive (ie, C). Here is the code for the classifier - we show the training portion here.

The caller sets the index directory, the category name and value, and the other parameters (which are explained in more detail later), then calls the train() method. Once the training is complete, the word probabilities and the category document ratio are available via getters. This models how one would normally do this in real-life, the training phase is typically time and resource-intensive, but the training data can now be used for any number of classification calls.

In the real world, though, we would probably be dealing with far larger volumes of pre-classified documents, so using an in-memory Map<String,double[]> probably wouldn't scale too well. In that case, we may want to think about using a database table to store these values. I tried that an earlier attempt using the Classifier4J Naive Bayes classifier, so you may want to check it out if you are interested in going that route.

In the classification phase, we feed in a body of text, and for each word that is found in the training set, multiply its p(w|C)/p(w|¬C) value into the likelihood ratio. Notice that the frequency of the word in the input document does not matter, since they get cancelled out when calculating the ratio. Finally we multiply in the document ratio to get the likelihood. Here is the classify() method - its part of the same class as above, I have separated it out for purposes of explanation.

### New TokenFilter to remove numbers

The training data (see below for more information about the data) I used contained a lot of numbers which did not seem to have much discriminatory value, so I decided to filter it out from the words the classifier was being trained with. It was also a good excuse to learn how to build a Lucene TokenFilter :-), which is shown below. As before, Marcus Tripp's blog post was very helpful.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 // Source: src/main/java/com/mycompany/myapp/tokenizers/lucene/NumericTokenFilter.java package com.mycompany.myapp.tokenizers.lucene; import java.io.IOException; import org.apache.commons.lang.math.NumberUtils; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; /** * Filters out numeric tokens from the TokenStream. */ public class NumericTokenFilter extends TokenFilter { public NumericTokenFilter(TokenStream input) { super(input); } @Override public Token next(Token token) throws IOException { while ((token = input.next(token)) != null) { String term = token.term(); term = term.replaceAll(",", ""); if (! NumberUtils.isNumber(term)) { return token; } } return null; } }

I should probably have built a different analyzer for this purpose, but I figured that numbers are a good thing to remove in most cases anyway, so I just stuck it into the SummaryAnalyzer I described in my post last week. The tokenStream() method is the only one that changed, so I show it here:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 // Source: src/main/java/com/mycompany/myapp/summarizers/SummaryAnalyzer.java // (changed since last week, only changed method shown) ... public class SummaryAnalyzer extends Analyzer { ... @Override public TokenStream tokenStream(String fieldName, Reader reader) { return new PorterStemFilter( new StopFilter( new LowerCaseFilter( new NumericTokenFilter( new StandardFilter( new StandardTokenizer(reader)))), stopset)); } ... }

You can hook up other Analyzer implementations by calling the setAnalyzer() method of LuceneNaiveBayesClassifier. The default (if not specified) is StandardAnalyzer.

### Feature Selection with Information Gain

If you look at the results, you will see that in the first test, we used all the words we found (after tokenization) for our probability calculations. This can sometimes lead to misleading results, since there may be large number of words with less discriminatory power (ie, those which occur with similar frequency in both categories). To remedy this, we compute the information gain for each word, and consider only the √k words with the highest information gain for classification, where k is the total number of words.

Information gain is computed using the following formula:
p(wi|C)
I(wi) = p(wi|C) * log(---------------)
p(wi) * p(C)

We can enable feature selection in our LuceneNaiveBayesClassifier by setting the selectTopFeatures property to true. This will invoke the InfoGainFeatureSelector shown below:

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 // Source: src/main/java/com/mycompany/myapp/classifiers/InfoGainFeatureSelector.java package com.mycompany.myapp.classifiers; import java.util.ArrayList; import java.util.Collections; import java.util.HashMap; import java.util.List; import java.util.Map; public class InfoGainFeatureSelector { private double pCategory; private List wordProbabilities; public void setPCategory(double pCategory) { this.pCategory = pCategory; } public void setWordProbabilities( Map wordProbabilities) { this.wordProbabilities = new ArrayList(); for (String word : wordProbabilities.keySet()) { double[] probabilities = wordProbabilities.get(word); this.wordProbabilities.add( new Word(word, probabilities[0], probabilities[1])); } } public Map selectFeatures() throws Exception { for (Word word : wordProbabilities) { if (word.pInCat > 0.0D) { word.infoGain = word.pInCat * Math.log( word.pInCat / ((word.pInCat + word.pNotInCat) * pCategory)); } else { word.infoGain = 0.0D; } } Collections.sort(wordProbabilities); List topFeaturesList = wordProbabilities.subList( 0, (int) Math.round(Math.sqrt(wordProbabilities.size()))); Map topFeatures = new HashMap(); for (Word topFeature : topFeaturesList) { topFeatures.put(topFeature.term, new double[] {topFeature.pInCat, topFeature.pNotInCat}); } return topFeatures; } private class Word implements Comparable { private String term; private double pInCat; private double pNotInCat; public double infoGain; public Word(String term, double pInCat, double pNotInCat) { this.term = term; this.pInCat = pInCat; this.pNotInCat = pNotInCat; } public int compareTo(Word o) { if (infoGain == o.infoGain) { return 0; } else { return infoGain > o.infoGain ? -1 : 1; } } public String toString() { return term + "(" + pInCat + "," + pNotInCat + ")=" + infoGain; } } }

### Prevent Overfitting

When our training data is small, we may add some numbers to the numerator and denominator when calculating the word probability, in order to make the word probability distribution smoother. Specifically, we add 1 to the numerator and k to the denominator, where k is the number of unique words in our training set. Thus:

p(wi|C)  = (μ(wi|C) + 1) / (n + k))
p(wi|¬C) = (μ(wi|¬C) + 1) / (n + k))

where:
n = number of words in training set.
k = number of unique words in training set.

This can be turned on in the LuceneNaiveBayesClassifier by setting the preventOverfitting property to true. In my (admittedly limited) testing, I did not see any changes in results after doing this, however.

### Test code and data

The training data I used is the set of 54 files from Reuters that is hardcoded in the cluster.pl file in the TextMine project (on which the TMAP book is based). It is categorized into three categories - "cocoa", "coffee" and "sugar". The information in these files seem to be primarily aimed at commodity market investors or people within the respective industries.

For classification, I choose some files such as this file on cocoa and this file on coffee from the Reuter's site. I train the classifier with the cocoa and "not cocoa" (ie the coffee and sugar) documents, then try to classify some cocoa documents and some coffee documents.

Here is my JUnit test. As you can see, the test builds the index out of the input file before the test (@BeforeClass) and deletes it after (@AfterClass). Four tests are run, each with different settings of featureSelection and overfitting prevention, and each test attempts to analyze 5 documents (3 cocoa and 2 coffee).

The results of the test are shown below.

 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Training (featureSelection=false, preventOverfitting=false) File: cocoa.txt in category:'cocoa'? true File: cocoa1.txt in category:'cocoa'? false File: cocoa2.txt in category:'cocoa'? false File: coffee.txt in category:'cocoa'? false File: coffee1.txt in category:'cocoa'? false Training (featureSelection=true, preventOverfitting=false) File: cocoa.txt in category:'cocoa'? true File: cocoa1.txt in category:'cocoa'? true File: cocoa2.txt in category:'cocoa'? true File: coffee.txt in category:'cocoa'? false File: coffee1.txt in category:'cocoa'? false Training (featureSelection=true, preventOverfitting=true) File: cocoa.txt in category:'cocoa'? true File: cocoa1.txt in category:'cocoa'? true File: cocoa2.txt in category:'cocoa'? true File: coffee.txt in category:'cocoa'? false File: coffee1.txt in category:'cocoa'? false Training (featureSelection=false, preventOverfitting=true) File: cocoa.txt in category:'cocoa'? true File: cocoa1.txt in category:'cocoa'? false File: cocoa2.txt in category:'cocoa'? false File: coffee.txt in category:'cocoa'? false File: coffee1.txt in category:'cocoa'? false

As you can see, at least for this data set, feature selection was necessary for it to classify all the test documents correctly (tests 2 and 3). Overfitting prevention did not seem to have any effect in these tests.

Update 2009-04-26: In recent posts, I have been building on code written and described in previous posts, so there were (and rightly so) quite a few requests for the code. So I've created a project on Sourceforge to host the code. You will find the complete source code built so far in the project's SVN repository.

#### 27 comments (moderated to prevent spam):

Mike said...

Thanks for the really nice post. I'm also doing some data analysis on a Lucene index and was wondering if you know how to get at collection term frequency (as opposed to document frequency)? For various calculations I need to know how many times a term occurs in the entire index. (i.e. sum of doc.term.tf across all docs) Seems like this basic number should be in the index somewhere without having to calculate it.

Sujit Pal said...

Hi Mike, thanks, and yes, you can use a similar approach to mine. Because I want to find the term frequencies across a single document, I stuff one document at a time into a Lucene index backed by a RAMDirectory. But if you already have an index with all your docs, then IndexReader.getTermFreqVector() should give you what you need.

thushara said...

the formula for the ratio (r) seems different from the ratio given by the wikipedia entry you mention in the post. i think it should be:

r = P(C)/P(!C) * product [P(wi|C)/P(wi|!C)]

sorry, i can't use the regular notation with this reduced functionality input box.

Sujit Pal said...

Hi thushara, I was trying to derive the formula for r in terms of data values I had already. I looked at the formula you mention above, and I can derive my formula for r from it. Although my formula for r can be simplified somewhat by factoring out the product so it comes a product of fractions.

Anonymous said...

Sujit, thanks for the great post!

Do you happen to run this Binary Naive Bayes classifier against the category coffee using the 5 documents: cocoa.txt, cocoa1.txt, cocoa2.txt, coffee.txt and coffee1.txt.

When I tried to run the classifier against category cocoa, I have the same correct result as you posted.

But when I tried to classify these 5 documents against category coffee, the classifier think cocoa1.txt and cocoa2.txt belong to coffee when featureSelection is true.

Do you have any suggestion on where might be the problem?

Thanks a lot,
Lisong.

Sujit Pal said...

Thanks, Lisong. I suspect that turning on feature selection results in the classifier overfitting the data, so we get stellar results for one category but really bad ones for another. I recently ran a cross validation for another of my "bright ideas" (a Lucene based classifier), and it came back with an accuracy score of 35% :-/.

lisongliu said...

Sujit,

I recently used Weka package against the same training set and test set. Weka's Naives Bayes based classifier got 3 out of 5 (coffee.txt, coffee1.txt and cocoa.txt) right. However, weka's support vector classifier is able to do a better job. It classified all 5 correctly.

Like you said, there might be some areas that need to be improved or there is some limitation with Naive Bayes itself.

Thanks,
Lisong.

Sujit Pal said...

Thanks, yet another reason to learn how to use Weka...I have been putting it off for a while, have been through the Witten-Frank book, but haven't actually used it for anything.

nakamoto said...

it is necessary for me. but i can't execute your code. please says to me for the details. Thanks!

Sujit Pal said...

Thanks Nakamoto. To run this stuff, I used the JUnit test shown in the post. Did that not run for you? Can you post the stack trace?

Anonymous said...

Hi..Sujith..

Thanks a lot for u blog..

I m not familiar with lucene.Is it possible to build Binary Naive Bayes classifier and all JTMT (summarizers,classifiers) projects by Eclipse tool..

Sujit Pal said...

Hi Anonymous, thanks, and you are welcome. By "build" do you mean compile? If so, yes, in Eclipse, your build path should contain the contents of the lib directory (the lib directory contains the lucene jars as well).

Savita said...

thank you for the post ..i want to implement Bayes theorem using java programming language can u please help me out..am beginner with the classifier of data mining concept..can you help me out if is it k with you..

Sujit Pal said...

Hi Savita, not sure if I can provide additional help. Did the code in this post not help? If not, can you pose specific questions? There is also a non-lucene Naive Bayes classifier from the classifier4j (classifier4j.sf.net) that may be helpful if you want a pure-Java (ie no Lucene) solution.

sab said...

could you tell me where i can find pure java solution to naive bayes classifier??

Sujit Pal said...

Hi Sab, you may want to take a look at classifier4j (classifier4j.sf.net).

CH.M.N.NAVEEN said...

hello
sir,

can you please tell me which class is the main class of the program.

regards,
ch.naveen

Sujit Pal said...

Hi ch.naveen, I don't have a main class, but the closest to that is ClassiferTest, which is a JUnit test case and which is run using the "unittest" target in the build.xml (calls the Ant junit target). You can use that to build your own main method if you want.

Hardik said...

its an awesome blog by salmon Run
Can anyone help me out to use the above logic in c#
Thanks

Sujit Pal said...

Hardik said...

Hello Sujit Pal
As i dont know java so m not able to understand the above code
I really need the above coding to be converted into .net
Thanks

Hardik said...

continue for the above comment:OR i have tried lucene for document categorization which give me socre according to the document classified
Can i use the lucene logic for my document classification or do i have to use navies and lucene both for document classification..
Thanks

Hardik said...

Hi Sujit
I have converted the above code in c# but have not used your summary analyzer.I have used the standard analyzer.I have merged all the different indexes(sugar,cocoa,coffee) with one index located at src > test > resources > data > scc-index.Now i classify the 5 files given by you
According to your above output m getting accurate result only when i pass false,true(SelectTopFeatures,Preventoverfitting)..rest in all the conditions my likelihoodRatio always return 0.0.
Can you help me out
Thanks

Hardik said...

I tried your above logic with cocoa as classification in c# with feature selection=true
My results are
cocoa.txt=true
cocoa1.txt=true
cocoa1.txt=true
coffee.txt=true
coffee1.txt=false
By looking at your result only coffee.txt shows wrong result

Now when i classify as coffee category
My results are
cocoa.txt=true
cocoa1.txt=true
cocoa1.txt=true
coffee.txt=true
coffee1.txt=true

so is the above results correct?
i have used the files from src/test/resources/data/

and i have merged all the three index into one index located at
src/test/resources/data/scc-index/

Sujit Pal said...

Hi Hardik, the files I have in there are pulled from the data in the TMAP book, and they have considerable amount of overlap because the subject areas have considerable overlap which is why I did the feature selection. Even so the results are not perfect - I did not do any calculations to find how accurate the classifier is - I did not know how to at that point. One useful starting point would be to just run my code as is and see if that returns the baseline reported. Then run your C# version and see if you can replicate the results (assuming you consider the results "good" and want to replicate it of course). Another option may be to look at classifier algorithms from toolkits such as Weka or Mahout and see if that gives you better results and if so use them instead.

Anonymous said...

Awesome post. I am trying to classify documents with large data set (Train set is about 6M with about 50k categories. I see that this approach would throw out of memory error. Any thoughts of how to do it?

Thanks

Sujit Pal said...

I think it may be because I am accumulating the (term, score) tuples into the trainingSet map. If you change the code to write this out to a database table, it will probably work out. Remember to batch up your inserts during training and create your index only after training is complete, otherwise the training process can become very slow.