Sunday, April 28, 2013

scalingpipe - porting LingPipe tutorial examples to Scala


Recently, I was tasked with evaluating LingPipe for use in our NLP processing pipeline. I have looked at LingPipe before, but have generally kept away from it because of its licensing - while it is quite friendly to individual developers such as myself (as long as I share the results of my work, I can use LingPipe without any royalties), a lot of the stuff I do is motivated by problems at work, and LingPipe based solutions are only practical when the company is open to the licensing costs involved.

So anyway, in an attempt to kill two birds with one stone, I decided to work with the LingPipe tutorial, but with Scala. I figured that would allow me to pick up the LingPipe API as well as give me some additional experience in Scala coding. I looked around to see if anybody had done something similar and I came upon the scalingpipe project on GitHub where Alexy Khrabov had started with porting the Interesting Phrases tutorial example.

So I forked the project, and about a month later, I ended up building almost all the LingPipe tutorial demos in Scala. There are now 54 examples across all 19 categories of the tutorial. Read the project's README.md and the code for specifics. I now have some insights into LingPipe's capabilities, and a working knowledge of the API. Here is a link to my scalingpipe fork. I have sent Alexy a pull request (my first attempt at contributing to other people's projects on github!)

Initially, LingPipe code appeared more complex than it needed to be - perhaps because of the heavy use of the Visitor Pattern in the tagging/chunking code, where custom ObjectHandler.handle() methods are invoked through the framework. The Text Processing with LingPipe 4 book (aka the LingPipe Book) by Bob Carpenter and Breck Baldwin has a good explanation of this approach, as well as (a lot of) theoretical NLP and how they are implemented in LingPipe - if you end up using LingPipe for more than trivial stuff (or just copying/porting the examples), you should probably read the book.

Now that I have a reasonable understanding of LingPipe, I plan on re-reading the Building Search Applications: Lucene, LingPipe and Gate by Dr Manu Konchady, and porting some of the Lucene/LingPipe examples in there. I've had this book for a long time, and all three components described in the book are now at least two major versions behind whats currently available, but I believe there may be some powerful stuff in there.

No comments:

Post a Comment

Comments are moderated to prevent spam.