Monday, October 22, 2012

First Steps with NLTK

Most of what I know about NLP is as a byproduct of search, ie, find named entities in (medical) text and annotating them with concept IDs (ie node IDs in our taxonomy graph). My interest in NLP so far has been mostly as a user, like using OpenNLP to do POS tagging and chunking. I've been meaning to learn a bit more, and I did take the Stanford Natural Language Processing class from Coursera. It taught me a few things, but still not enough for me to actually see where a deeper knowledge would actually help me. Recently (over the past month and a half), I have been reading the NLTK Book and the NLTK Cookbook in an effort to learn more about NLTK, the Natural Language Toolkit for Python.

This is not the first time I've been through the NLTK book, but it is the first time I have tried working out all the examples and (some of) the exercises (available on GitHub here), and I feel I now understand the material a lot better than before. I also realize that there are parts of NLP that I can safely ignore at my (user) level, since they are not either that baked out yet or because their scope of applicability is rather narrow. In this post, I will describe what I learned, where NLTK shines, and what one can do with it.

The NLTK book consists of 4 sections. Chapters 1-4 cover the basics; 5-7 covers language processing, tagging, classification and information extraction; Chapters 8-10 covers sentence parsing, syntax, structure and representations of meaning, and Chapter 11 covers managing linguistics data. Of these, I found the first 2 sections useful for my needs.

The NLTK package structure is reproduced below from the NLTK book, with links to the relevant sections of the PyDocs so I can easily reference the various modules when needed.

nltk.corpus* Contains corpus readers for various built-in corpora.
nltk.{tokenize, stem}* tokenizers to split text into paragraphs, sentences, words, and stemmers to remove morphological affixes from words.
nltk.collocations* Tools to identify collocations in text.
nltk.tag* Classes and Interfaces for part-of-speech tagging.
nltk.{classify, cluster}* Various algorithms for classification and clustering, and classes for labeling tokens with category labels.
nltk.chunk* Classes and Interfaces for Chunk Parsing.
nltk.parse Classes and Interfaces for producing parses from text.
nltk.{sem, inference} Classes for Lambda calculus, first order logic, model checking.
nltk.metrics* Classes and methods for scoring processing modules, eg. precision, recall, agreement coefficients, etc.
nltk.probability* Classes for counting and representing probability information, such as frequency distributions.
nltk.{app, chat}* Applications (graphical concordance, wordnet browser, chatbot).
nltk.toolbox Classes to manipulate data in SIL toolbox format.
* == "Interesting" from my point of view.

What I found nice about NLTK is that it comes with a whole set of tagged corpora such as Brown's, Treebank and ConLL2000 corpuses, etc, so you can experiment with the data without having to go looking for tagged data. You can do a lot with these corpuses and some simple tools such as frequency distributions (nltk.FreqDist) and conditional frequency distributions (nltk.ConditionalFreqDist).

Wordnet is also treated as a corpus by NLTK. This effectively gives the user access to a dictionary that "knows" the meaning of words, and can form the basis for some interesting applications.

The Tokenizing APIs convert the text into a stream of tuples. A block of text can be thought of as a list of paragraphs, which can be thought of as a list of sentences, which can be thought of as a list of words. Each word can have properties such as POS tags, IOB tags, etc, and the word and its properties can be stored as a tuple. This makes analysis extremely simple and extensible using Python's list comprehensions. Convenience methods for converting back and forth between string and tuple are also available.

Building n-gram collocations (bigrams and trigrams have their own special methods) is also a single method call, and uses the same tuple structure to return a list of n-gram tuples. Coupled with FreqDist, one can make interesting observations about a corpus with only a few lines of code.

NLTK also comes with two stemmers (Porter and Lancaster) for English, and Snowball stemmers for a variety of other European languages.

While NLTK provides a large number of Rule based strategies for POS tagging, IOB (Chunk) tagging, etc, tagging of any of these kinds (including Named Entity Recognition) can also be treated as a classification problem where the features are words or POS tags prior to the current word. NLTK also allows you to build tagger stacks with the Backoff tagger. For classification based approaches, it provides a large number of classification algorithms such as Decision Tree, Naive Bayes, Maximum Entropy, etc. There is a similar strategy for phrase chunking and NER as well.

Overall, I am quite impressed with the package and can't wait to get started with it. More to come in the following weeks, stay tuned.

2 comments (moderated to prevent spam):

Ahmed said...

Can you tell me that how can i split a text into paragraphs?

Sujit Pal said...

There is nothing in NLTK as far as I know... but you can exploit the structure of the text. For example, there is usually a double line between paragraphs, so you can split the body by "\n\n". Also some texts start paragraphs with an indent, so you can look for that.