Saturday, January 10, 2009

2008: A retrospective

Well, it is (was?) that time of year again, when people look back on the year past and make bold plans for the one ahead. This once-a-year all-fluff-no-stuff post actually started off because I had nothing to write about that first week three years ago, and it seemed a good idea to write a summary style post. But I've doing this now for two years, so its almost as much a tradition as one friend's yearly pilgrimage to Reno the day after Christmas. Anyway, I guess its good to take stock of the things I did last year and have a road map in my head of what I want to accomplish the year ahead, so here goes.

Last year, my main focus was on learning various Information Retrieval techniques and algorithms and the math behind them. The main source for these has been the TMAP book and the Internet. My motivation for learning this was to gain some background to help improve the algorithms for the proprietary indexing processes at work. A lot of companies who do Information Retrieval typically just use a standard IR library like Lucene and build around it. Unlike them, our indexing algorithms are intimately tied to our health taxonomy, and occur before we stuff the data into a Lucene index. I've been curious about the right way to do certain things in there, and the stuff in the TMAP book was quite useful.

I also did some work trying to figure out various ways of applying graph data structures to model an ontology, with a view to provide a simpler and more maintainable API to our taxonomy for internal client applications. This never made it beyond the blog post, though, because this would have replaced a (less maintainable, in my opinion) API that we already had in production, and replicating and regressing all the services would have been quite a bit of work. I guess I will try to revive this idea at a future job.

Another thing I worked on for a few weeks and described in my blog was an event driven workflow application that modeled an internal workflow that we have been trying to automate for some time. Ultimately, the automation was done (more elegantly, in my opinion) using multiprocess make (make -j). However, it introduced me to Spring's and OSWorkflow's event handling functionality, which I guess will help me in some future application.

Later in the year, I started looking at various ways of processing large volumes of data in reasonable amounts of time. To that end, I started looking at Hadoop and a bit of multithreaded programming using the Actor framework. In both cases, the motivation were some excellent talks organized by East Bay IT Group (ebig), which I joined this year.

There have been a few successes based on this blog too. One of my ideas that I introduced here and that I just accidentally talked about to one of our product people at the water cooler will ultimately become a product later this year.

This year, I plan to learn more math and statistics based Information Retrieval/Data Mining techniques. I still have a few chapters to go through in the TMAP book. I also plan to take a look at some larger frameworks such as GATE and Lingpipe that are described very nicely in Dr Konchady's new book Building Search Applications - Lucene, LingPipe and GATE.

I also want to explore some larger frameworks that may be useful in my work, such as Solr and Carrot. I have used Carrot, but without fully understanding the clustering algorithms. I am hoping that my new found knowledge of various IR algorithms will help me in doing this. Also, I want to use Hadoop and the Actor framework to do some real applications.

I also hope to learn a bit of Scala over this year. I started learning it late last year, and so far, it seems a very cool language. I still don't have a good use for it, but I am guessing that will become apparent as I learn more about the language.

There is a lot of open source software around, and engineers who are aware of and willing to use open source software in their work will be able to do more with less, something that we probably all need to do in the tougher economic climate ahead. I have already been doing this, and I plan to do more of it next year.

Well, that's pretty much it for my New Year's resolutions. Y'all have a very Happy New Year, everyone!

2 comments (moderated to prevent spam):

Unknown said...

This seems to be the right place to say thank you for a year full of great blog post in 2008. I really like your writing style and your hands-on approach to technology. Please keep on the good work. I'm looking forward following your post in 2009.

Best, Michael

Sujit Pal said...

Thanks very much for your kind words and good wishes, Michael.