Sunday, August 12, 2012

Scalding for the Impatient

Few weeks ago, I wrote about Pig, a DSL that allows you to specify a data processing flow in terms of PigLatin operations, and results in a sequence of Map-Reduce jobs on the backend. Cascading is similar to Pig, except that it provides a (functional) Java API to specify a data processing flow. One obvious advantage is that everything can now be in a single language (no more having to worry about UDF integration issues). But there are others as well, as detailed here and here.

Cascading is well documented, and there is also a very entertaining series of articles titled Cascading for the Impatient that builds up a Cascading application to calculate TF-IDF of terms in a (small) corpus. The objective is to showcase the features one would need to get up and running quickly with Cascading.

Scalding is a Scala DSL built on top of Cascading. As you would expect, Cascading code is an order of magnitude shorter than equivalent Map-Reduce code. But because Java is not a functional language, implementing functional constructs leads to some verbosity in Cascading that is eliminated in Scalding, leading to even shorter and more readable code.

I was looking for something to try my newly acquired Scala skills on, so I hit upon the idea of building up a similar application to calculate TF-IDF for terms in a corpus. The table below summarizes the progression of the Cascading for the Impatient series. I've provided links to the original articles for the theory (which is very nicely explained there) and links to the source codes for both the Cascading and Scalding versions.

Article[1] Description #-mappers #-reducers Code[2]
Part 1 Distributed File Copy 1 0
Part 2 Word Count 1 1
Part 3 Word Count with Scrub 1 1
Part 4 Word Count with Scrub and Stop Words 1 1
Part 5 TF-IDF 11 9

[1] - points to "Cascading for the Impatient Articles"
[2] - Cascading version links points to Paco Nathan's github repo, Scalding versions point to mine.

The code for the Scalding version is fairly easy to read if you know Scala (somewhat harder, but still possible if you don't). The first thing to note is the relative sizes - Scalding code is shorter and more succint than the Cascading version. The second thing to note is that the Scalding based code uses method calls that are not Cascading methods. You can read about the Scalding methods in the API Reference (I used the Fields-based reference exclusively). The tutorial and example code in the Scalding distribution is also helpful.

Project Setup

As you can see, I created my own Scala project and used Scalding as a dependency. I describe the steps here so you can do the same if you are so inclined.

Assuming you are going to be using Scalding in your applications, you need to download and build the Scalding JAR, then publish it to your local (or corporate) code repository (sbt uses ivy2). To do this, run the following sequence of commands:

1
2
3
4
sujit@cyclone:scalding$ git clone https://github.com/twitter/scalding.git
sujit@cyclone:scalding$ cd scalding
sujit@cyclone:scalding$ sbt assembly # build scalding jar
sujit@cyclone:scalding$ sbt publish-local # to add to local ivy2 repo.

Scalding also comes with a ruby script scald.rb that you use to run Scalding jobs. It is quite convenient to use - it forces all arguments to be named (resulting in cleaner/explicit argument handling code) and allows switching from local development to hadoop mode using a single switch. It is available in the scripts subdirectory of your scalding download. To use it outside Scalding (ie, in your own project), you will need to soft link it to a directory in your PATH. Copying it does not work because it has dependencies to other parts of the Scalding download.

The next step is to generate and setup your Scalding application project. Follow these steps:

  1. Generate your project using g8iter - type g8 typesafehub/scala-sbt at the command line, and answer the prompts. Your project is created as a directory named by Scala Project Name.
  2. Move to the project directory - type cd scalding-impatient (in my case).
  3. Build a basic build.sbt - create a file build.sbt in the project base directory and populate it with the key value pairs for name, version and scalaVersion (blank lines between pairs are mandatory).
  4. Copy over scalding libraryDependencies - copy over the libraryDependencies lines from the scalding build.sbt file and drop it in to your project's build.sbt. I am not sure if this is really necessary, or whether scalding declares its transitive dependencies which will be picked up with a single dependency declaration to scalding (see below). You may want to try omitting this step and see - if you succeed please let me know and I will update accordingly.
  5. Add the scalding libraryDependency - define the libraryDependency to the scalding JAR you just built and published. The line to add is libraryDependencies += "com.twitter" % "scalding_2.9.2" % "0.7.3".
  6. Rebuild your Eclipse files - Check my previous post for details about SBT-Eclipse setup. If you are all set up, then type sbt eclipse to generate these. Your project is now ready for development using Eclipse.

And thats pretty much it. Hope you have as much fun coding in Scala/Scalding as I did.

6 comments (moderated to prevent spam):

Felix said...

can you do a tutorial with avro as input and output.

Sujit Pal said...

I'll try - don't know much about avro except that its a json based serialization format.

Sujit Pal said...

Posting this on behalf of Saad Rashid, since I inadvertently deleted his comment when trying to get rid of the duplicates...

"""
Hi, Can you please provide Cascading for Impatient Part6 implementation in Scalding. There is no TDD based implementation of Checkpoint, Assertion, Debug & Sample found in Scalding DSL.

Thanks,

Saad Rashid.

"""

Sujit Pal said...

Hi Saad, I will check it out, but as far as I know (although to be honest, I haven't used Scalding that much beyond doing this one and then using it for some data cleanup work later, so my not knowing doesn't mean anything) these features don't exist natively, but you can probably build some of them, such as Debug and Sample, in your application.

Joe Bennett said...

Hi, I am new to scalding and the hadoop environment. I am working on an inverted index and using MultipleTextLineFiles source to process many input files. Do you happen to know how I would get the filename and associate them with my inverted index? I have looked online but have not had any luck...

Sujit Pal said...

Hi Joe, wasn't familiar with MultipleTextLineFiles, but some googling turned up this page, looks like its not supported OOB, you have to build a custom subclass that is given this information (so it can return it).