Few weeks ago, I wrote about Pig, a DSL that allows you to specify a data processing flow in terms of PigLatin operations, and results in a sequence of Map-Reduce jobs on the backend. Cascading is similar to Pig, except that it provides a (functional) Java API to specify a data processing flow. One obvious advantage is that everything can now be in a single language (no more having to worry about UDF integration issues). But there are others as well, as detailed here and here.
Cascading is well documented, and there is also a very entertaining series of articles titled Cascading for the Impatient that builds up a Cascading application to calculate TF-IDF of terms in a (small) corpus. The objective is to showcase the features one would need to get up and running quickly with Cascading.
Scalding is a Scala DSL built on top of Cascading. As you would expect, Cascading code is an order of magnitude shorter than equivalent Map-Reduce code. But because Java is not a functional language, implementing functional constructs leads to some verbosity in Cascading that is eliminated in Scalding, leading to even shorter and more readable code.
I was looking for something to try my newly acquired Scala skills on, so I hit upon the idea of building up a similar application to calculate TF-IDF for terms in a corpus. The table below summarizes the progression of the Cascading for the Impatient series. I've provided links to the original articles for the theory (which is very nicely explained there) and links to the source codes for both the Cascading and Scalding versions.
|Part 1||Distributed File Copy||1||0|
|Part 2||Word Count||1||1|
|Part 3||Word Count with Scrub||1||1|
|Part 4||Word Count with Scrub and Stop Words||1||1|
 - points to "Cascading for the Impatient Articles"
 - Cascading version links points to Paco Nathan's github repo, Scalding versions point to mine.
The code for the Scalding version is fairly easy to read if you know Scala (somewhat harder, but still possible if you don't). The first thing to note is the relative sizes - Scalding code is shorter and more succint than the Cascading version. The second thing to note is that the Scalding based code uses method calls that are not Cascading methods. You can read about the Scalding methods in the API Reference (I used the Fields-based reference exclusively). The tutorial and example code in the Scalding distribution is also helpful.
As you can see, I created my own Scala project and used Scalding as a dependency. I describe the steps here so you can do the same if you are so inclined.
Assuming you are going to be using Scalding in your applications, you need to download and build the Scalding JAR, then publish it to your local (or corporate) code repository (sbt uses ivy2). To do this, run the following sequence of commands:
1 2 3 4
sujit@cyclone:scalding$ git clone https://github.com/twitter/scalding.git sujit@cyclone:scalding$ cd scalding sujit@cyclone:scalding$ sbt assembly # build scalding jar sujit@cyclone:scalding$ sbt publish-local # to add to local ivy2 repo.
Scalding also comes with a ruby script scald.rb that you use to run Scalding jobs. It is quite convenient to use - it forces all arguments to be named (resulting in cleaner/explicit argument handling code) and allows switching from local development to hadoop mode using a single switch. It is available in the scripts subdirectory of your scalding download. To use it outside Scalding (ie, in your own project), you will need to soft link it to a directory in your PATH. Copying it does not work because it has dependencies to other parts of the Scalding download.
The next step is to generate and setup your Scalding application project. Follow these steps:
- Generate your project using g8iter - type g8 typesafehub/scala-sbt at the command line, and answer the prompts. Your project is created as a directory named by Scala Project Name.
- Move to the project directory - type cd scalding-impatient (in my case).
- Build a basic build.sbt - create a file build.sbt in the project base directory and populate it with the key value pairs for name, version and scalaVersion (blank lines between pairs are mandatory).
- Copy over scalding libraryDependencies - copy over the libraryDependencies lines from the scalding build.sbt file and drop it in to your project's build.sbt. I am not sure if this is really necessary, or whether scalding declares its transitive dependencies which will be picked up with a single dependency declaration to scalding (see below). You may want to try omitting this step and see - if you succeed please let me know and I will update accordingly.
- Add the scalding libraryDependency - define the libraryDependency to the scalding JAR you just built and published. The line to add is libraryDependencies += "com.twitter" % "scalding_2.9.2" % "0.7.3".
- Rebuild your Eclipse files - Check my previous post for details about SBT-Eclipse setup. If you are all set up, then type sbt eclipse to generate these. Your project is now ready for development using Eclipse.
And thats pretty much it. Hope you have as much fun coding in Scala/Scalding as I did.