Sunday, June 23, 2024

Book Report: Pandas Workout

Unlike many Data Scientists, I didn't automatically reach for Pandas when I needed to analyze data. I came upon this discipline (Data Science) as a Java Software Engineer who used Python for scripting, so I was quite comfortable operating on JSON / CSV / text files directly, loading data into relational databases and running SQL against them, and building visualizations with Matplotlib. So when Pandas first hit the scene, I thought it was a nice library, but I just didn't see the logic in spending time to learn another interface to do the same things I could do already. Of course, Pandas has matured since then (and so have I, hopefully), and when faced with a data analysis / preparation / cleanup task, I often now reach out not only for Pandas, but depending on the task, also its various incarnations such as PySpark, Dask Dataframes and RAPIDS cuDF. When I use Pandas (and its various incarnations) I often find myself depending heavily on Stack Overflow (and lately Github Copilot) for things I know can be done but not how. To some extent I blame this on never having spent the time to understand Pandas in depth. So when I was offered the chance to review Pandas Workout by Reuven Lerner, I welcomed it as a way to remedy this gap in my knowledge.

The book is about Pandas fundamentals rather than solving specific problems with Pandas. For that you will still want to look up Stack Overflow :-). In fact, in the foreword the author specifically targets my demographic (needs to look up Stack Overflow when solving problems with Pandas). But he promises that after reading the book you will understand why some solutions are better than others.

Pandas started as an open source project by Wes McKinney, and has grown somewhat organically into the top Data Science toolkit that is today. As a result, there are often multiple ways to do something in Pandas. While all these ways may produce identical results, their performance characteristics may be different, so there is usually an implicit "right" way. The book gives you the mental model to decide which among the different approaches is the "right" one.

The book is organized into the following chapters. Each chapter covers a particular aspect of Pandas usage. I have included a super-short TLDR style abstract for each chapter for your convenience.

  1. Series -- Pandas Series objects are the basic building block of Pandas and represent a typed sequence of data, that are used to construct DataFrames and Indexes. Many methods on the Series object apply in a similar way to DataFrames as well. This is a foundational chapter, understanding this will help with future chapters.
  2. Data Frames -- DataFrames represent tabular data as a sequence of Series, where each Series object represents a column in the table. Pandas inherits the idea of DataFrames from R, and the incarnations I listed (and a few that I didn't) use DataFrame as a basic abstraction as well. This chapter teaches you how to select from and manipulate DataFrames. Unless you've used Pandas extensively before, there is a high chance you will learn something useful new tricks here (I did, several of them).
  3. Import and Export -- covers reading and writing CSV and JSON formats to and from DataFrames. Covers some simple sanity checks you can run to verify that the import or export worked correctly. I learned about the pd.read_html method here, probably not that useful, but interesting to know!
  4. Indexes -- Indexes are used by Pandas to efficiently find data in DataFrames. While it may be possible to get by without Indexes, your Pandas code would take longer to run and consume more resources. The chapter deals with indexing techniques. I happened to know a lot of them, but there were a few that I didn't, especially the techniques around pivot tables.
  5. Cleaning -- this chapter teaches a skill that is very fundamental to (and maybe even the bane of) a Data Scientist's job. Statistics indicate that we spend 80% of our time cleaning data. Along with the techniques themselves (remove / interpolate / ignore), this chapter contains commentary that will help you frame these decisions on your own data cleaning tasks.
  6. Grouping, Joining and Sorting -- these three operations are so central to data analysis, so much so that SQL has special keywords for each operation (JOIN, GROUP BY and ORDER BY). This chapter covers various recipes to do these operations efficiently and correctly in Pandas.
  7. Advanced Grouping, Joining and Sorting -- this chapter goes into greater detail on how to combine these operations to deal with specific use-cases, the so-called "split-apply-combine" technique, including the concept of a general aggregation function agg. It also shows how to do method chaining using assign.
  8. Midway Project -- describes a project and asks questions that you should be able to answer from the data using the techniques you have learned so far. Comes with solutions.
  9. Strings -- one reason I don't have much experience with Pandas is because it is focused on numeric tables for the most part. However, Pandas also has impressive string handling facilities via the str accessor. This chapter was something of an eye-opener for me, showing me how to use Pandas for text analysis and pre-processing.
  10. Dates -- this chapter describes Pandas date and time handling capabilities. This can be useful when trying to work with time series or when trying to derive numerical features from columns containing datetime objects to combine with other numeric or text data.
  11. Visualizations -- this chapter describes visualization functionality you can invoke from within Pandas, that are powered either by Matplotlib or Seaborn. This is more convenient than exporting the data to Numpy and using the two packages to draw the charts.
  12. Performance -- performance has been a focus for most of the preceding chapters in this book. However, the recipes in this chapter are in the advanced tricks category, and include converting strings to categorical values, optimizing reads and writes using Apache Arrow backed formats, and the using fast special purpose functions for specific purposes.
  13. Final Project -- describes a project similar to the Midway project with questions that you should be able to answer from the data using the techniques you have learned so far.

I think the book has value beyond just teaching Pandas fundamentals though. The author sprinkles insights about Data Analysis and Data Science throughout the book, around learning to structure the problem and planning the sequence of steps that are best suited for the tools at hand, the importance of critical thinking, the importance of knowing the data and interpreting the results of the analysis, etc.

Each exercise (there are 50 in all) involves downloading some dataset, dealing with subjects as diverse as tourism, taxi rides, SAT scores, parking tickets, olympic games, oil prices, etc. I think the information about the availability of such datasets (and possibly related datasets) can also be very valuable to Data Scientists for their future projects.

I think the popularity of Pandas is because of the same reason as the popularity of Jupyter Notebooks. It is a nice, self-contained platform the allows a Data Scientist to demonstrate a series of data transformations from problem to solution in a clear, concise and standard manner, not only to customers, but to other Data Scientists as well. More than any other reason, I feel that this will continue to drive the popularity of Pandas and its various incarnations, and as a Data Scientist, it makes sense to learn how to use it properly. And the book definitely fulfils its promise of teaching you how to do that.