Salmon Run: Book Review: Software Engineering for Data Scientists

As a Software Engineer (backend Web Development then Search) turned Data Scientist, I was particularly interested in what the book Software Engineering for Data Scientists by Andrew Treadway had to say about the reverse transition. Transitioning between sub-disciplines is a given in our industry -- I started life as a sales/support engineer, then moved to application programming, then back and forth between architect, programmer, part-time sysadmin and full-time DBA, before getting into backend Web Development with Java. Nevertheless, the shift from that into Data Science (DS) has been the most challenging for me. Having lived through the time when Data Scientist first became a job title, to the genesis and evolution of Deep Learning to Transformers to Large Language Models to Agents, the field continues to be a moving target, growing and changing at breakneck speed.

Most applications today incorporate a healthy dose of Data Science based components. As a result, Data Scientists are increasingly being integrated into these teams and are expected to work collaboratively within team frameworks. The book addresses this new requirement in four parts -- the first part covers information that Data Scientists transitioning into such teams would need to know to get going, the second part covers scaling to larger datasets and compute clusters, the third part covers issues around production deployments, and the fourth covers monitoring. Here is my somewhat detailed, chapter by chapter review of the book.

Part I: Getting Started

Chapter 1: Software Engineering Principles -- Josh Wills, an early DS practitioner and evangelist, famously defined a Data Scientist as someone better at Statistics than a Software Engineer and better at Software Engineering than a Statistician. I think, as the field has matured over time and tooling has improved to incorporate the necessary statistics, the bar around Software Engineering (SE) has gone even higher. This chapter describes a typical DS workflow, with EDA / Data Validation, Data Cleaning, Feature Engineering, Model Training, Evaluation, Deployment and Monitoring, and how having DS with good SE skills result in better code structure, code collaboration, efficient scaling and testing, and easier deployments.

Chapter 2: Source Code Control for Data Scientists -- the author describes git, a distributed source code control system (and currently the de-facto standard) and common git commands for typical DS / SE work and introduces the reader to the feature branch workflow. I noticed that the author did not cover data version control systems such as dvc, but that could be because nowadays many companies prefer central data catalogs where the DS no longer needs to worry about versioning.

Chapter 3: Code Structure and Style -- nowadays it is possible to enforce a common coding style across an appliation using tools such as pylint and black. The chapter introduces these tools and talks about the PEP-8, the Style Guide for Python Code. The author provides some additional general guidelines such as modularizing code to avoid repetition (DRY). It also goes into some additional details such as incorporating type safety into Python using mypy, exception handling, and creating documentation from inline comments using pdoc (there is also the less capable but built-in pydoc).

Chapter 4: Object Oriented Programming for Data Scientists -- this chapter covers basic concepts of Object Oriented Programming (OOP) such as classes, methods instances, constructor, etc., and provides an example of using OOP in a Machine Learning (ML) based pipeline based on scikit-learn that demonstrates how OOP can improve code modularity. Having come from a Java / Spring background, I would have liked to see some discussion of Dependency Injection (DI) with Python here, but I guess this may be something the DS is expected to pick up as they get more familiar with SE.

Chapter 5: Creating Progress Bars and Timeouts in Python -- even though it may feel a bit strange to see these two items lumped into their own chapter, it makes sense when you realize that DS jobs are typically long running batch jobs, and the ability to show progress and stopping long running jobs with degraded performance are both quite important. The chapter shows how to use tqdm to show progress in your Python code, and how similar functionality is integrated into scikit-learn. It also covers how to respond to timeouts using the stopit package.

Part II: Scaling

Chapter 6: Making your Code Faster and More Efficient -- there is lots of good information here, some of which I knew and some that I didn't. It starts by introducing the Big O notation, then showing how you can profile a block of code using the kernprof line profiler and the @profile decorator. It also describes several strategies for making your code faster, such as replacing loops with select on Pandas dataframes, parallelizing with Numpy, avoiding Pandas apply, using list comprehensions and numpy.vectorize functions. It also introduces multi-processing using the built in multiprocessing library in Python and the n_jobs parameter in Scikit-Learn. It touches on Multithreading and Asynchronous Programming as possible additional techniques to address slow code but does not go into details. It introduces caching using functools.lru_cache and @lru_cache decorators. Finally, it describes some useful built-in Python data structures like set and Priority Queue, and the Numpy array, which uses vectorized operations internally.

Chapter 7: Memory Management with Python -- another useful chapter for me. It covers the use of the Python memory profiler guppy and the @profile decorator. It also discusses memory management strategies for Pandas and Scikit-Learn (using model.partial_fit()), using Numpy arrays in favor of Python lists, and Parquet as a more memory efficient alternative to CSV files.

Chapter 8: Alternatives to Pandas -- this chapter covers Dask and PySpark, two popular "big-data" libraries that work with Dataframes and distribute the workload across a cluster of machines, and support datasets too large to fit into RAM. Both do lazy evaluation, unlike Pandas which does eager execution. Examples are provided for both Dask and PySpark. The chapter also mentions the modin package, which allow you to create custom Pandas operations that delegate to Dask or Ray (another big data platform). It also mentions Polars, a Pandas-like package written in Rust for speed.

Part III: Deploying to Production

Chapter 9: Putting your Code into Production -- this chapter talks about various strategies for making the results of your DS artifact (e.g. a trained model) available to consumers. The first strategy covered is the simple recurring batch job. An important consideration is protecting user credentials, so strategies such as keyrings are discussed. Another slightly more advanced approach is to create a REST API, with tools such as FastAPI and uvicorn highlighted in the examples. Another strategy discussed is to create a high level CLI to help users to call your model without knowing too much about the internals.

Chapter 10: Testing in Python -- while unit testing is very important in the SE context, DS has traditionally not been very strict about this. But there is value in testing DS pipelines and config files as well, to ensure that all supported edge cases work correctly. Unit testing packages such as pytest and unittest are discussed, as well as the test coverage tool coverage.

Chapter 11: Scheduling and Packaging your Code -- this chapter covers scheduling your DS pipeline on Windows and Unix, packaging code with build and twine so application code can call your code as a local library, creating desktop based executables with PyQt and pyinstaller. I found this chapter particularly informative, since previously I had been exposing my DS artifacts using APIs and Streamlit. Always good to learn new ways to do things.

Chapter 12: Reporting and Logging in Python -- covers customizing logging formats so application logs can be parsed to produce useful insights about runs. Additional material includes generating PDF reports using reportlab and sending them automatically over email. I prefer markdown reports rendered on the user's browser, with notifications sent via email, but PDF looks interesting as well.

Part IV: Monitoring

Chapter 13 - Introduction to Web Development for Data Science -- this is a generic chapter on web development using Flask because the author feels (rightly) that DS should be capable of building simple web applications, and provides an example of building a web application that helps with ML model training. However, I feel that perhaps this chapter should have gone into an Appendix along with the Dask appendix. Monitoring is covered in some depth in the previous Part already.

Appendix: Dask with Coiled and AWS -- covers using Dask with the Coiled tool on the Amazon Web Services (AWS) cloud platform.

Overall, I thought the book provided good value. It is interesting how much of a head start I got as a SE first. However, in keeping with the grass is greener mindset, I feel that the move from DS to SE is probably less of a hurdle than in the other direction, but I will defer to those who have made the move in this direction.