Sunday, February 01, 2026

Book Review: Software Engineering for Data Scientists

As a Software Engineer (backend Web Development then Search) turned Data Scientist, I was particularly interested in what the book Software Engineering for Data Scientists by Andrew Treadway had to say about the reverse transition. Transitioning between sub-disciplines is a given in our industry -- I started life as a sales/support engineer, then moved to application programming, then back and forth between architect, programmer, part-time sysadmin and full-time DBA, before getting into backend Web Development with Java. Nevertheless, the shift from that into Data Science (DS) has been the most challenging for me. Having lived through the time when Data Scientist first became a job title, to the genesis and evolution of Deep Learning to Transformers to Large Language Models to Agents, the field continues to be a moving target, growing and changing at breakneck speed.

Most applications today incorporate a healthy dose of Data Science based components. As a result, Data Scientists are increasingly being integrated into these teams and are expected to work collaboratively within team frameworks. The book addresses this new requirement in four parts -- the first part covers information that Data Scientists transitioning into such teams would need to know to get going, the second part covers scaling to larger datasets and compute clusters, the third part covers issues around production deployments, and the fourth covers monitoring. Here is my somewhat detailed, chapter by chapter review of the book.

Part I: Getting Started

Chapter 1: Software Engineering Principles -- Josh Wills, an early DS practitioner and evangelist, famously defined a Data Scientist as someone better at Statistics than a Software Engineer and better at Software Engineering than a Statistician. I think, as the field has matured over time and tooling has improved to incorporate the necessary statistics, the bar around Software Engineering (SE) has gone even higher. This chapter describes a typical DS workflow, with EDA / Data Validation, Data Cleaning, Feature Engineering, Model Training, Evaluation, Deployment and Monitoring, and how having DS with good SE skills result in better code structure, code collaboration, efficient scaling and testing, and easier deployments.

Chapter 2: Source Code Control for Data Scientists -- the author describes git, a distributed source code control system (and currently the de-facto standard) and common git commands for typical DS / SE work and introduces the reader to the feature branch workflow. I noticed that the author did not cover data version control systems such as dvc, but that could be because nowadays many companies prefer central data catalogs where the DS no longer needs to worry about versioning.

Chapter 3: Code Structure and Style -- nowadays it is possible to enforce a common coding style across an appliation using tools such as pylint and black. The chapter introduces these tools and talks about the PEP-8, the Style Guide for Python Code. The author provides some additional general guidelines such as modularizing code to avoid repetition (DRY). It also goes into some additional details such as incorporating type safety into Python using mypy, exception handling, and creating documentation from inline comments using pdoc (there is also the less capable but built-in pydoc).

Chapter 4: Object Oriented Programming for Data Scientists -- this chapter covers basic concepts of Object Oriented Programming (OOP) such as classes, methods instances, constructor, etc., and provides an example of using OOP in a Machine Learning (ML) based pipeline based on scikit-learn that demonstrates how OOP can improve code modularity. Having come from a Java / Spring background, I would have liked to see some discussion of Dependency Injection (DI) with Python here, but I guess this may be something the DS is expected to pick up as they get more familiar with SE.

Chapter 5: Creating Progress Bars and Timeouts in Python -- even though it may feel a bit strange to see these two items lumped into their own chapter, it makes sense when you realize that DS jobs are typically long running batch jobs, and the ability to show progress and stopping long running jobs with degraded performance are both quite important. The chapter shows how to use tqdm to show progress in your Python code, and how similar functionality is integrated into scikit-learn. It also covers how to respond to timeouts using the stopit package.

Part II: Scaling

Chapter 6: Making your Code Faster and More Efficient -- there is lots of good information here, some of which I knew and some that I didn't. It starts by introducing the Big O notation, then showing how you can profile a block of code using the kernprof line profiler and the @profile decorator. It also describes several strategies for making your code faster, such as replacing loops with select on Pandas dataframes, parallelizing with Numpy, avoiding Pandas apply, using list comprehensions and numpy.vectorize functions. It also introduces multi-processing using the built in multiprocessing library in Python and the n_jobs parameter in Scikit-Learn. It touches on Multithreading and Asynchronous Programming as possible additional techniques to address slow code but does not go into details. It introduces caching using functools.lru_cache and @lru_cache decorators. Finally, it describes some useful built-in Python data structures like set and Priority Queue, and the Numpy array, which uses vectorized operations internally.

Chapter 7: Memory Management with Python -- another useful chapter for me. It covers the use of the Python memory profiler guppy and the @profile decorator. It also discusses memory management strategies for Pandas and Scikit-Learn (using model.partial_fit()), using Numpy arrays in favor of Python lists, and Parquet as a more memory efficient alternative to CSV files.

Chapter 8: Alternatives to Pandas -- this chapter covers Dask and PySpark, two popular "big-data" libraries that work with Dataframes and distribute the workload across a cluster of machines, and support datasets too large to fit into RAM. Both do lazy evaluation, unlike Pandas which does eager execution. Examples are provided for both Dask and PySpark. The chapter also mentions the modin package, which allow you to create custom Pandas operations that delegate to Dask or Ray (another big data platform). It also mentions Polars, a Pandas-like package written in Rust for speed.

Part III: Deploying to Production

Chapter 9: Putting your Code into Production -- this chapter talks about various strategies for making the results of your DS artifact (e.g. a trained model) available to consumers. The first strategy covered is the simple recurring batch job. An important consideration is protecting user credentials, so strategies such as keyrings are discussed. Another slightly more advanced approach is to create a REST API, with tools such as FastAPI and uvicorn highlighted in the examples. Another strategy discussed is to create a high level CLI to help users to call your model without knowing too much about the internals.

Chapter 10: Testing in Python -- while unit testing is very important in the SE context, DS has traditionally not been very strict about this. But there is value in testing DS pipelines and config files as well, to ensure that all supported edge cases work correctly. Unit testing packages such as pytest and unittest are discussed, as well as the test coverage tool coverage.

Chapter 11: Scheduling and Packaging your Code -- this chapter covers scheduling your DS pipeline on Windows and Unix, packaging code with build and twine so application code can call your code as a local library, creating desktop based executables with PyQt and pyinstaller. I found this chapter particularly informative, since previously I had been exposing my DS artifacts using APIs and Streamlit. Always good to learn new ways to do things.

Chapter 12: Reporting and Logging in Python -- covers customizing logging formats so application logs can be parsed to produce useful insights about runs. Additional material includes generating PDF reports using reportlab and sending them automatically over email. I prefer markdown reports rendered on the user's browser, with notifications sent via email, but PDF looks interesting as well.

Part IV: Monitoring

Chapter 13 - Introduction to Web Development for Data Science -- this is a generic chapter on web development using Flask because the author feels (rightly) that DS should be capable of building simple web applications, and provides an example of building a web application that helps with ML model training. However, I feel that perhaps this chapter should have gone into an Appendix along with the Dask appendix. Monitoring is covered in some depth in the previous Part already.

Appendix: Dask with Coiled and AWS -- covers using Dask with the Coiled tool on the Amazon Web Services (AWS) cloud platform.

Overall, I thought the book provided good value. It is interesting how much of a head start I got as a SE first. However, in keeping with the grass is greener mindset, I feel that the move from DS to SE is probably less of a hurdle than in the other direction, but I will defer to those who have made the move in this direction.

Saturday, January 10, 2026

Book Review: Transformers In Action

The Attention Is All You Need paper proposed the Transformer Architecrture as an improvement to the dominant encoder-decoder models of the time (both recurrent and convolutional). These models used an attention mechanism to connect the encoder and decoder parts, but the Transformer Architecture flipped the script, putting the Attention Mechanism at the center. An early implementation of the Transformer Architecture was BERT, which used the Transformer as an encoder. Later models such as BART used encoder and decoder Transformer components in a sequence-to-sequence setup. Since then, there has been an explosion of variants around this basic model, accompanied by a steady breaking of benchmarks at tasks where older recurrent and convolution sequence-to-sequence models reigned supreme.

A second major breakthrough was the emergence of decoder-only Transformer models for text generation. Early models were less than encouraging, but as researchers increased the number of parameters to train ever larger and larger models on larger and larger datasets, their text generation capabilities improved to the point where they become viable candidates for using as pre-trained general purpose inference models. These models are also based on Transformers, but are generally differentiated by calling them Large Language Models (LLM) or Foundation Models (FM).

From a user's point of view, once you get past the slightly larger computing requirements, the first category (BERT like Transformer models) is actually easier to fine tune for custom tasks than its predecessors, thanks to tooling available from libraries such as HuggingFace Transformers and SentenceTransformers. The second category (LLMs), at least initially, were the domain of compute and data rich organizations, who would create these models and make them available to others over an HTTP API as inference-only models, often for a fee. Because of the massive number of parameters and volume of training data, these models were generalized enough to do inference on diverse tasks in diverse domains without additional fine-tuning. Of course, because they were generative models, their outputs were not deterministic, prompting cautions such as the On the Dangers of Stochastic Parrots paper, and patterns to alleviate it like Retrieval Augmented Generation (RAG) and Chain of Thought (CoT) prompting. More recently, fine tuning has become practical for this class of models with the advent of Parameter Efficient Fine Tuning (PEFT) techniques. Also, with the advent of multimodal LLMs and reasoning capabilities, they are more than just Large Language Models.

Anyway, the point of this (probably incomplete) history lesson is that the Transformers in Action book by Nicole Koenigstein, that I am reviewing, primarily covers Transformers in the second category, except for the first two chapters where it covers basics of the Transformer architecture. If you were more interested in the first category, I would recommend Transformers for Natural Language Processing by Denis Rothman, which I have reviewed on Amazon previously.

Back to the review. The book is organized in three parts, with the first part consisting of Chapters 1 and 2, the second part consisting of Chapters 3-5 and the third part consisting of Chapters 6-10.

In Part 1, Chapter 1 describes the Transformers architecture at a high level, how it incorporates ideas from earlier neural models and how it is different from them. It covers the idea of in-context learning (zero-shot and few shot), the distinguishing feature of Transformer based LLMs. Chapter 2 does a deep dive into the Transformer Architecture and its components, covering ideas such as Stacked Encoder-Decoder, Add and Norm (LayerNorm) layers, the Query-Key-Value Attention Mechanism, and position wise Feed Forward Network (FFN).

In Part 2, Chapter 3 moves the discussion into decoder-only Transformers, i.e. Large Language Models, and the central theme of this book. It describes variants of the Transformer Architecture, i.e. encoder only models such as BERT versus decoder only Autoregressive models that predict the next token. It touches on Causal Attention and KV Cache as necessary ingredients for this type of model. It also touches on the use of encoder only models as Embedding models and how it relates to RAG. It also mentions Mixture of Experts (MoE) as a promising architectural variant of Decoder-only models.

Chapter 4 covers some basics about parameters that control the behavior of LLMs, such as top-k and top-p sampling, temperature, prompting styles such as Zero-shot, Few shot, CoT (Chain of Thought), Contrastive CoT where both right and wrong reasoning traces are provided, Chain of Verification (CoVe) where the model reflects on and verifies its output, Tree of Thought (ToT) which introduces intermediate steps in problem solving traces, and Thread of Thought (ThoT) that partitions the problem into sub-problems and combines the threads from the sub-solutions into the final generation.

Chapter 5 covers Preference Alignment and RAG. The first part, Preference Alignment, is aimed at fine-tuning the behavior of an LLM to a particular domain or behavior. It covers Reinforcement Learning from Human Feedback (RLHF) as a Markov Decision Process (MDP) and the use of Proximal Policy Optimization (PPO). It describes specializations of PPO such as DPO (Direct Preference Optimization) that does not need an explicit reward model, and GRPO ((Group Relative Policy Optimization) tht removes the need for an explicit value function. Both DPO and GRPO are preceded by SFT (Supervised Fine Tuning) to align a transformer to a domain. The second part covers RAG, which is more familiar to most people using LLMs -- the discussion formalizes the structure of a RAG pipeline (retriever, generator and refinement layer) and describes some popular RAG variants, i.e. Agentic RAG, Corrective RAG, Self RAG and Fusion RAG.

In Part 3, Chapter 6 discusses Multimodal models, how they are different from text-only LLMs, and how they work by projecting text and non-text input into a shared embedding space. It differentiates between Converter based alignment where all modalities are projected onto the same space and Perception based alignment where modality specific encoders produce each embedding and the LLM uses an Attention mechanism to combine them.

Chapter 7 discusses Small Language Models (SLMs) which are decoder-only transformer models with 8-13 billion parameters. These are larger than encoder-only or encoder-decoder style models, but smaller than other decoder-only models. Such models are usually better at general purpose inference than (smaller) encoder-only or encoder-decoder models, but not as good as their full size counterparts. SLMs focus on specialization and efficiency, and can be deployed as edge devices or specialized components co-existing with LLMs in RAG pipelines. They can be used to generate data to train their larger counterparts using Weak to Strong Learning, Approximate Gradient Proxies, and function as Auxilliary Reward Models for RLHF. They can be deployed as specialized tools or agents in Agentic Workflows, e.g. classifiers for sentiment analysis, compliance checks, intent detection, guard models and coding models. They also work well in privacy concious domains, where you don't want your requests going out to a third party model provider. Finally, they are more practical to fine-tune for your specific use case than the full sized LLM.

Chapter 8 discusses training and evaluating Large Language Models, and suggests the use of Ray Tune for Hyperparameter Tuning and the Weights and Biases (W&B) platform for logging and determining GPU utilization. It details various PEFT techniques such as LoRA (Low Rank Adaptation), DoRA (Weight Decomposed LoRA), Quantization, QLoRA (Quantized LoRA), QA-LoRA (Quantized Aware LoRA), and LQ-LoRA (Low Rank plus Quantized Matrix Decomposition LoRA). Unfortunately, the author has not included too many examples of this in the book, possibly based on it being perceived as out of scope for this book's average reader. However, I had expected some coverage of evaluation techniques which I did not find -- evaluation is a real problem for teams building RAG or other inference-only pipelines, and it is complex because outputs are non-deterministic. Perhaps this is an oversight that can be addressed in a future edition of the book.

Chapter 9 covers deployment issues associated with LLMs, namely around optimization and scaling. Model Optimization techniques such as pruning (removing neurons or edges) and distillation (from larger to more efficient smaller models for specific tasks) and Memory Optimization techniques such as various types of sharding (tensor, pipeline, optimizer and hybrid) are described. The chapter also describes Inference Optimization techniques such as KV Caching, Paged Attention, vLLM and Operator Fusion, GPU Optimizations such as Tiling and Flash Attention, and extensions to support Long Context such as Rotary Embeddings (RoPE), iRoPE which alternates between RoPE and NoPE (No Positional Embeddings), block sparse and linear attention, and sliding window and chunked attention.

Finally, Chapter 10 covers techniques to create Responsible and Ethical LLM based applications. It outlines some possible reasons for LLM bias based on the geographical distribution of training data, approaches to flag and filter hateful or toxic generations using pre-trained BERT class models such as RoBERTa-Toxicity and HateBERT, the use of custom logging on W&B for interpretability analysis, using perturbation models in the Captum tool to determine feature attribution, and explanations using Local Interpretable Model Agnostic Explanations (LIME). It also describes some rule-based techniques to ensure Responsible behavior of LLMs, such as adding disclaimer text, penalizing tokens if they match a blacklist, using rule and LLM driven input and output guards like the ones provided by llm-guard. It also suggests using safe/unsafe classifiers in Purple Llama to address Lifecylce vulnerabilities and prevent Jailbreaks.

As with the earlier Transformers book, what I found most useful about this book was the coverage. While I feel fortunate to have actually lived through these transformative times rather than read about them in a book, the pace of breakthroughs in the state of the art are hard to keep up with unless you are actively doing the research yourself (and maybe not even then). As a result, you end up knowing about a few things that you have used or considered using or found interesting, but are woefully ignorant about a lot of the other things in the field. Books like this not only fill out your knowledge gaps, they also give you new ideas based on things that you just learned.

In addition, this book describes many useful techniques to improve your LLM pipelines. Many of us, me included, have built traditional and neural (pre-transformer and transformer based) ML pipelines, and have been building RAG pipelines over the past couple of years. But we may not be familiar with all the latest prompting techniques, or we may not have fine-tuned a SLM because of the compute requirements. Books like these show us how to do it, and thereby make us more productive and more effective users of LLMs.