Salmon Run: December 2024

As an ML Engineer, we are generally tasked with solving some business problem with technology. Typically it involves leveraging data assets that your organization already owns or can acquire. Generally, unless it is a very simple problem, there would be more than one ML model involved, maybe different types of models depending on the sub-task, maybe other supporting tools such as a Search Index or Bloom Filter or third-party API. In such cases, these different models and tools would be organized into an ML Pipeline, where they would cooperate to produce the desired solution.

My general (very high level, very hand-wavy) process is to first convince myself that my proposed solution will work, then convince my project owners / peers, and finally to deploy the pipeline as an API to convince the application team that the solution solves the business problem. Of course, generating the initial proposed solution is a task in itself, and may need to be composed of multiple sub-solutions, each of which needs to be tested individually as well. So very likely the initial "proposed solution" is a partial bare-bones pipeline to begin with, and improves through successive iterations of feedback from the project and application teams.

In the past, I have treated these phases as largely disjoint, and each phase is built (mostly) from scratch with lot of copy-pasting of code from the previous phase. That is, I would start with notebooks (on Visual Studio Code of course) for the "convice myself" phase, copy-paste a lot of the functionality into a Streamlit application for the "convince project owners / peers" phase, and finally do another round of copy-pasting to build the backend for a FastAPI application for the "convnice application team" phase. While this works in general, folding in iterative improvements into each phase gets to be messy, time-consuming, and potentially error-prone.

Inspired by some of my fellow ML Engineers who are more steeped in Software Engineering best practices than I am, I decided to optimize the process by making it DRY (Don't Repeat Yourself). My modified process is as follows:

Convince Yourself -- continue using a combination of Notebooks and Short code snippets to test out sub-task functionality and compose sub-tasks into candidate pipelines. Focus is on exploration of different options, in terms of pre-trained third party models and supporting tools, fine-tuning candidate models, understanding the behavior of the individual components and the pipeline on small subsets of data, etc. There is no change here, the process can be as organized or chaotic as you like, if it works for you it works for you.

Convince Project Owners -- in this phase, your audience is a set of people that understand the domain very well, and are generally interested in how you are solving it, and how your solution will behave in wierd edge cases (that they have seen in the past and that you may not have imagined). They could run your notebooks in a pinch but they would prefer an application like interface with lots of debug information to show them how your pipeline is doing what it is doing.

Here the first step is to extract and parameterize functionality from my notebook(s) into functions. Functions would represent individual steps in multi-step pipeline, and should be able to return additional debug information when given a debug parameter. There should also be a function representing the entire pipeline, composed of calls to the individual steps. This is also the function that would deal with optional / new functionality across multiple iterations through feature flags. These functions should live in a central model.py file that would be called from all subsequent clients. Functions should have associated unit tests (unittest or pytest).

The Streamlit application should call the function representing the entire pipeline with the debug information. This ensures that as the pipeline evolves, no changes need to be made to the Streamlit client. Streamlit provides its own unit testing functionality in the form of the AppTest class, which can be used to run a few inputs through it. The focus is more to ensure that the app does not fail in a non-interactive manner so it can be run on a schedule (perhaps by a Github action).

Convince Project Team -- while this is similar to the previous step, I think of it as having the pipeline evaluated by domain experts in the project team against a larger dataset than what was achievable on the Streamlit application. We don't need as much intermediate / debugging information to illustrate how the process works. The focus here is on establishing that the solution generalizes for a sufficiently large and diverse set of data. This should be able to leverage the functions in the model we built in the previous phase. The output expected for this stage is a batch report, where you call the function representing the pipeline (with debug set to False this time), and format the returned value(s) into a file.

Convince Application Team -- this would expose a self-describing API that the application team can call to integrate your work into the application solving the business problem. This is again just a wrapper for your function call to the pipeline with debug set to False. Having this up as early as possible allows the application team to start working, as well as provide you valuable feedback around inputs and outputs, and point out edge cases where your pipeline might produce incorrect or inconsistent results.

I also used the requests library to build unit tests for the API, the objective is to just be able to test that it doesn't fail from the command line.

There is likely to be a feedback loop back to the Convince Yourself phase from each of these phase as inconsistencies are spotted and edge cases are uncovered. These may result in additional components being added to or removed from the pipeline, or their functionality changed. These changes should ideally only affect the model.py file, unless we need to add additional inputs, in that case these changes would affect the Streamlit app.py and the FastAPI api.py.

Finally, I orchestrated all these using SnakeMake, which I learned about in the recent PyData Global conference I attended. This allows me to not have to remember all the commands associated with running the Streamlit and FastAPI clients, running the different kinds of unit tests, etc, if I have to come back to the application after a while.

I implemented this approach over a small project recently, and the process is not as clear cut as I described, there was a fair amount of refactoring as I moved from the "Convince Project Owner" to "Convince Application Team". However, it feels less like a chore than it did when I have to fold in iterative improvements using the copy-paste approach. I think it is a step in the right direction, at least for me. What do you think?

I attended PyData Global 2024 last week. Its a virtual conference, so I was able to attend it from the comfort of my home, although presentations seem to be scheduled to be maximally convenient, time-wise, for folks in the US East Coast and Western Europe, so some of them were a bit early for me. There were four main tracks -- the General Track, the Data / Data Science Track, the AI / ML track and the LLM track -- where talks were presented in parallel. Fortunately, because it was virtual, there were recordings, which were made available almost immediately following the actual talk. So I was able to watch recordings of some of the talks I would have missed otherwise, and even squeeze in a few urgent work related meetings during the conference. So anyway, its not like I watched every preentation, but I did get to watch quite a few based on my interests. Some were geniuinely groundbreaking and / or new to me (and hence useful), and some others less so. But I enjoyed being there and being part of the awesome PyData community, so overall it was a net positive for me, in my opinion. Here is a Trip Report of the talks I attended, hope you find it useful.

Day 1 -- 03-Dec-2024

Understanding API Dispatching in NetworkX

The presenter describes how the NetworkX library seamlessly interfaces with faster algorithms from more modern, high performance libraries, while exposing the same (or almost same) API to the user. The additional information is usually in the form of additional parameters, or custom subclasses of the original parameter. One cool idea is that the new backend must minimally also pass tests written for the original NetworkX backend. I am probably never going to be a PyData library maintainer, but I thought this was a useful technique that one could use to hook up legacy code, which most of us probably have a lot of in our own application, with newer backends with minimal risk.

Streamlining AI Development and Deployment with KitOps

The presentation provides a tutorial for using KitOps, a standards based packaging and versioning system for AI / ML projects. It is definitely more integrated and feature-rich than a strategy of saving your code with Git and your data with DVC, but it also requires you to learn a new command (kit) with an extensive set of subcommands that does almost anything you can dream of doing with AL / ML deployment.

Enabling Multi-Language Programming in Data Engineering Workflows

The presentation demonstrates the use of Snakemake, an open-source Python based command-line based orchestration tool, to orchestrate a Clinical Trials Data Engineering workflow containing code written in Python, R and SAS. An interesting (probably innovative) twist was the use of Jinja2 to generate Snakemake files from workflow-specific templates. It seems very similar to Makefiles, which I have used earlier, before my Java / Scala days, when we switched to more JVM friendly alternatives like Ant, Maven and SBT. More recently, I see some (Python) projects using them as well, although Jenkins and Airflow seem more popular. I think SnakeMake is likely to be useful for the kind of work I do, which may not be able to justify the costs associated with Airflow or similar, but which would benefit from orchestration functionality nonetheless.

Keynote -- Embrace the Unix Command Line and Supercharge your PyData Workflow

The speaker describes various Unix command (only some of which I was aware of, I am sorry to say, despite my relatively long association with Unix), that can make your life as a Data Scientist / Engineer easier. I am also very envious of his very colorful and information rich command prompt. That said, there is some intersection between the tools he describes and the ones I use, and I have a few of my own that I swear by that he doesn't cover. But defintely a good presentation to watch if you use Unix / Linux, you will probably pick up a few new useful commands.

akimbo: vectorized processing of nested / ragged dataframe columns

The presenter describes akimbo, a Dataframe accessor for nested, ragged and otherwise awkward non-tabular data. Using the Akimbo accessor allows for vector speed compute on structures that are hard to express in Numpy form. Akimbo can be used from within Pandas, Polars, CuDF and Dask, as long as they use the Arrow backend.

Cost-effective data annotation with Bayesian experimental design

As the title implies, this talk is more about experimental design rather than a specific DS / ML framework. It describes techniques for identifying the most informative data points for human labeling, which in turn is likely to be most useful for model training. It reminded me a bit of Active Learning, where you identify high confidence predictions from an intermediate model to train future models. The presenter also relates this approach to binary search, which has similar characteristics. He also references OptBayesExpt, a package for Optimal Bayesian Experiment Design.

Effective GenAI Evaluations: Mitigate Hallucinations and Ship Fast

The presenter is one of the founders of Galileo, a company I follow for their cutting-edge research in areas relating to Generative AI. Among their innovations is ChainPoll, a technique that uses Chain of Thought (CoT) reasoning to determine if an LLM is hallucinating. He then describes Luna-8B (based on the BERT class DeBERTa-v3-large model), a model now offered as part of the Galileo software, that is capable of detecting hallucinations without CoT. He also talks about LunaFlow, also part of the Galileo software, that wraps the Luna-8B model.

Holistic Evaluation of Large Language Models

The presentation talks about NLP metrics such as BLEU and ROUGE, and how they are not really suitable to evaluate the generated output of LLMs. It then goes on to introduce more advanced metrics such as BERTScore and perplexity. Overall, a good overview of NLP metrics for folks who are new to NLP.

Let's get you started with asynchronous programming

I got my own start into asynchronous programming via LangChain's ainvoke call, mostly prescriptive based on examples, and suggestions based on error messages from the Python interpreter. I found this session useful as it gave me a more holistic understanding of asynchronous programming in Python, including learning what a Python co-routine is.

Fairness Tales: Measure / Mitigate Unfair Bias in ML

This presentation describes various fairness metrics that use the distribution of features and labels in the training data itself to determine whether the data (and thence the model) is biased or not. The metrics are illustrated in the context of a recruitment application.

Understanding Polars Data Types

A good general overview of data types used in Polars and what each is good for. I am trying to move off Pandas and on to Polars for new projects, so I thought this was useful.

Build simple and scalable data pipelines with Polars and DeltaLake

This was a very interesting presentation that showed the challenges of building a pipeline over data which may need to be updated retroactively and whose format may change over time. The presenter shows that using Polars (which uses Parquet file format by default) and Pandas (with the Parquet file format) along with DeltaLake (a standalone Rust based implementation called delta-rs) can address all these problems very effectively, as well as provide ACID query and update guarantees on the data. I also learned that DeltaLake does not imply Spark or Databricks as I was thinking previously.

Measuring the User Experience and the impact of Effort on Business Outcomes

Another presentation that is not about libraries or application development. The presenter describes the defining features of user experience within an application, and shows that User Effort, i.e. how much effort the user has to expend to achieve their goals, is the most meaningful success metric. She then describes some possible approaches, both statistical and domain derived, to derive the User Effort metric for a given application.

Day 2 -- 04-Dec-2024

Boosting AI Reliability: Uncertainty Quantification with MAPIE

This presentation describes the MAPIE library, which is described as a Model Agnostic Prediction Interval Estimator, used for quantifying uncertainty and risk of ML models. It can be used to compute conformal prediction intervals (similar to confidence intervals but predicts range of values for future observations) and calibrate models (transform model scores into probabilities). It can be called via a wrapper from any Scikit-Learn (or compatible) model.

The art of wrangling your GPU Python Environments

This presentation discusses the challenges in effectively configuring GPU environments using the myriad dependencies from hardware, drivers, CUDA, C++ and Python. The presenters describe how the Conda package manager does it via virtual packages, that allow it to call out to GPU capabilities that it does not have itself. They also describe RAPIDS (they are from NVidia) and Rapids Doctor (also from NVidia), a new tool that allows users to quickly resolve GPU issues.

Extraction Pipelines: ColPali's Vision Powered RAG for Enterprise Documents

ColPali is a recent encouraging approach to "Multimodal RAG". Effectively, it cuts up an input PDF into patches and then encodes them via a specialized multimodal aware embedding, then uses ColBERT late interaction to find the parts of the input that most satisfy the query. This presentation covers how ColPali works, effectively enabling the pipeline to "see" and reason over documents.

Fast, Intuitive Feature Selection vis regression on Shapley Values

This presentation describes a novel approach to doing feature selection. Ordinarily, one would detect the most important features by either adding or removing features one by one and training a model for a few epochs. This approach involves deriving the Shapley values once and using them to do a linear or logistic regression of the target on the Shapley values of the features and uses the results to implement a feature selection heuristic that is competitive with the earlier more heavyweight approaches. They provide an open source library shap-select that implements this approach.

Keynote: Do Python and Data Science Matter in our AI Future?

Not sure if the presenter ended up answering the question (he likely did, I might have missed it). But he raised some very important issues about software (especially Open Source software) being more about relationships than property, and how collaboration is bigger than capitalism. One of his observation that resonated with me was that Open Source is a path to permission-less innovation. Another interesting observation was that a dataset is just a quantized frozen model.

GraphRAG: Bringing together graph and vector search to empower retrieval

This presentation posits that vector search can be augmented by graph based search, and then demonstrates this by augmenting a Naive RAG pipeline (query -(retriever)-> context, query + context -(LLM)-> answer) wih a Kuzu backed graph DB. I learned several things from this presentation -- first, it is probably more convenient to use Kuzu instead of Neo4j Community Edition for my graph POCs, and second, more than just the entity-relationship paths, it may be worth looking at returning representative content for entities along these paths. Definitely something to try out in the future.

Rapid deduplication and fuzzy matching of large datasets using Splink

This presentation describes Splink, a data linkage library for medium to large datasets. Splink is available on Databricks where it is suitable for deduplicating datasets with 100 million+ records. Interestingly, when we deduplicate along same dataset, it is called deduplication, but when doing this across multiple datasets, it is called record linkage.

Statically Compiled Julia for Library Development

Julia is a JIT-compiled language and it can be called from Python. When called from Python, the Julia functionlity is statically compiled down to high performing native code. Unfortunately, currently this means that the entire Julia runtime is statically linked. This presentation describes work in the Julia community to modify this behavior, so it restricts the modules linked to only those referenced from the exposed entry-points, resulting in smaller and lighter weight executables.

Let our Optima Combine

This presentation introduces Constraint Optimization and the OR-Tool from Google. Its been a while since I used Linear Programming or similar tools, so it was nice to know they exist for Python. If I ever end up doing this for work or hobby, then I might look at OR-Tool.

Unlocking the Power of Hybrid Search: A Deep Dive into Python powered Precision and Scalability

This presentation described a Hybrid RAG pipeline with combination of vector and lexical search with a RRF (Reciprocal Rank Fusion) head to merge results and showed that merged results end up being generally more useful for answer generation since they combine the best of both worlds.

Automatic Differentiation, a tale of two languages

The presentation looks at the differences between Python and Julia with respect to how the AutoDiff functionality is implemented. With Python, it is part of external frameworks like Pytorch / Tensorflow / JAX, whereas with Julia it is part of the language. Julia has multiple pluggable AutoDiff implementations that can be used in different situations. This talk also helped address some questions that came up around difference in the AutoDiff implementation between Pytorch and Tensorflow that came up in our Deep Learning book reading group on TWIML.

Navigating Cloud Expenses in Data and AI: Strategies for Scientists and Engineers

The presentation describes the Open Source Metaflow library and its managed version Outerbounds, meant to help with development and deployment of DS / ML / AI projects. An interesting observation from the presenter is the complementarity of requirements from the Data Scientist versus the Operations Engineer. The presenter identifies issues such as GPU rent-vs-buy decisions, the human-vs-infra cost tradeoff and the importance of choosing the right instance type for the problem being solved, and shows how Outerbounds helps to identify and solve these issues.

Julia ML Ecosystem

Last time I looked at Julia, it was just starting out as a "Data Science language" that had nowhere close to the ecosystem that Python had (and continues to have). This presentation showed me a different (and much improved) picture, where it has already implemented equivalents for linear algebra (similar to Numpy / Pytorch / JAX), dataframe processing (Dataframe.jl and Tidier.jl analogous to Pandas / Polars), visualization (Makie.jl, JuliaPlots.jl and AlgebraOfGraphics.jl analogous to Matplotlib / Seaborn), Machine Learning (ML.jl analogous to Scikit Learn) and Deep Learning (Flux.jl analogous to Keras), etc. In addition, it is possible to call Python from Julia (and vice versa) so they can take advantage of each other's ecosystems.

Pytorch Workflow Mastery: A Guide to Track and Optimize Model Performance

This presentation is a good introduction to using Pytorch, demonstrating how to build a basic Convolutional Neural Network and train it with images from the CIFAR-10 dataset. It covers a few things that have become available / popular since I started working with Pytorch, so these parts were useful. Among them are the use of model.compile to generate a compiled model (similar to Tensorflow's Data Flow Graph), the use of canned metrics via the torchmetrics package, and integration with Weights and Biases (wandb.init()) and Optuna for Bayesian Hyperparameter optimization.

New Features in Apache Spark 4.0

I attended this presentation because I am a former Spark user. I haven't used it (at least not heavily) for the last couple of years since the data I need is now more conveniently available in Snowflake. But I was curious about what new functionality it has gained since I last used it. The presentation covers the ANSI SQL mode, the VARIANT data type that now allows JSON and XML data to natively parsed (upto 8x faster), the changes in Spark-connect to decouple client from server, making possible Spark connectors in various languages such as Rust and Go, parameterized queries and User Defined Table functions.

Day 3 -- 05-Dec-2024

The LEGO Approach to designing PyData workflows

Presenter describes her idea behind designing application systems with components designed to interlock with each other like Lego bricks, and her implementation of these ideas in the DataJourney framework.

Time Series Analysis with StatsModels

This was a workshop conducted by Allen Downey, the author of Think Stats. Specifically this workshop covered Chapter 12 of the book, applying the statsmodel library to do Time Series analysis. The workshop uses statsmodels to decompose a time series representing electricity generation over last 20+ years into trend, seasonal and random components, using additive and multiplicative decompositions to predict future data points from past data, and using ARIMA (autoregressive and moving average). I feel like I understand time series and ARIMA better than I used to, although I am sure I have just scratched the surface of this topic.

Building an AI Travel Agent that never Hallucinate

Hallucination is a feature of LLMs rather than a bug. So it seems like a tall order to build an AI Travel Agent (or any LLM based agent in general) that never hallucinates. However, and somewhat obviously in hindsight, one way to address the problem would be to severely limit its capabilities to make decisions. The CALM (Conversational AI with Language Models) framework from Rasa implements this by setting up the equivalent of a phone tree and giving the LLM only the capability to jump from node to node in the tree. I thought this was brilliant, because for most applications where you want an agent, you don't need (or want) full-blown AGI.

Evaluating RAGs: On the correctness and coherence of Open Source eval metrics

This presentation is a bit meta, it evaluates LLM evaluation metrics available from Open Source frameworks such as RAGAS and TruLens, across different LLMs like Claude Sonnet, GPT 3.5 and GPT-4, Llama2-70B and Llama3-70B. Results show that these metrics yield wildly different values for the same content. They do indicate future work as needing to evaluate these results against human judgment.

Building Knowledge Graph based Agents with Structured Text Generation and Open-Weights Models

This was a great presentation on using a combination of Structured Text Generation (using outlines) from content to build a Knowledge Graph. Structured Text output also makes it convenient to model Agents that execute actions through function calls. The presenter uses these ideas to first generate a Knowledge Graph from a dataset, then implements an Agentic Query pipeline that queries this Knowledge Graph.

From Features to Inference: Build a Core ML Platform from Scratch

This is a very impressive live coding presentation where the presenter sets up an ML pipeline from scratch, including an Inference Engine, Model Registry, Feature Store and an Event Bus to connect them all together using an Event Driven design. One good piece of advice here was to align the software with the language of business, i.e. domain driven design. Another was to build "default" implementations that you can write tests against, and replace them with "real" components as and when they come up. Expectations for these compoennts are already codified in the unit tests, so the new components must satisfy the same expectations. There are some very interesting (dependency injection like) code patterns, some of which reminded me of my Java / Spring days.

Putting the Data Science back into LLM Evaluation

This presentation covers a lot of familiar ground for folks that have worked with LLMs for some time. However, there are some new ideas here as well. One of them are the use of heuristic based guardrails such as matching length of output, patterns in output using regexes, using computed metrics such as Flesch-Kincaid scores, etc. Another is the use of chatbot arena style scoring to evaluate relative improvements. Presenters have created Parlance, an open source LLM evaluation tool that implements such a chatbot arena style model-to-model comparison metric.

Making Gaussian Processes Useful

The presentation is about Gaussian Processes, but because this is part of hierarchical models that are probabilistic models which most people are not that familiar with, the first part introduces PyMC and hierarchical models, then the second part covers how Gaussian processes can model the effect of continuous variables as a family of functions rather than a variable. I watched this presentation because was familiar with probabilistic hierarchical models, having used PyMC3 in the past, when it was backed by the forked version of Theano and NUTS was the state f the art sampler. Now it is backed by JAX and there is an even faster sampler based on Rust. But GPs were new to me, so I learned something new.

I might watch a few more presentations when I have time. PyData / NumFocus are generally very good about sharing the presentations openly, but it is likely to be 1-2 months before that happens. I will watch for the announcement and update this post with the information, but in the meantime, thats all I have to say about PyData Global 2024. I hope you found it interesting and useful.

Salmon Run

Tuesday, December 31, 2024

Packaging ML Pipelines from Experiment to Deployment

Sunday, December 08, 2024

Trip Report - PyData Global 2024

Day 1 -- 03-Dec-2024

Day 2 -- 04-Dec-2024

Day 3 -- 05-Dec-2024

Posts

Labels

Blogs I Read

About me

My Nerd Rating

Visitor Map

Contact Me