Salmon Run: December 2025

I attended PyData Global 2025 earlier this month. I had hoped to write this up earlier, but I've been busy, so only now getting the time Christmas morning. Merry Christmas to all my readers and best wishes for a Happy New 2026, hopefully it will be even better and more exciting (on the technology front) than this one! Taking stock of this year earlier today, I think I have some serious catching up to do in terms of reading about new stuff that just happened while I was busy doing other things. So hopefully I should have some writeups about them here in the coming year, although I am aware I have made similar promises earlier and broken them.

Anyway, back to PyData Global. It was held over 3 days December 9-11 and the baseline timezone was UTC, so for me the talks started very early in the morning (2:30-3:30 am) and ended at midday (1:30 pm on the first day and 11 am on the other two). So I ended up watching a lot of recordings. Basically I would attend the talks that were live past 6-7 am my time and then loop back to watch the recordings of the ones I missed from earlier in the day. Since I was watching a lot of recordings, it was tempting and easy to skip over prologue that speakers need to include in their presentation to ensure level setting with everyone in the audience, and I am afraid I succumbed repeatedly to that temptation. I was also multi-tasking with some work stuff, which meant I ended up picking and choosing more than I otherwise would (in a "real" physical conference).

Here are the talks I attended and my take aways from them.

Day 1

Scaling Fuzzy Product Matching with BM25: A Comparative Study of Python and Database Solutions -- this attempts to solve a product name matching problem, where the same product can be referred to by slightly different names. The strategy is to use BM25 search (available to DuckDB) to find similar names, reducing an O(n²) problem to a much smaller one, and finally using Dask and cuDF to merge the data. I also learned about the bm25s package for sparse BM25 matching in Python.

Lessons learnt in optimizing a large-scale Pandas application using Polars, Fireducks and cuDF -- nice coverage of optimization strategies for DataFrames, the presenter calls them T1 (replacing for-loops with iterator-loops), T2 (replacing loops with vector operations) and T3 (strategically filtering before applying join or aggregate functions), and compares the performance of different DataFrame handling packages. The speaker talks about the strengths of each library relative to Pandas (lazy mode and multi-threading for Polars and Fireducks, GPU parallelism for cuDF) and finds that performance of Polars and Fireducks on his dataset is better than Pandas because of multi-threading and best on cuDF because of GPU parallelism.

From Feature Engineering to Context Engineering for Agents -- the speaker makes the argument that Context Engineering for Agent based applications is the same as Feature Engineering for more traditional Machine Learning (ML) applications. The example he cites is Retrieval Augmented Generation (RAG) systems, where the retrieved context is used for in-context learning to help the LLM Agent return better generations. The speaker is also the author of Building Machine Learning Systems with a Feature Store, which he offered a free download of to the audience (and which I have downloaded and look forward to reading once I have some time).

Python Worst Practices: Learn from the Expert -- very entertaining talk about what not to do when building Data Science applications. To be fair, the practices he highlights are not all that uncommon, and underscores why the ability to think in terms of the domain rather than algorithms is so important.

Text Mining Orkut's Community Data with Python: Cultural Memory, Platform Neglect and Digital Amnesia -- Orkut used to be Google's answer to Facebook (and MySpace) but it never took off in the US. It was more popular in Brazil and India, until Google pulled the plug on it. The speaker is from Brazil and he describes his project to text mine Orkut to analyze how and why it failed. Even if you don't care about the history of Orkut, the talk is worth it if you are curius about the text mining and visualization techniques used it it. The repository behind the talk is at rodrigosf672/orkut-pydataglobal2025 on GitHub.

Using traditional AI and LLM to automate complex and critical documents in Healthcare -- description of a case study using Clinical Trials data from a Project Manager's point of view. Lot of useful lessons for someone looking to implement an AI solution in Healthcare.

Why Julia's GPU Accelerated ODE Solvers are 20x-100x Faster than JAX and Pytorch -- I don't use Julia, but may someday. However I am intrigued with the idea of applying ODE solvers to non-neural optimization problems as well. The talk goes into a lot of detail around Julia's ODE solver and how it is superior (in terms of scope and performance) to the ones built into JAX or Pytorch.

Where have all the Metrics gone? -- the speaker makes the point that traditional metrics are still relevant inthe age of AI, except that wrongness is now multi-dimensional, i.e. the LLM can make mistakes in more than one way, often at the same time. She then goes on to describe different kinds of failure modes for LLMs and classifies them as Domain Failure, Form Failure, Mode Collapse, Consistency Failure, Boundary Failure and Temporal Failure, and suggests a pragmatic way to manage these failures by ranking and measuring them separately. The application needs to be structured so wrongness of multiple components can be measured separately. She advocates for using traditional metrics as well as coming up with new ones.

The Boringly Simple Loop Powering GenAI Apps -- very nice talk that attempts to unify different AI architectures as variations of a nested two-loop pipeline. Speaker shows how simpler pipelines such as RAG or workflow systems are just specializations of the general pipeline. Along the way he also talks about the advantages and disadvantages of each architecture. Definitely worth watching if you are interested in AI architectures.

When AI Makes Things Up: Understanding and Tackling Hallucinations -- the speaker talks about why LLMs hallucinate, and strategies that the developer can adopt to alleviate where possible. She also talks about how to detect hallucinations using both human and LLM based oversight (consistency checks), how to estimate model confidence (2 approaches requiring access to token statistics and measuring output variance which does not need this). She also covers some high level strategies to prevent hallucinations.

Day 2

PyData/Sparse and Finch: extending sparse computing in Python ecosystem -- this is mostly about Finch, a sparse tensor compiler written in Julia, which can be accessed from Python for Sparse Array programming. It creates an intermediate notation (finch assembly code) that can translate to the underlying architecture (CPU, GPU, etc).

How to effectively use text embeddings in tree based models -- Tree based algorithms (Random Forest, XGBoost, etc) typically work with data decomposed into low-dimensional feature vectors, where these features are usually manually selected. Using embeddings directly as feature vectors would result in very deep trees and overfit. So the solution proposed is to use the embeddings to build multiple feature predictors, outputs of which would be used to create the feature vectors for the tree based model. The speaker demonstrates this technique using a StackingRegressor to create a 2-layer model ensemble. This technique can help with feature generation and results in explainable models.

Bayesian Decision Analysis with PyMC: Beyond AB Testing (Downey) -- this is a 90 minute workshop on using PyMC for doing Bayesian Decision Analysis. Specifically he attempts to do Bayesian AB testing of digital marketing strategies. This is a very hands-on session, where attendees are guided through various modeling approaches using PyMC. All notebooks are available at AllenDowney/BDAWithPyMC on GitHub. Great session, as it is always with Dr Downey's talks. I plan to go back to this again in the future.

UQLM: Detecting LLM Hallucinations with Uncertainty Quantification -- the speaker introduces their package UQLM for Uncertainty Quantification. They define hallucination as non-factual content that sounds plausible, which is impossible to prevent at scale using Human-in-the-loop (HITL) strategies. Their solution is to quantify the uncertainty of the model during text generation. Their package offers black-box and LLM-as-judge scorers that can work without requiring access to the token statistics, and white-box scorers that do. The project is hosted at cvs-health/uqlm on GitHub.

Lessons in Decision Making from the Monty Hall Problem -- The Monty Hall problem illustrates why probability is so non-intuitive. However, the speaker illustrates how extending the problem from 3 doors to N (where N >> 3) can make it less of an edge case and much more intuitive. I thought this approach might be useful as something applicable to other situations as well. He also covers applications of this kind of thinking in industry.

Let Me Structure Freely? How to Improve LLM Structured Output Quality -- I have been working mostly with Anthropic models which don't have as much formal support for Structured Input and Output as OpenAI's models. So the dependence on Structured I/O was new to me, prompting me to Google this separately (and apply it to cases where I am working with OpenAI's models). But apart from the benefits of Structured Output, the author also talks about an extension to the DSPy library (not yet merged) called StructureOfThought that allows for structured chain of thought like introspection in LLMs for reasoning problems.

Optimal variable binning in Logistic Regression -- the speaker introduces variable binning for Logistic Regression. The idea is to discretize continuous variables into categorical bins. Computing the weight of evidence per bin, or the information value globally across all bins, can provide a useful feature selection metric to decide if the variable is predictive or not. The binning criteria is an optimization problem to maximize a specific metric such as GINI or Information Value. The speaker reports that optimal binning on age feature resulted in best results for his application. More details on the guillermo-navas-palencia/optbinning.

Decisions under uncertainty: A Hands-on Guide to Bayesian Decision Theory -- Bayesian Decision Theiry is all about picking the action that optimizes the expected utility or cost. In its simplest form, it involves defining each possible action at each state, and estimating the utility / cost of each action based on your domain priors, and choosing the action that leads to the highest expected utility. Predictive models can also be Bayesian since the threshold of a particular utility is specific to the domain. Probabilities can also be estimated by a distribution where exact values are unknown, and Gaussian processes can be used to optimize costs. The speaker covers applications of Bayesian Decision Theory such as Hyperparameter Optimization and Experiment Design.

From Pandas to Policy as code; the future of ML Data Engineering (keynote) -- this was a keynote presentation containing lot of good general advice for Data Engineers. The gist of the advice is to minimize data movement, by processing data at the point of generation, only shipping the result of the processing rather than the entire payload. Indirectly, this is also an argument for efficient edge processing. Once processing is done, the raw data can be archived using a slower process since it is no longer time-sensitive. This also allows pipelines to compliant with regulations such as GDPR and minimizes exposure. The message is to apply data policy at the source.

Day 3

Revolutionalizing Safety Log Analysis in Oil and Gas: A Multi-Stage LLM Approach for Enhanced Hazard

How big are SLMs -- I have been interested in the possibility of deploying multiple special purpose Small Language Models (SLMs) in place of a single general purpose LLM driven by prompts, so I thought this talk might be interesting, and I was not dissapointed. The speaker defines SLMs as models with 1M-10B parameters and references some popular SLM (Phi-4, Mistral Small 3, Gemma, Llama 3.2, SmolLM v2, Qwen2) which I plan on exploring further. She enumerates some popular approaches for fine-tuning SLMs, both at the model and data level. She also mentions the possibility of distilling LLMs to SLMs, specifically Llama to BabyLlama. I thought it was a very good overview. If someone went ahead and went down all the rabbit holes the talk covered, one would have a very comprehensive and useful book on SLMs.

Beyond Just Predictions: Causal Thinking in Machine Learning -- this talk introduces Causality in Machine Learning, where you want to estimate the effect given the data. It describes a few approaches to estimate this in a focused manner, such as Uplift Modeling. The speaker covers Conditional Average Treatment Effect (CATE) and how to estimate this using Meta-Learners, the type of Meta-Learners (S-Learner where one predicts the effect with and without the treatment and computes the lift, and T-Learner to capture heterogeneous treatment effects, where one predicts the effect with different levels of treatment and computes the diff).

Detecting Regime Shifts in Time Series with Python: Entropy based Change Point Detection -- detecting changes in a time series where the change cannot be explained by randomness is the goal of change point detection. It has applications in anomaly detection, quality control, data drift, etc. Changes can be in the average, variance or frequency. The speaker describes some techniques to do this such as periodic sliding window stats and metrics to measure it (KL Divergence for continuous variables and Pearson distance for discrete). There is discussion on estimating the optimal kernel width and threshold.

Overall, I thought it was a good conference. I got to hear about cool things that the Python Data Science community did, and got a few ideas that I would like to try out for my own applications. The talks listed above are the ones I attended, if you have favorites and you don't see it listed here, please let me know so I can check it out.

Salmon Run

Friday, December 26, 2025

Trip Report: PyData Global 2025

Day 1

Day 2

Day 3