Sunday, December 08, 2024

Trip Report - PyData Global 2024

I attended PyData Global 2024 last week. Its a virtual conference, so I was able to attend it from the comfort of my home, although presentations seem to be scheduled to be maximally convenient, time-wise, for folks in the US East Coast and Western Europe, so some of them were a bit early for me. There were four main tracks -- the General Track, the Data / Data Science Track, the AI / ML track and the LLM track -- where talks were presented in parallel. Fortunately, because it was virtual, there were recordings, which were made available almost immediately following the actual talk. So I was able to watch recordings of some of the talks I would have missed otherwise, and even squeeze in a few urgent work related meetings during the conference. So anyway, its not like I watched every preentation, but I did get to watch quite a few based on my interests. Some were geniuinely groundbreaking and / or new to me (and hence useful), and some others less so. But I enjoyed being there and being part of the awesome PyData community, so overall it was a net positive for me, in my opinion. Here is a Trip Report of the talks I attended, hope you find it useful.

Day 1 -- 03-Dec-2024

Understanding API Dispatching in NetworkX

The presenter describes how the NetworkX library seamlessly interfaces with faster algorithms from more modern, high performance libraries, while exposing the same (or almost same) API to the user. The additional information is usually in the form of additional parameters, or custom subclasses of the original parameter. One cool idea is that the new backend must minimally also pass tests written for the original NetworkX backend. I am probably never going to be a PyData library maintainer, but I thought this was a useful technique that one could use to hook up legacy code, which most of us probably have a lot of in our own application, with newer backends with minimal risk.

Streamlining AI Development and Deployment with KitOps

The presentation provides a tutorial for using KitOps, a standards based packaging and versioning system for AI / ML projects. It is definitely more integrated and feature-rich than a strategy of saving your code with Git and your data with DVC, but it also requires you to learn a new command (kit) with an extensive set of subcommands that does almost anything you can dream of doing with AL / ML deployment.

Enabling Multi-Language Programming in Data Engineering Workflows

The presentation demonstrates the use of Snakemake, an open-source Python based command-line based orchestration tool, to orchestrate a Clinical Trials Data Engineering workflow containing code written in Python, R and SAS. An interesting (probably innovative) twist was the use of Jinja2 to generate Snakemake files from workflow-specific templates. It seems very similar to Makefiles, which I have used earlier, before my Java / Scala days, when we switched to more JVM friendly alternatives like Ant, Maven and SBT. More recently, I see some (Python) projects using them as well, although Jenkins and Airflow seem more popular. I think SnakeMake is likely to be useful for the kind of work I do, which may not be able to justify the costs associated with Airflow or similar, but which would benefit from orchestration functionality nonetheless.

Keynote -- Embrace the Unix Command Line and Supercharge your PyData Workflow

The speaker describes various Unix command (only some of which I was aware of, I am sorry to say, despite my relatively long association with Unix), that can make your life as a Data Scientist / Engineer easier. I am also very envious of his very colorful and information rich command prompt. That said, there is some intersection between the tools he describes and the ones I use, and I have a few of my own that I swear by that he doesn't cover. But defintely a good presentation to watch if you use Unix / Linux, you will probably pick up a few new useful commands.

akimbo: vectorized processing of nested / ragged dataframe columns

The presenter describes akimbo, a Dataframe accessor for nested, ragged and otherwise awkward non-tabular data. Using the Akimbo accessor allows for vector speed compute on structures that are hard to express in Numpy form. Akimbo can be used from within Pandas, Polars, CuDF and Dask, as long as they use the Arrow backend.

Cost-effective data annotation with Bayesian experimental design

As the title implies, this talk is more about experimental design rather than a specific DS / ML framework. It describes techniques for identifying the most informative data points for human labeling, which in turn is likely to be most useful for model training. It reminded me a bit of Active Learning, where you identify high confidence predictions from an intermediate model to train future models. The presenter also relates this approach to binary search, which has similar characteristics. He also references OptBayesExpt, a package for Optimal Bayesian Experiment Design.

Effective GenAI Evaluations: Mitigate Hallucinations and Ship Fast

The presenter is one of the founders of Galileo, a company I follow for their cutting-edge research in areas relating to Generative AI. Among their innovations is ChainPoll, a technique that uses Chain of Thought (CoT) reasoning to determine if an LLM is hallucinating. He then describes Luna-8B (based on the BERT class DeBERTa-v3-large model), a model now offered as part of the Galileo software, that is capable of detecting hallucinations without CoT. He also talks about LunaFlow, also part of the Galileo software, that wraps the Luna-8B model.

Holistic Evaluation of Large Language Models

The presentation talks about NLP metrics such as BLEU and ROUGE, and how they are not really suitable to evaluate the generated output of LLMs. It then goes on to introduce more advanced metrics such as BERTScore and perplexity. Overall, a good overview of NLP metrics for folks who are new to NLP.

Let's get you started with asynchronous programming

I got my own start into asynchronous programming via LangChain's ainvoke call, mostly prescriptive based on examples, and suggestions based on error messages from the Python interpreter. I found this session useful as it gave me a more holistic understanding of asynchronous programming in Python, including learning what a Python co-routine is.

Fairness Tales: Measure / Mitigate Unfair Bias in ML

This presentation describes various fairness metrics that use the distribution of features and labels in the training data itself to determine whether the data (and thence the model) is biased or not. The metrics are illustrated in the context of a recruitment application.

Understanding Polars Data Types

A good general overview of data types used in Polars and what each is good for. I am trying to move off Pandas and on to Polars for new projects, so I thought this was useful.

Build simple and scalable data pipelines with Polars and DeltaLake

This was a very interesting presentation that showed the challenges of building a pipeline over data which may need to be updated retroactively and whose format may change over time. The presenter shows that using Polars (which uses Parquet file format by default) and Pandas (with the Parquet file format) along with DeltaLake (a standalone Rust based implementation called delta-rs) can address all these problems very effectively, as well as provide ACID query and update guarantees on the data. I also learned that DeltaLake does not imply Spark or Databricks as I was thinking previously.

Measuring the User Experience and the impact of Effort on Business Outcomes

Another presentation that is not about libraries or application development. The presenter describes the defining features of user experience within an application, and shows that User Effort, i.e. how much effort the user has to expend to achieve their goals, is the most meaningful success metric. She then describes some possible approaches, both statistical and domain derived, to derive the User Effort metric for a given application.

Day 2 -- 04-Dec-2024

Boosting AI Reliability: Uncertainty Quantification with MAPIE

This presentation describes the MAPIE library, which is described as a Model Agnostic Prediction Interval Estimator, used for quantifying uncertainty and risk of ML models. It can be used to compute conformal prediction intervals (similar to confidence intervals but predicts range of values for future observations) and calibrate models (transform model scores into probabilities). It can be called via a wrapper from any Scikit-Learn (or compatible) model.

The art of wrangling your GPU Python Environments

This presentation discusses the challenges in effectively configuring GPU environments using the myriad dependencies from hardware, drivers, CUDA, C++ and Python. The presenters describe how the Conda package manager does it via virtual packages, that allow it to call out to GPU capabilities that it does not have itself. They also describe RAPIDS (they are from NVidia) and Rapids Doctor (also from NVidia), a new tool that allows users to quickly resolve GPU issues.

Extraction Pipelines: ColPali's Vision Powered RAG for Enterprise Documents

ColPali is a recent encouraging approach to "Multimodal RAG". Effectively, it cuts up an input PDF into patches and then encodes them via a specialized multimodal aware embedding, then uses ColBERT late interaction to find the parts of the input that most satisfy the query. This presentation covers how ColPali works, effectively enabling the pipeline to "see" and reason over documents.

Fast, Intuitive Feature Selection vis regression on Shapley Values

This presentation describes a novel approach to doing feature selection. Ordinarily, one would detect the most important features by either adding or removing features one by one and training a model for a few epochs. This approach involves deriving the Shapley values once and using them to do a linear or logistic regression of the target on the Shapley values of the features and uses the results to implement a feature selection heuristic that is competitive with the earlier more heavyweight approaches. They provide an open source library shap-select that implements this approach.

Keynote: Do Python and Data Science Matter in our AI Future?

Not sure if the presenter ended up answering the question (he likely did, I might have missed it). But he raised some very important issues about software (especially Open Source software) being more about relationships than property, and how collaboration is bigger than capitalism. One of his observation that resonated with me was that Open Source is a path to permission-less innovation. Another interesting observation was that a dataset is just a quantized frozen model.

GraphRAG: Bringing together graph and vector search to empower retrieval

This presentation posits that vector search can be augmented by graph based search, and then demonstrates this by augmenting a Naive RAG pipeline (query -(retriever)-> context, query + context -(LLM)-> answer) wih a Kuzu backed graph DB. I learned several things from this presentation -- first, it is probably more convenient to use Kuzu instead of Neo4j Community Edition for my graph POCs, and second, more than just the entity-relationship paths, it may be worth looking at returning representative content for entities along these paths. Definitely something to try out in the future.

Rapid deduplication and fuzzy matching of large datasets using Splink

This presentation describes Splink, a data linkage library for medium to large datasets. Splink is available on Databricks where it is suitable for deduplicating datasets with 100 million+ records. Interestingly, when we deduplicate along same dataset, it is called deduplication, but when doing this across multiple datasets, it is called record linkage.

Statically Compiled Julia for Library Development

Julia is a JIT-compiled language and it can be called from Python. When called from Python, the Julia functionlity is statically compiled down to high performing native code. Unfortunately, currently this means that the entire Julia runtime is statically linked. This presentation describes work in the Julia community to modify this behavior, so it restricts the modules linked to only those referenced from the exposed entry-points, resulting in smaller and lighter weight executables.

Let our Optima Combine

This presentation introduces Constraint Optimization and the OR-Tool from Google. Its been a while since I used Linear Programming or similar tools, so it was nice to know they exist for Python. If I ever end up doing this for work or hobby, then I might look at OR-Tool.

Unlocking the Power of Hybrid Search: A Deep Dive into Python powered Precision and Scalability

This presentation described a Hybrid RAG pipeline with combination of vector and lexical search with a RRF (Reciprocal Rank Fusion) head to merge results and showed that merged results end up being generally more useful for answer generation since they combine the best of both worlds.

Automatic Differentiation, a tale of two languages

The presentation looks at the differences between Python and Julia with respect to how the AutoDiff functionality is implemented. With Python, it is part of external frameworks like Pytorch / Tensorflow / JAX, whereas with Julia it is part of the language. Julia has multiple pluggable AutoDiff implementations that can be used in different situations. This talk also helped address some questions that came up around difference in the AutoDiff implementation between Pytorch and Tensorflow that came up in our Deep Learning book reading group on TWIML.

Navigating Cloud Expenses in Data and AI: Strategies for Scientists and Engineers

The presentation describes the Open Source Metaflow library and its managed version Outerbounds, meant to help with development and deployment of DS / ML / AI projects. An interesting observation from the presenter is the complementarity of requirements from the Data Scientist versus the Operations Engineer. The presenter identifies issues such as GPU rent-vs-buy decisions, the human-vs-infra cost tradeoff and the importance of choosing the right instance type for the problem being solved, and shows how Outerbounds helps to identify and solve these issues.

Julia ML Ecosystem

Last time I looked at Julia, it was just starting out as a "Data Science language" that had nowhere close to the ecosystem that Python had (and continues to have). This presentation showed me a different (and much improved) picture, where it has already implemented equivalents for linear algebra (similar to Numpy / Pytorch / JAX), dataframe processing (Dataframe.jl and Tidier.jl analogous to Pandas / Polars), visualization (Makie.jl, JuliaPlots.jl and AlgebraOfGraphics.jl analogous to Matplotlib / Seaborn), Machine Learning (ML.jl analogous to Scikit Learn) and Deep Learning (Flux.jl analogous to Keras), etc. In addition, it is possible to call Python from Julia (and vice versa) so they can take advantage of each other's ecosystems.

Pytorch Workflow Mastery: A Guide to Track and Optimize Model Performance

This presentation is a good introduction to using Pytorch, demonstrating how to build a basic Convolutional Neural Network and train it with images from the CIFAR-10 dataset. It covers a few things that have become available / popular since I started working with Pytorch, so these parts were useful. Among them are the use of model.compile to generate a compiled model (similar to Tensorflow's Data Flow Graph), the use of canned metrics via the torchmetrics package, and integration with Weights and Biases (wandb.init()) and Optuna for Bayesian Hyperparameter optimization.

New Features in Apache Spark 4.0

I attended this presentation because I am a former Spark user. I haven't used it (at least not heavily) for the last couple of years since the data I need is now more conveniently available in Snowflake. But I was curious about what new functionality it has gained since I last used it. The presentation covers the ANSI SQL mode, the VARIANT data type that now allows JSON and XML data to natively parsed (upto 8x faster), the changes in Spark-connect to decouple client from server, making possible Spark connectors in various languages such as Rust and Go, parameterized queries and User Defined Table functions.

Day 3 -- 05-Dec-2024

The LEGO Approach to designing PyData workflows

Presenter describes her idea behind designing application systems with components designed to interlock with each other like Lego bricks, and her implementation of these ideas in the DataJourney framework.

Time Series Analysis with StatsModels

This was a workshop conducted by Allen Downey, the author of Think Stats. Specifically this workshop covered Chapter 12 of the book, applying the statsmodel library to do Time Series analysis. The workshop uses statsmodels to decompose a time series representing electricity generation over last 20+ years into trend, seasonal and random components, using additive and multiplicative decompositions to predict future data points from past data, and using ARIMA (autoregressive and moving average). I feel like I understand time series and ARIMA better than I used to, although I am sure I have just scratched the surface of this topic.

Building an AI Travel Agent that never Hallucinate

Hallucination is a feature of LLMs rather than a bug. So it seems like a tall order to build an AI Travel Agent (or any LLM based agent in general) that never hallucinates. However, and somewhat obviously in hindsight, one way to address the problem would be to severely limit its capabilities to make decisions. The CALM (Conversational AI with Language Models) framework from Rasa implements this by setting up the equivalent of a phone tree and giving the LLM only the capability to jump from node to node in the tree. I thought this was brilliant, because for most applications where you want an agent, you don't need (or want) full-blown AGI.

Evaluating RAGs: On the correctness and coherence of Open Source eval metrics

This presentation is a bit meta, it evaluates LLM evaluation metrics available from Open Source frameworks such as RAGAS and TruLens, across different LLMs like Claude Sonnet, GPT 3.5 and GPT-4, Llama2-70B and Llama3-70B. Results show that these metrics yield wildly different values for the same content. They do indicate future work as needing to evaluate these results against human judgment.

Building Knowledge Graph based Agents with Structured Text Generation and Open-Weights Models

This was a great presentation on using a combination of Structured Text Generation (using outlines) from content to build a Knowledge Graph. Structured Text output also makes it convenient to model Agents that execute actions through function calls. The presenter uses these ideas to first generate a Knowledge Graph from a dataset, then implements an Agentic Query pipeline that queries this Knowledge Graph.

From Features to Inference: Build a Core ML Platform from Scratch

This is a very impressive live coding presentation where the presenter sets up an ML pipeline from scratch, including an Inference Engine, Model Registry, Feature Store and an Event Bus to connect them all together using an Event Driven design. One good piece of advice here was to align the software with the language of business, i.e. domain driven design. Another was to build "default" implementations that you can write tests against, and replace them with "real" components as and when they come up. Expectations for these compoennts are already codified in the unit tests, so the new components must satisfy the same expectations. There are some very interesting (dependency injection like) code patterns, some of which reminded me of my Java / Spring days.

Putting the Data Science back into LLM Evaluation

This presentation covers a lot of familiar ground for folks that have worked with LLMs for some time. However, there are some new ideas here as well. One of them are the use of heuristic based guardrails such as matching length of output, patterns in output using regexes, using computed metrics such as Flesch-Kincaid scores, etc. Another is the use of chatbot arena style scoring to evaluate relative improvements. Presenters have created Parlance, an open source LLM evaluation tool that implements such a chatbot arena style model-to-model comparison metric.

Making Gaussian Processes Useful

The presentation is about Gaussian Processes, but because this is part of hierarchical models that are probabilistic models which most people are not that familiar with, the first part introduces PyMC and hierarchical models, then the second part covers how Gaussian processes can model the effect of continuous variables as a family of functions rather than a variable. I watched this presentation because was familiar with probabilistic hierarchical models, having used PyMC3 in the past, when it was backed by the forked version of Theano and NUTS was the state f the art sampler. Now it is backed by JAX and there is an even faster sampler based on Rust. But GPs were new to me, so I learned something new.

I might watch a few more presentations when I have time. PyData / NumFocus are generally very good about sharing the presentations openly, but it is likely to be 1-2 months before that happens. I will watch for the announcement and update this post with the information, but in the meantime, thats all I have to say about PyData Global 2024. I hope you found it interesting and useful.

No comments:

Post a Comment

Comments are moderated to prevent spam.