Salmon Run: PyData Global 2023: Trip Report

I had the opportunity to present at PyData Global this year. It is a virtual conference that ran over 3 days in multiple tracks from December 6 to 8. I talked about Building Learning to Rank models for search using Large Language Models. For those attending the conference, I already shared the links to the slides and the associated code on its Discord channel, but for those who are not, they are linked below.

slides
code

As a speaker, I got a complimentary pass to attend the conference. Because it is virtual, I tend to pick and choose talks based on their title, and fitting the conference into my work schedule, rather than giving it my full attention as I would for an in-person conference. On the flip side, talks were recorded, so even though they were presented in multiple parallel tracks, I could always listen a recording if I missed the live event. So the list below may not be as complete as if this had been a in-person event, but it probably more closely represents my interests. So I guess there is a trade-off, and I thought the virtual format worked out well for me in this case.

I was also quite impressed by the features of the Airmeet conferencing platform that was used to host PyData Global. One immediate advantage for the attendee is that the schedule automatically links to live talks as they occur and to recorded ones when it is complete. They also have a virtual backstage for speakers, where speakers work with the host to verify that their cameras, speakers and screen sharing work. My screen sharing didn't initially, and after a few panic filled moments it turned out that my Chrome browser did not have permission to share the screen. Overall, definitely a step up from Zoom and MS-Teams with lots of human coordination, that we use for our internal conferences.

In any case, if you were also at PyData Global, you have your own set of talks that you attended. I list mine below, as well as what I learned from them. Maybe if you find one here that you missed and you like my review, you might consider going back to the AirMeet platform and watching it as well. For those not attending, I believe the organizers will move these talks to the PyData public channel on Youtube around the end of the year, so these reviews might help you choose which ones to watch once they become available.

Day 1

Arrow Revolution in Pandas and Dask -- this was a pretty technical presentation about how the use of PyArrow as a Pandas backend instead of Numpy has improved Pandas response times, as well as a discussion of how to use copy-on-write to improve Pandas performance. He also talks about the new query optimizer for Dask which can automatically rewrite the input operations DAG (directed acyclic graph) to be more performant with an additional optimize() call. I ended up learning a lot, although my initial curiosity going in was mainly about PyArrow for Parquet support in Pandas and interoperability with Spark.
Extremes, Outliers and GOATS: on life in a lognormal world -- a really informative and entertaining talk by the great Allen Downey about his thesis, backed by data, that real-world phenomena can often be modeled better using a lognormal distribution compared to the more ubiquitous gaussian (normal) distribution. He also makes a logical case for why this may be so, especially for outlier events. If you find statistical modeling interesting (and probably even if you don't) you won't want to miss this talk.
But what is a Gaussian Process? Regression while knowing how certain you are -- a great presentation on the intuition behind Gaussian Processes (GPs). I had heard the name before but didn't know (still don't, to be honest) how they can be used to solve real-world problems. Perhaps an example using PyMC or scipy.stats around a particular data science use case might have been more useful. However, the intuition is important, and maybe this understanding will help me find a good use case and implement a solution faster.
Build and deploy a Snowflake Native Application using Python -- I want to learn how to work with Snowflake using Python. I know there are tons of tutorials for this, but I was hoping that this would provide me a quick example-driven overview and save me some time. However, this is a very specific tutorial targeted at Snowflake App Developers on how to package up their product so it can be listed on the Snowflake App Marketplace. So while it does cover some of what I was looking for, its a subset of what is actually presented. Unless you are an aspiring Snowflake app developer, then I think you may be better off learning from subject-specific tutorials from the Snowflake website.

Day 2

Pandas 2, Dask or Polars: Quickly tackling larger data on a single machine -- a comparative study of the three popular (only?) Dataframe manipulation libraries in Python in terms of functionality and performance. Having switched recently from Pandas to Polars, and having used Dask for handling multi-machine jobs earlier, I got a lot out of the talk, including some validation that the move to Polars was probably a good decision long term. I also learned that Dask was originally built to exploit multiple cores on a single machine, and only later added the scheduler to distribute the job across multiple machines.
Who needs ChatGPT? Rock solid AI pipelines with HuggingFace and Kedro -- the main thing I got out of this talk was its coverage of Kedro, a ML development framework originally developed at McKinsey & Co, and since open sourced. I had heard of Kedro before, but didn't have the time to check it out. The main idea of Kedro seems to be to represent the ML pipeline DAG as YAML, although it has other features such as a mandated project structure, that help it to leverage the YAML configuration. The presenter walks through a use case involving HuggingFace models. Now that I understand Kedro a bit, I think I might try to use it for my next project.
Customizing and Evaluating LLMs, an Ops Perspective -- this talk has a lot of useful information if you are new to application development with Generative Large Language Models (LLM). Not so much for me, having been on both the development end, and more recently, on the evaluation end of an LLM based application. But definitely good to learn about best practices in this area in general. Two software packages I got from the presenttion are giskard and deepchecks. I had originally looked at giskard in connection with LLM bias evaluation, and deepchecks seems to be more MLOps / observability based evaluation tailored to LLMs, but I need to look at these further.
Optimize first, parallelize second: a better path to faster data processing -- the theme of this presentation is to try and optimize your base job to the extent possible before trying to parallelize it, which I think is really good advice. To that I would also add (based on experience) to make sure it functions correctly, because otherwise we end up with lots of garbage after having spent a lot of compute and time. Optimizing the base job also makes sure that the parallelized version completes sooner, and really helps to multiply the effect of the effort you spend optimizing the base job.
Real Time Machine Learning -- the main idea behind this presentation is the creation of a training sample for real-time ML that accurately reflects the historical distribution but does not increase drastically in size. This is achieved through the idea of coresets, which are data samples from consecutive windows of time that accurately reflect the overall data distribution in that window. The presenters are part of DataHeroes AI, that provides a coreset implementation. I haven't worked on Real time training, so I don't have an immediate need for this, but its good to know. Maybe we can use the idea for retraining models to address drift.
Tricking Neural Networks: Explore Adversarial Attacks -- this is something I am interested in, although I have almost no experience in it. I thought the presentation did a good job at presenting some basic theory behind adversarial attacks and highlighting some use cases. There is also a list of papers in the slides that may be useful to get more information.

Day 3

Accelerating fuzzy document deduplication to improve LLM training with RAPIDS and Dask -- I attended this talk because I was curious about the "fuzzy document deduplication" mentioned in the title, but the talk also covered information about RAPIDS and Dask, both of which obviously help with improving performance. In any case, the fuzzy deduplication is effected by hashing the documents using MinHash, then bucketing them and doing an all-pairs exact match within each bucket using Jaccard similarity, then considering only the high scoring document pairs as duplicates and removing them. I thought it was a cool idea that solves a O(n²) task in a scalable manner.
Training large scale models with PyTorch -- the presentation started with the clearest description of scaling laws (more data, more parameters, more training) I have heard so far, and describes advanced PyTorch distributed training functionality that addresses scaling issues associated with each of these laws. I use PyTorch, and have only recently started encountering issues where I might need to look at these functionalities, so I found this presentation really useful.
Modeling extreme events with PyMC -- I had attended a presentation on the intuition around Gaussian Processes (GP) earlier, and this presentation shows a few case studies where extreme events (mainly climate change events) are modeled using GPs. I thought these were fascinating and I understand GPs a little better now, but I think I might need to work at it some more.
Keras 3 for the Curious and Creative -- I attended this talk because I am as excited about the new features of Keras3, which has gone back to its roots as a multi-framework Deep Learning API (Tensorflow, PyTorch and JAX), and I was looking for an in-depth run through of the release notes, perhaps with some examples from the Keras documentation covering specific functionality, like the quick overviews directed at engineers and scientists. The presentation turned out to be more of an introduction to Deep Learning with Keras, which wasn't exactly what I was looking for.

These are all the talks I attended at PyData Global 2023. I might watch a few more recorded talks until the organizers discontinue access to the Airmeet platform and puts them up on their YouTube channel. Meanwhile, I hope I have provided enough inforamtion on these talks for you to make an informed decision.