Saturday, December 09, 2023

PyData Global 2023: Trip Report

I had the opportunity to present at PyData Global this year. It is a virtual conference that ran over 3 days in multiple tracks from December 6 to 8. I talked about Building Learning to Rank models for search using Large Language Models. For those attending the conference, I already shared the links to the slides and the associated code on its Discord channel, but for those who are not, they are linked below.

As a speaker, I got a complimentary pass to attend the conference. Because it is virtual, I tend to pick and choose talks based on their title, and fitting the conference into my work schedule, rather than giving it my full attention as I would for an in-person conference. On the flip side, talks were recorded, so even though they were presented in multiple parallel tracks, I could always listen a recording if I missed the live event. So the list below may not be as complete as if this had been a in-person event, but it probably more closely represents my interests. So I guess there is a trade-off, and I thought the virtual format worked out well for me in this case.

I was also quite impressed by the features of the Airmeet conferencing platform that was used to host PyData Global. One immediate advantage for the attendee is that the schedule automatically links to live talks as they occur and to recorded ones when it is complete. They also have a virtual backstage for speakers, where speakers work with the host to verify that their cameras, speakers and screen sharing work. My screen sharing didn't initially, and after a few panic filled moments it turned out that my Chrome browser did not have permission to share the screen. Overall, definitely a step up from Zoom and MS-Teams with lots of human coordination, that we use for our internal conferences.

In any case, if you were also at PyData Global, you have your own set of talks that you attended. I list mine below, as well as what I learned from them. Maybe if you find one here that you missed and you like my review, you might consider going back to the AirMeet platform and watching it as well. For those not attending, I believe the organizers will move these talks to the PyData public channel on Youtube around the end of the year, so these reviews might help you choose which ones to watch once they become available.

Day 1

  • Arrow Revolution in Pandas and Dask -- this was a pretty technical presentation about how the use of PyArrow as a Pandas backend instead of Numpy has improved Pandas response times, as well as a discussion of how to use copy-on-write to improve Pandas performance. He also talks about the new query optimizer for Dask which can automatically rewrite the input operations DAG (directed acyclic graph) to be more performant with an additional optimize() call. I ended up learning a lot, although my initial curiosity going in was mainly about PyArrow for Parquet support in Pandas and interoperability with Spark.
  • Extremes, Outliers and GOATS: on life in a lognormal world -- a really informative and entertaining talk by the great Allen Downey about his thesis, backed by data, that real-world phenomena can often be modeled better using a lognormal distribution compared to the more ubiquitous gaussian (normal) distribution. He also makes a logical case for why this may be so, especially for outlier events. If you find statistical modeling interesting (and probably even if you don't) you won't want to miss this talk.
  • But what is a Gaussian Process? Regression while knowing how certain you are -- a great presentation on the intuition behind Gaussian Processes (GPs). I had heard the name before but didn't know (still don't, to be honest) how they can be used to solve real-world problems. Perhaps an example using PyMC or scipy.stats around a particular data science use case might have been more useful. However, the intuition is important, and maybe this understanding will help me find a good use case and implement a solution faster.
  • Build and deploy a Snowflake Native Application using Python -- I want to learn how to work with Snowflake using Python. I know there are tons of tutorials for this, but I was hoping that this would provide me a quick example-driven overview and save me some time. However, this is a very specific tutorial targeted at Snowflake App Developers on how to package up their product so it can be listed on the Snowflake App Marketplace. So while it does cover some of what I was looking for, its a subset of what is actually presented. Unless you are an aspiring Snowflake app developer, then I think you may be better off learning from subject-specific tutorials from the Snowflake website.

Day 2

  • Pandas 2, Dask or Polars: Quickly tackling larger data on a single machine -- a comparative study of the three popular (only?) Dataframe manipulation libraries in Python in terms of functionality and performance. Having switched recently from Pandas to Polars, and having used Dask for handling multi-machine jobs earlier, I got a lot out of the talk, including some validation that the move to Polars was probably a good decision long term. I also learned that Dask was originally built to exploit multiple cores on a single machine, and only later added the scheduler to distribute the job across multiple machines.
  • Who needs ChatGPT? Rock solid AI pipelines with HuggingFace and Kedro -- the main thing I got out of this talk was its coverage of Kedro, a ML development framework originally developed at McKinsey & Co, and since open sourced. I had heard of Kedro before, but didn't have the time to check it out. The main idea of Kedro seems to be to represent the ML pipeline DAG as YAML, although it has other features such as a mandated project structure, that help it to leverage the YAML configuration. The presenter walks through a use case involving HuggingFace models. Now that I understand Kedro a bit, I think I might try to use it for my next project.
  • Customizing and Evaluating LLMs, an Ops Perspective -- this talk has a lot of useful information if you are new to application development with Generative Large Language Models (LLM). Not so much for me, having been on both the development end, and more recently, on the evaluation end of an LLM based application. But definitely good to learn about best practices in this area in general. Two software packages I got from the presenttion are giskard and deepchecks. I had originally looked at giskard in connection with LLM bias evaluation, and deepchecks seems to be more MLOps / observability based evaluation tailored to LLMs, but I need to look at these further.
  • Optimize first, parallelize second: a better path to faster data processing -- the theme of this presentation is to try and optimize your base job to the extent possible before trying to parallelize it, which I think is really good advice. To that I would also add (based on experience) to make sure it functions correctly, because otherwise we end up with lots of garbage after having spent a lot of compute and time. Optimizing the base job also makes sure that the parallelized version completes sooner, and really helps to multiply the effect of the effort you spend optimizing the base job.
  • Real Time Machine Learning -- the main idea behind this presentation is the creation of a training sample for real-time ML that accurately reflects the historical distribution but does not increase drastically in size. This is achieved through the idea of coresets, which are data samples from consecutive windows of time that accurately reflect the overall data distribution in that window. The presenters are part of DataHeroes AI, that provides a coreset implementation. I haven't worked on Real time training, so I don't have an immediate need for this, but its good to know. Maybe we can use the idea for retraining models to address drift.
  • Tricking Neural Networks: Explore Adversarial Attacks -- this is something I am interested in, although I have almost no experience in it. I thought the presentation did a good job at presenting some basic theory behind adversarial attacks and highlighting some use cases. There is also a list of papers in the slides that may be useful to get more information.

Day 3

  • Accelerating fuzzy document deduplication to improve LLM training with RAPIDS and Dask -- I attended this talk because I was curious about the "fuzzy document deduplication" mentioned in the title, but the talk also covered information about RAPIDS and Dask, both of which obviously help with improving performance. In any case, the fuzzy deduplication is effected by hashing the documents using MinHash, then bucketing them and doing an all-pairs exact match within each bucket using Jaccard similarity, then considering only the high scoring document pairs as duplicates and removing them. I thought it was a cool idea that solves a O(n2) task in a scalable manner.
  • Training large scale models with PyTorch -- the presentation started with the clearest description of scaling laws (more data, more parameters, more training) I have heard so far, and describes advanced PyTorch distributed training functionality that addresses scaling issues associated with each of these laws. I use PyTorch, and have only recently started encountering issues where I might need to look at these functionalities, so I found this presentation really useful.
  • Modeling extreme events with PyMC -- I had attended a presentation on the intuition around Gaussian Processes (GP) earlier, and this presentation shows a few case studies where extreme events (mainly climate change events) are modeled using GPs. I thought these were fascinating and I understand GPs a little better now, but I think I might need to work at it some more.
  • Keras 3 for the Curious and Creative -- I attended this talk because I am as excited about the new features of Keras3, which has gone back to its roots as a multi-framework Deep Learning API (Tensorflow, PyTorch and JAX), and I was looking for an in-depth run through of the release notes, perhaps with some examples from the Keras documentation covering specific functionality, like the quick overviews directed at engineers and scientists. The presentation turned out to be more of an introduction to Deep Learning with Keras, which wasn't exactly what I was looking for.

These are all the talks I attended at PyData Global 2023. I might watch a few more recorded talks until the organizers discontinue access to the Airmeet platform and puts them up on their YouTube channel. Meanwhile, I hope I have provided enough inforamtion on these talks for you to make an informed decision.

Sunday, December 03, 2023

Building Learning to Rank Models with Generative AI

Generative AI has been the new cool kid on the AI / ML block since early this year. Like everyone else, I continue to be amazed and wowed with each successive success story as they break existing benchmark records and showcase novel applications built on top of their new functionality. I was also lucky to be involved in a Generative AI project since the middle of this year, which gave me access to these LLMs to build some cool tools. These tools morphed into a small side project which I have the opportunity to share at PyData Global 2023. This post gives a high level overview of the project. I hope it piques your interest enough for you to attend my presentation, as well as many of the other cool presentations scheduled at PyData Global 2023.

I used to work in search, and over the past few years, search (and Natural Language Processing (NLP)) have moved from being heurisitcs based to statistical models to mebedding models to knowledge graphs to deep learning to transformers to Generative AI. Over this same period, I have been more and more interested in "search adjacent" areas, such as Natural Language Processing (NLP) and Machine Learning (ML) techniques for content enrichment and semantic search. As these disciplines have converged, I find myself increasingly at the intersection of search and ML, which is really an exciting place to be, since are so many more choices when deciding how to build our search pipelines.

One such choice is to use data to drive your search development process. The general strategy is to build a baseline search pipeline using either a statistical or vector model for lexical or vector-based search, or combining the two in some manner. The search engineer would then improve the search behavior based on observations of user behavior or feedback from domain experts (who generally also happen to be users of the system). However, user behavior is complex, while we are technically still using "user data", basing actions on a few observations usually results in a situation where the engineer is playing a never-ending game of whack-a-mole.

A more versatile approach might be to use the power of machine learning to create Learning to Rank models based on all of the observed user feedback. The advantage of the approach is that solutions are usually more rounded and more resistant to small changes in user behavior. While it is virtually impossible for a human to see all facets of a complex problem at the same time, to ML models these behaviors are just points in multi-dimensional space which it manipulates using math. A major barrier to using ML, however, is that you need to be able to intepret the feedback and tie it to user intent. You also need systems in place to collect the feedback efficiently. This is what you see in e-commerce, for example, as a result of which LTR models are quite common in such domains.

In domains where these conditions don't hold, search engineers may resort to collecting judgment labels on query-document pairs from human experts. However, because this work is onerous and expensive, the labels are usually not enough to train LTR models, and the engineer usually ends up using the labeled data as a validation set for their one-off changes. This is definitely better than flying blind, which admittedly also happens, but less optimal than training an LTR model.

Generative Large Language Models (LLMs) such as OpenAI's GPT, Anthropic's Claude, etc., provide a way for the engineer to prompt it with a query and the document text and ask it to provide a "relevant" or "irrelevant" judgment depending on whether the document was relevant for the query or not. This approach has the potential to produce unlimited judgment labels that are an order of magnitude cheaper to obtain than from a human expert, both in terms of quantity and cost, thus making the LTR approach practical regardless of domain.

In my presentation, I describe a case study where I did this, then used the generated judgments to train multiple LTR models and evaluate their performance against against each other. Looking forward to seeing you there!