Salmon Run: Building Learning to Rank Models with Generative AI

Sunday, December 03, 2023

Building Learning to Rank Models with Generative AI

Generative AI has been the new cool kid on the AI / ML block since early this year. Like everyone else, I continue to be amazed and wowed with each successive success story as they break existing benchmark records and showcase novel applications built on top of their new functionality. I was also lucky to be involved in a Generative AI project since the middle of this year, which gave me access to these LLMs to build some cool tools. These tools morphed into a small side project which I have the opportunity to share at PyData Global 2023. This post gives a high level overview of the project. I hope it piques your interest enough for you to attend my presentation, as well as many of the other cool presentations scheduled at PyData Global 2023.

I used to work in search, and over the past few years, search (and Natural Language Processing (NLP)) have moved from being heurisitcs based to statistical models to mebedding models to knowledge graphs to deep learning to transformers to Generative AI. Over this same period, I have been more and more interested in "search adjacent" areas, such as Natural Language Processing (NLP) and Machine Learning (ML) techniques for content enrichment and semantic search. As these disciplines have converged, I find myself increasingly at the intersection of search and ML, which is really an exciting place to be, since are so many more choices when deciding how to build our search pipelines.

One such choice is to use data to drive your search development process. The general strategy is to build a baseline search pipeline using either a statistical or vector model for lexical or vector-based search, or combining the two in some manner. The search engineer would then improve the search behavior based on observations of user behavior or feedback from domain experts (who generally also happen to be users of the system). However, user behavior is complex, while we are technically still using "user data", basing actions on a few observations usually results in a situation where the engineer is playing a never-ending game of whack-a-mole.

A more versatile approach might be to use the power of machine learning to create Learning to Rank models based on all of the observed user feedback. The advantage of the approach is that solutions are usually more rounded and more resistant to small changes in user behavior. While it is virtually impossible for a human to see all facets of a complex problem at the same time, to ML models these behaviors are just points in multi-dimensional space which it manipulates using math. A major barrier to using ML, however, is that you need to be able to intepret the feedback and tie it to user intent. You also need systems in place to collect the feedback efficiently. This is what you see in e-commerce, for example, as a result of which LTR models are quite common in such domains.

In domains where these conditions don't hold, search engineers may resort to collecting judgment labels on query-document pairs from human experts. However, because this work is onerous and expensive, the labels are usually not enough to train LTR models, and the engineer usually ends up using the labeled data as a validation set for their one-off changes. This is definitely better than flying blind, which admittedly also happens, but less optimal than training an LTR model.

Generative Large Language Models (LLMs) such as OpenAI's GPT, Anthropic's Claude, etc., provide a way for the engineer to prompt it with a query and the document text and ask it to provide a "relevant" or "irrelevant" judgment depending on whether the document was relevant for the query or not. This approach has the potential to produce unlimited judgment labels that are an order of magnitude cheaper to obtain than from a human expert, both in terms of quantity and cost, thus making the LTR approach practical regardless of domain.

In my presentation, I describe a case study where I did this, then used the generated judgments to train multiple LTR models and evaluate their performance against against each other. Looking forward to seeing you there!