Last month, I decided to sign-up for the Google AI Hackathon, where Google provided access to their Gemini Large Language Model (LLM) and tasked participants with building a creative application on top of it. I have worked with Anthropic's Claude and OpenAI's GPT-3 at work previously, and I was curious to see how Gemini stacked up against them. I was joined in that effort by David Campbell and Mayank Bhaskar, my non-work colleagues from the TWIML (This Week In Machine Learning) Slack. Winners for the Google AI Hackathon were declared last Thursday, and whilte our project sadly did not win anything, the gallery provides examples of some very cool applications of LLMs (and Gemini in particular) for both business and personal tasks.
Our project was to automate the evaluation of RAG (Retrieval Augmented Generation) pipelines using LLMs. I have written previously about the potential of LLMs to evaluate search pipelines, but the scope of this effort is broader in that it attempts to evaluate all aspects of the RAG pipeline, not just search. We were inspired by the RAGAS project, which defines 8 metrics that cover various aspects of the RAG pipeline. Another inspiration for our project was the ARES paper, which shows that fine-tuning the LLM judges on synthetically generated outputs improves evaluation confidence.
Here is a short (3 minutes) video description of our project on Youtube. This was part of our submission for the hackathon. We provide some more information about our project in our blog post below.
We re-implemented the RAGAS metrics using LangChain Expression Language (LCEL) and applied them to (question, answer, context and ground truth) tuples from the AmnestyQA dataset to generate the scores for these metrics. My original reason for doing this, rather than using the using what RAGAS provided directly, was because I couldn't make them work properly with Claude. This was because Claude cannot read and write JSON as well as GPT-3 (it works better with XML), and RAGAS was developed using GPT-3. All the RAGAS metrics are prompt-based and transferrable across LLMs with minimal change, and the code is quite well written. I wasn't sure if I would encounter similar issues with Gemini, so it seemed easier to just re-implement the metrics from the ground up for Gemini using LCEL than try to figure out how to make RAGAS work with Gemini. However, as we will see shortly, it ended up being a good decision.
Next we re-implemented the metrics with DSPy. DSPy is a framework for optimizing LLM prompts. Unlike RAGAS, where we tell the LLM how to compute the metrics, with DSPy the general approach is to have very generic prompts and show the LLM what to do using few shot examples. The distinction is reminiscent of doing prediction using Rules Engines versus using Machine Learning. Extending the analogy a bit further, DSPy provides its BootstrapFewShotWithRandomSearch
optimizer that allows you to search through its "hyperparameter space" of few shot examples, to find the best subset of examples to optimize the prompt with, with respect to some score metric you are optimizing for. In our case, we built the score metric to minimize the difference between the the score reported by the LCEL version of the metric and the score reporteed by the DSPy version. The result of this procedure are a set of prompts to generate the 8 RAG evaluation metrics that are optimized for the given domain.
To validate this claim, we generated histograms of scores for each metric using the LCEL and DSPy prompts, and compared how bimodal, or how tightly clustered around 0 and 1, they were. The intuition is that the more confident the LLM is about the evaluation, the more it will tend to deliver a confident judgment clustered around 0 or 1. In practice, we do see this happening in case of the DSPy prompts for all but 2 of the metrics, although the differences are not very large. This may be because we the AmnestyQA dataset is very small, only 20 questions.
To address the size of AmnestyQA dataset, Dave used the LLM to generate some more (question, context, answer, ground_truth) tuples given a question and answer pair from AmnestyQA and a Wikipedia retriever endpoint. The plan was for us to use this larger dataset for optimizing the DSPy prompts. However, rather than doing this completely unsupervised, we wanted to have a way for humans to validate and score the LCEL scores from these additional questions. We would then use these validated scores as the basis for optimizing the DSPy prompts for computing the various metrics.
This would require a web based tool that would allow humans to examine the output of each step of the LCEL metric score process. For example, the Faithfulness metric has two steps, the first is to extract facts from the answer, and the second is to provide a binary judgment of whether the context contains the fact. The score is computed by adding up the individual binary scores. The tool would allow us to view and update what facts were extracted in the first stage, and the binary output for each of the fact-context pairs. This is where implementing the RAGAS metrics on our own helped us, we refactored the code so the intermediate results were also available to the caller. Once the tool was in place, we would use it to validate our generated tuples and attempt to re-optimise the DSPy prompts. Mayank and Dave had started on this , but unfortunately we ran out of time before we could complete this step.
Another thing we noticed is that calculation of most of the metrics involves one or more subtasks to make some kind of binary (true / false) decision about a pair of strings. This is something that a smaller model, such as a T5 or a Sentence Transformer, could do quite easily, more predictably, faster, and at lower cost. As before, we could use extract the intermediate outputs from the LCEL metrics to create training data to do this. We could use DSPy and its BootstrapFindTune
optimizer to fine-tune these smaller models, or fine-tune Sentence Transformers or BERT models for binary classification and hook them up into the evaluation pipeline.
Anyway, that was our project. Obviously, there is quite a bit of work remaining to make it into a viable product for LLM based evaluation using the strategy we laid out. But we believe we have demonstrated that this can be viable, that given sufficient training data (about 50-100 examples for the optimized prompt, and maybe 300-500 each for the binary classifiers), it should be possible to build metrics that are tailored to one's domain and that can deliver evaluation judgments with greater confidence than those built using simple prompt engineering. In case you are interested in exploring further, you can find our code and preliminary results at sujitpal/llm-rag-eval on GitHub.