Saturday, May 18, 2024

Finetuning RAGAS Metrics using DSPy

Last month, I decided to sign-up for the Google AI Hackathon, where Google provided access to their Gemini Large Language Model (LLM) and tasked participants with building a creative application on top of it. I have worked with Anthropic's Claude and OpenAI's GPT-3 at work previously, and I was curious to see how Gemini stacked up against them. I was joined in that effort by David Campbell and Mayank Bhaskar, my non-work colleagues from the TWIML (This Week In Machine Learning) Slack. Winners for the Google AI Hackathon were declared last Thursday, and whilte our project sadly did not win anything, the gallery provides examples of some very cool applications of LLMs (and Gemini in particular) for both business and personal tasks.

Our project was to automate the evaluation of RAG (Retrieval Augmented Generation) pipelines using LLMs. I have written previously about the potential of LLMs to evaluate search pipelines, but the scope of this effort is broader in that it attempts to evaluate all aspects of the RAG pipeline, not just search. We were inspired by the RAGAS project, which defines 8 metrics that cover various aspects of the RAG pipeline. Another inspiration for our project was the ARES paper, which shows that fine-tuning the LLM judges on synthetically generated outputs improves evaluation confidence.

Here is a short (3 minutes) video description of our project on Youtube. This was part of our submission for the hackathon. We provide some more information about our project in our blog post below.

We re-implemented the RAGAS metrics using LangChain Expression Language (LCEL) and applied them to (question, answer, context and ground truth) tuples from the AmnestyQA dataset to generate the scores for these metrics. My original reason for doing this, rather than using the using what RAGAS provided directly, was because I couldn't make them work properly with Claude. This was because Claude cannot read and write JSON as well as GPT-3 (it works better with XML), and RAGAS was developed using GPT-3. All the RAGAS metrics are prompt-based and transferrable across LLMs with minimal change, and the code is quite well written. I wasn't sure if I would encounter similar issues with Gemini, so it seemed easier to just re-implement the metrics from the ground up for Gemini using LCEL than try to figure out how to make RAGAS work with Gemini. However, as we will see shortly, it ended up being a good decision.

Next we re-implemented the metrics with DSPy. DSPy is a framework for optimizing LLM prompts. Unlike RAGAS, where we tell the LLM how to compute the metrics, with DSPy the general approach is to have very generic prompts and show the LLM what to do using few shot examples. The distinction is reminiscent of doing prediction using Rules Engines versus using Machine Learning. Extending the analogy a bit further, DSPy provides its BootstrapFewShotWithRandomSearch optimizer that allows you to search through its "hyperparameter space" of few shot examples, to find the best subset of examples to optimize the prompt with, with respect to some score metric you are optimizing for. In our case, we built the score metric to minimize the difference between the the score reported by the LCEL version of the metric and the score reporteed by the DSPy version. The result of this procedure are a set of prompts to generate the 8 RAG evaluation metrics that are optimized for the given domain.

To validate this claim, we generated histograms of scores for each metric using the LCEL and DSPy prompts, and compared how bimodal, or how tightly clustered around 0 and 1, they were. The intuition is that the more confident the LLM is about the evaluation, the more it will tend to deliver a confident judgment clustered around 0 or 1. In practice, we do see this happening in case of the DSPy prompts for all but 2 of the metrics, although the differences are not very large. This may be because we the AmnestyQA dataset is very small, only 20 questions.

To address the size of AmnestyQA dataset, Dave used the LLM to generate some more (question, context, answer, ground_truth) tuples given a question and answer pair from AmnestyQA and a Wikipedia retriever endpoint. The plan was for us to use this larger dataset for optimizing the DSPy prompts. However, rather than doing this completely unsupervised, we wanted to have a way for humans to validate and score the LCEL scores from these additional questions. We would then use these validated scores as the basis for optimizing the DSPy prompts for computing the various metrics.

This would require a web based tool that would allow humans to examine the output of each step of the LCEL metric score process. For example, the Faithfulness metric has two steps, the first is to extract facts from the answer, and the second is to provide a binary judgment of whether the context contains the fact. The score is computed by adding up the individual binary scores. The tool would allow us to view and update what facts were extracted in the first stage, and the binary output for each of the fact-context pairs. This is where implementing the RAGAS metrics on our own helped us, we refactored the code so the intermediate results were also available to the caller. Once the tool was in place, we would use it to validate our generated tuples and attempt to re-optimise the DSPy prompts. Mayank and Dave had started on this , but unfortunately we ran out of time before we could complete this step.

Another thing we noticed is that calculation of most of the metrics involves one or more subtasks to make some kind of binary (true / false) decision about a pair of strings. This is something that a smaller model, such as a T5 or a Sentence Transformer, could do quite easily, more predictably, faster, and at lower cost. As before, we could use extract the intermediate outputs from the LCEL metrics to create training data to do this. We could use DSPy and its BootstrapFindTune optimizer to fine-tune these smaller models, or fine-tune Sentence Transformers or BERT models for binary classification and hook them up into the evaluation pipeline.

Anyway, that was our project. Obviously, there is quite a bit of work remaining to make it into a viable product for LLM based evaluation using the strategy we laid out. But we believe we have demonstrated that this can be viable, that given sufficient training data (about 50-100 examples for the optimized prompt, and maybe 300-500 each for the binary classifiers), it should be possible to build metrics that are tailored to one's domain and that can deliver evaluation judgments with greater confidence than those built using simple prompt engineering. In case you are interested in exploring further, you can find our code and preliminary results at sujitpal/llm-rag-eval on GitHub.

Tuesday, May 14, 2024

Performance Analysis of Float vs Byte vs Binary Vectors on OpenSearch

I've been working on an application where, given an input string, the objective is to recommend an output string that is similar to the input string, for some notion of similarity. A machine learning model, in this case a SentenceTransformers model, is taught this notion of similarity by showing it many examples of input-output pairs. The model's weights are then used to encode the part to be recommended as a vector, and written out into a search index, in this case OpenSearch. At inference time, the same model is used to encode the input string, and OpenSearch's vector search is used to find and recommend the nearest string to the input in vector space.

My dataset consisted of about 10,000 input-output pairs. I split it up into 90% training set (approx 9,000 pairs) and 10% test set (approx 1,000 pairs). I chose the sentence-transformers/all-MiniLM-L6-v2 model, a good pre-trained general-purpose model for vector search in its own right, that maps pairs into a dense 384-dimensional space. Sentence Transformer models use Contrastive Learning, which means I need both positive and negative pairs to train it, but my training set are all positive pairs by definition. Rather than try to generate negative pairs on my own, I used the built-in MultipleNegativesRanking (MNR) loss, which takes positive pairs and generates negative pairs by sampling from the batches. I trained for 10 epochs, using the AdamW optimizer (the default) with learning rate 2e-5 (also the default), saving the best checkpoint (using similarity on validation set).

To evaluate, I generated the top-100 nearest neighbors for each input of the test set pairs, and then computed Recall @k and MRR (Mean Reciprocal Rank) @K for k = 1, 3, 5, 10, 20, 50, 100. For recall, I would score a match @k as successful if the output of the pair appeared within the top k nearest neighbors returned from the vector search. This score is then averaged across all the 1,000 test set pairs for each value of k. MRR is similar, except the recall score for each test set pair and k is divided by the (1-based) position that matched (and 0 if no match).

The baseline for the experiment was computed using an index of encodings of the input part created using the stock all-MiniLM-L6-v2 model.

I had also recently read Jo Kristian Bergum's blog posts Billion Scale Vector Search with Vespa, part one and two, and more recently Matryoshka Binary Vectors: Slash Vector Search Costs with Vespa on the Vespa blog, where he compares performance of vectors of different storage types, among other things. Vespa allows for many different storage types, but I am using OpenSearch, which offers support for float (float32), byte (int8) and binary (bool) storage types. I was curious to see (a) if I could replace my float based vectors with these, and if so, how their performance would compare. This is what this post is about.

The index mappings for the float, byte and binary vector field is as follows. These would need to be set during index creation.

Float Vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
    "type": "knn_vector",
    "dimension": 384,
    "method": {
        "name": "hnsw",
        "engine": "lucene",
        "space_type": "cosinesimil",
        "parameters": {
            "ef_construction": 128,
            "m": 24
        }
    }
}

Byte vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
    "type": "knn_vector",
    "dimension": 384,
    "data_type": "byte",
    "method": {
        "name": "hnsw",
        "engine": "lucene",
        "space_type": "cosinesimil",
        "parameters": {
            "ef_construction": 128,
            "m": 24,
        }
    }
}

Binary vector

1
2
3
4
{
    "type": "binary",
    "doc_values": "true"
}

To generate the vectors, I used the fine-tuned version of the all-MiniLM-L6-v2 model as my encoder, and post-processed the float32 vector returned from the encoder to int8 and binary using the following functions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def convert_to_bytes(arr: np.ndarray):
    return np.floor(arr * 128).astype(np.int8)

def convert_to_bits(arr: np.ndarray):
    bits = np.packbits(
        np.where(arr > 0, 1, 0)).astype(np.int8)
    arr_b = bits.reshape(arr.shape[0], -1)
    hex_bits = []
    for row in arr_b:
        hex_bits.append(str(binascii.hexlify(row), "utf-8"))
    return hex_bits

At the end of this indexing process, I had three separate indexes of approximately 10,000 records, one with one part of the pair encoded as a float32 vector, another encoded as a byte (int8) vector, and the last as a binary vector. To give you a rough idea of the storage requirements (rough because there are fields other than the vectors for each index), the sizes reported by /_cat/indices are shown below.

Vector Storage Type Index Size
float 184.9 MB
byte 39.6 MB
binary 15.6 MB

On the query side, I use the following Script Score queries as described in the Byte-quantized vectors in OpenSearch blog post and Exact k-NN with scoring script documentation pages. The queries are all score scripts as shown below. The query for float and byte vectors are the same, the only difference is that the float_vector is quantized down to int8 in the second case.

Float and byte vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "script_score": {
        "query": {
            "match_all": {}
        },      
        "script": {
            "source": "knn_score",
            "lang": "knn",
            "params": {
                "field": "{field_name}",
                "query_value": "{float_or_byte_vector}",
                "space_type": "cosinesimil"
            }       
        }       
    }       
}    

Binary Vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "script_score": {
        "query": {
            "match_all": {}
        },
        "script": {
            "source": "knn_score",
            "lang": "knn",
            "params": {
                "field": "{field_name}",
                "query_value": "{binary_vector}",
                "space_type": "hammingbit"
            }
        }
    }
}

The chart below shows the Recall and MRR @k for various values of k as described above.

First thing to note is that fine-tuning helps, or at least it helped a lot in this case, probably because the notion of similarity I was working with was more nuanced than Cosine similarity. With respect to float vectors versus byte vectors, float vectors have a slight edge as you can see from the table below (if you look really hard you can also see the float vector (orange line) and byte vector (green line) almost overlaid on each other. While binary vectors don't perform as well, they still perform better than the baseline, and they are much faster, so they can be useful as the first stage of a two-stage retrieval pipeline.

Mean Reciprocal Rank

Recall

In terms of response time, binary vectors are the fastest, followed by byte vectors, followed by float vectors. I measured the response time for each of the 1,000 test set queries to extract 100 nearest neighbors from OpenSearch, then calculated the mean and standard deviation for each vector storage type. Here are the results.

Vector Storage Type Mean Response Time Standard Deviation
float 0.382 s 0.097 s
byte 0.231 s 0.050 s
binary 0.176 s 0.048 s

I originally brought this up in the Relevance Slack channel because I had made some mistakes in my evaluation and was getting results that did not agree with common sense. Thanks to all my "non-work colleagues" on the channel who helped me out by validating my observations and being sounding boards which eventually helped me find the mistakes in my analysis. In any case, I figured that the final results may be useful to the broader community of vector search enthusiasts and professionals, so I am sharing this. Hope you find it useful.

Tuesday, May 07, 2024

KGC/HCLS 2024 Trip Report

I was at KGC (Knowledge Graph Conference) 2024, which is happening May 6-10 at Cornell Tech. I was presenting (virtually) at their Health Care and Life Sciences (HCLS) workshop, so my speakers pass was only valid for today for the HCLS portion of KGC. My trip report covers a few talks that I attended here. Attending virtually was a bit chaotic as sessions went over sometimes, so you might leave a session to attend another, only to find that it hadn’t started yet. This is hard to forsee, we have faced this issue ourselves the first time we moved an internal conference from in-person to hybrid.

KGs in RAG (Tom Smoker, WhatWhyHow.AI)

I have been working with Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) for almost a year now, and I went to this talk hoping for insights on how to use graphs as input to RAG systems. Understandably, the speaker spent some time covering the basics, which I personally did not find very fruitful. However, there were some nuggets of wisdom I got out of the talk. First, the RAG pipelines can lower the risk of hallucinations by using LLMs for planning and reasoning, but without delegating to LLMs for factual information. And second, an agent architecture can more efficiently use smaller sub-graphs which can often be generated dynamically in Closed World models.

A side discussion on chat also yielded a paper reference Getting from Generative AI to Trustworthy AI: what LLMs may learn from Cyc (Lenat and Marcus, 2023). The paper looks really interesting on an initial skim and I plan to read in more detail later.

Knowledge Graphs for Precision Oncology (Krishna Bulusu, AstraZeneca)

A nice overview of applications of Knowledge Graph (KG) to Drug Discovery (DD). DD attempts to apply KG to solve 3 main problems: (1) find gene causing disease (2) match drug with disease and (3) (drug, gene, disease) as a fundamental relationship in DD. The speaker pointed out that the big advantage of KGs is Explainability. He also mentioned the use of graph clustering for node stratification.

Combining graph and vector representation for efficient information retrieval (Peio Popov, Ontotext)

This was a presentation from OntoText where they demonstrated new features built into their GraphDB database. This was of interest to me personally since our KG is also built using GraphDB. Specifically they have integrated LLM and vector search support into their products so they can be invoked from a SPARQL query. This gives GraphDB users the power to combine these techniques in the same call rather than build multi-stage pipelines.

I also learned the distinction between Semantic, Full text and Vector Search as ones based off KG, Lucene (or Lucene-like) indexes and vector search platforms, I would previously conflate the first and third.

Knowledge Engineering in Clinical Decision Support: When a Graph Representational Model is not enough (Maulik Kamdar, Optum)

This was a presentation from my ex-colleague Maulik Kamdar. He talks about challenges in Clinical Decision Support (CDS) where a KG alone is insufficient. Specifically the case he is considering where multiple third party ontologies need to be aligned into one KG. In this situation, similar concepts are combined into ValueSets, which are then composed with bare concepts or with each other to form Clinical Rules. Clinical Rules are further combined to form Clinical Calculators or Questionnaires, which are then combined to form Decision Trees and Flowcharts, which are then combined into Clinical Guidelines. I am probably biased given our common history, but I found this talk to be the most educational for me.

Knowledge Graphs, Theorem Provers and Language Models (Vijay Saraswat and Nikolaos Vasiloglou)

The speakers discussed the role of self-discovery, In-Context Learning (ICL), symbiotic integration of KG with search, and Graph RAG in reasoning engines powered by KG and LLM. They characterize an Agent as an LLM based black box that is provided with pairs of input-output instances to learn some unknown function (similar to ML models). They describe ICL as learning through few shot and many shot examples. They also talk about using the output of KG to fact-check / enhance LLMs and using LLMs to generate assertions that can be used to create a KG. Their demo shows how an LLM is able to learn to generate a Datalog like graph query language from text prompts using few-shot examples.

The speaker made reference to the following three papers in support of the techniques he was describing, which I have duly added to my reading list.

A Scalable and Robust Named Entity Recognition and Linking System for a Clinical Healthcare Knowledge Graph (Sujit Pal, Elsevier Health)

This was my talk. I had originally intended to attend in person but it seemed wasteful to fly across the country to deliver a 5-minute presentation. It did take a bit of planning to present remotely but I learned two useful life lessons.

  1. You can generate a presentation video from MS Powerpoint. Simply create your slides and record a slideshow where you record yourself narrating your presentation. Once done, export as an MP4 and upload to Youtube or other video service.
  2. You can print posters online and have them delivered to someone else.

Huge thanks to my colleague Tom Woodcock who attended in person, and who was kind enough to carry and hang my poster at the conference for me, and who also agreed to present my slideshow for me (although I think that in the end he did not have to). Many thanks also to my ex-colleague Helena Deus (part of the HCLS organizing team), who helped walk me through to a workable solution and was instrumental in my talk being delivered successfully. Also thanks to Leah Walton from the HCLS organizing team, for supporting me in my attempt to present remotely.

Here is the Youtube video for my 5-minute presentation in case you are interested. It’s a bit high-level since I had only 5 minutes to cover everything, but there is a little more information in the poster below.

Graphs for good – Hypothesis generation for Rare Disease Treatment (Brian Martin, AbbVie)

This presentation revolves around a graph that connects diseases to drugs via disease variants, gene, pathway, gene and compound entities. This was used to find a cure for a rare disease using existing medications. It was later extended to find candidate cures for a group of 20 most neglected diseases worldwide. The speakers verified that results for Dengue fever correlates well with previously known information, thus supporting the veracity of the approach. The paper describing this work is Leveraging a Billion-Edge Knowledge Graph for Drug Re-purposing and Target Prioritization using Genomically-Informed Subgraphs (Martin et al, 2022).

Generating and Querying Graphs with LLM (Brian Martin, Subha Madhavan, Berenice Wulbrecht)

Panel discussion where various strategies for generating and querying graphs using LLMs were discussed. Entertaining (and somewhat predictable) comparisons of Property Graphs vs RDF graphs to Ford and Ferrari automobiles, and how LLMs transform them into Teslas (with its self-driving technology). They also talk about extracting assertions from a corpus of documents to create a KG customized for the corpus, and then using the KG to fact-check the output of the LLM for RAG queries against that corpus.

Overall, I think it was a great conference. Learned a lot, would love to go back and present here in the future, hopefully this time in person.