I recently came across Prompt Compression (in the context of Prompt Engineering on Large Language Models) on this short course on Prompt Compression and Query Optimization from DeepLearning.AI. Essentially it involves compressing the prompt text using a trained model to drop non-essential tokens. The resulting prompt is shorter (and in cases of the original context being longer than the LLM's context limit, not truncated) but retains the original semantic meaning. Because it is short, the LLM can process it faster and cheaper, and in some cases get around the Lost In the Middle problems observed with long contexts.
The course demonstrated Prompt Compression using the LLMLingua library (paper) from Microsoft. I had heard about LLMLingua previously from my ex-colleague Raahul Dutta, who blogged about it on his Edition 26: LLMLingua - A Zip Technique for Prompt post, but at the time I thought maybe it was more in the realm of research. Seeing it mentioned in the DeepLearning.AI course made it feel more mainstream, so I tried it out a single query from my domain using their Quick Start example, compressing the prompt with the small llmlingua-2-bert-base-multilingual-cased-meetingbank model, and using Anthropic's Claude-v2 on AWS Bedrock as the LLM.
Compressing the prompt for the single query gave me a better answer than without compression, at least going by inspecting the answer produced by the LLM before and after compression. Encouraged by these results, I decided to evaluate the technique using a set of around 50 queries I had lying around (along with a vector search index) from a previous project. This post describes the evaluation process and the results I obtained from it.
My baseline was a naive RAG pipeline, with the context retrieved by vector matching the query against the corpus, and then incorporated into a prompt that looks like this. The index is an OpenSearch index containing vectors of document chunks, vectorization was done using the all-MiniLM-L6-v2
pre-trained SentenceTransformers encoder, and the LLM is Claude-2 (on AWS Bedrock as mentioned previously).
1 2 3 4 5 6 7 8 9 | Human: You are a medical expert tasked with answering questions expressed as short phrases. Given the following CONTEXT, answer the QUESTION. CONTEXT: {context} QUESTION: {question} Assistant: |
While the structure of the prompt is pretty standard, LLMLingua explicitly requires the prompt to be composed of an instruction (the System prompt beginning with Human:
), the demonstration (the {context}
) and the question (the actual quary to the RAG pipeline). The LLMLingua Compressor
's compress
function expects these to be passed separately as parameters. Presumably, it compresses the demonstration with respect to the instruction and the question, i.e. context tokens that are non-essential given the instruction and question are dropped during the compression process.
The baseline for the experiment uses the context as retrieved from the vector store without compression, and we evaluate the effects of prompt compression using the two models listed in LLMLingua's Quick Start -- llmlingua-2-bert-base-multilingual-cased-meetingbank
(small model) and llmlingua-2-bert-base-multilingual-cased-meetingbank
(large model). The three pipelines -- baseline, compression using small model, and compression using large model -- are run against my 50 query dataset. The examples imply that the compressed prompt can be provided as-is to the LLM, but I found that (at least with the small model), the resulting compressed prompt generates answers that does not always capture all of the question's nuance. So I ended up substituting only the {context}
part of the prompt with the generated compressed prompt in my experiments.
Our evaluation metric is Answer Relevance as defined by the RAGAS project. It is a measure of how relevant the generated answer is given the question. To calculate this, we prompt the LLM to generate a number of (in our case, upto 10) questions from the generated answer. We then compute the cosine similarity of the vector of each generated question with the vector of the actual question. The average of these cosine similarities is the Answer Relevance. Question Generation from the answer is done by prompting Claude-2 and vectorization of the original and generated questions are done using the same SentenceTransformer encoder we used for retrieval.
Contrary to what I saw in my first example, the results were mixed when run against the 50 queries. Prompt Compression does result in faster response times, but it degraded the Answer Relevance scores more times than improve it. This is true for both the small and large compression models. Here are plots of the difference of the Answer Relevance score for the compressed prompt against the baseline uncompressed prompt for each compression model. The vertical red line separates the cases where compression is hurting answer relevance (left side) versus improving answer relevance (right side). In general, it seems like compression helps when the input prompt is longer, which intuitively makes sense. But there doesn't seem to be a simple way to know up front if prompt compression is going to help or hurt.
I used the following parameters to instantiate LLMLingua's PromptCompressor
object and to call its compress_prompt
function. These are the same parameters that were shown in the Quick Start. It is possible I may have gotten different / better results if I had experimented a bit with the parameters.
1 2 3 4 5 6 7 8 9 | from llmlingua import PromptCompressor compressor = PromptCompressor(model_name=model_name, use_llmlingua2=True) compressed = compressor.compress_prompt(contexts, instruction=instruction, question=query, target_token=500, condition_compare=True, condition_in_question="after", rank_method="longllmlingua", use_sentence_level_filter=False, context_budget="+100", dynamic_context_compression_ratio=0.4, reorder_context="sort") compressed_context = compressed["compressed_prompt"] |
A few observations about the compressed context. The number of context documents changes before and after compression. In my case, all input contexts had 10 chunks, and the output would vary between 3-5 chunks, which probably leads to the elimination of Lost in the Middle side-effects as claimed in LLMLingua's documentation. Also, the resulting context chunks are shorter and seems to be a string of keywords rather than coherent sentences, basically unintelligible to human readers, but intelligible to the LLM.
Overall, Prompt Compression seems like an interesting and very powerful technique which can result in savings in time and money if used judiciously. Their paper shows very impressive results on some standard benchmark datasets with supervised learning style metrics using a variety of compression ratios. I used Answer Relevance because it can be computed without needing domain experts to grade additional answers. But it is likely that I am missing some important optimization, so I am curious if any of you have tried it, and if your results are different from mine. If so, would appreciate any pointers to things you think I might be missing.