Sunday, December 08, 2024

Trip Report - PyData Global 2024

I attended PyData Global 2024 last week. Its a virtual conference, so I was able to attend it from the comfort of my home, although presentations seem to be scheduled to be maximally convenient, time-wise, for folks in the US East Coast and Western Europe, so some of them were a bit early for me. There were four main tracks -- the General Track, the Data / Data Science Track, the AI / ML track and the LLM track -- where talks were presented in parallel. Fortunately, because it was virtual, there were recordings, which were made available almost immediately following the actual talk. So I was able to watch recordings of some of the talks I would have missed otherwise, and even squeeze in a few urgent work related meetings during the conference. So anyway, its not like I watched every preentation, but I did get to watch quite a few based on my interests. Some were geniuinely groundbreaking and / or new to me (and hence useful), and some others less so. But I enjoyed being there and being part of the awesome PyData community, so overall it was a net positive for me, in my opinion. Here is a Trip Report of the talks I attended, hope you find it useful.

Day 1 -- 03-Dec-2024

Understanding API Dispatching in NetworkX

The presenter describes how the NetworkX library seamlessly interfaces with faster algorithms from more modern, high performance libraries, while exposing the same (or almost same) API to the user. The additional information is usually in the form of additional parameters, or custom subclasses of the original parameter. One cool idea is that the new backend must minimally also pass tests written for the original NetworkX backend. I am probably never going to be a PyData library maintainer, but I thought this was a useful technique that one could use to hook up legacy code, which most of us probably have a lot of in our own application, with newer backends with minimal risk.

Streamlining AI Development and Deployment with KitOps

The presentation provides a tutorial for using KitOps, a standards based packaging and versioning system for AI / ML projects. It is definitely more integrated and feature-rich than a strategy of saving your code with Git and your data with DVC, but it also requires you to learn a new command (kit) with an extensive set of subcommands that does almost anything you can dream of doing with AL / ML deployment.

Enabling Multi-Language Programming in Data Engineering Workflows

The presentation demonstrates the use of Snakemake, an open-source Python based command-line based orchestration tool, to orchestrate a Clinical Trials Data Engineering workflow containing code written in Python, R and SAS. An interesting (probably innovative) twist was the use of Jinja2 to generate Snakemake files from workflow-specific templates. It seems very similar to Makefiles, which I have used earlier, before my Java / Scala days, when we switched to more JVM friendly alternatives like Ant, Maven and SBT. More recently, I see some (Python) projects using them as well, although Jenkins and Airflow seem more popular. I think SnakeMake is likely to be useful for the kind of work I do, which may not be able to justify the costs associated with Airflow or similar, but which would benefit from orchestration functionality nonetheless.

Keynote -- Embrace the Unix Command Line and Supercharge your PyData Workflow

The speaker describes various Unix command (only some of which I was aware of, I am sorry to say, despite my relatively long association with Unix), that can make your life as a Data Scientist / Engineer easier. I am also very envious of his very colorful and information rich command prompt. That said, there is some intersection between the tools he describes and the ones I use, and I have a few of my own that I swear by that he doesn't cover. But defintely a good presentation to watch if you use Unix / Linux, you will probably pick up a few new useful commands.

akimbo: vectorized processing of nested / ragged dataframe columns

The presenter describes akimbo, a Dataframe accessor for nested, ragged and otherwise awkward non-tabular data. Using the Akimbo accessor allows for vector speed compute on structures that are hard to express in Numpy form. Akimbo can be used from within Pandas, Polars, CuDF and Dask, as long as they use the Arrow backend.

Cost-effective data annotation with Bayesian experimental design

As the title implies, this talk is more about experimental design rather than a specific DS / ML framework. It describes techniques for identifying the most informative data points for human labeling, which in turn is likely to be most useful for model training. It reminded me a bit of Active Learning, where you identify high confidence predictions from an intermediate model to train future models. The presenter also relates this approach to binary search, which has similar characteristics. He also references OptBayesExpt, a package for Optimal Bayesian Experiment Design.

Effective GenAI Evaluations: Mitigate Hallucinations and Ship Fast

The presenter is one of the founders of Galileo, a company I follow for their cutting-edge research in areas relating to Generative AI. Among their innovations is ChainPoll, a technique that uses Chain of Thought (CoT) reasoning to determine if an LLM is hallucinating. He then describes Luna-8B (based on the BERT class DeBERTa-v3-large model), a model now offered as part of the Galileo software, that is capable of detecting hallucinations without CoT. He also talks about LunaFlow, also part of the Galileo software, that wraps the Luna-8B model.

Holistic Evaluation of Large Language Models

The presentation talks about NLP metrics such as BLEU and ROUGE, and how they are not really suitable to evaluate the generated output of LLMs. It then goes on to introduce more advanced metrics such as BERTScore and perplexity. Overall, a good overview of NLP metrics for folks who are new to NLP.

Let's get you started with asynchronous programming

I got my own start into asynchronous programming via LangChain's ainvoke call, mostly prescriptive based on examples, and suggestions based on error messages from the Python interpreter. I found this session useful as it gave me a more holistic understanding of asynchronous programming in Python, including learning what a Python co-routine is.

Fairness Tales: Measure / Mitigate Unfair Bias in ML

This presentation describes various fairness metrics that use the distribution of features and labels in the training data itself to determine whether the data (and thence the model) is biased or not. The metrics are illustrated in the context of a recruitment application.

Understanding Polars Data Types

A good general overview of data types used in Polars and what each is good for. I am trying to move off Pandas and on to Polars for new projects, so I thought this was useful.

Build simple and scalable data pipelines with Polars and DeltaLake

This was a very interesting presentation that showed the challenges of building a pipeline over data which may need to be updated retroactively and whose format may change over time. The presenter shows that using Polars (which uses Parquet file format by default) and Pandas (with the Parquet file format) along with DeltaLake (a standalone Rust based implementation called delta-rs) can address all these problems very effectively, as well as provide ACID query and update guarantees on the data. I also learned that DeltaLake does not imply Spark or Databricks as I was thinking previously.

Measuring the User Experience and the impact of Effort on Business Outcomes

Another presentation that is not about libraries or application development. The presenter describes the defining features of user experience within an application, and shows that User Effort, i.e. how much effort the user has to expend to achieve their goals, is the most meaningful success metric. She then describes some possible approaches, both statistical and domain derived, to derive the User Effort metric for a given application.

Day 2 -- 04-Dec-2024

Boosting AI Reliability: Uncertainty Quantification with MAPIE

This presentation describes the MAPIE library, which is described as a Model Agnostic Prediction Interval Estimator, used for quantifying uncertainty and risk of ML models. It can be used to compute conformal prediction intervals (similar to confidence intervals but predicts range of values for future observations) and calibrate models (transform model scores into probabilities). It can be called via a wrapper from any Scikit-Learn (or compatible) model.

The art of wrangling your GPU Python Environments

This presentation discusses the challenges in effectively configuring GPU environments using the myriad dependencies from hardware, drivers, CUDA, C++ and Python. The presenters describe how the Conda package manager does it via virtual packages, that allow it to call out to GPU capabilities that it does not have itself. They also describe RAPIDS (they are from NVidia) and Rapids Doctor (also from NVidia), a new tool that allows users to quickly resolve GPU issues.

Extraction Pipelines: ColPali's Vision Powered RAG for Enterprise Documents

ColPali is a recent encouraging approach to "Multimodal RAG". Effectively, it cuts up an input PDF into patches and then encodes them via a specialized multimodal aware embedding, then uses ColBERT late interaction to find the parts of the input that most satisfy the query. This presentation covers how ColPali works, effectively enabling the pipeline to "see" and reason over documents.

Fast, Intuitive Feature Selection vis regression on Shapley Values

This presentation describes a novel approach to doing feature selection. Ordinarily, one would detect the most important features by either adding or removing features one by one and training a model for a few epochs. This approach involves deriving the Shapley values once and using them to do a linear or logistic regression of the target on the Shapley values of the features and uses the results to implement a feature selection heuristic that is competitive with the earlier more heavyweight approaches. They provide an open source library shap-select that implements this approach.

Keynote: Do Python and Data Science Matter in our AI Future?

Not sure if the presenter ended up answering the question (he likely did, I might have missed it). But he raised some very important issues about software (especially Open Source software) being more about relationships than property, and how collaboration is bigger than capitalism. One of his observation that resonated with me was that Open Source is a path to permission-less innovation. Another interesting observation was that a dataset is just a quantized frozen model.

GraphRAG: Bringing together graph and vector search to empower retrieval

This presentation posits that vector search can be augmented by graph based search, and then demonstrates this by augmenting a Naive RAG pipeline (query -(retriever)-> context, query + context -(LLM)-> answer) wih a Kuzu backed graph DB. I learned several things from this presentation -- first, it is probably more convenient to use Kuzu instead of Neo4j Community Edition for my graph POCs, and second, more than just the entity-relationship paths, it may be worth looking at returning representative content for entities along these paths. Definitely something to try out in the future.

Rapid deduplication and fuzzy matching of large datasets using Splink

This presentation describes Splink, a data linkage library for medium to large datasets. Splink is available on Databricks where it is suitable for deduplicating datasets with 100 million+ records. Interestingly, when we deduplicate along same dataset, it is called deduplication, but when doing this across multiple datasets, it is called record linkage.

Statically Compiled Julia for Library Development

Julia is a JIT-compiled language and it can be called from Python. When called from Python, the Julia functionlity is statically compiled down to high performing native code. Unfortunately, currently this means that the entire Julia runtime is statically linked. This presentation describes work in the Julia community to modify this behavior, so it restricts the modules linked to only those referenced from the exposed entry-points, resulting in smaller and lighter weight executables.

Let our Optima Combine

This presentation introduces Constraint Optimization and the OR-Tool from Google. Its been a while since I used Linear Programming or similar tools, so it was nice to know they exist for Python. If I ever end up doing this for work or hobby, then I might look at OR-Tool.

Unlocking the Power of Hybrid Search: A Deep Dive into Python powered Precision and Scalability

This presentation described a Hybrid RAG pipeline with combination of vector and lexical search with a RRF (Reciprocal Rank Fusion) head to merge results and showed that merged results end up being generally more useful for answer generation since they combine the best of both worlds.

Automatic Differentiation, a tale of two languages

The presentation looks at the differences between Python and Julia with respect to how the AutoDiff functionality is implemented. With Python, it is part of external frameworks like Pytorch / Tensorflow / JAX, whereas with Julia it is part of the language. Julia has multiple pluggable AutoDiff implementations that can be used in different situations. This talk also helped address some questions that came up around difference in the AutoDiff implementation between Pytorch and Tensorflow that came up in our Deep Learning book reading group on TWIML.

Navigating Cloud Expenses in Data and AI: Strategies for Scientists and Engineers

The presentation describes the Open Source Metaflow library and its managed version Outerbounds, meant to help with development and deployment of DS / ML / AI projects. An interesting observation from the presenter is the complementarity of requirements from the Data Scientist versus the Operations Engineer. The presenter identifies issues such as GPU rent-vs-buy decisions, the human-vs-infra cost tradeoff and the importance of choosing the right instance type for the problem being solved, and shows how Outerbounds helps to identify and solve these issues.

Julia ML Ecosystem

Last time I looked at Julia, it was just starting out as a "Data Science language" that had nowhere close to the ecosystem that Python had (and continues to have). This presentation showed me a different (and much improved) picture, where it has already implemented equivalents for linear algebra (similar to Numpy / Pytorch / JAX), dataframe processing (Dataframe.jl and Tidier.jl analogous to Pandas / Polars), visualization (Makie.jl, JuliaPlots.jl and AlgebraOfGraphics.jl analogous to Matplotlib / Seaborn), Machine Learning (ML.jl analogous to Scikit Learn) and Deep Learning (Flux.jl analogous to Keras), etc. In addition, it is possible to call Python from Julia (and vice versa) so they can take advantage of each other's ecosystems.

Pytorch Workflow Mastery: A Guide to Track and Optimize Model Performance

This presentation is a good introduction to using Pytorch, demonstrating how to build a basic Convolutional Neural Network and train it with images from the CIFAR-10 dataset. It covers a few things that have become available / popular since I started working with Pytorch, so these parts were useful. Among them are the use of model.compile to generate a compiled model (similar to Tensorflow's Data Flow Graph), the use of canned metrics via the torchmetrics package, and integration with Weights and Biases (wandb.init()) and Optuna for Bayesian Hyperparameter optimization.

New Features in Apache Spark 4.0

I attended this presentation because I am a former Spark user. I haven't used it (at least not heavily) for the last couple of years since the data I need is now more conveniently available in Snowflake. But I was curious about what new functionality it has gained since I last used it. The presentation covers the ANSI SQL mode, the VARIANT data type that now allows JSON and XML data to natively parsed (upto 8x faster), the changes in Spark-connect to decouple client from server, making possible Spark connectors in various languages such as Rust and Go, parameterized queries and User Defined Table functions.

Day 3 -- 05-Dec-2024

The LEGO Approach to designing PyData workflows

Presenter describes her idea behind designing application systems with components designed to interlock with each other like Lego bricks, and her implementation of these ideas in the DataJourney framework.

Time Series Analysis with StatsModels

This was a workshop conducted by Allen Downey, the author of Think Stats. Specifically this workshop covered Chapter 12 of the book, applying the statsmodel library to do Time Series analysis. The workshop uses statsmodels to decompose a time series representing electricity generation over last 20+ years into trend, seasonal and random components, using additive and multiplicative decompositions to predict future data points from past data, and using ARIMA (autoregressive and moving average). I feel like I understand time series and ARIMA better than I used to, although I am sure I have just scratched the surface of this topic.

Building an AI Travel Agent that never Hallucinate

Hallucination is a feature of LLMs rather than a bug. So it seems like a tall order to build an AI Travel Agent (or any LLM based agent in general) that never hallucinates. However, and somewhat obviously in hindsight, one way to address the problem would be to severely limit its capabilities to make decisions. The CALM (Conversational AI with Language Models) framework from Rasa implements this by setting up the equivalent of a phone tree and giving the LLM only the capability to jump from node to node in the tree. I thought this was brilliant, because for most applications where you want an agent, you don't need (or want) full-blown AGI.

Evaluating RAGs: On the correctness and coherence of Open Source eval metrics

This presentation is a bit meta, it evaluates LLM evaluation metrics available from Open Source frameworks such as RAGAS and TruLens, across different LLMs like Claude Sonnet, GPT 3.5 and GPT-4, Llama2-70B and Llama3-70B. Results show that these metrics yield wildly different values for the same content. They do indicate future work as needing to evaluate these results against human judgment.

Building Knowledge Graph based Agents with Structured Text Generation and Open-Weights Models

This was a great presentation on using a combination of Structured Text Generation (using outlines) from content to build a Knowledge Graph. Structured Text output also makes it convenient to model Agents that execute actions through function calls. The presenter uses these ideas to first generate a Knowledge Graph from a dataset, then implements an Agentic Query pipeline that queries this Knowledge Graph.

From Features to Inference: Build a Core ML Platform from Scratch

This is a very impressive live coding presentation where the presenter sets up an ML pipeline from scratch, including an Inference Engine, Model Registry, Feature Store and an Event Bus to connect them all together using an Event Driven design. One good piece of advice here was to align the software with the language of business, i.e. domain driven design. Another was to build "default" implementations that you can write tests against, and replace them with "real" components as and when they come up. Expectations for these compoennts are already codified in the unit tests, so the new components must satisfy the same expectations. There are some very interesting (dependency injection like) code patterns, some of which reminded me of my Java / Spring days.

Putting the Data Science back into LLM Evaluation

This presentation covers a lot of familiar ground for folks that have worked with LLMs for some time. However, there are some new ideas here as well. One of them are the use of heuristic based guardrails such as matching length of output, patterns in output using regexes, using computed metrics such as Flesch-Kincaid scores, etc. Another is the use of chatbot arena style scoring to evaluate relative improvements. Presenters have created Parlance, an open source LLM evaluation tool that implements such a chatbot arena style model-to-model comparison metric.

Making Gaussian Processes Useful

The presentation is about Gaussian Processes, but because this is part of hierarchical models that are probabilistic models which most people are not that familiar with, the first part introduces PyMC and hierarchical models, then the second part covers how Gaussian processes can model the effect of continuous variables as a family of functions rather than a variable. I watched this presentation because was familiar with probabilistic hierarchical models, having used PyMC3 in the past, when it was backed by the forked version of Theano and NUTS was the state f the art sampler. Now it is backed by JAX and there is an even faster sampler based on Rust. But GPs were new to me, so I learned something new.

I might watch a few more presentations when I have time. PyData / NumFocus are generally very good about sharing the presentations openly, but it is likely to be 1-2 months before that happens. I will watch for the announcement and update this post with the information, but in the meantime, thats all I have to say about PyData Global 2024. I hope you found it interesting and useful.

Saturday, October 05, 2024

Using Knowledge Graphs to enhance Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) has become a popular approach to harness LLMs for question answering using your own corpus of data. Typically, the context to augment the query that is passed into the Large Language Model (LLM) to generate an answer comes from a database or search index containing your domain data. When it is a search index, the trend is to use Vector search (HNSW ANN based) over Lexical (BM25/TF-IDF based) search, often combining both Lexical and Vector searches into Hybrid search pipelines.

In the past, I have worked on Knowledge Graph (KG) backed entity search platforms, and observed that for certain types of queries, they produce results that are superior / more relevant compared to that produced from a standard lexical search platform. The GraphRAG framework from Microsoft Research describes a comprehensive technique to leverage KG for RAG. GraphRAG helps produce better quality answers in the following two situations.

  • the answer requires synthesizing insights from disparate pieces of information through their shared attributes
  • the answer requires understanding summarized semantic concepts over part of or the entire corpus

The full GraphRAG approach consists of building a KG out of the corpus, and then querying the resulting KG to augment the context in Retrieval Augmented Generation. In my case, I already had access to a medical KG, so I focused on building out the inference side. This post describes what I had to do to get that to work. It is based in large part on the ideas described in this Knowledge Graph RAG Query Engine page from the LlamaIndex documentation.

At a high level, the idea is to extract entities from the question, and then query a KG with these entities to find and extract relationship paths, single or multi-hop, between them. These relationship paths are used, in conjunction with context extracted from the search index, to augment the query for RAG. The relationship paths are the shortest paths between pairs of entities in the KG, and we only consider paths upto 2 hops in length (since longer paths are likely to be less interesting).

Our medical KG is stored in an Ontotext RDF store. I am sure we can compute shortest paths in SPARQL (the standard query language for RDF) but Cypher seems simpler for this use case, so I decided to dump out the nodes and relationships from the RDF store into flat files that look like the following, and then upload them to a Neo4j graph database using neo4j-admin database import full.

1
2
3
4
5
6
7
8
9
# nodes.csv
cid:ID,cfname,stygrp,:LABEL
C8918738,Acholeplasma parvum,organism,Ent
...

# relationships.csv
:START_ID,:END_ID,:TYPE,relname,rank
C2792057,C8429338,Rel,HAS_DRUG,7
...

The first line in both CSV files are the headers that inform Neo4j about the schema. Here our nodes are of type Ent and relationships are of type Rel, cid is an ID attribute that is used to connect nodes, and the other elements are (scalar) attributes of each node. Entities were extracted using our Dictionary-based Named Entity Recognizer (NER) based on the Aho-Corasick algorithm, and shortest paths are computed between each pair of entities (indicated by placeholders _LHS_ and _RHS_) extracted using the following Cypher query.

1
2
MATCH p = allShortestPaths((a:Ent {cid:'_LHS_'})-[*..]-(b:Ent {cid:'_RHS_'}))
RETURN p, length(p)

Shortest paths returned by the Cypher query that are more than 2 hops long are discarded, since these don't indicate strong / useful relationships between the entity pairs. The resulting list of relationship paths are passed into the LLM along with the search result context to produce the answer.

We evaluated this implementation against the baseline RAG pipeline (our pipeline minus the relation paths) using the RAGAS metrics Answer Correctness and Answer Similarity. Answer Correctness measures the factual similarity between the ground truth answer and the generated answer, and Answer Similarity measures the semantic similarity between these two elements. Our evaluation set was a set of 50 queries where the ground truth was assigned by human domain experts. The LLM used to generate the answer was Claude-v2 from Anthropic while the one used for evaluation was Claude-v3 (Sonnet). The table below shows the averaged Answer Correctness and Similarity over all 50 queries, for the Baseline and my GraphRAG pipeline respectively.

Pipeline Answer Correctness Answer Similarity
Baseline 0.417 0.403
GraphRAG (inference) 0.737 0.758

As you can see, the performance gain from using the KG to augment the query for RAG seems to be quite impressive. Since we already have the KG and the NER available from previous projects, it is a very low effort addition to make to our pipeline. Of course, we would need to verify these results using Further human evaluations.

I recently came across the paper Knowledge Graph based Thought: A Knowledge Graph enhanced LLM Framework for pan-cancer Question Answering (Feng et al, 2024). In it, the authors identify four broad classes of triplet patterns that their questions (i.e, in their domain) can be decomposed to, and addressed using reasoning approaches backed by Knowledge Graphs -- One hop, Multi-hop, Intersection and Attribute problems. The idea is to use an LLM prompt to identify the entities and relationships in the question, then use an LLM to determine which of these templates should be used to address the question and produce an answer. Depending on the path chosen, an LLM is used to generate a Cypher query (an industry standard query language for graph databases originally introduced by Neo4j) to extract the missing entities and relationships in the template and answer the question. An interesting future direction for my GraphRAG implementation would be to incorporate some of the ideas from this paper.

Monday, July 29, 2024

Experiments with Prompt Compression

I recently came across Prompt Compression (in the context of Prompt Engineering on Large Language Models) on this short course on Prompt Compression and Query Optimization from DeepLearning.AI. Essentially it involves compressing the prompt text using a trained model to drop non-essential tokens. The resulting prompt is shorter (and in cases of the original context being longer than the LLM's context limit, not truncated) but retains the original semantic meaning. Because it is short, the LLM can process it faster and cheaper, and in some cases get around the Lost In the Middle problems observed with long contexts.

The course demonstrated Prompt Compression using the LLMLingua library (paper) from Microsoft. I had heard about LLMLingua previously from my ex-colleague Raahul Dutta, who blogged about it on his Edition 26: LLMLingua - A Zip Technique for Prompt post, but at the time I thought maybe it was more in the realm of research. Seeing it mentioned in the DeepLearning.AI course made it feel more mainstream, so I tried it out a single query from my domain using their Quick Start example, compressing the prompt with the small llmlingua-2-bert-base-multilingual-cased-meetingbank model, and using Anthropic's Claude-v2 on AWS Bedrock as the LLM.

Compressing the prompt for the single query gave me a better answer than without compression, at least going by inspecting the answer produced by the LLM before and after compression. Encouraged by these results, I decided to evaluate the technique using a set of around 50 queries I had lying around (along with a vector search index) from a previous project. This post describes the evaluation process and the results I obtained from it.

My baseline was a naive RAG pipeline, with the context retrieved by vector matching the query against the corpus, and then incorporated into a prompt that looks like this. The index is an OpenSearch index containing vectors of document chunks, vectorization was done using the all-MiniLM-L6-v2 pre-trained SentenceTransformers encoder, and the LLM is Claude-2 (on AWS Bedrock as mentioned previously).

1
2
3
4
5
6
7
8
9
Human: You are a medical expert tasked with answering questions
expressed as short phrases. Given the following CONTEXT, answer the QUESTION.

CONTEXT:
{context}

QUESTION: {question}

Assistant:

While the structure of the prompt is pretty standard, LLMLingua explicitly requires the prompt to be composed of an instruction (the System prompt beginning with Human:), the demonstration (the {context}) and the question (the actual quary to the RAG pipeline). The LLMLingua Compressor's compress function expects these to be passed separately as parameters. Presumably, it compresses the demonstration with respect to the instruction and the question, i.e. context tokens that are non-essential given the instruction and question are dropped during the compression process.

The baseline for the experiment uses the context as retrieved from the vector store without compression, and we evaluate the effects of prompt compression using the two models listed in LLMLingua's Quick Start -- llmlingua-2-bert-base-multilingual-cased-meetingbank (small model) and llmlingua-2-bert-base-multilingual-cased-meetingbank (large model). The three pipelines -- baseline, compression using small model, and compression using large model -- are run against my 50 query dataset. The examples imply that the compressed prompt can be provided as-is to the LLM, but I found that (at least with the small model), the resulting compressed prompt generates answers that does not always capture all of the question's nuance. So I ended up substituting only the {context} part of the prompt with the generated compressed prompt in my experiments.

Our evaluation metric is Answer Relevance as defined by the RAGAS project. It is a measure of how relevant the generated answer is given the question. To calculate this, we prompt the LLM to generate a number of (in our case, upto 10) questions from the generated answer. We then compute the cosine similarity of the vector of each generated question with the vector of the actual question. The average of these cosine similarities is the Answer Relevance. Question Generation from the answer is done by prompting Claude-2 and vectorization of the original and generated questions are done using the same SentenceTransformer encoder we used for retrieval.

Contrary to what I saw in my first example, the results were mixed when run against the 50 queries. Prompt Compression does result in faster response times, but it degraded the Answer Relevance scores more times than improve it. This is true for both the small and large compression models. Here are plots of the difference of the Answer Relevance score for the compressed prompt against the baseline uncompressed prompt for each compression model. The vertical red line separates the cases where compression is hurting answer relevance (left side) versus improving answer relevance (right side). In general, it seems like compression helps when the input prompt is longer, which intuitively makes sense. But there doesn't seem to be a simple way to know up front if prompt compression is going to help or hurt.

I used the following parameters to instantiate LLMLingua's PromptCompressor object and to call its compress_prompt function. These are the same parameters that were shown in the Quick Start. It is possible I may have gotten different / better results if I had experimented a bit with the parameters.

1
2
3
4
5
6
7
8
9
from llmlingua import PromptCompressor

compressor = PromptCompressor(model_name=model_name, use_llmlingua2=True)

compressed = compressor.compress_prompt(contexts, instruction=instruction, question=query,
    target_token=500, condition_compare=True, condition_in_question="after", 
    rank_method="longllmlingua", use_sentence_level_filter=False, context_budget="+100",
    dynamic_context_compression_ratio=0.4, reorder_context="sort")
compressed_context = compressed["compressed_prompt"]

A few observations about the compressed context. The number of context documents changes before and after compression. In my case, all input contexts had 10 chunks, and the output would vary between 3-5 chunks, which probably leads to the elimination of Lost in the Middle side-effects as claimed in LLMLingua's documentation. Also, the resulting context chunks are shorter and seems to be a string of keywords rather than coherent sentences, basically unintelligible to human readers, but intelligible to the LLM.

Overall, Prompt Compression seems like an interesting and very powerful technique which can result in savings in time and money if used judiciously. Their paper shows very impressive results on some standard benchmark datasets with supervised learning style metrics using a variety of compression ratios. I used Answer Relevance because it can be computed without needing domain experts to grade additional answers. But it is likely that I am missing some important optimization, so I am curious if any of you have tried it, and if your results are different from mine. If so, would appreciate any pointers to things you think I might be missing.

Sunday, June 30, 2024

Table Extraction from PDFs using Multimodal (Vision) LLMs

Couple of weeks ago a colleague and I participated in an internal hackathon where the task was to come up with an interesting use case using the recent multi-modal Large Language Models (LLMs). Multi-modal LLMs take not only text inputs via their prompt like earlier LLMs, but can also accept non-text modalities such as images and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI, Gemini 1.5 from Google, and Claude-3.5 Sonnet from Anthropic. The hackathon provided access to GPT-4o through Azure, Microsoft's Cloud Computing Platform. We did not win, there were other entries that were better than ours both in terms of the originality of their idea as well as quality of their implementations. However, we learned some cool new things during the hackathon, and figured that these might be of general interest to others as well, hence this post.

Our idea was to use GPT-4o to extract and codify tables found in academic papers as semi-structured data (i.e. JSON). We could then either query the JSON data for searching within tables, or convert it to Markdown for downstream LLMs to query them easily via their text interface. We had originally intended to extend the idea to figures and charts, but we could not get that pipeline working end to end.

Here is what our pipeline looked like.

  1. Academic papers are usually available as PDFs. We use the PyMuPDF library to split the PDF file into a set of image files, where each image file corresponds to a page in the paper.
  2. We then send each page image through the Table Transformer, which returns bounding box information for each table it detects in the page, as well as a confidence score. The Table Transformer model we used was microsoft/table-transformer-detection.
  3. We crop out each table from the pages using the bounding box information, and then send each table to GPT-4o as part of a prompt asking to convert it to a JSON structure. GPT-4o responds with a JSON structure representing the table.

This pipeline was based on my colleague's idea. I like how it progressively simplifies the task by splitting each page of the incoming PDF into its own image, then uses a pre-trained Table Transformer to crop out the tables from them, and only then passes the table to GPT-4o to convert to JSON. That table image is passed into the prompt as a "data URL" which is just the base-64 encoding of the image formatted as "data:{mime_type};base64,{base64_encoded_data}. The Table Transformer, while not perfect, proved remarkably successful at identifying tables in the text. I say remarkable because we used a pre-trained model, but perhaps it is not that remarkable once you consider that it probably trained on tables in academic papers as well.

Our prompt for GPT-4o looked something like this:

System: You are an AI model that specializes in detecting the tables and extracting, interpreting table content from images. Follow below instruction step by step:
1. Recognize whether the given image is table or not, if it's not a table print "None". if it's a table go to next step.
2. accurately convert the table's content into a structured structured Json format

general instruction:
1. do not output anything extra. 
2. a table must contains rows and columns

User: Given the image, detect whether it's a table or not, if it's a table then convert it to Json format
{image_data_url}

For the figure pipeline, I tried to use an OWL-VIT (Vision Transformer for Open World Localization) model in place of the Table Transformer. But it was not as successful at detecting figures in the text, probably since SAM seems to be fine-tuned to detect objects in natural images. Unfortunately we couldn't find a pre-trained model that would work for this particular case. Another issue was converting the fgure into a semi-structured JSON representation, we ended up asking GPT-4o to describe the image as text.

One suggestion by some of my TWIML non-work colleagues was to ask GPT-4o to return the bounding boxes for the images it finds in it, and then use that to extract the figures to send to GPT-4o for describing. It didn't work unfortunately, but was definitely worth trying. As LLMs get more and more capable, I think it makes sense to rethink our pipelines to delegate more and more work to the LLM. Or at least verify that it can't do something before moving on to older (and harder to implement) solutions.

Sunday, June 23, 2024

Book Report: Pandas Workout

Unlike many Data Scientists, I didn't automatically reach for Pandas when I needed to analyze data. I came upon this discipline (Data Science) as a Java Software Engineer who used Python for scripting, so I was quite comfortable operating on JSON / CSV / text files directly, loading data into relational databases and running SQL against them, and building visualizations with Matplotlib. So when Pandas first hit the scene, I thought it was a nice library, but I just didn't see the logic in spending time to learn another interface to do the same things I could do already. Of course, Pandas has matured since then (and so have I, hopefully), and when faced with a data analysis / preparation / cleanup task, I often now reach out not only for Pandas, but depending on the task, also its various incarnations such as PySpark, Dask Dataframes and RAPIDS cuDF. When I use Pandas (and its various incarnations) I often find myself depending heavily on Stack Overflow (and lately Github Copilot) for things I know can be done but not how. To some extent I blame this on never having spent the time to understand Pandas in depth. So when I was offered the chance to review Pandas Workout by Reuven Lerner, I welcomed it as a way to remedy this gap in my knowledge.

The book is about Pandas fundamentals rather than solving specific problems with Pandas. For that you will still want to look up Stack Overflow :-). In fact, in the foreword the author specifically targets my demographic (needs to look up Stack Overflow when solving problems with Pandas). But he promises that after reading the book you will understand why some solutions are better than others.

Pandas started as an open source project by Wes McKinney, and has grown somewhat organically into the top Data Science toolkit that is today. As a result, there are often multiple ways to do something in Pandas. While all these ways may produce identical results, their performance characteristics may be different, so there is usually an implicit "right" way. The book gives you the mental model to decide which among the different approaches is the "right" one.

The book is organized into the following chapters. Each chapter covers a particular aspect of Pandas usage. I have included a super-short TLDR style abstract for each chapter for your convenience.

  1. Series -- Pandas Series objects are the basic building block of Pandas and represent a typed sequence of data, that are used to construct DataFrames and Indexes. Many methods on the Series object apply in a similar way to DataFrames as well. This is a foundational chapter, understanding this will help with future chapters.
  2. Data Frames -- DataFrames represent tabular data as a sequence of Series, where each Series object represents a column in the table. Pandas inherits the idea of DataFrames from R, and the incarnations I listed (and a few that I didn't) use DataFrame as a basic abstraction as well. This chapter teaches you how to select from and manipulate DataFrames. Unless you've used Pandas extensively before, there is a high chance you will learn something useful new tricks here (I did, several of them).
  3. Import and Export -- covers reading and writing CSV and JSON formats to and from DataFrames. Covers some simple sanity checks you can run to verify that the import or export worked correctly. I learned about the pd.read_html method here, probably not that useful, but interesting to know!
  4. Indexes -- Indexes are used by Pandas to efficiently find data in DataFrames. While it may be possible to get by without Indexes, your Pandas code would take longer to run and consume more resources. The chapter deals with indexing techniques. I happened to know a lot of them, but there were a few that I didn't, especially the techniques around pivot tables.
  5. Cleaning -- this chapter teaches a skill that is very fundamental to (and maybe even the bane of) a Data Scientist's job. Statistics indicate that we spend 80% of our time cleaning data. Along with the techniques themselves (remove / interpolate / ignore), this chapter contains commentary that will help you frame these decisions on your own data cleaning tasks.
  6. Grouping, Joining and Sorting -- these three operations are so central to data analysis, so much so that SQL has special keywords for each operation (JOIN, GROUP BY and ORDER BY). This chapter covers various recipes to do these operations efficiently and correctly in Pandas.
  7. Advanced Grouping, Joining and Sorting -- this chapter goes into greater detail on how to combine these operations to deal with specific use-cases, the so-called "split-apply-combine" technique, including the concept of a general aggregation function agg. It also shows how to do method chaining using assign.
  8. Midway Project -- describes a project and asks questions that you should be able to answer from the data using the techniques you have learned so far. Comes with solutions.
  9. Strings -- one reason I don't have much experience with Pandas is because it is focused on numeric tables for the most part. However, Pandas also has impressive string handling facilities via the str accessor. This chapter was something of an eye-opener for me, showing me how to use Pandas for text analysis and pre-processing.
  10. Dates -- this chapter describes Pandas date and time handling capabilities. This can be useful when trying to work with time series or when trying to derive numerical features from columns containing datetime objects to combine with other numeric or text data.
  11. Visualizations -- this chapter describes visualization functionality you can invoke from within Pandas, that are powered either by Matplotlib or Seaborn. This is more convenient than exporting the data to Numpy and using the two packages to draw the charts.
  12. Performance -- performance has been a focus for most of the preceding chapters in this book. However, the recipes in this chapter are in the advanced tricks category, and include converting strings to categorical values, optimizing reads and writes using Apache Arrow backed formats, and the using fast special purpose functions for specific purposes.
  13. Final Project -- describes a project similar to the Midway project with questions that you should be able to answer from the data using the techniques you have learned so far.

I think the book has value beyond just teaching Pandas fundamentals though. The author sprinkles insights about Data Analysis and Data Science throughout the book, around learning to structure the problem and planning the sequence of steps that are best suited for the tools at hand, the importance of critical thinking, the importance of knowing the data and interpreting the results of the analysis, etc.

Each exercise (there are 50 in all) involves downloading some dataset, dealing with subjects as diverse as tourism, taxi rides, SAT scores, parking tickets, olympic games, oil prices, etc. I think the information about the availability of such datasets (and possibly related datasets) can also be very valuable to Data Scientists for their future projects.

I think the popularity of Pandas is because of the same reason as the popularity of Jupyter Notebooks. It is a nice, self-contained platform the allows a Data Scientist to demonstrate a series of data transformations from problem to solution in a clear, concise and standard manner, not only to customers, but to other Data Scientists as well. More than any other reason, I feel that this will continue to drive the popularity of Pandas and its various incarnations, and as a Data Scientist, it makes sense to learn how to use it properly. And the book definitely fulfils its promise of teaching you how to do that.

Saturday, May 18, 2024

Finetuning RAGAS Metrics using DSPy

Last month, I decided to sign-up for the Google AI Hackathon, where Google provided access to their Gemini Large Language Model (LLM) and tasked participants with building a creative application on top of it. I have worked with Anthropic's Claude and OpenAI's GPT-3 at work previously, and I was curious to see how Gemini stacked up against them. I was joined in that effort by David Campbell and Mayank Bhaskar, my non-work colleagues from the TWIML (This Week In Machine Learning) Slack. Winners for the Google AI Hackathon were declared last Thursday, and whilte our project sadly did not win anything, the gallery provides examples of some very cool applications of LLMs (and Gemini in particular) for both business and personal tasks.

Our project was to automate the evaluation of RAG (Retrieval Augmented Generation) pipelines using LLMs. I have written previously about the potential of LLMs to evaluate search pipelines, but the scope of this effort is broader in that it attempts to evaluate all aspects of the RAG pipeline, not just search. We were inspired by the RAGAS project, which defines 8 metrics that cover various aspects of the RAG pipeline. Another inspiration for our project was the ARES paper, which shows that fine-tuning the LLM judges on synthetically generated outputs improves evaluation confidence.

Here is a short (3 minutes) video description of our project on Youtube. This was part of our submission for the hackathon. We provide some more information about our project in our blog post below.

We re-implemented the RAGAS metrics using LangChain Expression Language (LCEL) and applied them to (question, answer, context and ground truth) tuples from the AmnestyQA dataset to generate the scores for these metrics. My original reason for doing this, rather than using the using what RAGAS provided directly, was because I couldn't make them work properly with Claude. This was because Claude cannot read and write JSON as well as GPT-3 (it works better with XML), and RAGAS was developed using GPT-3. All the RAGAS metrics are prompt-based and transferrable across LLMs with minimal change, and the code is quite well written. I wasn't sure if I would encounter similar issues with Gemini, so it seemed easier to just re-implement the metrics from the ground up for Gemini using LCEL than try to figure out how to make RAGAS work with Gemini. However, as we will see shortly, it ended up being a good decision.

Next we re-implemented the metrics with DSPy. DSPy is a framework for optimizing LLM prompts. Unlike RAGAS, where we tell the LLM how to compute the metrics, with DSPy the general approach is to have very generic prompts and show the LLM what to do using few shot examples. The distinction is reminiscent of doing prediction using Rules Engines versus using Machine Learning. Extending the analogy a bit further, DSPy provides its BootstrapFewShotWithRandomSearch optimizer that allows you to search through its "hyperparameter space" of few shot examples, to find the best subset of examples to optimize the prompt with, with respect to some score metric you are optimizing for. In our case, we built the score metric to minimize the difference between the the score reported by the LCEL version of the metric and the score reporteed by the DSPy version. The result of this procedure are a set of prompts to generate the 8 RAG evaluation metrics that are optimized for the given domain.

To validate this claim, we generated histograms of scores for each metric using the LCEL and DSPy prompts, and compared how bimodal, or how tightly clustered around 0 and 1, they were. The intuition is that the more confident the LLM is about the evaluation, the more it will tend to deliver a confident judgment clustered around 0 or 1. In practice, we do see this happening in case of the DSPy prompts for all but 2 of the metrics, although the differences are not very large. This may be because we the AmnestyQA dataset is very small, only 20 questions.

To address the size of AmnestyQA dataset, Dave used the LLM to generate some more (question, context, answer, ground_truth) tuples given a question and answer pair from AmnestyQA and a Wikipedia retriever endpoint. The plan was for us to use this larger dataset for optimizing the DSPy prompts. However, rather than doing this completely unsupervised, we wanted to have a way for humans to validate and score the LCEL scores from these additional questions. We would then use these validated scores as the basis for optimizing the DSPy prompts for computing the various metrics.

This would require a web based tool that would allow humans to examine the output of each step of the LCEL metric score process. For example, the Faithfulness metric has two steps, the first is to extract facts from the answer, and the second is to provide a binary judgment of whether the context contains the fact. The score is computed by adding up the individual binary scores. The tool would allow us to view and update what facts were extracted in the first stage, and the binary output for each of the fact-context pairs. This is where implementing the RAGAS metrics on our own helped us, we refactored the code so the intermediate results were also available to the caller. Once the tool was in place, we would use it to validate our generated tuples and attempt to re-optimise the DSPy prompts. Mayank and Dave had started on this , but unfortunately we ran out of time before we could complete this step.

Another thing we noticed is that calculation of most of the metrics involves one or more subtasks to make some kind of binary (true / false) decision about a pair of strings. This is something that a smaller model, such as a T5 or a Sentence Transformer, could do quite easily, more predictably, faster, and at lower cost. As before, we could use extract the intermediate outputs from the LCEL metrics to create training data to do this. We could use DSPy and its BootstrapFindTune optimizer to fine-tune these smaller models, or fine-tune Sentence Transformers or BERT models for binary classification and hook them up into the evaluation pipeline.

Anyway, that was our project. Obviously, there is quite a bit of work remaining to make it into a viable product for LLM based evaluation using the strategy we laid out. But we believe we have demonstrated that this can be viable, that given sufficient training data (about 50-100 examples for the optimized prompt, and maybe 300-500 each for the binary classifiers), it should be possible to build metrics that are tailored to one's domain and that can deliver evaluation judgments with greater confidence than those built using simple prompt engineering. In case you are interested in exploring further, you can find our code and preliminary results at sujitpal/llm-rag-eval on GitHub.

Tuesday, May 14, 2024

Performance Analysis of Float vs Byte vs Binary Vectors on OpenSearch

I've been working on an application where, given an input string, the objective is to recommend an output string that is similar to the input string, for some notion of similarity. A machine learning model, in this case a SentenceTransformers model, is taught this notion of similarity by showing it many examples of input-output pairs. The model's weights are then used to encode the part to be recommended as a vector, and written out into a search index, in this case OpenSearch. At inference time, the same model is used to encode the input string, and OpenSearch's vector search is used to find and recommend the nearest string to the input in vector space.

My dataset consisted of about 10,000 input-output pairs. I split it up into 90% training set (approx 9,000 pairs) and 10% test set (approx 1,000 pairs). I chose the sentence-transformers/all-MiniLM-L6-v2 model, a good pre-trained general-purpose model for vector search in its own right, that maps pairs into a dense 384-dimensional space. Sentence Transformer models use Contrastive Learning, which means I need both positive and negative pairs to train it, but my training set are all positive pairs by definition. Rather than try to generate negative pairs on my own, I used the built-in MultipleNegativesRanking (MNR) loss, which takes positive pairs and generates negative pairs by sampling from the batches. I trained for 10 epochs, using the AdamW optimizer (the default) with learning rate 2e-5 (also the default), saving the best checkpoint (using similarity on validation set).

To evaluate, I generated the top-100 nearest neighbors for each input of the test set pairs, and then computed Recall @k and MRR (Mean Reciprocal Rank) @K for k = 1, 3, 5, 10, 20, 50, 100. For recall, I would score a match @k as successful if the output of the pair appeared within the top k nearest neighbors returned from the vector search. This score is then averaged across all the 1,000 test set pairs for each value of k. MRR is similar, except the recall score for each test set pair and k is divided by the (1-based) position that matched (and 0 if no match).

The baseline for the experiment was computed using an index of encodings of the input part created using the stock all-MiniLM-L6-v2 model.

I had also recently read Jo Kristian Bergum's blog posts Billion Scale Vector Search with Vespa, part one and two, and more recently Matryoshka Binary Vectors: Slash Vector Search Costs with Vespa on the Vespa blog, where he compares performance of vectors of different storage types, among other things. Vespa allows for many different storage types, but I am using OpenSearch, which offers support for float (float32), byte (int8) and binary (bool) storage types. I was curious to see (a) if I could replace my float based vectors with these, and if so, how their performance would compare. This is what this post is about.

The index mappings for the float, byte and binary vector field is as follows. These would need to be set during index creation.

Float Vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
{
    "type": "knn_vector",
    "dimension": 384,
    "method": {
        "name": "hnsw",
        "engine": "lucene",
        "space_type": "cosinesimil",
        "parameters": {
            "ef_construction": 128,
            "m": 24
        }
    }
}

Byte vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
    "type": "knn_vector",
    "dimension": 384,
    "data_type": "byte",
    "method": {
        "name": "hnsw",
        "engine": "lucene",
        "space_type": "cosinesimil",
        "parameters": {
            "ef_construction": 128,
            "m": 24,
        }
    }
}

Binary vector

1
2
3
4
{
    "type": "binary",
    "doc_values": "true"
}

To generate the vectors, I used the fine-tuned version of the all-MiniLM-L6-v2 model as my encoder, and post-processed the float32 vector returned from the encoder to int8 and binary using the following functions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def convert_to_bytes(arr: np.ndarray):
    return np.floor(arr * 128).astype(np.int8)

def convert_to_bits(arr: np.ndarray):
    bits = np.packbits(
        np.where(arr > 0, 1, 0)).astype(np.int8)
    arr_b = bits.reshape(arr.shape[0], -1)
    hex_bits = []
    for row in arr_b:
        hex_bits.append(str(binascii.hexlify(row), "utf-8"))
    return hex_bits

At the end of this indexing process, I had three separate indexes of approximately 10,000 records, one with one part of the pair encoded as a float32 vector, another encoded as a byte (int8) vector, and the last as a binary vector. To give you a rough idea of the storage requirements (rough because there are fields other than the vectors for each index), the sizes reported by /_cat/indices are shown below.

Vector Storage Type Index Size
float 184.9 MB
byte 39.6 MB
binary 15.6 MB

On the query side, I use the following Script Score queries as described in the Byte-quantized vectors in OpenSearch blog post and Exact k-NN with scoring script documentation pages. The queries are all score scripts as shown below. The query for float and byte vectors are the same, the only difference is that the float_vector is quantized down to int8 in the second case.

Float and byte vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "script_score": {
        "query": {
            "match_all": {}
        },      
        "script": {
            "source": "knn_score",
            "lang": "knn",
            "params": {
                "field": "{field_name}",
                "query_value": "{float_or_byte_vector}",
                "space_type": "cosinesimil"
            }       
        }       
    }       
}    

Binary Vector

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
{
    "script_score": {
        "query": {
            "match_all": {}
        },
        "script": {
            "source": "knn_score",
            "lang": "knn",
            "params": {
                "field": "{field_name}",
                "query_value": "{binary_vector}",
                "space_type": "hammingbit"
            }
        }
    }
}

The chart below shows the Recall and MRR @k for various values of k as described above.

First thing to note is that fine-tuning helps, or at least it helped a lot in this case, probably because the notion of similarity I was working with was more nuanced than Cosine similarity. With respect to float vectors versus byte vectors, float vectors have a slight edge as you can see from the table below (if you look really hard you can also see the float vector (orange line) and byte vector (green line) almost overlaid on each other. While binary vectors don't perform as well, they still perform better than the baseline, and they are much faster, so they can be useful as the first stage of a two-stage retrieval pipeline.

Mean Reciprocal Rank

Recall

In terms of response time, binary vectors are the fastest, followed by byte vectors, followed by float vectors. I measured the response time for each of the 1,000 test set queries to extract 100 nearest neighbors from OpenSearch, then calculated the mean and standard deviation for each vector storage type. Here are the results.

Vector Storage Type Mean Response Time Standard Deviation
float 0.382 s 0.097 s
byte 0.231 s 0.050 s
binary 0.176 s 0.048 s

I originally brought this up in the Relevance Slack channel because I had made some mistakes in my evaluation and was getting results that did not agree with common sense. Thanks to all my "non-work colleagues" on the channel who helped me out by validating my observations and being sounding boards which eventually helped me find the mistakes in my analysis. In any case, I figured that the final results may be useful to the broader community of vector search enthusiasts and professionals, so I am sharing this. Hope you find it useful.

Tuesday, May 07, 2024

KGC/HCLS 2024 Trip Report

I was at KGC (Knowledge Graph Conference) 2024, which is happening May 6-10 at Cornell Tech. I was presenting (virtually) at their Health Care and Life Sciences (HCLS) workshop, so my speakers pass was only valid for today for the HCLS portion of KGC. My trip report covers a few talks that I attended here. Attending virtually was a bit chaotic as sessions went over sometimes, so you might leave a session to attend another, only to find that it hadn’t started yet. This is hard to forsee, we have faced this issue ourselves the first time we moved an internal conference from in-person to hybrid.

KGs in RAG (Tom Smoker, WhatWhyHow.AI)

I have been working with Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) for almost a year now, and I went to this talk hoping for insights on how to use graphs as input to RAG systems. Understandably, the speaker spent some time covering the basics, which I personally did not find very fruitful. However, there were some nuggets of wisdom I got out of the talk. First, the RAG pipelines can lower the risk of hallucinations by using LLMs for planning and reasoning, but without delegating to LLMs for factual information. And second, an agent architecture can more efficiently use smaller sub-graphs which can often be generated dynamically in Closed World models.

A side discussion on chat also yielded a paper reference Getting from Generative AI to Trustworthy AI: what LLMs may learn from Cyc (Lenat and Marcus, 2023). The paper looks really interesting on an initial skim and I plan to read in more detail later.

Knowledge Graphs for Precision Oncology (Krishna Bulusu, AstraZeneca)

A nice overview of applications of Knowledge Graph (KG) to Drug Discovery (DD). DD attempts to apply KG to solve 3 main problems: (1) find gene causing disease (2) match drug with disease and (3) (drug, gene, disease) as a fundamental relationship in DD. The speaker pointed out that the big advantage of KGs is Explainability. He also mentioned the use of graph clustering for node stratification.

Combining graph and vector representation for efficient information retrieval (Peio Popov, Ontotext)

This was a presentation from OntoText where they demonstrated new features built into their GraphDB database. This was of interest to me personally since our KG is also built using GraphDB. Specifically they have integrated LLM and vector search support into their products so they can be invoked from a SPARQL query. This gives GraphDB users the power to combine these techniques in the same call rather than build multi-stage pipelines.

I also learned the distinction between Semantic, Full text and Vector Search as ones based off KG, Lucene (or Lucene-like) indexes and vector search platforms, I would previously conflate the first and third.

Knowledge Engineering in Clinical Decision Support: When a Graph Representational Model is not enough (Maulik Kamdar, Optum)

This was a presentation from my ex-colleague Maulik Kamdar. He talks about challenges in Clinical Decision Support (CDS) where a KG alone is insufficient. Specifically the case he is considering where multiple third party ontologies need to be aligned into one KG. In this situation, similar concepts are combined into ValueSets, which are then composed with bare concepts or with each other to form Clinical Rules. Clinical Rules are further combined to form Clinical Calculators or Questionnaires, which are then combined to form Decision Trees and Flowcharts, which are then combined into Clinical Guidelines. I am probably biased given our common history, but I found this talk to be the most educational for me.

Knowledge Graphs, Theorem Provers and Language Models (Vijay Saraswat and Nikolaos Vasiloglou)

The speakers discussed the role of self-discovery, In-Context Learning (ICL), symbiotic integration of KG with search, and Graph RAG in reasoning engines powered by KG and LLM. They characterize an Agent as an LLM based black box that is provided with pairs of input-output instances to learn some unknown function (similar to ML models). They describe ICL as learning through few shot and many shot examples. They also talk about using the output of KG to fact-check / enhance LLMs and using LLMs to generate assertions that can be used to create a KG. Their demo shows how an LLM is able to learn to generate a Datalog like graph query language from text prompts using few-shot examples.

The speaker made reference to the following three papers in support of the techniques he was describing, which I have duly added to my reading list.

A Scalable and Robust Named Entity Recognition and Linking System for a Clinical Healthcare Knowledge Graph (Sujit Pal, Elsevier Health)

This was my talk. I had originally intended to attend in person but it seemed wasteful to fly across the country to deliver a 5-minute presentation. It did take a bit of planning to present remotely but I learned two useful life lessons.

  1. You can generate a presentation video from MS Powerpoint. Simply create your slides and record a slideshow where you record yourself narrating your presentation. Once done, export as an MP4 and upload to Youtube or other video service.
  2. You can print posters online and have them delivered to someone else.

Huge thanks to my colleague Tom Woodcock who attended in person, and who was kind enough to carry and hang my poster at the conference for me, and who also agreed to present my slideshow for me (although I think that in the end he did not have to). Many thanks also to my ex-colleague Helena Deus (part of the HCLS organizing team), who helped walk me through to a workable solution and was instrumental in my talk being delivered successfully. Also thanks to Leah Walton from the HCLS organizing team, for supporting me in my attempt to present remotely.

Here is the Youtube video for my 5-minute presentation in case you are interested. It’s a bit high-level since I had only 5 minutes to cover everything, but there is a little more information in the poster below.

Graphs for good – Hypothesis generation for Rare Disease Treatment (Brian Martin, AbbVie)

This presentation revolves around a graph that connects diseases to drugs via disease variants, gene, pathway, gene and compound entities. This was used to find a cure for a rare disease using existing medications. It was later extended to find candidate cures for a group of 20 most neglected diseases worldwide. The speakers verified that results for Dengue fever correlates well with previously known information, thus supporting the veracity of the approach. The paper describing this work is Leveraging a Billion-Edge Knowledge Graph for Drug Re-purposing and Target Prioritization using Genomically-Informed Subgraphs (Martin et al, 2022).

Generating and Querying Graphs with LLM (Brian Martin, Subha Madhavan, Berenice Wulbrecht)

Panel discussion where various strategies for generating and querying graphs using LLMs were discussed. Entertaining (and somewhat predictable) comparisons of Property Graphs vs RDF graphs to Ford and Ferrari automobiles, and how LLMs transform them into Teslas (with its self-driving technology). They also talk about extracting assertions from a corpus of documents to create a KG customized for the corpus, and then using the KG to fact-check the output of the LLM for RAG queries against that corpus.

Overall, I think it was a great conference. Learned a lot, would love to go back and present here in the future, hopefully this time in person.

Saturday, March 23, 2024

Book Report: Machine Learning for Drug Discovery

Drug Discovery is a field where biochemists (and more recently computer scientists) turn ideas into potential medications. I first came across a few applications in this area when checking out how to build Graph Neural Networks (GNN) as part of auditing the CS224W: Machine Learning with Graphs course from Stanford, some learnings of which I recycled into my Deep Learning with Graphs tutorial at ODSC 2021. Of course, drug discovery is much more than just GNNs, I mention this only because this happened to be my entry point into this fascinating world. However, I will hasten to add that despite having made an entrance, I am still parked pretty solidly close to the entrance (or exit, depending on your point of view).

But I am always looking to learn more about stuff I find interesting, so when I was offered a chance to review Dr Noah Flynn's Machine Learning for Drug Discovery published by Manning, I jumped on it. The book is currently in MEAP (Manning Early Access Program) so currently there are only 5 chapters available, but once the book is completed, there are going to be 15 chapters in all. The intended audience of the book, as the title suggests, are computational biochemists, i.e. the ones who attempt to solve Drug Discovery problems using Machine Learning. Thus, to become a computational biochemist, there are two main ways -- either you are a biochemist and you learn the ML, or you are a ML person and you learn the biochemistry. The book is aimed at both categories of readers.

As someone in the latter category, I had to spend much more time on the biochemistry aspects. I suspect that most readers of this review would also fall into this category. For them, I would say that while the ML part is sophisticated enough to solve the problem at hand, they are methods and practices that should be familiar to most ML people already. The most useful things that I think you would get out of this book are as follows:

  • Framing the Drug Discovery problem as a ML problem
  • Preprocessing and Encoding inputs
  • Getting data to train your ML model

For the first one, you either need to have a biochemistry background yourself, or you need to pair with someone who does. I suppose you could get by with a life sciences or chemistry background as well, or acquire enough biochemistry knowledge over time in this field, and this book may even take you part of the way there, but be aware that the learning curve is steep.

For the second and the third items, I thought the book was super useful. Most chapters are built as case studies around a Drug Discovery problem, so as you go through the chapters, you will learn about the sites to acquire your datasets from, and the techniques to preprocess the data from these sites into a form suitable for consumption by your ML model. At least the first 5 chapters deal with fairly simple ML models, but which may or may not be familiar to you depending on your industry, so you might also learn a few things about evaluating or tuning these models that you didn't know before (I did).

The first chapter introduces the reader to the domain and talks about the need for computational approaches to Drug Discovery. It introduces the terminology and the RDKit software library, an open-source cheminformatics toolkit the provides implementations of many common operations needed for computational Drug Discovery (sort of like a specialized supplement to Scikit-Learn for general ML). It also covers high level rules of thumb for detecting drug compounds, such as Lipinski's rule of 5. It then covers some common use cases common in Drug Discovery, ranging from Virtual Screening to Generative and Synthetic Chemistry. It also covers some popular (and public) repositories for Chemistry data, such as ChEMBL, PubChem, Protein Data Bank (PDB), etc.

The second chapter demonstrates Ligand based Screening, where you already have a reference molecule with some of the desired properties, and you want to search the chemical space for molecules similar to that one, with the objective of finding more drugs like the one you started with. The case study here is to identify potential anti-malarial compounds. The dataset for this comes packaged with RDKit itself as Structure Definition Files (SDF) which describes each molecule using a SMILES (Simplified Molecular Input Link Entry System) string. The chapter walks us through converting the SMILES to MOL format, then using RDKit to extract specialized chemical features from the MOL and SMILES, preprocessing to filter out uninteresting molecules based on rule based thresholds such as bio-availability, molecular weight, etc, structure based thresholds such as toxicity, and specific substructural patterns (similar to subgraph motifs). It then uses RDKit to generate Morgan fingerprints out of the remaining molecules (MOL). Morgan (and other) fingerprints are similar to embeddings in NLP, except that they encode structural information through a more deterministic process, and are hence more explainable than embeddings. Finally, these fingerprints are compared with the reference molecule using Tanimoto similarity and the nearest neighbors found.

Chapter 3 continues with the problem of Ligand based screening, but tries to predict cardiotoxicity of the anti-malarial compounds found in the previous chapter using a linear model. This is done indirectly by predicting if the compound blocks the hERG (or gene potassiuam) channel, then it is cardiotoxic, and vice versa. A linear model (Scikit-Learn SGD CLassifier) is trained using the hERG dataset from the Therapeutic Data Commons (TDC). The chapter shows some Exploratory Data Analysis (EDA) on the data, using standard preprocessing as described in the previous chapter. An additional step here is to standardize (regularize) the data for classification. The author provides the biochemistry reasoning for behind this step, but uses the implementation already provided by RDKit. Finally Morgan fingerprints are used to train the SGD Classifier. Because the elements of Morgan fingerprints have meaning, the weights of the resulting SGD model can be used to determine feature importances. There is also some discussion here of cross validation, L1/L2 regularization, removing collinearity, adding interaction terms and hyperparameter sweeps.

Chapter 4 explores building a linear regression model to predict solubility, i.e. how much of the drug would be absorbed by the system. The dataset used to train the regressor is the AqSolDB, also from TDC. This chapter introduces the idea of scaffold splitting, a technique common with biochemical datasets that preserves the structural / chemical similarity within each split. It also briefly describes outlier removal at the extremes, which requires chemistry knowledge. The RDKit library is used to extract features from the dataset, and the model trained to minimize the Mean Squared Error loss. The RANSAC (RANdom SAmple Consensus) technique is introduced that makes models more robust to outliers. On the ML side, there is some discussion on the bias-variance tradeoff and Learning / Validation curves.

The fifth and last chapter of the MEAP (at the time of writing this review) deals with predicting how well the body will metabolize the drug. Typically, drugs are broken down into enzymes in the liver, a large proportion of which are collectively known as the Cytochrome P450 superfamily. As before, metabolism is predicted indirectly by whether the drug inhibits Cytochrome P450 -- if it does, then it will not get metabolized easily, and vice versa. The dataset used to train the model is the CYP3A4 dataset, also from TDC. Data is prepared using the same set of (by now) standard pipeline and the classifier trained a binary predictions of whether the input inhibits Cytochrome P450 or not. The chapter discusses the utility of Reliability Plots in Performance Evaluation and Platt scaling for calibrating probabilities. It also talks about how to deal with imbalanced datasets, Data Augmentation, Class Weights and other approaches to deal with class imbalance. Various models are trained and evaluated, and their important features identified and visualized with RDKit Similarity Map. The chapter ends with a short discussion on Multi-label classification.

The pandemic and the rapid discovery of the COVID vaccine gave a lot of us (at least those of us that were watching) a ringside view into the fascinating world of drug discovery. This book provides yet another peek into this world, with its carefully crafted case studies and examples. Overall, I think you will learn a lot about drug discovery if you go through this book, both on the biochemistry side and the ML side. There are exercises at the end of each chapter, doing these would help you get more familiar with RDKit and hopefully more effective at computational drug discovery.