Salmon Run: 2024

Sunday, June 30, 2024

Table Extraction from PDFs using Multimodal (Vision) LLMs

Couple of weeks ago a colleague and I participated in an internal hackathon where the task was to come up with an interesting use case using the recent multi-modal Large Language Models (LLMs). Multi-modal LLMs take not only text inputs via their prompt like earlier LLMs, but can also accept non-text modalities such as images and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI, Gemini 1.5 from Google, and Claude-3.5 Sonnet from Anthropic. The hackathon provided access to GPT-4o through Azure, Microsoft's Cloud Computing Platform. We did not win, there were other entries that were better than ours both in terms of the originality of their idea as well as quality of their implementations. However, we learned some cool new things during the hackathon, and figured that these might be of general interest to others as well, hence this post.

Our idea was to use GPT-4o to extract and codify tables found in academic papers as semi-structured data (i.e. JSON). We could then either query the JSON data for searching within tables, or convert it to Markdown for downstream LLMs to query them easily via their text interface. We had originally intended to extend the idea to figures and charts, but we could not get that pipeline working end to end.

Here is what our pipeline looked like.

Academic papers are usually available as PDFs. We use the PyMuPDF library to split the PDF file into a set of image files, where each image file corresponds to a page in the paper.
We then send each page image through the Table Transformer, which returns bounding box information for each table it detects in the page, as well as a confidence score. The Table Transformer model we used was microsoft/table-transformer-detection.
We crop out each table from the pages using the bounding box information, and then send each table to GPT-4o as part of a prompt asking to convert it to a JSON structure. GPT-4o responds with a JSON structure representing the table.

This pipeline was based on my colleague's idea. I like how it progressively simplifies the task by splitting each page of the incoming PDF into its own image, then uses a pre-trained Table Transformer to crop out the tables from them, and only then passes the table to GPT-4o to convert to JSON. That table image is passed into the prompt as a "data URL" which is just the base-64 encoding of the image formatted as "data:{mime_type};base64,{base64_encoded_data}. The Table Transformer, while not perfect, proved remarkably successful at identifying tables in the text. I say remarkable because we used a pre-trained model, but perhaps it is not that remarkable once you consider that it probably trained on tables in academic papers as well.

Our prompt for GPT-4o looked something like this:

System: You are an AI model that specializes in detecting the tables and extracting, interpreting table content from images. Follow below instruction step by step:
1. Recognize whether the given image is table or not, if it's not a table print "None". if it's a table go to next step.
2. accurately convert the table's content into a structured structured Json format

general instruction:
1. do not output anything extra. 
2. a table must contains rows and columns

User: Given the image, detect whether it's a table or not, if it's a table then convert it to Json format
{image_data_url}

For the figure pipeline, I tried to use an OWL-VIT (Vision Transformer for Open World Localization) model in place of the Table Transformer. But it was not as successful at detecting figures in the text, probably since SAM seems to be fine-tuned to detect objects in natural images. Unfortunately we couldn't find a pre-trained model that would work for this particular case. Another issue was converting the fgure into a semi-structured JSON representation, we ended up asking GPT-4o to describe the image as text.

One suggestion by some of my TWIML non-work colleagues was to ask GPT-4o to return the bounding boxes for the images it finds in it, and then use that to extract the figures to send to GPT-4o for describing. It didn't work unfortunately, but was definitely worth trying. As LLMs get more and more capable, I think it makes sense to rethink our pipelines to delegate more and more work to the LLM. Or at least verify that it can't do something before moving on to older (and harder to implement) solutions.

Sunday, June 23, 2024

Book Report: Pandas Workout

Unlike many Data Scientists, I didn't automatically reach for Pandas when I needed to analyze data. I came upon this discipline (Data Science) as a Java Software Engineer who used Python for scripting, so I was quite comfortable operating on JSON / CSV / text files directly, loading data into relational databases and running SQL against them, and building visualizations with Matplotlib. So when Pandas first hit the scene, I thought it was a nice library, but I just didn't see the logic in spending time to learn another interface to do the same things I could do already. Of course, Pandas has matured since then (and so have I, hopefully), and when faced with a data analysis / preparation / cleanup task, I often now reach out not only for Pandas, but depending on the task, also its various incarnations such as PySpark, Dask Dataframes and RAPIDS cuDF. When I use Pandas (and its various incarnations) I often find myself depending heavily on Stack Overflow (and lately Github Copilot) for things I know can be done but not how. To some extent I blame this on never having spent the time to understand Pandas in depth. So when I was offered the chance to review Pandas Workout by Reuven Lerner, I welcomed it as a way to remedy this gap in my knowledge.

The book is about Pandas fundamentals rather than solving specific problems with Pandas. For that you will still want to look up Stack Overflow :-). In fact, in the foreword the author specifically targets my demographic (needs to look up Stack Overflow when solving problems with Pandas). But he promises that after reading the book you will understand why some solutions are better than others.

Pandas started as an open source project by Wes McKinney, and has grown somewhat organically into the top Data Science toolkit that is today. As a result, there are often multiple ways to do something in Pandas. While all these ways may produce identical results, their performance characteristics may be different, so there is usually an implicit "right" way. The book gives you the mental model to decide which among the different approaches is the "right" one.

The book is organized into the following chapters. Each chapter covers a particular aspect of Pandas usage. I have included a super-short TLDR style abstract for each chapter for your convenience.

Series -- Pandas Series objects are the basic building block of Pandas and represent a typed sequence of data, that are used to construct DataFrames and Indexes. Many methods on the Series object apply in a similar way to DataFrames as well. This is a foundational chapter, understanding this will help with future chapters.
Data Frames -- DataFrames represent tabular data as a sequence of Series, where each Series object represents a column in the table. Pandas inherits the idea of DataFrames from R, and the incarnations I listed (and a few that I didn't) use DataFrame as a basic abstraction as well. This chapter teaches you how to select from and manipulate DataFrames. Unless you've used Pandas extensively before, there is a high chance you will learn something useful new tricks here (I did, several of them).
Import and Export -- covers reading and writing CSV and JSON formats to and from DataFrames. Covers some simple sanity checks you can run to verify that the import or export worked correctly. I learned about the pd.read_html method here, probably not that useful, but interesting to know!
Indexes -- Indexes are used by Pandas to efficiently find data in DataFrames. While it may be possible to get by without Indexes, your Pandas code would take longer to run and consume more resources. The chapter deals with indexing techniques. I happened to know a lot of them, but there were a few that I didn't, especially the techniques around pivot tables.
Cleaning -- this chapter teaches a skill that is very fundamental to (and maybe even the bane of) a Data Scientist's job. Statistics indicate that we spend 80% of our time cleaning data. Along with the techniques themselves (remove / interpolate / ignore), this chapter contains commentary that will help you frame these decisions on your own data cleaning tasks.
Grouping, Joining and Sorting -- these three operations are so central to data analysis, so much so that SQL has special keywords for each operation (JOIN, GROUP BY and ORDER BY). This chapter covers various recipes to do these operations efficiently and correctly in Pandas.
Advanced Grouping, Joining and Sorting -- this chapter goes into greater detail on how to combine these operations to deal with specific use-cases, the so-called "split-apply-combine" technique, including the concept of a general aggregation function agg. It also shows how to do method chaining using assign.
Midway Project -- describes a project and asks questions that you should be able to answer from the data using the techniques you have learned so far. Comes with solutions.
Strings -- one reason I don't have much experience with Pandas is because it is focused on numeric tables for the most part. However, Pandas also has impressive string handling facilities via the str accessor. This chapter was something of an eye-opener for me, showing me how to use Pandas for text analysis and pre-processing.
Dates -- this chapter describes Pandas date and time handling capabilities. This can be useful when trying to work with time series or when trying to derive numerical features from columns containing datetime objects to combine with other numeric or text data.
Visualizations -- this chapter describes visualization functionality you can invoke from within Pandas, that are powered either by Matplotlib or Seaborn. This is more convenient than exporting the data to Numpy and using the two packages to draw the charts.
Performance -- performance has been a focus for most of the preceding chapters in this book. However, the recipes in this chapter are in the advanced tricks category, and include converting strings to categorical values, optimizing reads and writes using Apache Arrow backed formats, and the using fast special purpose functions for specific purposes.
Final Project -- describes a project similar to the Midway project with questions that you should be able to answer from the data using the techniques you have learned so far.

I think the book has value beyond just teaching Pandas fundamentals though. The author sprinkles insights about Data Analysis and Data Science throughout the book, around learning to structure the problem and planning the sequence of steps that are best suited for the tools at hand, the importance of critical thinking, the importance of knowing the data and interpreting the results of the analysis, etc.

Each exercise (there are 50 in all) involves downloading some dataset, dealing with subjects as diverse as tourism, taxi rides, SAT scores, parking tickets, olympic games, oil prices, etc. I think the information about the availability of such datasets (and possibly related datasets) can also be very valuable to Data Scientists for their future projects.

I think the popularity of Pandas is because of the same reason as the popularity of Jupyter Notebooks. It is a nice, self-contained platform the allows a Data Scientist to demonstrate a series of data transformations from problem to solution in a clear, concise and standard manner, not only to customers, but to other Data Scientists as well. More than any other reason, I feel that this will continue to drive the popularity of Pandas and its various incarnations, and as a Data Scientist, it makes sense to learn how to use it properly. And the book definitely fulfils its promise of teaching you how to do that.

Saturday, May 18, 2024

Finetuning RAGAS Metrics using DSPy

Last month, I decided to sign-up for the Google AI Hackathon, where Google provided access to their Gemini Large Language Model (LLM) and tasked participants with building a creative application on top of it. I have worked with Anthropic's Claude and OpenAI's GPT-3 at work previously, and I was curious to see how Gemini stacked up against them. I was joined in that effort by David Campbell and Mayank Bhaskar, my non-work colleagues from the TWIML (This Week In Machine Learning) Slack. Winners for the Google AI Hackathon were declared last Thursday, and whilte our project sadly did not win anything, the gallery provides examples of some very cool applications of LLMs (and Gemini in particular) for both business and personal tasks.

Our project was to automate the evaluation of RAG (Retrieval Augmented Generation) pipelines using LLMs. I have written previously about the potential of LLMs to evaluate search pipelines, but the scope of this effort is broader in that it attempts to evaluate all aspects of the RAG pipeline, not just search. We were inspired by the RAGAS project, which defines 8 metrics that cover various aspects of the RAG pipeline. Another inspiration for our project was the ARES paper, which shows that fine-tuning the LLM judges on synthetically generated outputs improves evaluation confidence.

Here is a short (3 minutes) video description of our project on Youtube. This was part of our submission for the hackathon. We provide some more information about our project in our blog post below.

We re-implemented the RAGAS metrics using LangChain Expression Language (LCEL) and applied them to (question, answer, context and ground truth) tuples from the AmnestyQA dataset to generate the scores for these metrics. My original reason for doing this, rather than using the using what RAGAS provided directly, was because I couldn't make them work properly with Claude. This was because Claude cannot read and write JSON as well as GPT-3 (it works better with XML), and RAGAS was developed using GPT-3. All the RAGAS metrics are prompt-based and transferrable across LLMs with minimal change, and the code is quite well written. I wasn't sure if I would encounter similar issues with Gemini, so it seemed easier to just re-implement the metrics from the ground up for Gemini using LCEL than try to figure out how to make RAGAS work with Gemini. However, as we will see shortly, it ended up being a good decision.

Next we re-implemented the metrics with DSPy. DSPy is a framework for optimizing LLM prompts. Unlike RAGAS, where we tell the LLM how to compute the metrics, with DSPy the general approach is to have very generic prompts and show the LLM what to do using few shot examples. The distinction is reminiscent of doing prediction using Rules Engines versus using Machine Learning. Extending the analogy a bit further, DSPy provides its BootstrapFewShotWithRandomSearch optimizer that allows you to search through its "hyperparameter space" of few shot examples, to find the best subset of examples to optimize the prompt with, with respect to some score metric you are optimizing for. In our case, we built the score metric to minimize the difference between the the score reported by the LCEL version of the metric and the score reporteed by the DSPy version. The result of this procedure are a set of prompts to generate the 8 RAG evaluation metrics that are optimized for the given domain.

To validate this claim, we generated histograms of scores for each metric using the LCEL and DSPy prompts, and compared how bimodal, or how tightly clustered around 0 and 1, they were. The intuition is that the more confident the LLM is about the evaluation, the more it will tend to deliver a confident judgment clustered around 0 or 1. In practice, we do see this happening in case of the DSPy prompts for all but 2 of the metrics, although the differences are not very large. This may be because we the AmnestyQA dataset is very small, only 20 questions.

To address the size of AmnestyQA dataset, Dave used the LLM to generate some more (question, context, answer, ground_truth) tuples given a question and answer pair from AmnestyQA and a Wikipedia retriever endpoint. The plan was for us to use this larger dataset for optimizing the DSPy prompts. However, rather than doing this completely unsupervised, we wanted to have a way for humans to validate and score the LCEL scores from these additional questions. We would then use these validated scores as the basis for optimizing the DSPy prompts for computing the various metrics.

This would require a web based tool that would allow humans to examine the output of each step of the LCEL metric score process. For example, the Faithfulness metric has two steps, the first is to extract facts from the answer, and the second is to provide a binary judgment of whether the context contains the fact. The score is computed by adding up the individual binary scores. The tool would allow us to view and update what facts were extracted in the first stage, and the binary output for each of the fact-context pairs. This is where implementing the RAGAS metrics on our own helped us, we refactored the code so the intermediate results were also available to the caller. Once the tool was in place, we would use it to validate our generated tuples and attempt to re-optimise the DSPy prompts. Mayank and Dave had started on this , but unfortunately we ran out of time before we could complete this step.

Another thing we noticed is that calculation of most of the metrics involves one or more subtasks to make some kind of binary (true / false) decision about a pair of strings. This is something that a smaller model, such as a T5 or a Sentence Transformer, could do quite easily, more predictably, faster, and at lower cost. As before, we could use extract the intermediate outputs from the LCEL metrics to create training data to do this. We could use DSPy and its BootstrapFindTune optimizer to fine-tune these smaller models, or fine-tune Sentence Transformers or BERT models for binary classification and hook them up into the evaluation pipeline.

Anyway, that was our project. Obviously, there is quite a bit of work remaining to make it into a viable product for LLM based evaluation using the strategy we laid out. But we believe we have demonstrated that this can be viable, that given sufficient training data (about 50-100 examples for the optimized prompt, and maybe 300-500 each for the binary classifiers), it should be possible to build metrics that are tailored to one's domain and that can deliver evaluation judgments with greater confidence than those built using simple prompt engineering. In case you are interested in exploring further, you can find our code and preliminary results at sujitpal/llm-rag-eval on GitHub.

Tuesday, May 14, 2024

Performance Analysis of Float vs Byte vs Binary Vectors on OpenSearch

I've been working on an application where, given an input string, the objective is to recommend an output string that is similar to the input string, for some notion of similarity. A machine learning model, in this case a SentenceTransformers model, is taught this notion of similarity by showing it many examples of input-output pairs. The model's weights are then used to encode the part to be recommended as a vector, and written out into a search index, in this case OpenSearch. At inference time, the same model is used to encode the input string, and OpenSearch's vector search is used to find and recommend the nearest string to the input in vector space.

My dataset consisted of about 10,000 input-output pairs. I split it up into 90% training set (approx 9,000 pairs) and 10% test set (approx 1,000 pairs). I chose the sentence-transformers/all-MiniLM-L6-v2 model, a good pre-trained general-purpose model for vector search in its own right, that maps pairs into a dense 384-dimensional space. Sentence Transformer models use Contrastive Learning, which means I need both positive and negative pairs to train it, but my training set are all positive pairs by definition. Rather than try to generate negative pairs on my own, I used the built-in MultipleNegativesRanking (MNR) loss, which takes positive pairs and generates negative pairs by sampling from the batches. I trained for 10 epochs, using the AdamW optimizer (the default) with learning rate 2e-5 (also the default), saving the best checkpoint (using similarity on validation set).

To evaluate, I generated the top-100 nearest neighbors for each input of the test set pairs, and then computed Recall @k and MRR (Mean Reciprocal Rank) @K for k = 1, 3, 5, 10, 20, 50, 100. For recall, I would score a match @k as successful if the output of the pair appeared within the top k nearest neighbors returned from the vector search. This score is then averaged across all the 1,000 test set pairs for each value of k. MRR is similar, except the recall score for each test set pair and k is divided by the (1-based) position that matched (and 0 if no match).

The baseline for the experiment was computed using an index of encodings of the input part created using the stock all-MiniLM-L6-v2 model.

I had also recently read Jo Kristian Bergum's blog posts Billion Scale Vector Search with Vespa, part one and two, and more recently Matryoshka Binary Vectors: Slash Vector Search Costs with Vespa on the Vespa blog, where he compares performance of vectors of different storage types, among other things. Vespa allows for many different storage types, but I am using OpenSearch, which offers support for float (float32), byte (int8) and binary (bool) storage types. I was curious to see (a) if I could replace my float based vectors with these, and if so, how their performance would compare. This is what this post is about.

The index mappings for the float, byte and binary vector field is as follows. These would need to be set during index creation.

Float Vector

{
    "type": "knn_vector",
    "dimension": 384,
    "method": {
        "name": "hnsw",
        "engine": "lucene",
        "space_type": "cosinesimil",
        "parameters": {
            "ef_construction": 128,
            "m": 24
        }
    }
}

Byte vector

{
    "type": "knn_vector",
    "dimension": 384,
    "data_type": "byte",
    "method": {
        "name": "hnsw",
        "engine": "lucene",
        "space_type": "cosinesimil",
        "parameters": {
            "ef_construction": 128,
            "m": 24,
        }
    }
}

Binary vector

{
    "type": "binary",
    "doc_values": "true"
}

To generate the vectors, I used the fine-tuned version of the all-MiniLM-L6-v2 model as my encoder, and post-processed the float32 vector returned from the encoder to int8 and binary using the following functions.

def convert_to_bytes(arr: np.ndarray):
    return np.floor(arr * 128).astype(np.int8)

def convert_to_bits(arr: np.ndarray):
    bits = np.packbits(
        np.where(arr > 0, 1, 0)).astype(np.int8)
    arr_b = bits.reshape(arr.shape[0], -1)
    hex_bits = []
    for row in arr_b:
        hex_bits.append(str(binascii.hexlify(row), "utf-8"))
    return hex_bits

At the end of this indexing process, I had three separate indexes of approximately 10,000 records, one with one part of the pair encoded as a float32 vector, another encoded as a byte (int8) vector, and the last as a binary vector. To give you a rough idea of the storage requirements (rough because there are fields other than the vectors for each index), the sizes reported by /_cat/indices are shown below.

Vector Storage Type	Index Size
float	184.9 MB
byte	39.6 MB
binary	15.6 MB

On the query side, I use the following Script Score queries as described in the Byte-quantized vectors in OpenSearch blog post and Exact k-NN with scoring script documentation pages. The queries are all score scripts as shown below. The query for float and byte vectors are the same, the only difference is that the float_vector is quantized down to int8 in the second case.

Float and byte vector

{
    "script_score": {
        "query": {
            "match_all": {}
        },      
        "script": {
            "source": "knn_score",
            "lang": "knn",
            "params": {
                "field": "{field_name}",
                "query_value": "{float_or_byte_vector}",
                "space_type": "cosinesimil"
            }       
        }       
    }       
}

Binary Vector

{
    "script_score": {
        "query": {
            "match_all": {}
        },
        "script": {
            "source": "knn_score",
            "lang": "knn",
            "params": {
                "field": "{field_name}",
                "query_value": "{binary_vector}",
                "space_type": "hammingbit"
            }
        }
    }
}

The chart below shows the Recall and MRR @k for various values of k as described above.

First thing to note is that fine-tuning helps, or at least it helped a lot in this case, probably because the notion of similarity I was working with was more nuanced than Cosine similarity. With respect to float vectors versus byte vectors, float vectors have a slight edge as you can see from the table below (if you look really hard you can also see the float vector (orange line) and byte vector (green line) almost overlaid on each other. While binary vectors don't perform as well, they still perform better than the baseline, and they are much faster, so they can be useful as the first stage of a two-stage retrieval pipeline.

Mean Reciprocal Rank

Recall

In terms of response time, binary vectors are the fastest, followed by byte vectors, followed by float vectors. I measured the response time for each of the 1,000 test set queries to extract 100 nearest neighbors from OpenSearch, then calculated the mean and standard deviation for each vector storage type. Here are the results.

Vector Storage Type	Mean Response Time	Standard Deviation
float	0.382 s	0.097 s
byte	0.231 s	0.050 s
binary	0.176 s	0.048 s

I originally brought this up in the Relevance Slack channel because I had made some mistakes in my evaluation and was getting results that did not agree with common sense. Thanks to all my "non-work colleagues" on the channel who helped me out by validating my observations and being sounding boards which eventually helped me find the mistakes in my analysis. In any case, I figured that the final results may be useful to the broader community of vector search enthusiasts and professionals, so I am sharing this. Hope you find it useful.

Tuesday, May 07, 2024

KGC/HCLS 2024 Trip Report

I was at KGC (Knowledge Graph Conference) 2024, which is happening May 6-10 at Cornell Tech. I was presenting (virtually) at their Health Care and Life Sciences (HCLS) workshop, so my speakers pass was only valid for today for the HCLS portion of KGC. My trip report covers a few talks that I attended here. Attending virtually was a bit chaotic as sessions went over sometimes, so you might leave a session to attend another, only to find that it hadn’t started yet. This is hard to forsee, we have faced this issue ourselves the first time we moved an internal conference from in-person to hybrid.

KGs in RAG (Tom Smoker, WhatWhyHow.AI)

I have been working with Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) for almost a year now, and I went to this talk hoping for insights on how to use graphs as input to RAG systems. Understandably, the speaker spent some time covering the basics, which I personally did not find very fruitful. However, there were some nuggets of wisdom I got out of the talk. First, the RAG pipelines can lower the risk of hallucinations by using LLMs for planning and reasoning, but without delegating to LLMs for factual information. And second, an agent architecture can more efficiently use smaller sub-graphs which can often be generated dynamically in Closed World models.

A side discussion on chat also yielded a paper reference Getting from Generative AI to Trustworthy AI: what LLMs may learn from Cyc (Lenat and Marcus, 2023). The paper looks really interesting on an initial skim and I plan to read in more detail later.

Knowledge Graphs for Precision Oncology (Krishna Bulusu, AstraZeneca)

A nice overview of applications of Knowledge Graph (KG) to Drug Discovery (DD). DD attempts to apply KG to solve 3 main problems: (1) find gene causing disease (2) match drug with disease and (3) (drug, gene, disease) as a fundamental relationship in DD. The speaker pointed out that the big advantage of KGs is Explainability. He also mentioned the use of graph clustering for node stratification.

Combining graph and vector representation for efficient information retrieval (Peio Popov, Ontotext)

This was a presentation from OntoText where they demonstrated new features built into their GraphDB database. This was of interest to me personally since our KG is also built using GraphDB. Specifically they have integrated LLM and vector search support into their products so they can be invoked from a SPARQL query. This gives GraphDB users the power to combine these techniques in the same call rather than build multi-stage pipelines.

I also learned the distinction between Semantic, Full text and Vector Search as ones based off KG, Lucene (or Lucene-like) indexes and vector search platforms, I would previously conflate the first and third.

Knowledge Engineering in Clinical Decision Support: When a Graph Representational Model is not enough (Maulik Kamdar, Optum)

This was a presentation from my ex-colleague Maulik Kamdar. He talks about challenges in Clinical Decision Support (CDS) where a KG alone is insufficient. Specifically the case he is considering where multiple third party ontologies need to be aligned into one KG. In this situation, similar concepts are combined into ValueSets, which are then composed with bare concepts or with each other to form Clinical Rules. Clinical Rules are further combined to form Clinical Calculators or Questionnaires, which are then combined to form Decision Trees and Flowcharts, which are then combined into Clinical Guidelines. I am probably biased given our common history, but I found this talk to be the most educational for me.

Knowledge Graphs, Theorem Provers and Language Models (Vijay Saraswat and Nikolaos Vasiloglou)

The speakers discussed the role of self-discovery, In-Context Learning (ICL), symbiotic integration of KG with search, and Graph RAG in reasoning engines powered by KG and LLM. They characterize an Agent as an LLM based black box that is provided with pairs of input-output instances to learn some unknown function (similar to ML models). They describe ICL as learning through few shot and many shot examples. They also talk about using the output of KG to fact-check / enhance LLMs and using LLMs to generate assertions that can be used to create a KG. Their demo shows how an LLM is able to learn to generate a Datalog like graph query language from text prompts using few-shot examples.

The speaker made reference to the following three papers in support of the techniques he was describing, which I have duly added to my reading list.

A Scalable and Robust Named Entity Recognition and Linking System for a Clinical Healthcare Knowledge Graph (Sujit Pal, Elsevier Health)

This was my talk. I had originally intended to attend in person but it seemed wasteful to fly across the country to deliver a 5-minute presentation. It did take a bit of planning to present remotely but I learned two useful life lessons.

You can generate a presentation video from MS Powerpoint. Simply create your slides and record a slideshow where you record yourself narrating your presentation. Once done, export as an MP4 and upload to Youtube or other video service.
You can print posters online and have them delivered to someone else.

Huge thanks to my colleague Tom Woodcock who attended in person, and who was kind enough to carry and hang my poster at the conference for me, and who also agreed to present my slideshow for me (although I think that in the end he did not have to). Many thanks also to my ex-colleague Helena Deus (part of the HCLS organizing team), who helped walk me through to a workable solution and was instrumental in my talk being delivered successfully. Also thanks to Leah Walton from the HCLS organizing team, for supporting me in my attempt to present remotely.

Here is the Youtube video for my 5-minute presentation in case you are interested. It’s a bit high-level since I had only 5 minutes to cover everything, but there is a little more information in the poster below.

Graphs for good – Hypothesis generation for Rare Disease Treatment (Brian Martin, AbbVie)

This presentation revolves around a graph that connects diseases to drugs via disease variants, gene, pathway, gene and compound entities. This was used to find a cure for a rare disease using existing medications. It was later extended to find candidate cures for a group of 20 most neglected diseases worldwide. The speakers verified that results for Dengue fever correlates well with previously known information, thus supporting the veracity of the approach. The paper describing this work is Leveraging a Billion-Edge Knowledge Graph for Drug Re-purposing and Target Prioritization using Genomically-Informed Subgraphs (Martin et al, 2022).

Generating and Querying Graphs with LLM (Brian Martin, Subha Madhavan, Berenice Wulbrecht)

Panel discussion where various strategies for generating and querying graphs using LLMs were discussed. Entertaining (and somewhat predictable) comparisons of Property Graphs vs RDF graphs to Ford and Ferrari automobiles, and how LLMs transform them into Teslas (with its self-driving technology). They also talk about extracting assertions from a corpus of documents to create a KG customized for the corpus, and then using the KG to fact-check the output of the LLM for RAG queries against that corpus.

Overall, I think it was a great conference. Learned a lot, would love to go back and present here in the future, hopefully this time in person.

Saturday, March 23, 2024

Book Report: Machine Learning for Drug Discovery

Drug Discovery is a field where biochemists (and more recently computer scientists) turn ideas into potential medications. I first came across a few applications in this area when checking out how to build Graph Neural Networks (GNN) as part of auditing the CS224W: Machine Learning with Graphs course from Stanford, some learnings of which I recycled into my Deep Learning with Graphs tutorial at ODSC 2021. Of course, drug discovery is much more than just GNNs, I mention this only because this happened to be my entry point into this fascinating world. However, I will hasten to add that despite having made an entrance, I am still parked pretty solidly close to the entrance (or exit, depending on your point of view).

But I am always looking to learn more about stuff I find interesting, so when I was offered a chance to review Dr Noah Flynn's Machine Learning for Drug Discovery published by Manning, I jumped on it. The book is currently in MEAP (Manning Early Access Program) so currently there are only 5 chapters available, but once the book is completed, there are going to be 15 chapters in all. The intended audience of the book, as the title suggests, are computational biochemists, i.e. the ones who attempt to solve Drug Discovery problems using Machine Learning. Thus, to become a computational biochemist, there are two main ways -- either you are a biochemist and you learn the ML, or you are a ML person and you learn the biochemistry. The book is aimed at both categories of readers.

As someone in the latter category, I had to spend much more time on the biochemistry aspects. I suspect that most readers of this review would also fall into this category. For them, I would say that while the ML part is sophisticated enough to solve the problem at hand, they are methods and practices that should be familiar to most ML people already. The most useful things that I think you would get out of this book are as follows:

Framing the Drug Discovery problem as a ML problem
Preprocessing and Encoding inputs
Getting data to train your ML model

For the first one, you either need to have a biochemistry background yourself, or you need to pair with someone who does. I suppose you could get by with a life sciences or chemistry background as well, or acquire enough biochemistry knowledge over time in this field, and this book may even take you part of the way there, but be aware that the learning curve is steep.

For the second and the third items, I thought the book was super useful. Most chapters are built as case studies around a Drug Discovery problem, so as you go through the chapters, you will learn about the sites to acquire your datasets from, and the techniques to preprocess the data from these sites into a form suitable for consumption by your ML model. At least the first 5 chapters deal with fairly simple ML models, but which may or may not be familiar to you depending on your industry, so you might also learn a few things about evaluating or tuning these models that you didn't know before (I did).

The first chapter introduces the reader to the domain and talks about the need for computational approaches to Drug Discovery. It introduces the terminology and the RDKit software library, an open-source cheminformatics toolkit the provides implementations of many common operations needed for computational Drug Discovery (sort of like a specialized supplement to Scikit-Learn for general ML). It also covers high level rules of thumb for detecting drug compounds, such as Lipinski's rule of 5. It then covers some common use cases common in Drug Discovery, ranging from Virtual Screening to Generative and Synthetic Chemistry. It also covers some popular (and public) repositories for Chemistry data, such as ChEMBL, PubChem, Protein Data Bank (PDB), etc.

The second chapter demonstrates Ligand based Screening, where you already have a reference molecule with some of the desired properties, and you want to search the chemical space for molecules similar to that one, with the objective of finding more drugs like the one you started with. The case study here is to identify potential anti-malarial compounds. The dataset for this comes packaged with RDKit itself as Structure Definition Files (SDF) which describes each molecule using a SMILES (Simplified Molecular Input Link Entry System) string. The chapter walks us through converting the SMILES to MOL format, then using RDKit to extract specialized chemical features from the MOL and SMILES, preprocessing to filter out uninteresting molecules based on rule based thresholds such as bio-availability, molecular weight, etc, structure based thresholds such as toxicity, and specific substructural patterns (similar to subgraph motifs). It then uses RDKit to generate Morgan fingerprints out of the remaining molecules (MOL). Morgan (and other) fingerprints are similar to embeddings in NLP, except that they encode structural information through a more deterministic process, and are hence more explainable than embeddings. Finally, these fingerprints are compared with the reference molecule using Tanimoto similarity and the nearest neighbors found.

Chapter 3 continues with the problem of Ligand based screening, but tries to predict cardiotoxicity of the anti-malarial compounds found in the previous chapter using a linear model. This is done indirectly by predicting if the compound blocks the hERG (or gene potassiuam) channel, then it is cardiotoxic, and vice versa. A linear model (Scikit-Learn SGD CLassifier) is trained using the hERG dataset from the Therapeutic Data Commons (TDC). The chapter shows some Exploratory Data Analysis (EDA) on the data, using standard preprocessing as described in the previous chapter. An additional step here is to standardize (regularize) the data for classification. The author provides the biochemistry reasoning for behind this step, but uses the implementation already provided by RDKit. Finally Morgan fingerprints are used to train the SGD Classifier. Because the elements of Morgan fingerprints have meaning, the weights of the resulting SGD model can be used to determine feature importances. There is also some discussion here of cross validation, L1/L2 regularization, removing collinearity, adding interaction terms and hyperparameter sweeps.

Chapter 4 explores building a linear regression model to predict solubility, i.e. how much of the drug would be absorbed by the system. The dataset used to train the regressor is the AqSolDB, also from TDC. This chapter introduces the idea of scaffold splitting, a technique common with biochemical datasets that preserves the structural / chemical similarity within each split. It also briefly describes outlier removal at the extremes, which requires chemistry knowledge. The RDKit library is used to extract features from the dataset, and the model trained to minimize the Mean Squared Error loss. The RANSAC (RANdom SAmple Consensus) technique is introduced that makes models more robust to outliers. On the ML side, there is some discussion on the bias-variance tradeoff and Learning / Validation curves.

The fifth and last chapter of the MEAP (at the time of writing this review) deals with predicting how well the body will metabolize the drug. Typically, drugs are broken down into enzymes in the liver, a large proportion of which are collectively known as the Cytochrome P450 superfamily. As before, metabolism is predicted indirectly by whether the drug inhibits Cytochrome P450 -- if it does, then it will not get metabolized easily, and vice versa. The dataset used to train the model is the CYP3A4 dataset, also from TDC. Data is prepared using the same set of (by now) standard pipeline and the classifier trained a binary predictions of whether the input inhibits Cytochrome P450 or not. The chapter discusses the utility of Reliability Plots in Performance Evaluation and Platt scaling for calibrating probabilities. It also talks about how to deal with imbalanced datasets, Data Augmentation, Class Weights and other approaches to deal with class imbalance. Various models are trained and evaluated, and their important features identified and visualized with RDKit Similarity Map. The chapter ends with a short discussion on Multi-label classification.

The pandemic and the rapid discovery of the COVID vaccine gave a lot of us (at least those of us that were watching) a ringside view into the fascinating world of drug discovery. This book provides yet another peek into this world, with its carefully crafted case studies and examples. Overall, I think you will learn a lot about drug discovery if you go through this book, both on the biochemistry side and the ML side. There are exercises at the end of each chapter, doing these would help you get more familiar with RDKit and hopefully more effective at computational drug discovery.

Sunday, March 17, 2024

Hierarchical (and other) Indexes using LlamaIndex for RAG Content Enrichment

At our weekly This Week in Machine Learning (TWIML) meetings, (our leader and facilitataor) Darin Plutchok pointed out a LinkedIn blog post on Semantic Chunking that has been recently implemented in the LangChain framework. Unlike more traditional chunking approaches that use number of tokens or separator tokens as a guide, this one chunks groups of sentences into semantic units by breaking them when the (semantic) similarity between consecutive sentences (or sentence-grams) fall below some predefined threshold. I had tried it earlier (pre-LangChain) and while results were reasonable, it would need a lot of processing, so I went back to what I was using before.

I was also recently exploring LlamaIndex as part of the effort to familiarize myself with the GenAI ecosystem. LlamaIndex supports hierarchical indexes natively, meaning it provides the data structures that make building them easier and more natural. Unlike the typical RAG index, which are just a sequence of chunks (and their vectors), hierarchical indexes would cluster chunks into parent chunks, and parent chunks into grandparent chunks, and so on. A parent chunk would generally inherit or merge most of the metadata from its children, and its text would be a summary of its children's text contents. To illustrate my point about LlamaIndex data structures having natural support for this kind of setup, here are the definitions of the LlamaIndex TextNode (the LlamaIndex Document object is just a child of TextNode with an additional doc_id: str field) and the LangChain Document. Of particular interest is the relationships field, which allows pointers to other chunks using named relationships PARENT, CHILD, NEXT, PREVIOUS, SOURCE, etc. Arguably, the LlamaIndex TextNode can be represented more generally and succintly by the LangChain Document, but the hooks do help to support hierarchical indexing more naturally.

# this is a LlamaIndex TextNode
class TextNode:
  id_: str = None
  embedding: Optional[List[float]] = None
  extra_info: Dict[str, Any]
  excluded_embed_metadata_keys: List[str] = None
  excluded_llm_metadata_keys: List[str] = None
  relationships: Dict[NodeRelationship, Union[RelatedNodeInfo, List[RelatedNodeInfo]] = None
  text: str
  start_char_idx: Optional[int] = None
  end_char_idx: Optional[int] = None
  text_template: str = "{metadata_str}\n\n{content}"
  metadata_template: str = "{key}: {value}",
  metadata_separator = str = "\n"

# and this is a LangChain Document
class Document:
  page_content: str
  metadata: Dict[str, Any]

In any case, having discovered the hammer that is LlamaIndex, I began to see a lot of potential hierarchical indexes nails. One such nail that occurred to me was to use Semantic Chunking to cluster consecutive chunks rather than sentences (or sentence-grams), and then create parents nodes from these chunk clusters. Instead of computing cosine similarity between consecutive sentence vectors to build up chunks, we compute cosine similarity across consecutive chunk vectors and split them up into clusters based on some similarity threshold, i.e. if the similarity drops below the threshold, we terminate the cluster and start a new one.

Both LangChain and LlamaIndex have implementations of Semantic Chunking (for sentence clustering into chunks, not chunk clustering into parent chunks). LangChain's Semantic Chunking allows you to set the threshold using percentiles, standard deviation and inter-quartile range, while the LlamaIndex implementation supports only the percentile threshold. But intuitively, here's how you could get an idea of the percentile threshold to use -- thresholds for the other methods can be computed similarly. Assume your content has N chunks and K clusters (based on your understanding of the data or from other estimates), then assuming a uniform distribution, there would be N/K chunks in each cluster. If N/K is approximately 20%, then your percentile threshold would be approximately 80.

LlamaIndex provides an IngestionPipeline which takes a list of TransformComponent objects. My pipeline looks something like below. The last component is a custom subclass of TransformComponent, all you need to do is to override it's __call__ method, which takes a List[TextNode] and returns a List[TextNode].

transformations = [
    text_splitter: SentenceSplitter,
    embedding_generator: HuggingFaceEmbedding,
    summary_node_builder: SemanticChunkingSummaryNodeBuilder
]
ingestion_pipeline = IngestionPipeline(transformations=transformations)
docs = SimpleDirectoryReader("/path/to/input/docs")
nodes = ingestion_pipeline.run(documents=docs)

My custom component takes the desired cluster size K during construction. It uses the vectors computed by the (LlamaIndex provided) HuggingFaceEmbedding component to compute similarities between consecutive vectors and uses K to compute a threshold to use. It then uses the threshold to cluster the chunks, resulting in a list of list of chunks List[List[TextNode]]. For each cluster, we create a summary TextNode and set its CHILD relationships to the cluster nodes, and the PARENT relationship of each child in the cluster to this new summary node. The text of the child nodes are first condensed using extractive summarization, then these condensed summaries are further summarized into one final summary using abstractive summarization. I used bert-extractive-summarizer with bert-base-uncased for the first and a HuggingFace summarization pipeline with facebook/bert-large-cnn for the second. I suppose I could have used an LLM for the second step, but it would have taken more time to build the index, and I have been experimenting with ideas described in the DeepLearning.AI course Open Source Models with HuggingFace.

Finally, I recalculate the embeddings for the summary nodes -- I ran the summary node texts through the HuggingFaceEmbedding, but I guess I could have done some aggregation (mean-pool / max-pool) on the child vectors as well.

Darin also pointed out another instance of Hierarchical Index proposed via the RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval and described in detail by the authors in this LlamaIndex webinar. This is a bit more radical than my idea of using semantic chunking to cluster consecutive chunks, in that it allows clustering of chunks across the entire corpus. One other significant difference is that it allows for soft-clustering, meaning a chunk can be a member of more than one chunk. They first reduce the dimensionality of the vector space using UMAP (Uniform Manifold Approximation and Projection) and then apply Gaussian Mixture Model (GMM) to do the soft clustering. To find the optimum number of clusters K for the GMM, one can use a combination of AIC (Aikake Information Criterion) and BIC (Bayesian Information Criterion).

In my case, when training the GMM, the AIC kept decreasing as the number of clusters increased, and the BIC had its minimum value for K=10, which corresponds roughly to the 12 chapters in my Snowflake book (my test corpus). But there was a lot of overlap, which would force me to implement some sort of logic to take advantage of the soft clustering, which I didn't want to do, since I wanted to reuse code from my earlier Semantic Chunking node builder component. Ultimately, I settled on 90 clusters by using my original intuition to compute K, and the resulting clusters seem reasonably well separated as seen below.

Using the results of the clustering, I built this also as another custom LlamaIndex TransformComponent for hierarchical indexing. This implementation differs from the previous one only in the way it assigns nodes to clusters, all other details with respect to text summarization and metadata merging are identical.

For both these indexes, we have a choice to maintain the index as hierarchical, and decide which layer(s) to query based on the question, or add the summary nodes into the same level as the other chunks, and let vector similarity surface them when queries deal with cross-cutting issues that may be found together in these nodes. The RAPTOR paper reports that they don't see a significant gain using the first approach over the second. Because my query functionality is LangChain based, my approach has been to generate the nodes and then reformat them into LangChain Document objects and use LCEL to query the index and generate answers, so I haven't looked into querying from a hierarchical index at all.

Looking back on this work, I am reminded of similar choices when designing traditional search pipelines. Often there is a choice between building functionality into the index to support a cheaper query implementation, or building the logic into the query pipeline that may be more expensive but also more flexible. I think LlamaIndex started with the first approach (as evidenced by their blog posts Chunking Strategies for Large Language Models Part I and Evaluating Ideal Chunk Sizes for RAG Systems using LlamaIndex) while LangChain started with the second, even though nowadays there is a lot of convergence between the two frameworks.

Saturday, February 24, 2024

Thoughts on using LangChain LCEL with Claude

I got into Natural Language Processing (NLP) and Machine Learning (ML) through Search. And this led me into Generative AI (GenAI), which led me back to Search via Retrieval Augmented Generation (RAG). RAG started out relatively simple -- take a query, generate search results, use search results as context for a Large Language Model (LLM) to generate an abstractive summary of the results. Back when I started on my first "official" GenAI project middle of last year, there were not too many frameworks to support building GenAI components (at least not the prompt based ones), except maybe LangChain, which was just starting out. But prompting as a concept is not too difficult to understand and implement, so thats what we did at the time.

I did have plans to use LangChain in my project once it became more stable, so I started out building my components to be "langchain compliant". But that turned out to be a bad idea as LangChain continued its exponential (and from the outside at least, somewhat haphazard) growth and showed no signs of stabilizing. At one point, LangChain users were advised to make pip install -U langchain part of their daily morning routine! So anyway, we ended up building up our GenAI application by hooking up third party components with our own (non-framework) code, using Anthropic's Claude-v2 as our LLM, ElasticSearch as our lexical / vector document store and PostgreSQL as our conversational buffer.

While I continue to believe that the decision to go with our own code made more sense than trying to jump on the LangChain (or Semantic Kernel, or Haystack, or some other) train, I do regret it in some ways. A collateral benefit for people who adopted and stuck with LangChain were the ready-to-use implementations of cutting-edge RAG and GenAI techniques that the community implemented at almost the same pace as they were being proposed in academic papers. For the subset of these people that were even slightly curious about how these implementations worked, this offered a ringside view into the latest advances in the field and a chance to stay current with it, with minimal effort.

So anyway, in an attempt to replicate this benefit for myself (going forward at least), I decided to learn LangChain by doing a small side project. Earlier I needed to learn to use Snowflake for something else and had their free O'Reilly book on disk, so I converted it to text, chunked it, and put it into a Chroma vector store. I then tried to implement examples from the DeepLearning.AI courses LangChain: Chat with your Data and LangChain for LLM Application Development. The big difference is that the course examples use OpenAI's GPT-3 as their LLM whereas I use Claude-2 on AWS Bedrock in mine. In this post, I share the issues I faced and my solutions, hopefully this can help guide others in similar situations.

Couple of observations here. First, the granularity of GenAI components is necessarily larger than traditional software components, and this means application details that the developer of the component was working on can leak into the component itself (mostly through the prompt). To a user of the component, this can manifest as subtle bugs. Fortunately, LangChain developers seem to have also noticed this and have come up with the LangChain Expression Language (LCEL), a small set of reusable components that can be composed to create chains from the ground up. They have also marked a large number of Chains as Legacy Chains (to be converted to LCEL chains in the future).

Second, most of the components (or chains, since that is LangChain's central abstraction) are developed against OpenAI GPT-3 (or its chat version GPT-3.5 Turbo) whose strengths and weaknesses may be different from those of your LLM. For example, OpenAI is very good at generating JSON output, whereas Claude is better at generating XML. I have also seen that Claude can terminate XML / JSON output mid-output unless forced to complete using stop_sequences. Yhis doesn't seem to be a problem GPT-3 users have observed -- when I mentioned this problem and the fix, I drew a blank on both counts.

To address the first issue, my general approach in trying to re-implement these examples has been to use LCEL to build my chains from scratch. I attempt to leverage the expertise available in LangChain by looking in the code or running the existing LangChain chain with langchain.debug set to True. Doing this helps me see the prompt being used and the flow, which I can use to adapt the prompt and flow for my LCEL chain. To address the second issue, I play to Claude's strengths by specifying XML output format in my prompts and parsing them as Pydantic objects for data transfer across chains.

The example application I will use to illustrate these techniques here is derived from the Evaluation lesson from the LangChain for LLM Application Development course, and is illustrated in the diagram below. The application takes a chunk of text as input, and uses the Question Generation chain to generate multiple question-answer pairs from it. The questions and the original content are fed into the Question Answering chain, which uses the question to generate additional context from a vector retriever, and uses all three to generate an answer. The answer generated from the Question Generation chain and the answer generated from the Question Answering chain are fed into a Question Generation Evaluation chain, where the LLM grades one against the other, and generates an aggregate score for the questions generated from the chunk.

Each chain in this pipeline is actually quite simple, they take one or more inputs and generates a block of XML. All the chains are structured as follows:

1
2
3

from langchain_core.output_parsers import StrOutputParser

chain = prompt | model | StrOutputParser()

And all our prompts follow the same general format. Here is the prompt for the Evaluation chain (the third one) which I adapted from the QAEvalChain used in the lesson notebook. Developing from scratch using LCEL gives me the chance to use Claude's Human / Assistant format (see LangChain Guidelines for Anthropic) rather than depend on the generic prompt that happens to work well for GPT-3.

Human: You are a teacher grading a quiz.

You are given a question, the context the question is about, and the student's 
answer.

QUESTION: {question}
CONTEXT: {context}
STUDENT ANSWER: {predicted_answer}
TRUE ANSWER: {generated_answer}

You are to score the student's answer as either CORRECT or INCORRECT, based on the 
context.

Write out in a step by step manner your reasoning to be sure that your conclusion 
is correct. Avoid simply stating the correct answer at the outset.

Please provide your response in the following format:

<result>
    <qa_eval>
        <question>the question here</question>
        <student_answer>the student's answer here</student_answer>
        <true_answer>the true answer here</true_answer>
        <explanation>step by step reasoning here</explanation>
        <grade>CORRECT or INCORRECT here</grade>
    </qa_eval>
</result>

Grade the student answers based ONLY on their factual accuracy. Ignore differences in 
punctuation and phrasing between the student answer and true answer. It is OK if the 
student answer contains more information than the true answer, as long as it does not 
contain any conflicting statements.

Assistant:

In addition, I specify the formatting instructions explicitly in the prompt instead of using the canned ones from XMLOutputParser or PydanticOutputParser via get_formatting_instructions(), which are comparatively quite generic and sub-optimal. By convention, the outermost tag in my format is always <result>...</result>. The qa_eval tag inside result has a corresponding Pydantic class analog declared in the code as follows:

from pydantic import BaseModel, Field

class QAEval(BaseModel):
    question: str = Field(alias="question", description="question text")
    student_answer: str = Field(alias="student_answer",
                                description="answer predicted by QA chain")
    true_answer: str = Field(alias="true_answer",
                             description="answer generated by QG chain")
    explanation: str = Field(alias="explanation",
                             description="chain of thought for grading")
    grade: str = Field(alias="grade",
                       description="LLM grade CORRECT or INCORRECT")

After the StrOutputParser extracts the LLM output into a string, it is first passed through a regular expression to remove any content outside the <result>...</result>, then convert it into the QAEval Pydantic object using the following code. This allows us to keep object manipulation between chains independent of the output format, as well as negate any need for format specific parsing.

import re
import xmltodict

from pydantic import Field
from pydantic.generics import GenericModel
from typing import Generic, List, Tuple, TypeVar

T = TypeVar("T")

class Result(GenericModel, Generic[T]):
    value: T = Field(alias="result")

def parse_response(response):
    response = response.strip()
    start_tag, end_tag = "<result>", "</result>"
    is_valid = response.startswith(start_tag) and response.endswith(end_tag)
    if not is_valid:
        pattern = f"(?:{start_tag})(.*)(?:{end_tag})"
        p = re.compile(pattern, re.DOTALL)
        m = p.search(response)
        if m is not None:
            response = start_tag + m.group(1) + end_tag
    resp_dict = xmltodict.parse(response)
    result = Result(**resp_dict)
    return result

# example call
response = chain.invoke(
    "question": "the question",
    "context": "the context",
    "predicted_answer": "the predicted answer",
    "generated_answer": "the generated answer"
})
result = parse_response(response)
qa_eval = result.value["qa_eval"]

One downside to this approach is that it uses the current version of the Pydantic toolkit (v2) whereas LangChain still uses Pydantic V1 internally, as descibed in LangChain's Pydantic compatibility page. This is why this conversion needs to be outside LangChain and in the application code. Ideally, I would like this to be part of a subclass of PydanticOutputParser where the formatting_instructions could be generated from the class definition as a nice side effect, but that would mean more work than I am prepared to do at this point :-). Meanwhile, this seems like a decent compromise.

Thats all I had for today. Thank you for staying with me so far, and hope you found this useful!

Saturday, February 03, 2024

Book Report: Allen B Downey's Probably Overthinking It

I have read Allen Downey's books on statistics in the past, when trying to turn myself from a Software Engineer into what Josh Wills says a Data Scientist is -- someone who is better at statistics than a Software Engineer and better at software than a statistician (with somewhat limited success in the first area, I will hasten to add). Last year, I had the good fortune to present at PyData Global 2023 (the video is out finally!) so had a free ticket to attend, and one of the talks I really enjoyed there was Allen Downey's talk Extremes, Outliers and GOATs: on life in a lognormal world. In it, he mentions that this is essentially the material from Chapter 4 of his book Probably Overthinking It. I liked his talk enough to buy the book, and I wanted to share my understanding of this book with you all, hence this post.

The book is not as dense as a "real" book on stats like say The Elements of Statistical Learning but is definitely not light reading. I tried reading it on a flight from San Francisco to Philadelphia (and back) and found it pretty heavy going. While the writing is lucid and illustrated with tons of well-explained and easy to understand examples, most of these were new concepts to me, and I wished I took notes after each chapter so I could relate all these concepts together enough to reason about them rather than just learn about them. So I did another pass through the book, this time with pen and paper, and I now feel more confident about talking to other people about it. Hopefully, this is also helpful for folks who have done (or planning to do) the first pass on the book but not the second.

Most people who are new to statistics (me included) lay great store in the Gaussian (Normal) distribution to explain or model various datasets. Chapter 1 challenges this idea and demonstrate that while individual traits may follow a Gaussian distribution, a combination of such traits can be a very restrictive filter. In other words, almost all of us are weird (i.e. not normal). For me, it also introduces the Cumulative Distribution Function (CDF) as a modeling tool.

The second chapter introduces the Inspection Paradox, which explains why it always seems like our wait time for the next train is longer then the average wait time between trains, among other things. The explanation lies in the sampling strategy -- if we sample our data from the population, we may get a skew from oversampling from over-represented populations. It also describes a practical use case of this paradox to detect COVID superspreaders.

The third chapter describes what the author calls Preston's paradox, based on a 1976 paper by Samuel Preston. The paradox is that even if every woman has fewer children than her mother, the average family size can increase over time. The paradox is explained by an idea similar to the Inspection Paradox, i.e. because there are more women in existence from large families than small ones, a larger proportion of women would end up having large families than small ones, and overall that contributes to an increase in family size. The opposite can hold true as well, as demonstrateed by the loosening of reproductive restrictions in China in the aftermath of China's one-child policy not having the desired effect in boosting family sizes.

Chapter 4 is the one the author talked about in the PyData Global talk. In it, he demonstrates that certain attributes are better explained by a log-normal distribution, i.e. taking the log of the values in the distribution, rather than our familiar Gaussian distribution. This is especially true for outlier type distributions, such as performance numbers of GOAT (Greatest Of All Time) athletes compared to the general population. The explanation for this is that GOAT performance is almost always a multiplicative combination of innate human prowess (nature) and these skills being effectively harnessed and trained (nurture) plus a whole lot of other factors that all have to line up just so for the event to happen, and whose contributions to the target are therefore multiplicative rather than additive, hence the effectiveness of the log-normal distribution over the normal one.

Chapter 5 explores different survival characterstics of different populations and classifies them as either NBUE (New Better than Used in Expectation) and NWUE (New Worse than Used in Expectation). The former would apply for predicting the remaining life of lightbulbs with use, and the latter would apply for predicting cancer survivability and child mortality over time. Using child mortality statistics, the author shows that as healthcare improves and becomes more predictable across age categories, the NWUE distribution changes to resemble more closely a NBUE distribution.

Chapter 6 explores Berkson's Paradox, where a sub-sample selected from a population using some selection criteria can create correlations that did not exist in the population, or correlations that are opposite to that observed in the population. Berkson originally pointed out the paradox as a warning about using hospital data (sub-sample) to make conclusions about the general population. The selection criteria restrict the general population in specific ways, leading to a change in composition of the traits in the sub-sample, thus leading to the paradox.

Chapter 7 warns about the dangers of interpreting correlation as causation, something most of us have probably read or heard about many many times in the popular Data Science literature. The main case study here are moms who smoke (or don't smoke) and their low birth weight (LBW) babies. A study concluded that while smoker's were more likely to give birth to LBW babies, and LBW babies had a higher mortality rate, the mortality rate of LBW babies whose mothers smoked was 48% lower than those whose mothers didn't smoke. Further LBW babies of non-smokers also had higher rate of birth defects. Interpreting this correlation as causation, i.e. not heeding the warning, it seems like maternal smoking is beneficial for LBW babies, protecting them from mortality and birth defects. The explanation is that maternal smoking is not the only cause of LBW babies, and birth defects may be congenital and not linked to smoking. These two factors mean that there are biological explanations for LBW other than maternal smoking. This and a few other examples segue naturally into a brief high-level introduction to Causal Reasoning, which I also found useful.

Following on from GOAT events being better represented by log-normal rather than normal distributions, Chapter 8 describes applying this to model extremely rare events (such as earthquakes and stock market crashes), and concludes that while the log-normal distribution is more "long-tailed" than a Gaussian, rare events have an even longer tail that is better modeled by log-Student-t (or Log-t) distibution (Student-t is a Gaussian with longer / fatter tails). It also introduces the idea of a Tail distribution (the inverse of a CDF, a survival chart is a tail distribution chart). The author also makes a brief reference to Nassim Taleb's Black Swan events, saying that the ability to model and predict them make them more of Gray Swans.

Chapter 9 talks about the challenges in ensuring algorithmic fairness to all recipients of its predictions, which is very relevant given the many paradoxes the book has already covered. In this chapter, the author describes Bayes rule without mentioning it by name, calling it the "base rate" and the difference between the prior and posterior probabilities the "base rate fallacy". He also covers other aspects of fairness, citing differences across groups that an algorithm often does not see. This last part seemed to me to be related to the Inspection Paradox described earlier in the book.

Chapter 10 describes Simpson's Paradox, where sub-populations can exhibit similar correlations across the sub-populations but where the same traits are anti-correlated in the conbined population. To some extent, this seems related to Berkson's law. Among the examples cited, there is one about penguins, where within each species, the beak size and body size are correlated, but across species, they are anti-correlated. The explanation here is that there is a biological reason for the correlation within the species, but the anti-correlation is just a statistical artifact (correlation != causation in action I guess?).

Chapter 11 is about how certain instances of Simpson's Paradox can be explained as a combination of other underlying factors. It is a trusim that people get more conservative as they get older (i.e. if you are not a liberal when you are young, you have no heart, and if you are not a conservative when old, you have no brain). However, within each age group, it is observed that people actually get more liberal over time. This is explained as a combination of the age effect, the period effect, and the cohort effect. The age effect shows a positive correlation between adherence to traditional beliefs (conservativeness) and age. However, within each age group, it is observed that people get more liberal over time, i.e. the cohort effect. Finally the period effect deals with specific events during the time period under consideration, and this covers older people dying out and being replaced with younger (and more liberal) people.

Chaoter 12 continues the discussion from the previous chapter and brings in the idea of the Overton Window, which dictates what views are considered acceptable at any particular point in time, and which changes over time as well. So what was thought to be liberal in decades past is now considered more conservative. So while an individual may get more liberal with time, the Overtom Window has shifted faster towards liberalism. This can explain why an individual may find themselves getting more conservative as they age, relative to the world around them.

Overall, I enjoyed this book. I think the most impressive thing about this book was its use of generally available datasets to model physical and social environments, and using simulations to control for certain aspects of these data experiments. Also, I think I learned a few things about corner cases in Statistics which I think may be useful when reasoning about them in future. I hope I have sparked your curiosity about this book as well.

Salmon Run