Saturday, February 24, 2024

Thoughts on using LangChain LCEL with Claude

I got into Natural Language Processing (NLP) and Machine Learning (ML) through Search. And this led me into Generative AI (GenAI), which led me back to Search via Retrieval Augmented Generation (RAG). RAG started out relatively simple -- take a query, generate search results, use search results as context for a Large Language Model (LLM) to generate an abstractive summary of the results. Back when I started on my first "official" GenAI project middle of last year, there were not too many frameworks to support building GenAI components (at least not the prompt based ones), except maybe LangChain, which was just starting out. But prompting as a concept is not too difficult to understand and implement, so thats what we did at the time.

I did have plans to use LangChain in my project once it became more stable, so I started out building my components to be "langchain compliant". But that turned out to be a bad idea as LangChain continued its exponential (and from the outside at least, somewhat haphazard) growth and showed no signs of stabilizing. At one point, LangChain users were advised to make pip install -U langchain part of their daily morning routine! So anyway, we ended up building up our GenAI application by hooking up third party components with our own (non-framework) code, using Anthropic's Claude-v2 as our LLM, ElasticSearch as our lexical / vector document store and PostgreSQL as our conversational buffer.

While I continue to believe that the decision to go with our own code made more sense than trying to jump on the LangChain (or Semantic Kernel, or Haystack, or some other) train, I do regret it in some ways. A collateral benefit for people who adopted and stuck with LangChain were the ready-to-use implementations of cutting-edge RAG and GenAI techniques that the community implemented at almost the same pace as they were being proposed in academic papers. For the subset of these people that were even slightly curious about how these implementations worked, this offered a ringside view into the latest advances in the field and a chance to stay current with it, with minimal effort.

So anyway, in an attempt to replicate this benefit for myself (going forward at least), I decided to learn LangChain by doing a small side project. Earlier I needed to learn to use Snowflake for something else and had their free O'Reilly book on disk, so I converted it to text, chunked it, and put it into a Chroma vector store. I then tried to implement examples from the DeepLearning.AI courses LangChain: Chat with your Data and LangChain for LLM Application Development. The big difference is that the course examples use OpenAI's GPT-3 as their LLM whereas I use Claude-2 on AWS Bedrock in mine. In this post, I share the issues I faced and my solutions, hopefully this can help guide others in similar situations.

Couple of observations here. First, the granularity of GenAI components is necessarily larger than traditional software components, and this means application details that the developer of the component was working on can leak into the component itself (mostly through the prompt). To a user of the component, this can manifest as subtle bugs. Fortunately, LangChain developers seem to have also noticed this and have come up with the LangChain Expression Language (LCEL), a small set of reusable components that can be composed to create chains from the ground up. They have also marked a large number of Chains as Legacy Chains (to be converted to LCEL chains in the future).

Second, most of the components (or chains, since that is LangChain's central abstraction) are developed against OpenAI GPT-3 (or its chat version GPT-3.5 Turbo) whose strengths and weaknesses may be different from those of your LLM. For example, OpenAI is very good at generating JSON output, whereas Claude is better at generating XML. I have also seen that Claude can terminate XML / JSON output mid-output unless forced to complete using stop_sequences. Yhis doesn't seem to be a problem GPT-3 users have observed -- when I mentioned this problem and the fix, I drew a blank on both counts.

To address the first issue, my general approach in trying to re-implement these examples has been to use LCEL to build my chains from scratch. I attempt to leverage the expertise available in LangChain by looking in the code or running the existing LangChain chain with langchain.debug set to True. Doing this helps me see the prompt being used and the flow, which I can use to adapt the prompt and flow for my LCEL chain. To address the second issue, I play to Claude's strengths by specifying XML output format in my prompts and parsing them as Pydantic objects for data transfer across chains.

The example application I will use to illustrate these techniques here is derived from the Evaluation lesson from the LangChain for LLM Application Development course, and is illustrated in the diagram below. The application takes a chunk of text as input, and uses the Question Generation chain to generate multiple question-answer pairs from it. The questions and the original content are fed into the Question Answering chain, which uses the question to generate additional context from a vector retriever, and uses all three to generate an answer. The answer generated from the Question Generation chain and the answer generated from the Question Answering chain are fed into a Question Generation Evaluation chain, where the LLM grades one against the other, and generates an aggregate score for the questions generated from the chunk.

Each chain in this pipeline is actually quite simple, they take one or more inputs and generates a block of XML. All the chains are structured as follows:

1
2
3
from langchain_core.output_parsers import StrOutputParser

chain = prompt | model | StrOutputParser()

And all our prompts follow the same general format. Here is the prompt for the Evaluation chain (the third one) which I adapted from the QAEvalChain used in the lesson notebook. Developing from scratch using LCEL gives me the chance to use Claude's Human / Assistant format (see LangChain Guidelines for Anthropic) rather than depend on the generic prompt that happens to work well for GPT-3.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Human: You are a teacher grading a quiz.

You are given a question, the context the question is about, and the student's 
answer.

QUESTION: {question}
CONTEXT: {context}
STUDENT ANSWER: {predicted_answer}
TRUE ANSWER: {generated_answer}

You are to score the student's answer as either CORRECT or INCORRECT, based on the 
context.

Write out in a step by step manner your reasoning to be sure that your conclusion 
is correct. Avoid simply stating the correct answer at the outset.

Please provide your response in the following format:

<result>
    <qa_eval>
        <question>the question here</question>
        <student_answer>the student's answer here</student_answer>
        <true_answer>the true answer here</true_answer>
        <explanation>step by step reasoning here</explanation>
        <grade>CORRECT or INCORRECT here</grade>
    </qa_eval>
</result>

Grade the student answers based ONLY on their factual accuracy. Ignore differences in 
punctuation and phrasing between the student answer and true answer. It is OK if the 
student answer contains more information than the true answer, as long as it does not 
contain any conflicting statements.

Assistant:

In addition, I specify the formatting instructions explicitly in the prompt instead of using the canned ones from XMLOutputParser or PydanticOutputParser via get_formatting_instructions(), which are comparatively quite generic and sub-optimal. By convention, the outermost tag in my format is always <result>...</result>. The qa_eval tag inside result has a corresponding Pydantic class analog declared in the code as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from pydantic import BaseModel, Field

class QAEval(BaseModel):
    question: str = Field(alias="question", description="question text")
    student_answer: str = Field(alias="student_answer",
                                description="answer predicted by QA chain")
    true_answer: str = Field(alias="true_answer",
                             description="answer generated by QG chain")
    explanation: str = Field(alias="explanation",
                             description="chain of thought for grading")
    grade: str = Field(alias="grade",
                       description="LLM grade CORRECT or INCORRECT")

After the StrOutputParser extracts the LLM output into a string, it is first passed through a regular expression to remove any content outside the <result>...</result>, then convert it into the QAEval Pydantic object using the following code. This allows us to keep object manipulation between chains independent of the output format, as well as negate any need for format specific parsing.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import re
import xmltodict

from pydantic import Field
from pydantic.generics import GenericModel
from typing import Generic, List, Tuple, TypeVar

T = TypeVar("T")

class Result(GenericModel, Generic[T]):
    value: T = Field(alias="result")

def parse_response(response):
    response = response.strip()
    start_tag, end_tag = "<result>", "</result>"
    is_valid = response.startswith(start_tag) and response.endswith(end_tag)
    if not is_valid:
        pattern = f"(?:{start_tag})(.*)(?:{end_tag})"
        p = re.compile(pattern, re.DOTALL)
        m = p.search(response)
        if m is not None:
            response = start_tag + m.group(1) + end_tag
    resp_dict = xmltodict.parse(response)
    result = Result(**resp_dict)
    return result

# example call
response = chain.invoke(
    "question": "the question",
    "context": "the context",
    "predicted_answer": "the predicted answer",
    "generated_answer": "the generated answer"
})
result = parse_response(response)
qa_eval = result.value["qa_eval"]

One downside to this approach is that it uses the current version of the Pydantic toolkit (v2) whereas LangChain still uses Pydantic V1 internally, as descibed in LangChain's Pydantic compatibility page. This is why this conversion needs to be outside LangChain and in the application code. Ideally, I would like this to be part of a subclass of PydanticOutputParser where the formatting_instructions could be generated from the class definition as a nice side effect, but that would mean more work than I am prepared to do at this point :-). Meanwhile, this seems like a decent compromise.

Thats all I had for today. Thank you for staying with me so far, and hope you found this useful!

Saturday, February 03, 2024

Book Report: Allen B Downey's Probably Overthinking It

I have read Allen Downey's books on statistics in the past, when trying to turn myself from a Software Engineer into what Josh Wills says a Data Scientist is -- someone who is better at statistics than a Software Engineer and better at software than a statistician (with somewhat limited success in the first area, I will hasten to add). Last year, I had the good fortune to present at PyData Global 2023 (the video is out finally!) so had a free ticket to attend, and one of the talks I really enjoyed there was Allen Downey's talk Extremes, Outliers and GOATs: on life in a lognormal world. In it, he mentions that this is essentially the material from Chapter 4 of his book Probably Overthinking It. I liked his talk enough to buy the book, and I wanted to share my understanding of this book with you all, hence this post.

The book is not as dense as a "real" book on stats like say The Elements of Statistical Learning but is definitely not light reading. I tried reading it on a flight from San Francisco to Philadelphia (and back) and found it pretty heavy going. While the writing is lucid and illustrated with tons of well-explained and easy to understand examples, most of these were new concepts to me, and I wished I took notes after each chapter so I could relate all these concepts together enough to reason about them rather than just learn about them. So I did another pass through the book, this time with pen and paper, and I now feel more confident about talking to other people about it. Hopefully, this is also helpful for folks who have done (or planning to do) the first pass on the book but not the second.

Most people who are new to statistics (me included) lay great store in the Gaussian (Normal) distribution to explain or model various datasets. Chapter 1 challenges this idea and demonstrate that while individual traits may follow a Gaussian distribution, a combination of such traits can be a very restrictive filter. In other words, almost all of us are weird (i.e. not normal). For me, it also introduces the Cumulative Distribution Function (CDF) as a modeling tool.

The second chapter introduces the Inspection Paradox, which explains why it always seems like our wait time for the next train is longer then the average wait time between trains, among other things. The explanation lies in the sampling strategy -- if we sample our data from the population, we may get a skew from oversampling from over-represented populations. It also describes a practical use case of this paradox to detect COVID superspreaders.

The third chapter describes what the author calls Preston's paradox, based on a 1976 paper by Samuel Preston. The paradox is that even if every woman has fewer children than her mother, the average family size can increase over time. The paradox is explained by an idea similar to the Inspection Paradox, i.e. because there are more women in existence from large families than small ones, a larger proportion of women would end up having large families than small ones, and overall that contributes to an increase in family size. The opposite can hold true as well, as demonstrateed by the loosening of reproductive restrictions in China in the aftermath of China's one-child policy not having the desired effect in boosting family sizes.

Chapter 4 is the one the author talked about in the PyData Global talk. In it, he demonstrates that certain attributes are better explained by a log-normal distribution, i.e. taking the log of the values in the distribution, rather than our familiar Gaussian distribution. This is especially true for outlier type distributions, such as performance numbers of GOAT (Greatest Of All Time) athletes compared to the general population. The explanation for this is that GOAT performance is almost always a multiplicative combination of innate human prowess (nature) and these skills being effectively harnessed and trained (nurture) plus a whole lot of other factors that all have to line up just so for the event to happen, and whose contributions to the target are therefore multiplicative rather than additive, hence the effectiveness of the log-normal distribution over the normal one.

Chapter 5 explores different survival characterstics of different populations and classifies them as either NBUE (New Better than Used in Expectation) and NWUE (New Worse than Used in Expectation). The former would apply for predicting the remaining life of lightbulbs with use, and the latter would apply for predicting cancer survivability and child mortality over time. Using child mortality statistics, the author shows that as healthcare improves and becomes more predictable across age categories, the NWUE distribution changes to resemble more closely a NBUE distribution.

Chapter 6 explores Berkson's Paradox, where a sub-sample selected from a population using some selection criteria can create correlations that did not exist in the population, or correlations that are opposite to that observed in the population. Berkson originally pointed out the paradox as a warning about using hospital data (sub-sample) to make conclusions about the general population. The selection criteria restrict the general population in specific ways, leading to a change in composition of the traits in the sub-sample, thus leading to the paradox.

Chapter 7 warns about the dangers of interpreting correlation as causation, something most of us have probably read or heard about many many times in the popular Data Science literature. The main case study here are moms who smoke (or don't smoke) and their low birth weight (LBW) babies. A study concluded that while smoker's were more likely to give birth to LBW babies, and LBW babies had a higher mortality rate, the mortality rate of LBW babies whose mothers smoked was 48% lower than those whose mothers didn't smoke. Further LBW babies of non-smokers also had higher rate of birth defects. Interpreting this correlation as causation, i.e. not heeding the warning, it seems like maternal smoking is beneficial for LBW babies, protecting them from mortality and birth defects. The explanation is that maternal smoking is not the only cause of LBW babies, and birth defects may be congenital and not linked to smoking. These two factors mean that there are biological explanations for LBW other than maternal smoking. This and a few other examples segue naturally into a brief high-level introduction to Causal Reasoning, which I also found useful.

Following on from GOAT events being better represented by log-normal rather than normal distributions, Chapter 8 describes applying this to model extremely rare events (such as earthquakes and stock market crashes), and concludes that while the log-normal distribution is more "long-tailed" than a Gaussian, rare events have an even longer tail that is better modeled by log-Student-t (or Log-t) distibution (Student-t is a Gaussian with longer / fatter tails). It also introduces the idea of a Tail distribution (the inverse of a CDF, a survival chart is a tail distribution chart). The author also makes a brief reference to Nassim Taleb's Black Swan events, saying that the ability to model and predict them make them more of Gray Swans.

Chapter 9 talks about the challenges in ensuring algorithmic fairness to all recipients of its predictions, which is very relevant given the many paradoxes the book has already covered. In this chapter, the author describes Bayes rule without mentioning it by name, calling it the "base rate" and the difference between the prior and posterior probabilities the "base rate fallacy". He also covers other aspects of fairness, citing differences across groups that an algorithm often does not see. This last part seemed to me to be related to the Inspection Paradox described earlier in the book.

Chapter 10 describes Simpson's Paradox, where sub-populations can exhibit similar correlations across the sub-populations but where the same traits are anti-correlated in the conbined population. To some extent, this seems related to Berkson's law. Among the examples cited, there is one about penguins, where within each species, the beak size and body size are correlated, but across species, they are anti-correlated. The explanation here is that there is a biological reason for the correlation within the species, but the anti-correlation is just a statistical artifact (correlation != causation in action I guess?).

Chapter 11 is about how certain instances of Simpson's Paradox can be explained as a combination of other underlying factors. It is a trusim that people get more conservative as they get older (i.e. if you are not a liberal when you are young, you have no heart, and if you are not a conservative when old, you have no brain). However, within each age group, it is observed that people actually get more liberal over time. This is explained as a combination of the age effect, the period effect, and the cohort effect. The age effect shows a positive correlation between adherence to traditional beliefs (conservativeness) and age. However, within each age group, it is observed that people get more liberal over time, i.e. the cohort effect. Finally the period effect deals with specific events during the time period under consideration, and this covers older people dying out and being replaced with younger (and more liberal) people.

Chaoter 12 continues the discussion from the previous chapter and brings in the idea of the Overton Window, which dictates what views are considered acceptable at any particular point in time, and which changes over time as well. So what was thought to be liberal in decades past is now considered more conservative. So while an individual may get more liberal with time, the Overtom Window has shifted faster towards liberalism. This can explain why an individual may find themselves getting more conservative as they age, relative to the world around them.

Overall, I enjoyed this book. I think the most impressive thing about this book was its use of generally available datasets to model physical and social environments, and using simulations to control for certain aspects of these data experiments. Also, I think I learned a few things about corner cases in Statistics which I think may be useful when reasoning about them in future. I hope I have sparked your curiosity about this book as well.