Salmon Run: Using Knowledge Graphs to enhance Retrieval Augmented Generation

Saturday, October 05, 2024

Using Knowledge Graphs to enhance Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) has become a popular approach to harness LLMs for question answering using your own corpus of data. Typically, the context to augment the query that is passed into the Large Language Model (LLM) to generate an answer comes from a database or search index containing your domain data. When it is a search index, the trend is to use Vector search (HNSW ANN based) over Lexical (BM25/TF-IDF based) search, often combining both Lexical and Vector searches into Hybrid search pipelines.

In the past, I have worked on Knowledge Graph (KG) backed entity search platforms, and observed that for certain types of queries, they produce results that are superior / more relevant compared to that produced from a standard lexical search platform. The GraphRAG framework from Microsoft Research describes a comprehensive technique to leverage KG for RAG. GraphRAG helps produce better quality answers in the following two situations.

the answer requires synthesizing insights from disparate pieces of information through their shared attributes
the answer requires understanding summarized semantic concepts over part of or the entire corpus

The full GraphRAG approach consists of building a KG out of the corpus, and then querying the resulting KG to augment the context in Retrieval Augmented Generation. In my case, I already had access to a medical KG, so I focused on building out the inference side. This post describes what I had to do to get that to work. It is based in large part on the ideas described in this Knowledge Graph RAG Query Engine page from the LlamaIndex documentation.

At a high level, the idea is to extract entities from the question, and then query a KG with these entities to find and extract relationship paths, single or multi-hop, between them. These relationship paths are used, in conjunction with context extracted from the search index, to augment the query for RAG. The relationship paths are the shortest paths between pairs of entities in the KG, and we only consider paths upto 2 hops in length (since longer paths are likely to be less interesting).

Our medical KG is stored in an Ontotext RDF store. I am sure we can compute shortest paths in SPARQL (the standard query language for RDF) but Cypher seems simpler for this use case, so I decided to dump out the nodes and relationships from the RDF store into flat files that look like the following, and then upload them to a Neo4j graph database using neo4j-admin database import full.

# nodes.csv
cid:ID,cfname,stygrp,:LABEL
C8918738,Acholeplasma parvum,organism,Ent
...

# relationships.csv
:START_ID,:END_ID,:TYPE,relname,rank
C2792057,C8429338,Rel,HAS_DRUG,7
...

The first line in both CSV files are the headers that inform Neo4j about the schema. Here our nodes are of type Ent and relationships are of type Rel, cid is an ID attribute that is used to connect nodes, and the other elements are (scalar) attributes of each node. Entities were extracted using our Dictionary-based Named Entity Recognizer (NER) based on the Aho-Corasick algorithm, and shortest paths are computed between each pair of entities (indicated by placeholders _LHS_ and _RHS_) extracted using the following Cypher query.

1 2	MATCH p = allShortestPaths((a:Ent {cid:'_LHS_'})-[*..]-(b:Ent {cid:'_RHS_'})) RETURN p, length(p)

Shortest paths returned by the Cypher query that are more than 2 hops long are discarded, since these don't indicate strong / useful relationships between the entity pairs. The resulting list of relationship paths are passed into the LLM along with the search result context to produce the answer.

We evaluated this implementation against the baseline RAG pipeline (our pipeline minus the relation paths) using the RAGAS metrics Answer Correctness and Answer Similarity. Answer Correctness measures the factual similarity between the ground truth answer and the generated answer, and Answer Similarity measures the semantic similarity between these two elements. Our evaluation set was a set of 50 queries where the ground truth was assigned by human domain experts. The LLM used to generate the answer was Claude-v2 from Anthropic while the one used for evaluation was Claude-v3 (Sonnet). The table below shows the averaged Answer Correctness and Similarity over all 50 queries, for the Baseline and my GraphRAG pipeline respectively.

Pipeline	Answer Correctness	Answer Similarity
Baseline	0.417	0.403
GraphRAG (inference)	0.737	0.758

As you can see, the performance gain from using the KG to augment the query for RAG seems to be quite impressive. Since we already have the KG and the NER available from previous projects, it is a very low effort addition to make to our pipeline. Of course, we would need to verify these results using Further human evaluations.

I recently came across the paper Knowledge Graph based Thought: A Knowledge Graph enhanced LLM Framework for pan-cancer Question Answering (Feng et al, 2024). In it, the authors identify four broad classes of triplet patterns that their questions (i.e, in their domain) can be decomposed to, and addressed using reasoning approaches backed by Knowledge Graphs -- One hop, Multi-hop, Intersection and Attribute problems. The idea is to use an LLM prompt to identify the entities and relationships in the question, then use an LLM to determine which of these templates should be used to address the question and produce an answer. Depending on the path chosen, an LLM is used to generate a Cypher query (an industry standard query language for graph databases originally introduced by Neo4j) to extract the missing entities and relationships in the template and answer the question. An interesting future direction for my GraphRAG implementation would be to incorporate some of the ideas from this paper.