At our weekly This Week in Machine Learning (TWIML) meetings, (our leader and facilitataor) Darin Plutchok pointed out a LinkedIn blog post on Semantic Chunking that has been recently implemented in the LangChain framework. Unlike more traditional chunking approaches that use number of tokens or separator tokens as a guide, this one chunks groups of sentences into semantic units by breaking them when the (semantic) similarity between consecutive sentences (or sentence-grams) fall below some predefined threshold. I had tried it earlier (pre-LangChain) and while results were reasonable, it would need a lot of processing, so I went back to what I was using before.
I was also recently exploring LlamaIndex as part of the effort to familiarize myself with the GenAI ecosystem. LlamaIndex supports hierarchical indexes natively, meaning it provides the data structures that make building them easier and more natural. Unlike the typical RAG index, which are just a sequence of chunks (and their vectors), hierarchical indexes would cluster chunks into parent chunks, and parent chunks into grandparent chunks, and so on. A parent chunk would generally inherit or merge most of the metadata from its children, and its text would be a summary of its children's text contents. To illustrate my point about LlamaIndex data structures having natural support for this kind of setup, here are the definitions of the LlamaIndex TextNode
(the LlamaIndex Document
object is just a child of TextNode with an additional doc_id: str
field) and the LangChain Document
. Of particular interest is the relationships
field, which allows pointers to other chunks using named relationships PARENT
, CHILD
, NEXT
, PREVIOUS
, SOURCE
, etc. Arguably, the LlamaIndex TextNode
can be represented more generally and succintly by the LangChain Document
, but the hooks do help to support hierarchical indexing more naturally.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | # this is a LlamaIndex TextNode class TextNode: id_: str = None embedding: Optional[List[float]] = None extra_info: Dict[str, Any] excluded_embed_metadata_keys: List[str] = None excluded_llm_metadata_keys: List[str] = None relationships: Dict[NodeRelationship, Union[RelatedNodeInfo, List[RelatedNodeInfo]] = None text: str start_char_idx: Optional[int] = None end_char_idx: Optional[int] = None text_template: str = "{metadata_str}\n\n{content}" metadata_template: str = "{key}: {value}", metadata_separator = str = "\n" # and this is a LangChain Document class Document: page_content: str metadata: Dict[str, Any] |
In any case, having discovered the hammer that is LlamaIndex, I began to see a lot of potential hierarchical indexes nails. One such nail that occurred to me was to use Semantic Chunking to cluster consecutive chunks rather than sentences (or sentence-grams), and then create parents nodes from these chunk clusters. Instead of computing cosine similarity between consecutive sentence vectors to build up chunks, we compute cosine similarity across consecutive chunk vectors and split them up into clusters based on some similarity threshold, i.e. if the similarity drops below the threshold, we terminate the cluster and start a new one.
Both LangChain and LlamaIndex have implementations of Semantic Chunking (for sentence clustering into chunks, not chunk clustering into parent chunks). LangChain's Semantic Chunking allows you to set the threshold using percentiles, standard deviation and inter-quartile range, while the LlamaIndex implementation supports only the percentile threshold. But intuitively, here's how you could get an idea of the percentile threshold to use -- thresholds for the other methods can be computed similarly. Assume your content has N
chunks and K
clusters (based on your understanding of the data or from other estimates), then assuming a uniform distribution, there would be N/K
chunks in each cluster. If N/K
is approximately 20%, then your percentile threshold would be approximately 80.
LlamaIndex provides an IngestionPipeline
which takes a list of TransformComponent
objects. My pipeline looks something like below. The last component is a custom subclass of TransformComponent
, all you need to do is to override it's __call__
method, which takes a List[TextNode]
and returns a List[TextNode]
.
1 2 3 4 5 6 7 8 | transformations = [ text_splitter: SentenceSplitter, embedding_generator: HuggingFaceEmbedding, summary_node_builder: SemanticChunkingSummaryNodeBuilder ] ingestion_pipeline = IngestionPipeline(transformations=transformations) docs = SimpleDirectoryReader("/path/to/input/docs") nodes = ingestion_pipeline.run(documents=docs) |
My custom component takes the desired cluster size K
during construction. It uses the vectors computed by the (LlamaIndex provided) HuggingFaceEmbedding
component to compute similarities between consecutive vectors and uses K
to compute a threshold to use. It then uses the threshold to cluster the chunks, resulting in a list of list of chunks List[List[TextNode]]
. For each cluster, we create a summary TextNode
and set its CHILD
relationships to the cluster nodes, and the PARENT
relationship of each child in the cluster to this new summary node. The text of the child nodes are first condensed using extractive summarization, then these condensed summaries are further summarized into one final summary using abstractive summarization. I used bert-extractive-summarizer with bert-base-uncased
for the first and a HuggingFace summarization pipeline with facebook/bert-large-cnn
for the second. I suppose I could have used an LLM for the second step, but it would have taken more time to build the index, and I have been experimenting with ideas described in the DeepLearning.AI course Open Source Models with HuggingFace.
Finally, I recalculate the embeddings for the summary nodes -- I ran the summary node texts through the HuggingFaceEmbedding
, but I guess I could have done some aggregation (mean-pool / max-pool) on the child vectors as well.
Darin also pointed out another instance of Hierarchical Index proposed via the RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval and described in detail by the authors in this LlamaIndex webinar. This is a bit more radical than my idea of using semantic chunking to cluster consecutive chunks, in that it allows clustering of chunks across the entire corpus. One other significant difference is that it allows for soft-clustering, meaning a chunk can be a member of more than one chunk. They first reduce the dimensionality of the vector space using UMAP (Uniform Manifold Approximation and Projection) and then apply Gaussian Mixture Model (GMM) to do the soft clustering. To find the optimum number of clusters K
for the GMM, one can use a combination of AIC (Aikake Information Criterion) and BIC (Bayesian Information Criterion).
In my case, when training the GMM, the AIC kept decreasing as the number of clusters increased, and the BIC had its minimum value for K=10
, which corresponds roughly to the 12 chapters in my Snowflake book (my test corpus). But there was a lot of overlap, which would force me to implement some sort of logic to take advantage of the soft clustering, which I didn't want to do, since I wanted to reuse code from my earlier Semantic Chunking node builder component. Ultimately, I settled on 90 clusters by using my original intuition to compute K
, and the resulting clusters seem reasonably well separated as seen below.
Using the results of the clustering, I built this also as another custom LlamaIndex TransformComponent
for hierarchical indexing. This implementation differs from the previous one only in the way it assigns nodes to clusters, all other details with respect to text summarization and metadata merging are identical.
For both these indexes, we have a choice to maintain the index as hierarchical, and decide which layer(s) to query based on the question, or add the summary nodes into the same level as the other chunks, and let vector similarity surface them when queries deal with cross-cutting issues that may be found together in these nodes. The RAPTOR paper reports that they don't see a significant gain using the first approach over the second. Because my query functionality is LangChain based, my approach has been to generate the nodes and then reformat them into LangChain Document
objects and use LCEL to query the index and generate answers, so I haven't looked into querying from a hierarchical index at all.
Looking back on this work, I am reminded of similar choices when designing traditional search pipelines. Often there is a choice between building functionality into the index to support a cheaper query implementation, or building the logic into the query pipeline that may be more expensive but also more flexible. I think LlamaIndex started with the first approach (as evidenced by their blog posts Chunking Strategies for Large Language Models Part I and Evaluating Ideal Chunk Sizes for RAG Systems using LlamaIndex) while LangChain started with the second, even though nowadays there is a lot of convergence between the two frameworks.
Be the first to comment. Comments are moderated to prevent spam.
Post a Comment