From the course: Hands-On AI: RAG using LlamaIndex

Document summary index - LlamaIndex Tutorial

From the course: Hands-On AI: RAG using LlamaIndex

Document summary index

- [Instructor] By this point in the course, I hope you're familiar with how a retrieval augmented generation pipeline works. We have these source documents, which we then split up and create text chunks out of. These text chunks get embedded using some embedding model and stored in vector database. At query time, we're going to retrieve chunks by looking at the embedding similarity between what is in our vector database and what the user's query is. Then, we'll synthesize the response by packaging up that retrieved context, putting it into a prompted setting to get to our LLM. All this, as we have seen, works quite well and we get some decent responses, but there is a problem here. The problem is the best way to represent text for retrieval might not be the best way to represent it for synthesis. For example, a raw text chunk might have some really important details that the LLM needs to synthesize a good response. However, it could also contain irrelevant information that will bias the embedding representation, or it might lack context, which would make it harder to retrieve. We've already discussed one way that we can address this challenge, and that's the small to big retrieval, where we embed a sentence and then link it to a window of surrounding text. This way, we get more precise retrieval of the relevant context and we have enough context for the LLM to synthesize a good answer. What I'm going to talk about in this lesson is the document summary index or document summary based retrieval. In this technique, we embed a document summary and then link it to the associated text chunks. This is really advantageous because the summaries can provide more context than those individual chunks. Also, the LLM can reason over summaries before accessing the full document, which will allow for optimal representations for retrieval and synthesis. At a high level, this approach will extract summaries for each document in the hopes of improving retrieval performance over the traditional kind of semantic search on the text chunk alone that we have grown accustomed to over the last several modules. This will use a concise summary that will be generated by an LLM, which could use its reasoning capabilities to enhance retrieval before synthesizing over the referenced chunks. There's two key techniques that we can leverage here. One is to embed summaries to linked document chunks. Another is to retrieve the summaries and then replace with the full document context. This technique has some really, really great benefits. One is that summaries allow for initial filtering at the document level, which is useful when the query is focusing on the overall theme or summary of a document, not specific details. Two is that the system can first retrieve summaries, which are shorter and faster to process. This way, only the relevant documents are analyzed in detail, which will save time and resources. Another advantage is that summaries provide a distilled view of the document's content, which can capture the essence and the key themes of that document. Finally, this document summary index technique could be combined with chunking and metadata based retrieval strategies, so you can create this kind of layered approach that will balance accuracy and performance. Let's see this all in action. We'll begin as we normally do with setting up our LLM, setting up our embedding models, and instantiating a vector store. Again, we are using the in-memory vector store. We're also going to be using GPT-4, now GPT-4o as our language model. And remember that we're only taking a subset of all our documents. Now, with all that in place, we can start to use the DocumentSummaryIndex. So the DocumentSummaryIndex is a abstraction in a Llama index that allows us to build an index from a set of documents, then generate a summary for each document using a response synthesizer, and then we're going to store the summaries and the corresponding document nodes into the index. This document summary index supports two retrieval modes, an embedding-based retrieval mode, and a LLM-based retrieval mode. The embedding-based retrieval mode is going to embed the summaries using an embedding model, and then at query time will retrieve relevant summaries based on the similarity to the query embedding. LLM-based retrieval, on the other hand, is going to use a LLM to retrieve the relevant summaries based on that query. Both of these techniques are focusing on indexing documents, generating summaries, and then providing some efficient retrieval methods based on either LMS or embeddings. There's two ways that you can create a document summary index. One is using the high-level API and the other is using the lower-level APIs. I'm going to start by showing you the high-level API. So we'll begin by instantiating a text splitter, a response synthesizer. We'll use the tree_summarize response synthesizer. And then finally, we'll instantiate the DocumentSummaryIndex from documents. So this is a different pattern than what we've seen previously, and that's because the DocumentSummaryIndex is different from a vector store index. We'll see in a little bit. So we'll go ahead and pass in the documents and the other arguments. So we need to tell this DocumentSummaryIndex what LLM we want to use, the embedding model, the transformations, how to synthesize response, and then what our vector store is going to be. Then, we can go ahead and set up our query engine and pipeline, which is similar to what we have seen before. And I won't run inference with this because I'm going to show you the lower-level APIs for embedding-based retrieval. So let's go ahead and talk about those. Now that we've created a document summary index, here's how retrieval works. There's two ways that you can retrieve. The default retrieval method for the document summary index is using embeddings, so we retrieve relevant summaries from the index using embedding similarity. You can also configure a DocumentSummaryIndexLLMRetriever, which will retrieve the relevant summaries from the index using LLM calls. So the LLM-based retriever, again, is using language models to select relevant summaries based on the user query, while the embedding-based retriever is going to use embedding similarity to find relevant summaries. First, I want to talk to you about the DocumentSummaryIndexLLMRetriever. DocumentSummaryIndexLLMRetriever is going to retrieve relevant summaries from the index using LLMs. And here, we can customize a number of things. So let's go ahead and pull up the code. First, we need to, of course, create a response synthesizer, and then we can instantiate the DocumentSummaryIndexLLMRetriever. What this is going to do is, again, retrieve summaries from the index using a LLM call. You can see here that there's a number of things that we can customize. We can customize the prompt for selecting relevant summaries. We can also process summary nodes and batches. We can retrieve the top-k summary nodes based on the LLM's relevance scoring. We can also use a LLM to select the relevant summaries. So there's a few different arguments we can pass here. These are all configurations, all things that you can choose. Once we've instantiated this retriever, we can create a query engine from it. So you've seen query engines before, so this is a more low-level API. Here we create our query engine, passing the DocumentSummaryIndexLLMRetriever as the retriever, and also the response synthesizer. We can go ahead and update the prompts. In this case, we're using the hype answer prompt, and then we can run a query. And we can get a response from that. Also, you can create a query pipeline using this as well. So above, I just showed you how to use the query_engine.query, but you can also pass it into a pipeline, just like how we've been doing throughout the course. You can also create the DocumentSummaryIndexEmbeddingRetriever. So again, this is going to retrieve relevant summaries from the index using embedding similarity, we're going to retrieve the top-k summary nodes based on embedding similarity, and we use a embedding model, of course, to embed the query. We're then going to query the vector store to find similar summaries, and then inject that context into the LLM context window. Here, there are two things that we need to pass, the index and similarity top-k. So we can go ahead and instantiate a DocumentSummaryEmbedRetriever. We can create the doc_embed_query_engine from that retriever, like so. Then of course we can create our query chain and pipeline and get a response from our LLM. There you have it. You've learned a new technique for advanced RAG, a new pre retrieval and indexing technique. In the next lesson, we're going to talk about query transformation. So I'll see you there.

Contents