From the course: Hands-On AI: RAG using LlamaIndex
Small to big retrieval
- [Instructor] Ever felt like finding a needle in a haystack? That's what basic RAG pipelines can feel like when they embed and retrieve huge chunks of text. You end up using the same big chunks for synthesis. This isn't always ideal, because there's often a lot of filler text that muddles the important parts, which makes retrieval less effective. Now imagine instead of searching through the whole haystack, you can find the first few needles and then pull out only the relevant bits of straw around them. That's what small to big retrieval does. We start by fetching the smaller more focused chunks of text that directly answer your query. Then, we use those chunks to guide us to larger parent chunks, which provide a broader context that can be used for synthesis. That way, you get the best of both worlds. Precise retrieval and comprehensive context. Let's go ahead and get right into it with our code. This is stuff that you are hopefully familiar with by now. It's stuff that we've done 100 times and let's jump right into small to big retrieval. Bit of a warning here, all the notebooks that I've got for you are very heavy on text. There's a lot of great information in them. I'm going to skim over the information and leave it to you to get more in depth, and pause and read. So let's go ahead and talk at a high level about small to big retrieval. This is also known as recursive retrieval and it's a awesome feature of LlamaIndex. The whole purpose of this is to efficiently retrieve relevant context for a query. So recursive retriever starts by retrieving smaller query specific chunks and then we follow references to larger contextual chunks. We also do a bit of node postprocessing where we are transforming what is sent to the language model so that we can enhance the quality and relevance of our response. There is also a response synthesizer that we need to make use of, which will combine the retrieved chunks with the user query to generate a more coherent response. The first thing I want to talk about is the SentenceWindowNodeParser. The SentenceWindowNodeParser focuses on individual sentences while still capturing the surrounding context. And this is useful for scenarios that require broad context. The way it works is first we do a bit of sentence splitting. So we divide our corpus into sentences using the tokenizer. In this case, the tokenizer that's used is the PunktSentenceTokenizer from nltk. For each one of those sentence, we create a window of surrounding sentences. We then store this window into the nodes metadata. That means we also need to undertake some metadata management. So we need to store the original sentence text in the metadata, and also we need to handle the window and original text and exclude that from being embedded. Some key arguments that you need to know are as follows. The window_size, window_metadata_key, original_text_metadata_key, and the sentence_splitter. The usage pattern for this is as follows and it's similar to everything that we've seen so far. We're going to see it in action in just a second. If you're wondering when you should use this, you should use this for tasks that require sentence level understanding, but can benefit from broader context. You also can use this when you want fine grain control over the embedding space, because it could focus on specific sentence, meaning within the local context. We're also going to combine this with the MetadataReplacementNodePostProcessor, which is honestly a mouthful to say. And I'll talk about that in a little bit. Essentially what this thing does is that it replaces the original sentence with its window for broader context and consideration. So see this in action. So here is the original_text node. What we're going to do is take this text node and send it through the SentenceWindowNodeParser using a window_size of two. What we end up with now is if you look here, we now have metadata. What is in the metadata? Well, we have the original text. We also have this window. This window has two sentences before and two sentences after that original sentence. I want to draw your attention to two keys here. One is the excluded_embed_metadata_keys and the excluded_llm_metadata_keys. What this is saying is when we are going to embed this node, ignore the window, ignore the original_text, and just embed this. I also want to note here that we have this information in the node for relationships. All this is telling us is given us a path from the original_text to the window around it. Here's another example. In this case, it's a window size of three, and you can examine this on your own. All right, so let's talk about the MetadataReplacementPostProcessor and the SentenceWindowNode processor. So now let's talk about the MetadataReplacementPostProcessor and the SentenceWindowNodeParser. And look, I am just as frustrated with the names of some of these things as you are. These are long names and they're mouthful, but at least they're descriptive. So just recap, what is it that the SentenceWindowNodeParser does? We parse the document into nodes, each node has a single sentence, and then we create this contextual window around that sentence so that each node has a window of sentences surrounding this core sentence, which gives us added context. But there's also this MetadataReplacementPostProcessor. What this does is it's going to replace the sentence with a window for broader context. That's pretty much all it does. This way the language model will have like the full context so that I could reason over that. So let's quickly just talk about how this thing actually works, right? We have a user query that comes in. That user query, we are going to find the sentence that's most similar to it. So we find this sentence. But at the same time, we are sending to the language model a window of sentences before and a window of sentences after it. That all gets sent to the LLM. The LLM then uses that to generate a response. We're going to go ahead and instantiate a sentence_window_splitter, just a wrapper function here. I'm going to create nodes, each of which has a window_size of a five. And here's an example of what one of those looks like. Here, I want to point out metadata mode. So I showed you earlier in the nodes that we have these excluded_embed_metadata_keys, excluded_llm_metadata_keys. So you can manipulate what metadata gets sent to the LLM. So in this case, we can remove or add keys or whatever here. But this just illustrates, okay, what is all the metadata that is associated with this mode? Here are the metadata_mode. When it's LLM, this is just everything that gets sent to the LLM. So this is what the LLM is going to see at generation time. And of course we can change this. And you'll note here that, for example, we're looking at here node five, let me just change this here to be node five, right? So the LLM, in this case, is just going to get that one bit of context. But we can change this and we can manipulate this, and this is what the postprocessor does. All right, so next we are going to ingest this Qdrant and build an index. And so we're instantiating the vector store, like so. Here you'll notice that in transforms, I am not including the actual SentenceWindowNodeParser. That's because I've already handled that above. So I can just pass the documents themselves. The only transformation I need to see embedding model. We'll go ahead and ingest this into Qdrant. You'll see that it takes quite a bit of time, roughly 13 and a half minutes. If you go to Qdrant, again, remember, go to Overview, hit on Clusters, click the down caret, open the dashboard. And you can see here that we have our collection. The small-to-big-sentence-window, and it's all right there for us. With that in place, we now build a query engine. So I'm going to switch up the prompt a little bit. So let's go ahead and show you what the prompt is. If we go to the helper functions, we look at prompts, and specifically I'm going to look for the HYPE_ANSWER_GEN_PROMPT. You're trusted mentor to an adult mentee. Your mentee is seeking advice in the question. Here's some raw thoughts and then I, you know, tell it how to respond. In this case, a HYPE tone be straight up, right? And here is where we do our postprocessing, right? So we're going to use a different pattern than what we've seen for constructing the query engine up until this point. So I need to instantiate a node_postprocessor, right? This is what is going to inject that sentence window into the context of the LLM. I'll go ahead and build my index, I'll go ahead and build my query engine. And we will run that and we'll update our prompt as well, of course. Here we'll set up our query_pipeline. So we have an input component and then the sentence_window_query engine, right? You'll notice that I passed the node_postprocessor as an argument to the query engine, so it knows how to process what I'm going to be sending to it. Now we can go ahead and set up our query_pipeline. So we've done this before several times. Here is a question that I've ran through here. Essentially, how can I effectively build strength across multiple facets of real life without relying on complicated machines? And so you see that we get a decent response here from the language model. Of course you could try it again on this question if you'd like. Now let's dive a bit deeper into the mechanisms behind small to big retrieval. And I want to focus on how smaller child chunks refer to bigger parent chunks and how post-processing of these nodes is going to help enhance our data retrieval and response generation. So I want to kind of demystify a concept here that point to a larger parent chunk. And this will help manage and access data in a more structured manner. So what happens is during query time, the user query will come in, that query will be searched against the vector database. We're going to retrieve smaller chunks