From the course: Hands-On AI: RAG using LlamaIndex
Hybrid retrieval
- [Instructor] We are going to kick off our discussion of the modular RAG paradigm by discussing hybrid retrievers. Arguably, hybrid retrievers probably should fit in the previous section where we were talking about post-retrieval techniques, but I'm including it here because I feel like it doesn't really fit in pre-retrieval or post-retrieval as it fits more in the at-retrieval time. So let's go ahead and start by installing the BM25 retriever. Do our necessary imports. Note that we're using Quadrant Cloud for this and I'll explain why in just a second. We'll set up our LLM and our embedding model. We'll create our document store, but we're doing something a little bit different here. We're actually instantiating the document store directly as well, so we're loading it from disc, bringing it into a Python variable, then with this Python variable that holds all the documents that we've brought in, we're creating a LlamaIndex simple document store object. And now we'll go ahead and create our Quadrant Vector Store, and note that for this, we're actually making use of the storage context. So a hybrid retrieval system is combining a document store with the vector index so that's why we needed to define a document store above. So the vector and the document store have complementary purposes. So the vector index is great at retrieving documents based on similarity, semantic similarity using embeddings. On the other hand, a document store can be used to get structured data by document like, you know, metadata like the author, date, so on and so forth. Or you can just use it for simple text-based search. So we're using a storage context here and we're setting the document store equal to the document store that we defined above. Now we're going to ingest this into Quadrant, using both the document store as well as the vector store so this is a bit of a departure from what you've seen previously. And what I've done here is I've foregone using the ingest abstraction and just use the vector store index directly. Now let's go ahead and see this in action, but first I want to talk to you about vector store query modes. So we've touched on this briefly throughout the course, but let's dig a little bit deeper. There are several query modes that you can use. We've been primarily using the default query mode, which is just performing strict vector search, so we're retrieving the most similar vectors based on the query vector, but there are several other query modes at our disposal. One of them is the hybrid mode, and hybrid search is combining text-based search, traditional search with vector search. When we use hybrid search, the vector index is going to find the subset of relevant documents based on the user's query, but then we can also query the document store based on text. So now we kind of have two systems here, one that's matching in the vector index, and one that's using just text-based search. There's also semantic hybrid. This is similar to what we described above. We have text search with vector embeddings, but we incorporate a semantic re-ranker to improve the relevance. Before I talk to you about sparse search, I want to point out that throughout this entire course we have only been working with dense vectors. Dense vectors are created with dense embedding models so we get this numerical representation of our chunk of text, which is represented as a long list of floating point numbers. Dense vectors are great at capturing really rich semantic meanings across an entire piece of text. A sparse vector is different. The values in a sparse vector are mostly zeros, and that's why we call them sparse. And this is good at capturing really specific keywords or similar details. You can review the notes here if you're interested on how to use a sparse vector, but I've not used a sparse vector at all through the entirety of the course, and I will not use a sparse vector here either. Another query mode you have is tech search. This is just what it says. Just looking for exact keyword matches. Similarity Top K you are familiar with. We've touched on this several times in the course. And then hybrid Top K is just Top K results from hybrid search. In order for you to use these different query modes, you actually just pass it as an argument to the as_retriever method of the index. So what I'm going to do here is show you this in action. We're going to define a query string. This is a function. This function is a take in our query string and our index and some keyword arguments. We'll define the retriever engine by just using the as_retriever method of our index, and we'll pass in the keyword arguments. We'll retrieve our documents by using the retrieve method of the retriever engine using our query. We'll go ahead and just print the retrieve documents. And this is just iterating through all the retrieve documents and printing out the score as well as the actual text. Now, to use the different query modes, you're just passing them as arguments to the as_retriever method. So I'm defining a dictionary here with all the different methods. I'm going to iterate through that dictionary, pass each one of them to our function and you can see the results here. And now this brings me to hybrid fusion retrieval. At a high level, a hybrid search system works by one, creating a subquery. That subquery is going to use vector search to find semantically similar documents, even if they don't contain the exact keywords in that query. Then we have another subquery. This will use full text search like BM25 to find documents that contain the query keywords. Then we'll join the results of the two keywords, and we'll produce a final ranked result that will combine the vector search similarity scores and the full text search relevance scores. I've linked here to the BM25 retriever on Wikipedia. This is a old technique. BM standing for best matching. It dates back to the 1970s and '80s. It's been used in information retrieval for years. You can look through the equations here if you'd like. I'm not going to explain them to you. They're pretty straightforward, and if you know enough about probability, you should be able to understand them. Either way, you don't need to know the mathematics behind BM25 in order to make use of it so I'm not going to talk too much about that. I also link to the implementation in LlamaIndex. Note that we're making use of a tokenizer here that's going to remove stop words, which is a important thing to do so words like the, of, and, uh, so on, so forth. You can take a look at the stop words by following the source code to where they're being retrieved from, in this case, the port stemmer. If you're curious, of course, you can always look at the source code, take a look at how this is working under the hood. A lot of the arguments here are things that you are familiar with, but if you're curious, of course, just look at the source code and you'll get more clarity on what's happening under the hood. I want to point out some useful information to have as a RAG practitioner. There's these index fusion modes that you can use. We're setting the mode to reciprocal re-rank. So reciprocal re-rank is going to merge its index with a BM25 base retriever. This way we can understand both the semantic relationships, that is the meaningful connections between words and the keywords in the input queries. We have other modes as well. For example, relative score, distance space, simple, et cetera. So I've detailed the different modes here. If you're interested, by all means, click on the source code and read more about it. You can also pause here and read this if you'd like. Next, I want to talk to you about the reciprocal re-rank algorithm. I've linked to the research paper that introduces this methodology here. It's a short read, just a couple of pages long. Not too much math, but again, like I said before, you don't need to understand the math behind the thing to use the thing effectively. I'll go ahead and briefly describe this algorithm. So imagine you've got a bunch of different search engines. For example, Google, Bing, DuckDuckGo, whatever. And each one of these is ranking the search results in a different way when you type in a query. What reciprocal re-rank does is combine those rankings to get an even better overall ranking. So with each search result, the algorithm is going to look at its rank in each list. It then will assign a score to the result based on the reciprocal of its rank. So if the result is number one, then you know its score is going to be one over one, which is one. If the result is number two, it'll be one over two, which is equal to 0.5. If it's at three, it'll be one over three, so on and so forth. We then add up these reciprocal rank scores for each result across the different search engines and then we re-rank all the results based on their total reciprocal rank scores. And then the results with the highest total scores come up to the top, so there's a reordering. So again, at a high level, it's three steps. Calculate the rank, aggregate the scores, and then reorder. This is essentially a way of combining the rankings from many different systems, so you give more weight to the results that rank highly in many of them. Results that consistently show up near the top of different rankings are going to bubble up to the top of this combined reciprocal rank list. And so the cool thing is that this simple method works really, really well in practice and will beat out more complex approaches. Whenever possible, use a combination of full-text search and vector search with the reciprocal rank algorithm and you'll be just surprised at how efficient it is and how much better your results are going to be without having a ton of computation or a bunch of calls to a LLM. It's a very, very powerful technique. So to make use of it, we'll see it in the code here. First, we instantiate a vector retriever. We just use the as_retriever method from our vector store. And then we need to now define a BM25 retriever. And so for this, we instantiate this with our document store. So now we have a retriever that does nothing but retrieves based on embeddings, and we have a retriever that retrieves only based on full text. That's what BM25 does. You can set a query generation prompt template if you'd like, you can use a custom one. I'm just instantiating it here to show you how it's done. You don't have to use it if you don't want to. You can use the default one. And so just to point it out here, what the actual default prompt is, you can go to the source code for the query fusion retriever and you can see the query generation prompt here. Of course, you can always use the get prompts method of any retriever to get the prompts. But now we'll go ahead and instantiate the query fusion retriever and so we pass it a list of retrievers. So if we look at the code here, the source code, we can give it a list of retrievers. So in this case, we're just passing two retrievers. You can pass several retrievers if you wanted to to make this even more interesting and more complex, but to make use of hybrid search, all we need is the vector retriever with the full-text retriever. You set a similarity Top K, the number of queries. If you don't want to regenerate queries, then just set it to one. You can also change the mode. In this case, I'm using reciprocal re-rank using async and then verbose. These are just parameters that you don't need to worry too much about. We can get a node to score object by passing a query to the retriever. You can see the queries that it generates, and then print out the actual nodes with the new score. Of course, you can use this as well in a query engine with a query chain, so I've shown you how to do so here. Of course, we need to add something here, right? So we need to add the input component, so make sure you do that so let's go ahead and do that right now. Make sure we import from the query pipeline the input component and instantiate that and be sure to put it into the chain. And that way you can use your query pipeline as you normally would. So this hybrid methodology is amazing. It works really well. You get full-text search that's going to quickly retrieve documents based on exact keyword matches. Then you have the vector search that's going to find relevant documents that don't necessarily contain the keywords, but are semantically similar. Then you combine these two and get a really comprehensive and accurate search result than either one of these things could have done by themselves. This added re-ranking can be adjusted so that you can place more weight on full-text versus vector search, if you'd like, and you do that via the alpha parameter. And the alpha parameter, I didn't pass it here, but you would just pass alpha equals to whatever value you want that's between zero and one. And in the end, what you have is a combination of full-text search for keyword matching and vector search for semantic similarity. You put these together, you got this hybrid system. This hybrid system is really, really powerful. I can't understate how powerful it is and how good of a technique it is. It's probably one of the best techniques that you can use when building a RAG system, so I highly recommend making use of this. So that's it for this lesson and I'll see you in the next one.