From the course: Hands-On AI: RAG using LlamaIndex
Self-correcting
- [Instructor] Naive rag methods and even some of the advanced ones that we've seen throughout this course do present a few issues. One is that they tend to retrieve information for any query, even if the information that it retrieves is irrelevant or not necessary. Not only that, when it retrieves that context, the system is not even actually checking to see if what we retrieved is relevant to the user's query. It's just haphazardly injecting it into the context. And finally, there's no mechanism in place to make sure that the response that the language model is generating is actually accurate and relevant to the user's query. So we need a way to consistently get the best responses and best answers from our query engine and by extension, our language model. And this is where we look to self-correcting rag. Specifically, we are going to discuss three types of self-correcting techniques in this lesson. One is the retry query engine where we are using an evaluator to improve the response from a base query engine. Two is a retry source query engine. Here we're going to modify the query source nodes by filtering the existing source nodes for the query based on the LLM node evaluation. And finally, it's the retry guideline query engine. This engine uses guidelines to direct the evaluator's behavior. This can be customized with your own guidelines via prompts, and the engine is going to evaluate the response against those guidelines, and if the response doesn't meet the guideline, it's going to transform the query and retry. So let's go ahead and get into this in detail. I'll start with the retry query engine. So the whole purpose of this is to try to enhance the query response by retrying queries that fail to meet some evaluation criteria. The hope here is that the quality of answers improves iteratively with each response. The main purpose of this query engine is just to automatically retry queries to improve the accuracy and comprehensiveness of the response. It evaluates the initial response and then retries with modifications if it doesn't meet some predefined criteria. And we'll see how this happens in just a moment. It's a ton of prompts happening under the hood. The arguments you need to know about for this query engine, of course, we need to first pass in a actual query engine to this. So this is going to be a base query engine. So this is a query engine that uses another query engine. We also need to instantiate an evaluator that is going to evaluate the response quality using this evaluate response method under the hood. Max retries is the number of retry attempts to undertake, and we also need a callback manager, which will manage callbacks during execution. Of course if you're curious about how all of this works under the hood, I've linked to the source code. And like I've said several times before, the best way to figure out how these abstractions work is to read the source code, go through it and kind of demystify it for yourself. Don't let it just be magic. So what is happening under the hood? First, we get a query. There is a initial query that we receive that gets sent to the base query engine. This will generate a response, and then we have an evaluator that's going to check and see does this response meet some criteria? If the response does meet the criteria, great, we can return it. If it doesn't, we're going to transform the query based on some feedback. And then we retry this. So we're going to create a new instance with the same query_engine, evaluator, and we're going to reduce the number of max_retries. And then we repeat this process. We repeat this until we get a satisfactory response or until max_retries is reached. So under the hood here, we instantiate a base_query_engine, then we instantiate a relevancy evaluator, and this is coming from llama_index.core.evaluation. And then finally, we can instantiate the retry_query_engine. And the retry_query_engine takes us as an argument, the base_query_engine, as well as the query_response_evaluator. We'll go ahead and instantiate a query_chain and the query_pipeline. And now let's take a look at what this thing is actually doing. So again, under the hood, a bunch of calls and prompts to a language model to help iteratively improve on the response. So we've seen this response_synthesizer before, but what's unique here is not the response synthesizers, but the evaluation template. So you can see here the evaluator has two templates that it uses. One is to evaluate the response and another one to actually refine that. So there's two prompts that are being leveraged under the hood here. So again, you can always look at the source code and you can see how these prompts are being used, how they're being parsed and passed. With the prompts in place, we can run the query engine. So here I've ran the query engine, and you can see we get a response object, which also has the source nodes as well. And of course, we can run this as a query_pipeline too. Next up is the RetrySourceQueryEngine. So this query engine retries a query with a subset of source nodes if the initial response fails evaluation. So again, we are improving response quality by selectively using source nodes that pass evaluation. Then we create a new index with those nodes and retry the query with the refined index. The arguments you need to know here are the query_engine, so this is going to be the base query engine for executing queries, the evaluator, and the number of retries. What's happening under the hood is first there is a initial query to the base query engine. Then the evaluator is going to look at the responses. If the response passes the evaluation, then we return that response. If it fails, we evaluate each source node that is used in the response, and then we go through some refinement process. So we create a new index with the source nodes that pass evaluation. We create a new instance of the RetrieverQueryEngine. With that new index, we also create a new instance of the RetrySourceQueryEngine with the RetrieverQueryEngine and a reduced number of maximum tries. And then we retry with the RetrySourceQueryEngine. We repeat this until we get a good response or a satisfactory response. And of course, if you're curious about the details, you have the source code here that you can look into. So here we follow a very familiar pattern. We instantiate the retry_source_query_engine. Of course, we pass the base_query_engine. It's the same query engine we had defined above. And then we can take a look at the prompts for this query_engine. And you can see here, we got the familiar response synthesizers. But what we're interested again is the evaluator and the evaluator refine. And again, feel free to look at the source code for how this is working under the hood. We'll go ahead and create a query_pipeline, and then you can run the query. Next up is the RetryGuidelineQueryEngine. So this will evaluate responses against some predefined guideline. Then it's going to transform the query based on the feedback it gets from the language model and then retry until a satisfactory response is achieved or the maximum number of retries is reached. The whole purpose of this is to just improve response quality through this iterative evaluation and query transformation. So we evaluate the initial response, modify the query if necessary, and then retry again. So here are the arguments you need to know about. Of course you know what a query_engine is. The main differences here are the guideline_evaluator, the resynthesize_query. So the guideline_evaluator will just evaluate the response against some predefined guideline. Then this resynthesize_query is actually a flag that's going to indicate if the query should be resynthesize based on some feedback. We also have this query_transformer here, which is going to transform the query based on some feedback. And of course, max_retries is how many times are we going to retry this. Under the hood, what's happening? Again, we start with an initial query. We use the base query engine to get a response. If the max_retries is zero or less, we just return that response. If the max_retries is greater than one, we're going to evaluate the response using a guideline evaluator. We're going to check and see does it pass or fail this guideline evaluator? If it passes, great, let's go ahead and give that response to the user. If it fails, then again, we go through this iterative loop where we create a new RetryGuidelineQueryEngine with the same settings, but we're reducing the max_retries. Then there's a transformation that happens that's going to modify the original query based on the feedback we get from the evaluator. And then this transform query, of course, gets passed to the RetryGuidelineQueryEngine. And we'll repeat this process until we get either a satisfactory response or we reach the number of maximum retries. All right, so let's go ahead now and see this all in action. First, we instantiate the guideline evaluator. The guideline evaluator is going to take in some default guidelines. And to that we're just going to add a couple of other guidelines. What are the default guidelines? Simple. Just the response should fully answer the query, should not be vague and should be specific or use some number or statistics when possible. And so to that, we're also saying: don't make the response too long and then summarize it when possible. So here we have a query. So to see this in action,