From the course: Hands-On AI: RAG using LlamaIndex
Introduction to RAG evaluation - LlamaIndex Tutorial
From the course: Hands-On AI: RAG using LlamaIndex
Introduction to RAG evaluation
- [Instructor] Before we start talking about RAG evaluation, let's recap what a Naive RAG pipeline looks like. We first create an index, right? So we have some source data. We clean that source data up. We chunk it, embed it, push it to the vector database, and create an index over that. Then when a user query comes in, that query goes to the embedding model, the same embedding model that we use to create our vector store. Transforms that query into a embedding representation, searches the vector store for the most similar nodes or documents to that user query. Those documents get retrieved and injected into this prompt template. This prompt template gets constructed and packaged and sent to a large language model. The large language model synthesizes all this and produces a response. Now, imagine that we have this RAG system set up for like a high stakes type of environment. What if we have a RAG system that is used to answer questions for medical diagnosis support, or maybe to provide background knowledge for some type of policy decision-making? In applications like this, generating outputs that are not just fluent but factually correct is absolutely essential. We need to have a high degree of confidence in our RAG system before we put it into production, before we deploy it. This means we need robust and insightful evaluation. Now, evaluation is critical for pretty much everything in deep learning, in machine learning and classical machine learning. Without an evaluation system, you're not going to be able to compare different models, prompts, contexts, retrieval strategies to determine what's going to work best. You also won't be able to assess the quality of your RAG pipeline over time. In addition to that, evaluation gives you a concrete number that will tell you how accurate your system is, how relevant its answers are, and how well it's working overall. It also will help you determine what part of the pipeline needs improving and how you can improve it. So we've gone over the basic steps in a Naive RAG pipeline. Now, if you think about it, there's essentially two components. There are a number of different sub components to it, but it boils down to two main things. There's retrieval and generation. So the retrieval component, as we mentioned, is fetching relevant information from this external knowledge source and is informing the generation process. This, itself, is actually two phases. There's indexing and searching. So in indexing, the documents are being organized for retrieval. In searching, the index is used to fetch relevant documents based on the user's query. And so here, we might run into some challenges. One is evaluating the effectiveness in filtering and selecting the most pertinent information. Two is assessing the relevance and usefulness of the data that we actually retrieve. In the generation component, we're taking the retrieved context and the original query as input to generate a coherent and contextually relevant and appropriate output. And in this component, we face a couple of major challenges as well. First is we need to make sure that the LLM is utilizing the retrieved context effectively. And also we need to assess the factual correctness and relevance and coherence of what the language model generates. So RAG system can go wrong in a number of ways. The information that we retrieve might be irrelevant or unreliable. We might be ignoring the retrieved context in favor of the language model's priors. We might generate inconsistent or contradictory statements. We might not be able to judge whether there's insufficient context to answer the question. We might get so many outputs that we can't synthesize them together. We also will face challenges with factual inaccuracies and hallucinations that are not grounded in retrieval. So carefully evaluating RAG systems along these dimensions is key to understanding failure modes and improving the robustness and reliability. And really, there are two main aspects of evaluation. There's quality and ability. Quality is measured via relevance, which is ensuring that the retrieved context is precise and that the generated answers directly relate to the user's query. Faithfulness is ensuring that the generated responses are consistent with the retrieved context and do not contain contradictions or inconsistencies. When we're measuring the ability of our RAG system, we need to look at its ability to handle noisy context to know when to admit that it lacks knowledge or it lacks context. We also need the system to combine information from various sources. We also need the RAG system overall to identify and ignore misinformation. It's important to develop evaluation schemas and metrics that address these issues. And this is an important open problem for the field. We need techniques that go beyond surface level metrics to really probe the ability of a RAG system to create coherent, contextually relevant, grounded outputs. And in the next section, I'm going to give you a overview of three core evaluation metrics.