From the course: Hands-On AI: RAG using LlamaIndex
Comparative analysis of retrieval-augmented generation techniques - LlamaIndex Tutorial
From the course: Hands-On AI: RAG using LlamaIndex
Comparative analysis of retrieval-augmented generation techniques
- [Instructor] So throughout this course you've learned a lot about retrieval augmented generation. You've learned about the different paradigms, the different techniques, and really how complex all of this is. A natural question to ask at this point is, okay, what am I supposed to use? Like I've got all these tools at my disposal, how am I supposed to put these tools together or use these tools to build something useful? Unfortunately, I don't have the answer for you. I can't tell you exactly what the best methodology to use is, but I do want to kind of give you a little bit of hope by talking about this paper called "Advanced RAG Output Grading." Now, I mentioned this paper at the very beginning of the course, and hopefully you took time to look into it when we were discussing evaluation. If not, don't worry. I'm going to go over it here in a little bit of detail and talk about their findings. Because I feel like this is a important paper, it's an interesting paper. At the very least, it'll give you a framework for how you can go about evaluating your RAG system. So the main purpose of this paper, what they're trying to do is evaluate different RAG techniques, measure their performance specifically with respect to retrieval precision and answer similarity, and they want to find the best performing methods, and then also provide some recommendations for how to go forward. The first thing I want to talk about is actually their data collection and their dataset construction. So they looked at arXiv they focused on 423 research papers related to AI and LLMs. With that, they generated 107 QA pairs using GPT-4, I believe, and they validated this with human reviewers. They then selected 13 papers for a further detailed analysis. They added more papers to be kind of like noisy so they can simulate a real world environment, and then they used a bunch of different chunking strategies to create vector databases. And in this paper, the methods they evaluated were the sentence window retrieval, the document summary index, hypothetical document embeddings, a multi-query, max marginal relevance, cohere reranker, and then LLM-based reranking. They used the LlamaIndex abstractions for all of these. So if you recall sentence-window retrieval is optimizing retrieval and generation by tailoring the text chunk size. Document summary index is indexing document summaries for efficient retrieval while using the full text regeneration. Hypothetical document embeddings is going to use a LLM to generate hypothetical answers to improve document retrieval. Multi-query is going to expand the user query into multiple smaller queries to help broaden the search scope. MRR is going to balance relevance and diversity in the retrieve documents. The cohere reranker is going to use a large language model to prioritize the most relevant documents within the context. LLM reranker is just going to use a LLM that is not from cohere to rerank the retrieve documents based on the contextual understanding. And they conducted 10 runs for each one of these techniques. And this is to kind of help ensure some type of statistical reliability. The metrics they looked at were retrieval precision, which is measuring the relevance of the retrieved context to the question that varies from a scale of zero to one. They also looked at answer similarity. Answer similarity is assessing the alignment of the system's answer with a reference response, and they scored this from zero to five. They also looked at some other metrics as well. They rightfully acknowledged that there is a need for further metrics specifically designed to assess retrieval precision and answer similarity beyond just these simple scoring metrics. And so with these metrics, they then did a statistical analysis. So they calculated the average scores for retrieval precision and answer similarity across the 10 runs for each of the techniques. They used statistical tests, specifically they used ANOVA, so the analysis of variance, to kind of detect the overall significant differences amongst the techniques. And then they use the Tukey HSD, and this is a post ANOVA test to do a pairwise comparison. All right, so now that you know the tests that they employed, you can look at the results here. So they present the results for the classical vector database. They present results for the sentence window, document summary index. They also point out some limitations. So here are their findings. So they found that HyDE with LLM ranking, if we look at the box plot of the performance with respect to retrieval precision, they found that HyDE plus LLM reranking has the highest retrieval precision. They found that sentence window retrieval also performed well in terms of retrieval precision. And you can see here there's some moderately performing techniques, for example document summary index and multi-query. Lower performing techniques were MRR and the cohere reranker. For answer similarity they found again that HyDE and LLM reranking tended to work the best. They improved retrieval precision and led to higher answer similarity. Sentence window retrieval actually ended up showing a lot of variability. So while it achieved high precision, effectively leveraging the retrieved context for generation, it was sometimes inconsistent. And so the document summary index and the multi-query and MRR were kind of moderately performing techniques. And so we can look at the conclusions, and my biggest takeaways from this is that retrieval precision matters. So methods like HyDE plus LLM rerank and sentence window retrieval plus HyDE have high precision and this is a good indicator for at least the faithfulness metrics. There's interesting discrepancy with sentence window retrieval. So despite having high precision, sentence window retrieval showed a lot of inconsistencies with answer similarity. The paper also highlights that answer similarity is influenced by various factors that are beyond retrieval because we're using a LLM to judge the answer. So beyond just retrieval, we're influenced by the LLM's capabilities and also prompt engineering as well. But overall, if we prioritize RAG methods with high retrieval precision like HyDE and LLM reranking, we'll be able to generate faithful and relevant answers. So faithfulness requires going beyond keyword matching and ensuring that the language model understands the meaning of the context that is retrieved. So there's a lot of future work to be done with evaluating RAG. So I encourage you to take a look at this paper, 14, 15 pages well spent. Again, this is the paper called "Advanced RAG Output Grading." So although I can't tell you what the best methodology to use is for your specific use case, I have provided you with the tools, that is a toolkit of techniques that you can use, and a framework with which you can use to evaluate your responses. Now, this "Advanced RAG Output Grading" does have a GitHub associated with it. Find the GitHub here, it's under the predlico repo under ARAGOG. And here they've got all the code that they used to evaluate as well. So whether or not this ends up being the definitive way to evaluate RAG pipelines, I do hope that it inspires you and motivates you, or at least gives you a framework for thinking about how to evaluate all the different methods and techniques that you've learned throughout this course.