From the course: Hands-On AI: RAG using LlamaIndex
Components of a RAG system
- [Instructor] In this module, we're going to get a introduction to retrieval-augmented generation. Specifically, I'm going to teach you the concept of a Naive RAG pipeline, but we have a couple of modules to go before we get there. So I want to set the tone for what this chapter is going to be. I'll begin by giving you a high-level overview of the components in a RAG system. Then, the next two videos, I'm going to talk about an ingestion pipeline and a query pipeline. These pipelines are abstractions over what we saw in the previous chapter and in the previous modules. What we saw before were the low-level APIs. I think that's important for you to understand how the low-level APIs work so that the high-level stuff isn't like confusing to you. So we're going to first talk about the components of a RAG system, talk about the ingestion pipeline, query pipeline. I'll touch on prompt engineering for RAG. And then, we'll talk about data preparation. And then finally, we'll put all of that together in a end-to-end pipeline, and this is going to be called the Naive RAG pipeline. And I'll close off this section by talking about the drawbacks of Naive RAG. And then from there, we are going to build on RAG. We'll talk about advanced techniques and modular techniques as well. So let's go ahead and get right to it. Let's talk about the components of a RAG system. So at a high level, a RAG system is a three-step process. So you start with a user query that comes in. We take that user query, we search the knowledge base for context that's related to that user query. We're then going to take that retrieved context, we're going to add it to a prompt. and this gets sent to the language model. The language model is going to do what it does. It takes in that prompt, that string of text, and then it's going to generate an output. And so this three-step process has several different components that we've been introduced to over the previous modules. So there's the language model, there's the prompt, there's document loaders, there's document chunkers, or node parsers, there's the embedding model, the vector store, the vector store retriever, and the user input. So even though it's a kind of three-step process, there's a lot of different moving pieces. We've touched on all of these in the previous chapter. And as we'll see throughout the course, there are a wide variety of configurations for each one of these different core components of a RAG system. And we'll learn about how we can manipulate those components to define several different types of retrieval strategies. So now that you are familiar with the core concepts in LlamaIndex and the components that you need for RAG system, let's talk about how we put these together into several different subsystems. And these subsystems, the index retrieval and augment subsystems work together kind of in a orchestrated matter. We're transforming a user's query into a contextually-rich and accurate response from the LLM. Let's start by discussing the indexing system. So the indexing system is going to prepare and organize data for retrieval. And there's several different steps that happen in the indexing system. So first, we are loading documents. Those documents are going to then get split up and chunked up so that we have smaller, more manageable chunks of documents. Those document chunks get sent to an embedding model, which creates the embeddings and pushes that to a vector database. And that is where we store our embeddings. Then, we have the retrieval system. And so the retrieval system is what fetches the most relevant information and context based on the user's query. So we start with a user's query, then, we embed that user's query using the same embedding model that we use to embed our context in our documents from the indexing stage. We then, with that embedding, perform vector search, and we're going to look for embeddings that closely match the user's query. And then we fetch those, those get returned to us. These returned snippets, these returned chunks of context are then sent to the augment system so that we can generate a response. So the augment system is actually the second half of this image here. And the augment system is going to enhance and augment the prompt that gets sent to the LLM with the retrieved context. And this is going to ensure that our model has the necessary information to generate a response. So we create a initial prompt. This is going to start with the user's original question or query. And we augment this prompt with additional context that we have retrieved from the vector store. Now, we have got a enriched input for the LLM. This gets sent to the LLM. And the LLM generates a response, which is then sent to the user. And so it's these subsystems, the indexing, retrieval, and augment systems, that make up the whole RAG system. And this helps us get more accurate, credible, and contextually irrelevant outputs from our system. Now, with this in mind, we're going to talk about the higher-level abstractions in LlamaIndex, the ingestion pipeline and the query pipeline. And we'll see how we can start to now build a actual end-to-end RAG system.