From the course: Hands-On AI: RAG using LlamaIndex
Prompt engineering for RAG
- [Instructor] In this video, I'm going to show you how to customize the prompt template that is associated with your query engine. Now, before we proceed, you should install the following packages to your environment. This is because we are going to be importing some helper functions. These helper functions require that these packages be installed. A heads up that I won't be using FastEmbed or Mistral AI throughout the course, but I have made the functions extensively enough so that should you choose to use a free option like FastEmbed for an embedding model, or if you want to hack around with Mistral AI, it's going to be easy for you to do so. So let's go ahead and get started. So we start by doing our imports, and this sys append here just makes it so that we can import our helper functions. Instantiate our API keys, and now we can start using our helper functions to set up our LLM Embedding model vector store and create an index. So you can see here that we end up saving quite a few lines of code because you know everything has been abstracted away in the helper files. If you ever need to remind yourself what's happening under the hood, you can go ahead and just go to the helpers utils, and you can see all the code. By this time in the course, we've seen in these patterns so many times that I think we can save ourselves the hassle of having notebooks filled with repetitive code. Note that, you know, you'll be seeing these helper functions over and over again, but overall I'm saving you the hassle of having to read a ton of boiler plate code. So let's go ahead and now set up our LLM, set up our embedding model, create our vector store, instantiate the storage context, and build our index. So I want to go ahead now and import another couple of helper functions. One of these is the Create Query Engine helper function. If you go and look into the source code for utils and you do a CTRL - F for Create Query Engine, you'll see that there's a few different modes that you can pass, chat, query, or retrieve. This just makes it easy for us to use these different abstractions. On top of the index. I'm going to use the query mode. So let's go ahead and instantiate that. And you'll see that the query engine actually has some default prompts associated with it. And we can update these default prompts in a fairly straightforward manner. All we have to do is define the custom prompt as a string with the variable names as placeholders, and then create a prompt template. We can do that and then we can go ahead and update the prompts like so, and we can now see that our prompt has been updated. Awesome. So now let's go ahead and build out a query pipeline. So I've got this helper function that is a abstraction over the query pipeline. We need to create an input component and then we can construct our chain in the following way. We have an input component and that input is going to be passed to the query engine. You might be wondering why I did not pass a LLM here. Well, that's because you know, we're using settings.llm. So the query engine itself has a argument for an LLM, but because we're using settings.llm, it's going to be used in there implicitly. Let's go ahead and construct our query engine. We can run the pipeline, and here I'm just saying remix the Rudyard Kipling poem, "If," and we can see the output. Interestingly enough, it's a short haiku. "Terms so strict, yet Gutenberg's license "shines a path to knowledge." That's interesting. Now I want to see what happens if we do this, if we pass in settings.llm, does that change the response? Right, and you can see that it does change the response because we're taking the haiku and we're passing that haiku back to another LLM instance, so it does kind of switch up the response that we get. So I just wanted to point that out, that we can remove this and maybe we can even try another query here, and maybe that query could be something along the lines of, "If you could keep your head when all around you "are losing their cool and blaming it on you," it's a line from Rudyard Kipling's poem, "If," and let's see what happens. So you can see here the query engine is running and let's see what the response looks like. There you go. All right. So we can see here that I actually need to update this first. So let's rerun that, reinstantiating the actual query pipeline. Now the query pipeline is reinstantiated, with the correct sequence in the chain, we can probably get a better response. There we go. Now I want to talk about response synthesizers. So there's a ton of documentation in Llama Index about response synthesizers. I've linked to it here. You can go to the component guide under Querying, and under Response Synthesis, you can get an in-depth walkthrough detail of the response synthesizers. So let's go ahead and just see one in action. So I'm going to import the response mode. And you can see here that there are a few different response modes that we can use. So I've gone ahead and actually talked about 'em here for you. So you can also look at more detailed notes about what the response modes are in the documentation. But very briefly, we have a few different ones that we're going to talk about here. Accumulate, compact, compact_accumulate, generation, no_text, refine, simple_summarize, and tree_summarize. First, I want to talk about refine, and refine is just a iterative method to generate a response. I'm going to breeze through these and be respectful of your time. You can always pause the videos and go look at the notebook and read this yourself. But just to touch on it a little bit, let's start by defining some of these response modes. So refine is a iterative method to generate a response. Compact is going to combine the text chunks from these larger chunks into something a bit more smaller. Simple_summarize, just going to take all the text chunks and summarize them into one. Tree_summarize, we've seen this used before, essentially just creates a candidate of nodes in a bottom up manner. Generation mode just says, ignore the context and just generate a response. No text, accumulate, compact, accumulate. I'll let you read those on your own. So let's go ahead now and build out our response synthesizer. We've instantiated that here. We'll create our query engine and go ahead and run the cell and get a response. So you can see here that it's cranking away, and we'll get a response here. So success and failure are relative concepts, and it's all about how we choose and perceive and interpret, so on and so forth. And this is just a glimpse into a little bit of prompt engineering. Really, it's just updating the prompt templates for the query engines. We'll see a little bit more about prompt engineering and manipulating prompts later in the course when we look at advanced RAG and modular RAG. I'll see you in the next video where we are now going to prepare the PDFs that we downloaded oh-so-long-ago and to get them ready to be ingested into a quadrant collection.