From the course: Hands-On AI: RAG using LlamaIndex

Prompt compression - LlamaIndex Tutorial

From the course: Hands-On AI: RAG using LlamaIndex

Prompt compression

- [Instructor] Suppose you're interacting with a RAG system or a AI system, or a Large Language Model in general, and you're asking it complex questions that's requiring it to draw upon a large amount of background information. Typically, this would require sending a very long prompt to the language model. This of course, can be slow, it could be expensive, and you might even exceed the model's context window. This is where prompt compression comes in. And what we're going to talk about in this lesson is a technique called LongLLMLingua. This uses a prompt compression method to drastically shorten the prompt while retaining the most relevant information that's needed to answer the question. That way, we'll have faster and more cost effective generation while still getting high quality answers. The key components of this technique of LongLLMLingua is question-aware, coarse-grained prompt compression, which means we're evaluating the relevance between the context and the question based on some measure called perplexity. We're also using a question-aware, fine-grained prompt compression, which uses contrastive perplexity to extract key tokens that are relevant to the question. And then, we also use something called adaptive granular control, which is dynamically allocating different compression ratios to different documents based on the rank information. This will allow us to recover the original prompt content by mapping the response sequence in the compressed prompt to the original prompt. So, there's been a number of experiments done using LongLLMLingua. It's been shown to improve performance by 21.4 points at a 4x compression rate in RAG scenarios. And it's been shown to outperform retrieval-based and compression-based methods in long context benchmarks. For example, like LongBench and ZeroSCROLLS. So, let's talk about what's going on under the hood with the LongLLMLinguaPostprocessor. Of course, I've linked to the source code. If you have any questions or wondering how this works under the hood in great detail, look at the source code. But I'll tell you about the arguments that you need to be aware of. One is the model name. So, this post processor optimizes nodes by compressing the context using this methodology that we talked about. It's called the LongLLMLingua method, and it's shortening the node text based on the query, and we're trying to improve the efficiency and reduce computational costs. And so, some arguments you need to be aware of are one, the model name. So, this is the pre-trained language model for compression. By default, it uses a open source model from Hugging Face, specifically a version of Llama-2-7B that was released by NousResearch. Now, we're not going to run this method here because we require a GPU. But if you're running this in a Colab environment, or if you're running it in an environment where you have access to a GPU, you also need to use a device map. There's additional model configurations that you need to pass as well. There's a configuration for the OpenAI API key if you choose to use that. There's Metadata mode that we have talked about before. There's an instruction string, which is the instruction string for the context compression. You've got a target token, so the number of tokens that you want to compress to. A ranking method that's going to be used to rank for compression. This is using the default of LongLLMLingua. And then, some additional keyword arguments for the compression. Under the hood, what it's doing is extracting the content of each node based on the metadata mode. It's splitting the context text by new lines. It's calling a compressed prompt method, and then passing the context text, the instruction, the query, target token count, the ranking method, and additional keyword arguments to this method. Then we're separating text out, sort of splitting the compressed prompt into separate compressed context texts. Then, we're going to remove the question and the instruction. And then finally, we'll use the remaining compressed text to create a new optimized node. Note that of the two methods we talked about, the question-aware, fine-grained compression is not yet implemented in LlamaIndex. Compression at the current moment is primarily based on coarse-grained approach with the specified ranking method that you pass in. This is of course, a post processor, so we'll go ahead and instantiate a query engine, a response synthesizer, as well as the actual compressor itself. We'll instantiate the node post processor, passing in all the arguments that you want to set. And of course, it has the same usage pattern as what we've seen over the last couple of videos, so they should look very familiar to you. So, go ahead and just instantiate your post processor. You can pass it to your query engine and then get your response. And of course, you've seen how we can use this in a pipeline as well. So to wrap it up, this LongLLMLingua is a very powerful technique that will help improve the efficiency and performance of your RAG system, especially when you have long context or complex queries. So, we're intelligently compressing the prompts while preserving key information. That way we get faster, cheaper generation while maintaining high quality responses. I encourage you to check out the paper LLMLingua, as well as dive into the source code. That way you'll be able to fully grasp the details and think about how you might want to apply this in your own work. I'll see you in the next video where we are going to talk about self-correcting RAG.

Contents