From the course: Hands-On AI: RAG using LlamaIndex

Using LLMs - LlamaIndex Tutorial

From the course: Hands-On AI: RAG using LlamaIndex

Using LLMs

- [Instructor] Let's begin our deep dive into the core components of LlamaIndex, starting with how to use a LLM. So if you're running this on Codespaces, make sure you select the kernel and connect to the environment that we had set up. If you are running this on Google Colab, make sure you run this cell to install the appropriate libraries. We'll go ahead and do a couple of imports here and we'll load our dotenv file. And this line of code pretty much is saying, okay, if the environment variable is present, let's grab that. If not, we're going to be prompted to enter our API key. That's pretty much all that's saying there. All right, so let's talk now about using LLMs. So when you're building a LLM-based applications, you know one of the first decisions is which LLM to use. And you can actually use more than one if you wish. But the LLM is used at different stages of the pipeline. It's used during indexing and during querying. During indexing, we can use it to judge the relevance of the data whether we should index this or not. And we can also summarize data and index based on those summaries. During querying, an LLM is used for retrieval and response synthesis. So when it's used for retrieval, it's fetching data from the index, choosing the best data source from the options and maybe even using some tools to to fetch data. So LlamaIndex gives us a nice single interface to connect to various LLMs. So you can easily pass in any LLM you choose at any stage of the pipeline. In this course, we're primarily using, you know, OpenAI and Cohere. And again, remember that if you want to see the full list of integrations for LLMs, it's available in the docs as well as on the GitHub. So you can check that out if you'd like. For this lesson, we are going to use Cohere. So first thing we do is import our language model from Cohere. We're going to use the command-r-plus model, set the temperature=0.2. You can hack around with a variety of different parameters if you'd like. So for example, if you're curious, "Okay, like what is it that I can pass as an argument to Cohere?" So to see the different arguments, of course you go to the LlamaIndex GitHub, you can go down to the integrations, you can look at the LLM integrations, look at Cohere, click on Cohere, go to base, scroll down until you see the Coherent LLM class and you can see all the different arguments that you can pass to the language model. So for example, you could pass temperature max tokens and some other keyword arguments that are from the Cohere API. So if you go to the Cohere documentation, look under the models or the text generation, the API, and see what different arguments you can pass, I'm only ever really going to manipulate the temperature. So to get a response from the language model, we just do llm.complete. So in this case, I'm saying to the language model, complete this sentence so we can run that and we'll get a response. So Alexander the Great was a king of the ancient Greek Kingdom of Macedon and a legendary military commander. So you see here all we did was pass a string, but you can also use a prompt template in LlamaIndex. So prompt template, you can also use it to build an index to perform insertions to traverse during querying and synthesize the final answer. There's several built in prompt templates in LlamaIndex, but what I'm going to show you is how to create one from scratch. So we'll start by importing from llama_index.core, the prompt template. I'm going to set up a template here, write a song about, and then I have a placeholder here for thing in the style of style. And so to construct my prompt, I'm going to take that template and I'm going to format it with thing and with style. So we can go ahead and hit that and we'll get a response. Now you can see here it gives us a wrap in that particular style. You can also use the chat message templates in LlamaIndex. So chat messages are essentially just list of messages that are kind of the back and forth between the system or the the language model and the user. So here, I've got a list of chat messages that I'm passing the language model and I'm going to ask the language model to create a response. And notice here that instead of doing .complete like I did here, I'm using .chat. And you can see here that we have the response like so. You can also create a chat prompt template. And it's done in much the same way as a regular prompt template. So we instantiate our language model, we have a list of chat messages. The content of one of the chat messages is actually a prompt template. So we construct our chat prompt template using the list of chat messages and then we can format the actual prompt variable like so. And you can see here, we get an answer. So sitting around waiting for the answer is not the best user experience. Well LlamaIndex allows you to stream the response from the LLM provider as well. And so that's what we're going to do here. So again, everything is the same, but except, I'm calling the stream chat method of the language model. And you can see here that we get a much better kind of user experience instead of sitting there waiting for the response to complete, we're kind of seeing it happen in real time. LlamaIndex also has a chat engine. So instead of just doing these kind of messages where we're sending something to an LLM and getting something back, we can actually just kind of do a back and forth chat style with the LLM. And when you run this, you'll notice something pop up here. And so up here, we can go ahead and turn a message and maybe the message could be something like, "Hey, how do I learn something new?" Hit Command + Enter and we'll get a response eventually. And you can see here we get the response. And so we can also continue the conversation if we'd like or just go ahead and hit Exit to exit. Note that there's actually another method that we can call on the chat engine. So if you're ever curious about what methods are available to you for any LlamaIndex module, so notice here that the chat engine is instantiated as a simple chat engine from default. If I want to see the different methods available to me on the simple chat engine, I could just hit dir on that. And we can see here that we have a streaming chat ripple method that we can use. So we can use that as well. So we can do cha_engine.streaming_chat_repl. And you'll see here the same box pops up front, tell me how to learn a new knowledge. And you can see here, it's a streaming output, so it makes for a much better user experience. Great. So that is the basics of working with language models in LlamaIndex, and this is an important component of a RAG system because this is how we generate our response. In the next lesson, I'll show you how to load data using LlamaIndex.

Contents