From the course: Hands-On AI: RAG using LlamaIndex

Optimizing chunk size - LlamaIndex Tutorial

From the course: Hands-On AI: RAG using LlamaIndex

Optimizing chunk size

- [Instructor] By this point in the course, you are an expert in the Naive RAG paradigm. And now it's time to go deeper with advanced RAG. We're going to start by talking about pre-retrieval and indexing techniques. The lowest hanging fruit amongst these is the chunk size. And I know you are probably asking yourself, does chunk size matter for a RAG system? The short answer is yes, it does. But why? Well, think about what happens when you're building a RAG system. We have a bunch of documents that we need our LLM to have access to so that it could generate a response. In order for us to inject context into the prompt that we send to an LLM, we need to do retrieval. And in the retrieval step, we're trying to find the right context to give to the model so that it could generate a good response. And so this process of chunking is where we break our documents into smaller pieces to make vector search more easier and make it more efficient. But why is chunk size important? Well, we can look at it from the perspective of some metrics. For example, chunk size has a direct impact on the average response time on average faithfulness and average relevancy. And if you have poor performance in these metrics, this means that it's often an issue with the retrieval step, not finding the right context, but you can improve the retrieval step by adjusting the chunk size. Turns out that having the right chunk size will help the system find and use the most relevant information. And in this lesson, we are going to dive deep into chunking. I'm going to discuss how it impacts indexing and retrieval and how we can manipulate the chunk size using LlamaIndex. But remember, the goal for chunking is not just to chunk for chunking's sake, but to get your data in a format where it can be retrieved for evaluator. So let's go ahead and overview this concept of chunking. What is it? What does it mean? So in LlamaIndex, the process of chunking means that we split a document into smaller pieces. These are called chunks. The default size in LlamaIndex is 1024 tokens, and the chunk overlap is 20 tokens. And here, we have some jargon to clear up out of the way. There's chunk size. Chunk size is just the maximum number of tokens in each chunk. Chunk overlap is the shared tokens that are between adjacent chunks. This helps maintain context and prevents information loss. If you opt for a small chunk size, you'll end up with more precise focused embeddings. And this is good when you're looking for really specific information. If you opt for a larger chunk size, you're going to end up with more general, broader context. And this is great for, for example, summary or overview situations. So chunk size has a huge impact on your RAG system. So I've got a blog here that I recommend checking out. "Chunk Size Matters" by the LlamaIndex team. It's a great read, so go ahead, check it out. And I've linked it here. But to highlight the impact that chunk size has, let's consider relevance and granularity. As I just discussed, smaller chunks means we have higher granularity. With higher granularity, we might miss overall context. With a larger chunk, we might retain some of that overall context, but we run the risk of including irrelevant information. As I discussed previously, this directly impacts the faithfulness and relevancy metrics. When selecting a chunk size, you also want to consider the fact that chunk size is going to impact the response generation time. Larger chunks means that you have more context, which could mean a slower response time from the LLM. So, what should you consider when you are trying to find the right chunk size? One is the data characteristics. So if you have long detailed documents, then you might need a larger chunk size. If your document is a lot of short focused passages, then a smaller chunk size might be better. Think about what you need for retrieval. If you need specific details, then you might want smaller chunks. If you want more general broad context, then maybe you would go for a larger chunk size. It's also extremely important to talk about the relationship between chunk size and the similarity_top_k. Remember that chunk size is going to affect the specificity of your embeddings. Smaller chunk size means more precise embeddings. Larger chunk size will capture more information, but you're going to miss a lot of finer details. When you're reducing the chunk size, that means your embeddings get more specific. That means that you might have more relevant chunks that matches a user's query. And so to capture this increased granularity, you want to increase the similarity_top_k parameter. For example, if your chunk size is being halved from 1024 to 512, then you might want to increase similarity_top_k from whatever your value was before. This will ensure that you have comprehensive retrieval and that you are fetching the most relevant results from the vector database. There are a number of methods you can use to chunk your text in LangChain. You can look at the documentation here. If you go to Component Guides, Loading, Node Parsers and Text Splitters, you'll see all of them here on the side. I've summarized them for you here. I'm not going to go over all of them with you. Feel free to read through this on your own because throughout this course I'm only going to focus on a couple of different strategies. That's the TokenTextSplitter and the SentenceTextSplitter, which we're going to cover in this lesson. And then in a later lesson, I'll talk to you about the SentenceWindowNodeParser and the SemanticSplitterNodeParser. Now, before I go into the TokenTextSplitter, I just want to take a moment here to show you this chunk_visualizer. So this chunk_visualizer, it actually uses LangChain Recursive, CharacterSplitter, and RecursiveTextSplitter under the hood. Don't worry too much about what that is, but it's just a good thing to look at so you can kind of understand the impact of chunk size and overlap on a source text. So I recommend playing around with this so you can get a sense of how chunking works. Now, let's talk about the TokenTextSplitter. So the TokenTextSplitter is just chunking up a given string of text into smaller chunks in such a way that each chunk is going to stay within a specified token limit. So there's a few things happening under the hood of the TokenTextSplitter. One is the tokenization process. So we are utilizing a tokenizer to break down the text into individual tokens, which are just words or subwords. So the default tokenizer that's used in the TokenTextSplitter is the tokenizer for GPT-3.5-Turbo, which is the cl100k_base tokenizer. This tokenizer is the same tokenizer that is used for our text embedding. Three small and text embedding three large embedding models. If you are using a different embedding model, then make sure that you are using that tokenizer to count the tokens in LlamaIndex. Why is that? Well, because a embedding model also has a context window, and that context window has some kind of length associated with it. So if you want to ensure that you are not exceeding the length of that context window, then use the tokenizer associated with that embedding model to count the tokens for splitting. It's just a good idea to do. Once the tokenization is completed, it then groups the tokens into chunks so that each chunk is staying within that specified chunk size limit. And then there's some handling for the overlap. That is going to maintain context and coherence between chunks in such a way that the last few tokens of one chunk are repeated again at the beginning of the next chunk. So if we look into the source code of the TokenTextSplitter, you'll see that there are a few arguments available to you. The ones that you need to worry the most about are chunk_size, which is going to control the maximum token count for each chunk. Recall that the DEFAULT_CHUNK_SIZE is 1024. There's also the chunk_overlap. This just determines the number of overlapping tokens between consecutive chunks. This defaults to 20. There is also the separator. You can specify a character to use to split the text into tokens. The default here is a blank space. There's also backup_separator, which will give some additional characters for splitting if the primary separator isn't enough. There's also include metadata which enables or disables the inclusion of metadata within each chunk. And then also the include_prev_next_rel, which will enable or disable tracking the relationship between nodes. A point on the order of splitting with the separator and backup separator. First we split by separator, then we're going to split by backup separator. And then finally split by characters. So this is the order of splitting. In order to use the TokenTextSplitter, it's the basic usage pattern that you see here. Now, let's go ahead and see this in action. We'll start by looking at a random piece of text here. So this random piece of text is just this string. Let's go ahead and just split it into chunk size of 64 with a 16 token overlap, and take a look at what we have here. So you'll see that we have taken this one string of text. It is now been split into several strings of text. You'll see here that if we scroll all the way towards the end here, right? For example, I just want to take a look at this. I said, for myself, even before I had the money. If we scroll all the way here, I said, for myself, even before I had the money, right? There is that chunk overlap. So that's what that chunk overlap is. Now, let's look at something interesting here. I can count the number of words in one of those chunks. So let's just take the first element of that list. And this has 53 words in it. All right, but we had said that we wanted to have a chunk size of 64 tokens, right? Well, let's see how many tokens that string of text turns into. So this is just a small function here using tiktoken. It's the tokenizer from open embedding. Don't worry too much about the code here. We're just going to run that, instantiate the function. And what we're going to do now is first look at the tokens that the string of text has been split up into. And you'll see this will have a b. b just means blank space. And you see here the word aspirational has been split up into aspir and ational. Absurdly has been split into absurd and ly. So that's how the tokenizer works. The tokenization is a process that's learned. I'm not going to go in depth into it now. If you are interested in learning more about tokenizer, if you go to YouTube, type in Jay Alammar and then tokenization, you'll see this video here. Really, really good video by Jay Alammar that goes into tokenization. Highly recommend it. We can count the number of tokens in that string and see that we have 64 tokens. Awesome. Here, I'm just going to define a function that's going to split some text up for us. That's just wrapping the TokenTextSplitter because I'm going to use this in a dictionary. I'm going to look at a few different chunk_sizes. So let's just go ahead and do that here. So you can see here we got some issues that the metadata length is close to the chunk size. Resulting chunks are less than 50 tokens. So consider increasing or decreasing the size of your metadata. This is just a warning from LlamaIndex because we're including the metadata as well. So we're going to end up actually with some tokens that are less than the number of tokens that we had specified up here. That's fine. And so now we can say with a token_split_chunk_size_64, we get 61,102 chunks. With a chunk size of 512, we get 1,938 chunks. So you can see that the smaller the chunk size, the more chunks we have. Now, let's talk about the SentenceSplitter. The SentenceSplitter, as the name suggests, is going to split text while try to keep complete sentences and paragraphs together. This is different than the TokenTextSplitter, which focuses on token limits. So, how does it work? First, we do some initial splitting. So we're dividing the paragraphs if you would, using the paragraph_separator. Each one of those is going to be split using a chunking tokenizer. In this case, it's the PunktSentenceTokenizer from the nltk library. And then if those don't yield enough splits, then we have a backup_tokenizer. Then we are chunking with sentence awareness. So those resulting splits are going to be grouped into chunks in such a way that we're keeping sentences together. Then we handle overlap as well. And so if we look at the source code for the SentenceTextSplitter, you'll see that there are a few arguments that you need to know. And actually these are very similar to the arguments from the TokenTextSplitter. So this is the beautiful thing about LlamaIndex and this orchestration framework is that even though we have a wide number of choices when it comes to a splitter, the usage pattern is pretty much the same for each one of those. So if you're wondering when to use that SentenceSplitter, here's some points to consider. If you want to preserve complete sentences and paragraphs, use the SentenceSplitter. If you are dealing with text where sentence boundaries are meaningful, use a SentenceSplitter. If you want to avoid having broken sentences at the beginning or end of a chunk, then use the SentenceSplitter. So let's go ahead and see the impact of the SentenceSplitter. And you can see here that we have more chunks than we did with the TokenTextSplitter and the chunks are actually different and they're actually of a varying size. So this is another important point to make here. Your chunk size does not have to be the same for every single chunk because these chunks go into the embedding model. The embedding model, no matter what, will produce a fixed length vector. And we saw this way at the beginning of the course when I was talking about indexing. So it's okay to have different chunk sizes because it's all going to an embedding model. The embedding model is going to produce a fixed length vector. So let's go ahead and do this. So we're creating some splits here. Again, we're seeing the same warning. Don't worry too much about that. And we can see the impact like so. So we can see that with a sentence_split_chunk_size_64, we get 36,583 chunks, which is about half the number, almost a little more than half of the number of chunks with the TokenTextSplitter. So just to recap the difference between the TokenTextSplitter and the SentenceTextSplitter. The TokenTextSplitter is going to split the text into chunks based on a specified number of tokens. We're using the tokenizer to break the text down into individual tokens that are going to be grouped into chunks of a specified size. SentenceSplitter, on the other hand, is going to split the text into chunks based on sentences. So we're looking at sentence boundaries to identify where a sentence begins or ends. And then we're going to group these sentences into chunks. So I'm going to randomly select a strategy for ingestion. And in this case, we are going to do the sentence split with a chunk size of 512. So let's go ahead and copy that. And I'm going to name the collection. Let's just call it words of the senpai, wots_sentence_split_chunk_size_256. Just give it a really descriptive name. We can now go through the same pattern that we've seen in the video where we put it all together for Naive RAG. And we're going to set up the LLM, set up our embedding model. We're going to use the text-embedding-3-small, and then we'll go ahead and set up our vector_store. Now, let's go ahead and set up our ingestion pipeline. So remember the splitter that we used, the random choice you had was the SentenceSplitter with a chunk size of 256. We'll instantiate our transformations. So we'll set up the SentenceSplitter and our embedding model. We'll split the nodes using the ingest abstraction that we have from our helper files and go ahead and run that. So ingestion took just under two minutes. Now, if we go to Qdrant, click on Clusters, go down to our Cluster here, open the dashboard, you see that we have our collection here. Awesome. Now, we can go ahead and build a index over a vector_store. All right, let's go ahead now and create the query engine. Note that the response mode by default is refine. And this is going to create and refine an answer by sequentially going through each retrieved text chunk. This means that we're making a separate LLM call per Node and retrieved chunk. So I'm going to change this to the compact mode, which is similar to refine, but we're concatenating the chunks together. And as a result, we end up with less LLM calls. You can of course visit the LlamaIndex docs. And in the documentation, you can see the different response synthesizers that you have available to you. So we've touched on this previously as well, but it's here just as a refresher. We're also going to change the value of similarity_top_k from its default of 2 to 5. This is an arbitrary choice just to keep it simple and illustrate to you that again, this is a hyper parameter that's under your control. And as I discussed previously, there is that relationship between similarity_top_k and chunk size. So just be mindful of that. Now, another thing to bring up here is the query mode. So the query engine has a parameter for vector_store_query_mode. And if you look at the source documentation, you can see that there are several modes that you can choose from. I'm going to use Max Marginal Relevance here to illustrate that this is, again, a choice that you have at your disposal. So Max Marginal Relevance is going to balance the relevance and diversity when selecting a subset of items from a larger set. The key idea is to iteratively select items that are highly relevant to the query, but different from everything else that's been selected. And so this is done through maximizing a score that is comprised of two components: the relevance and the diversity. So I'm going to let you go ahead and read this on your own. I don't want to spend too much time reading this, but just know that this is a choice available to you. Meaning you can, first of all, you can select the vector_store_query_mode. And if you choose MMR, you have the Lambda hyper parameter that you can toggle and play around with. And just consider the fact that MMR is also going to be impacted by your chunk size, right? So already you could see just by looking at chunk size and query mode, that there's a lot of moving pieces to RAG system, right? Because MMR is calculating based on the embeddings, those embeddings, like we discussed previously, are directly impacted by your chunking strategy. So again, I'm going to leave it up to you to experiment with MMR or to not use it at all or experiment with the different Lambda values, but the pattern is there for you to see. So here, just in a pass into our query_engine. So remember that this create_query_engine is a wrapper that I built in our utils folder. You can always go back to the source code here, top level of the repository, under helpers, you'll see utils. And you can look at the create_query_engine source code if you want to see what's happening under the hood. But the things that you need to look at is this and this. These are the parameters that we use to change up the query mode. And then the query mode itself will have some argument. In this case, the mmr_threshold, I'll set to a number 0.42. So now that we have created our query_engine, let's look at the prompt. So you can see here that we have a standard prompt. I'm going to modify this standard prompt. So if we go back here to the helpers, we look at prompt. And you can see I've defined some prompts here. Here, I've created a ANSWER_GEN_PROMPT, which you can take a look at. And this is just saying, you know, "Your trusted mentor to an adult mentee, they're coming to you with a challenging question. Here's the question. Here's some raw thoughts that you can use to formulate an answer," so on and so forth. So this is the prompt that we're going to use to generate an answer. And of course it's printed out here as well. And we're going to make that a actual template and then update it accordingly. And so we can look at the prompt dictionary for the query_engine and see that we have the prompt that we just defined. All right, next, we are going to create our query pipeline. So in order to create our query pipeline, we need to have a InputComponent and then create a chain. And so remember that the query_engine itself implicitly has the LLM associated with it, right? So we don't actually don't need to add another LLM to the chain because the query engine has the LLM implicitly associated with it. And if you look at the documentation, remember as_query_engine, there's this argument here for LLM and it gets resolved like this. So it's pulling it from the global settings. So we go ahead and build our query pipelines that you can verify the LLM in the settings coming from the global context like so. And we can start to now just run some queries. So let's see what responses we get. And you see the error here. Now, in order to overcome that error, I just have to put input=. There you go. And the error is overcome. And we get a response right here. So we see some advice here. "To become the best in the world at what you do, keep redefining what you do until this is true. Find or build specific knowledge. Sales skills, for example, are a form of specific knowledge." And you can see all of this awesome information here. And of course we can try it again here. So make sure we put input=. I'll run that. Question is how can I set up systems to become the most successful version of myself? Looks like we got a connection time out that could happen periodically I guess. We can rerun it. And we see here a response. "Set up systems instead of specific goals. Use judgment to determine the environment where you can thrive. Create those environments around you." There you go. And there you have it. You've seen a end-to-end advanced RAG pipeline where all we did was switch up the chunk size and the vector store query mode as well as the response mode. There's a lot more to it than this, and we're going to learn more about what we could change in the next video where I talk about small to big retrieval. I'll see you there.

Contents