From the course: Advanced RAG Applications with Vector Databases

Chunking considerations

From the course: Advanced RAG Applications with Vector Databases

Chunking considerations

- [Instructor] The first pre-processing step to discuss is chunking. Chunking is how we make documents consumable for generative AI use cases like RAG. Chunking is the process of splitting up documents into smaller chunks of text. These chunks need to be small enough to be consumable, coherent and contextual, so let's break that down. What makes a chunk consumable? From a technical perspective, a consumable chunk must fit into the context window of your chosen embedding model. On top of that, at least three of these chunks must fit into the context window of your chosen LLM. The number of these chunks that have to fit into your LLM context window is dependent on the top K you've chosen for your vector database retrieval, which we'll touch more on embedding models as well as top K for retrievals and vector databases later. From a common sense perspective. You want to ensure that you can consume or read your chunk in one go. What makes a chunk coherent? A coherent chunk is one that makes sense. If you read the chunk of text and it makes you go, "Huh," that is not a coherent chunk of text. From a technical perspective, you want to ensure that your chunks don't start and stop in the middle of a word, clause, or sentence. From a common sense perspective, you want your chunks of text to be sets of complete thoughts. For example, "Curiosity killed the cat" is a coherent chunk, "killed the" is not. The last C of chunking is contextual. What makes a chunk contextual? This one is a little different from the other two. The technical and common sense perspectives of contextual are more or less the same thing. The idea behind contextual chunking is that you get chunks of text that contain all the necessary context to answer a question. For example, "Curiosity killed the cat" may be a coherent chunk, but it is often taken out of context of the full saying, and the full saying is "Curiosity killed the cat, but satisfaction brought it back." If you take things out of context, you may not derive their full meaning. As Stewart Stafford said, "Separate texts from context and all that remains is a con." When it comes to chunking, there are three major considerations to take into account. These are the size of your chunks, how much chunks overlap with each other, and whether or not we should use special characters to mark where to split chunks. Chunk size is a pretty self-explanatory term. The chunk size refers the number of characters in a chunk. Picking your chunk size is largely dependent on the structure of your data, and we'll look at some examples later. For reference, most paragraphs are about 100 words or 500 characters, and that's a good place to start for your chunk size. Depending on which method you use to chunk your data, your chunk size may be used as a hard limit or just a guideline, more on this later. Much like chunk size, chunk overlap is a relatively self-explanatory term. Chunk overlap refers to the number of characters repeated between consecutive chunks. Why would you want to have overlapping sections between different chunks of data? There's two ways to think about how this helps. First, you can think about chunk overlap as the way to preserve context between chunks. If you have the last sentence or paragraph of the chunk before, you have extra context into the current chunk. Second, you can think about it as a tool to help reinforce the guidelines of chunks being consumable, coherent, and contextual. Special characters. Unfortunately, there is no industry standard on what these are called and how these are referenced, and I couldn't really come up with a better name. But these are characters that you want to split your text up on. These can be used in conjunction with chunk size and chunk overlap to allow you to relax restrictions around chunk size and overlap to create more coherent chunks. For example, let's say you want to ensure all your splits are in complete sentences, but your chunk sizing the number of characters per chunk doesn't always end on a complete sentence. What do you do? You can use special characters to relax this restriction by allowing your chunks to go oversize and end at the next period, double new line, single new line, or any other special character that you want.

Contents