From the course: Advanced RAG Applications with Vector Databases

Metadata

- [Instructor] Metadata is the final piece of what makes vector databases useful. Without storing metadata, we would just be comparing a bunch of numbers. The term metadata encompasses all of the data that gets stored with your vector embeddings. When it comes to retrieval, augmented generation, we definitely need to store the actual text that the vector embedding was generated from, and we can also store many other types of metadata. So what is metadata? Other than the text itself? There are many different types of metadata. You can think of metadata in many different ways. It's the data that isn't the embeddings that you store in your vector database. A lot of this data falls into the category of data that gets stored in traditional databases, and we'll cover more examples later in this video. You also need to remember that metadata is critical for RAG. It's not just critical for performing basic RAG by providing the text and unvectorized data, but also critical for advanced usage like filtering. I would split metadata into two general types, chunking metadata and non chunking metadata. This categorization is based on where the metadata is coming from. Chunking metadata is metadata that comes out of the chunking process. Examples of chunking metadata include the sentence number, the subtitle, or the section header. You can think about this type of metadata as the metadata that tells you where in the document the current chunk you're working with comes from. The main usage for this metadata is context and filtering. You can use chunking metadata to understand more about the context of a chunk, such as through the subtitles, as well as filter the chunks. For example, you may want chunks only from a certain section. The other type of metadata is non-chunking metadata. All this means is that the metadata was not produced nor tied to the chunking process. Examples of non-chunking metadata include the author, the last time an entry was updated, or the document title. The main usage for non chunking metadata is for filtering your search. For example, you may only want data that was written by you or updated in the last month. So how can we store metadata? As we mentioned before, a lot of metadata, almost all of the non-chunky metadata is metadata that was traditionally stored in a relational database. So one option is to link to where your metadata was stored. Another option that is more popular for RAG applications is to store your metadata directly in the vector store itself. It's easier and faster to store your metadata directly with your vectors and use it for info and filtering that way.

Contents