From the course: Hands-On AI: RAG using LlamaIndex

Metadata extraction - LlamaIndex Tutorial

From the course: Hands-On AI: RAG using LlamaIndex

Metadata extraction

- [Instructor] I spent the last few videos talking to you at great length about various chunking strategies. I don't want you to think that this is the only advanced RAG or pre-indexing technique at your disposal. But I do want to point out that as we have seen previously with some of those examples, some of those chunks of texts that we saw, sometimes a chunk of text might lack the context necessary to disambiguate that chunk from other similar chunks of text. To combat this, we can use a LLM to extract certain contextual information relevant to our document, relevant to our nodes, to better help the retrieval and the language model disambiguate similar looking passages of text. One way to do this is by using metadata. So let's go ahead and talk about metadata. We'll start as we normally do with our imports. We'll set up our API keys, remember that from here on out we'll be working with an in-memory vector database. We'll set up our LLM, set up our embedding model. These few lines of code is just sampling from our doc store. If you recall, we had a doc store where we cleaned and processed and parsed all of those PDFs. We're just going to sample from there, but we're sampling in such a way that we're going to group by authors. And then for each one of those authors, we'll select, in this case, 25 samples. So we're going to keep the number of documents that we're working with small. Again, the whole point here is just to illustrate and show you the various patterns for RAG and for Advanced RAG. So we're not going to ingest everything anymore from here on out. I think you've gained a lot of that experience over the several videos that you've seen up until this point. Now we can just kind of focus on the patterns themselves and not really get distracted about ingesting to a vector database in the cloud or anything like that, working strictly in memory. So what is metadata? Metadata is just additional context or information about the nodes. So during retrieval we can leverage this additional context, this additional information for more precise and relevant retrieval. But the effectiveness of this approach depends on the quality and relevance of the metadata tags that you use. The most simplest way to add metadata is to do it manually. And here's how we can do that. What I'm going to do here is just add a metadata tag called known_for. This will just add to each appropriate chunk of text metadata regarding what the author is best known for. So for example, here it says Naval Ravikant is best known for his insights on how to build wealth and achieve happiness, so on and so forth. Bruce Lee offers profound wisdom on self-improvement, personal growth, so and so forth. So we'll go ahead and instantiate this dictionary, which is going to have the metadata that we will attach to our documents. We'll go ahead and add this known_for tag to the metadata for each of our documents and we can take a look at the metadata for one random document. And you can see here we've got the metadata that has the page_number, file_name, title, author, and what this author is known for. Manually adding metadata is great, it's an important thing to do, but we can take this a step further. We can extract metadata automatically from our nodes. And so what metadata extraction does is we're automatically gathering information about our documents, about our nodes that we can use to enhance the organization retrieval and really understanding of our corpus of documents. This becomes really useful when we have a ton of documents where manually extracting metadata is really untenable and impractical just due to the volume of data that we have. So LLAMA Index has some tools and abstractions that we can use to automatically identify and extract different types of metadata from a document. So if we look at the source code, you'll see that there are a number of different extractors that we can use. One of them is the summary extractor. The summary extractor will automatically generate a concise summary of that documents or nodes content. There's also the questions answered extractor. This is a really cool one because what it will do is it'll take a look at that chunk of text that node and it will say, okay, what questions can this node possibly answer and then we'll use that as metadata. There's a title extractor. A title extractor is just going to give the document some title, right, like a title for that document. There's also the entity extractor where you can extract, you know, named entities like people, places or organizations, things like that. There's also the keyword extractor, which will attach unique keywords to that particular node. So let's go ahead and look at some of these extractors in action. All of them come from the llama_index.core extractor module. And the interesting thing to point out is that each one of these extractors has a prompt template associated with it. And that is because we're making use of a large language model for this automatic extraction. And you can inspect the prompt for each one of these extractors just by using the .prompt_template attribute. So you can see for the summary extractor, we just give it the content of the section and say, summarize the key topics. For the question answered Extractor, we say, here's some context, generate some number of questions. Title extractor will say something like, give a title that summarizes the unique entities, titles or themes of this content. The keyword extractor abstraction is structured a bit differently. It has a prompt template, but it's buried within the LLM call and it's not like an attribute that we can access like how we have seen here. You can look at the source code, which I've linked to here, and the source code, we'll tell you exactly that this is the prompt template. So we have some context string and we say, give some number of unique keywords for this document. Format them as comma separated. In my opinion, the two most powerful automated metadata extraction techniques are the question answered extractor and the summary extractor. So recall, QuestionAnswerdExtractor, we're just going to generate some questions from a piece of text so some question answer pairs. Summary extractor will extract the summary not only with the current text but with some of the adjacent text as well. This is going to lead to higher quality answers and higher quality retrieval. And so what we'll do now is define some metadata extractors, and we're going to use this metadatamode.embed, which just is telling LLAMA index how we want to handle the metadata when we are generating embeddings for a document or for a node. Remember that when you call the get content function on a document, and if we specify metadatamode.embed, it's going to return the content of the document with the metadata that will be visible to the embedding model. So a node that I wrote here that we'll use GPT-3.5 turbo to generate the metadata. I'm actually just going to use GPT-4.0. If you wanted to use GPT-3.5 turbo, you can swap that out if you'd like. At the end of the day, this is all about experimentation, so I encourage you to try out the other metadata extractors to see what true results look like. I give you the patterns for keyword extractor and the title extractor like so. But we're just going to focus on using the QA extractor and the summary extractor. So I've set the qa_llm to be GPT-4.0. For text splitter here, I'm just going to use a token text splitter. And you can see here again, you've got another design choice to make. You could have used any other type of splitter if you wanted to. You could use a sentence window, node parser sentence, splitter, you could use a semantic splitter. Right, tyhere's so many design choices at your disposal. Not to mention, hyper parameters like chunk size and overlap as well. The QA extractor here, I'm just going to generate two questions per node. And for the summary extractor, we'll get summaries for the previous, the current, and the next node. So we go ahead and instantiate those. We'll go ahead now and set up our vector database. Note that if you were to run this on the entirety of the documents, like all of the documents that we had cleaned, it could take up to 30 minutes. We're only working with a small subset so it'll take a lot less time. We will set up our transforms. So we have the text splitter, QA, extractor,

Contents