From the course: Hands-On AI: RAG using LlamaIndex
Loading data
- [Instructor] So we know that we can't build a rag system without external data, but in order to use our external data, we need to load it, and that's what we're going to learn how to do in this module. So note that this module makes use of html2text. You can run this cell and install that, or just install it from the command line like I've done here. Make sure that you have connected to the environment and now we can go ahead and get right to it. So preparing data for an LLM involves creating kind of an ingestion pipeline. And this is similar to traditional ML, where we have data cleaning or an ETL process, and ingestion happens in three stages. There's loading the data, transforming the data, and then finally indexing and storing the data. So let's walk through how to load data. So to use data with an LLM, we load it using a data connector, which is known as a reader in Llamaindex. A reader is going to format data, in this case a text document, into a Llamaindex document object that's going to contain the data and metadata. We're going to use a Simple Directory Reader, it's the most straightforward reader and loader that Llamaindex has. A Simple Directory Reader can read data in various formats. It could be markdown, PDF, word document, PowerPoint decks, images, audio, so on and so forth. And it does so for every file in a directory. So I've linked to the source code of Simple Directory Reader here. And if you scroll to Simple Directory Reader, which we can do a Control F for, and you can see the different arguments that we can pass, you can pass in the input directory, and that's just going to read every single file that is in that directory and read each file into a document object. Or alternatively, we can give it a list of input files. So for illustrative purposes, I just want to work with one file, and in that case it's the Almanac of Naval Ravikant. So I'm going to pass the input files argument, which needs to take a list of the files that we want to load and then hit the directory reader with load data. So let's go ahead and do that. So let's see what we have here. All right, so we're going to get back a list. This documents thing that I've defined here is a list, but how many elements are in that list? Okay, well it has 242 elements. What do these elements represent? So we can look at the data itself, and unfortunately this PDF can't be rendered, so we'll need to get an extension. So let's go here, let's do a PDF, PDF Viewer. So we'll go ahead and do the PDF Viewer extension because it's going to be helpful for us to be able to view the PDFs that we have. All right, so we've got the extension installed, and we should be able to go ahead and open this up and view it, and yes we can. All right, so notice here that this PDF has 242 pages. So this list of documents that we have is also 242 elements long. Okay, that's interesting. What are these things in this list? So let's go ahead and try to get the type of the first element. Okay, well the first element is a Llamaindex document object. So what does that look like? So we can, again, just pick any random element from that list. We will hit it with __dict__. And here is what a Llamaindex document object looks like. It's essentially a dictionary, and it has an ID that is used to identify this document object. It can hold the embedding, it has a spot for metadata, it has a spot to exclude metadata. More importantly though, it's got the actual raw text for that page. All right, and so this right here is a Llamaindex document object. This Llamaindex document object that we have here is what we are going to be working with primarily throughout this course. You can also manually create a document object. So let's go ahead and and see how that's done here. So I'm going to import from Llamaindex core, just the document class, and then just instantiate a manual document. And to do so I need to pass in a string of text. Okay, when we do that, we can hit it with dict, and you can see that it's got much of the same information as what we saw above. We can actually add metadata to the document object ourselves. So here we've got the manual document object with metadata. And in this case I'm going to again give it some text and then I'm going to define metadata. And this metadata will have two elements, the file name and some category. And of course you can see that it shows up here, we have the metadata with the file name and category. You can also add metadata to a document object after that document object has already been instantiated. So you see here that we've got the same manual document as we saw above. Right here you see that the metadata is empty. So I'm going to go ahead and add some metadata. And you can see here now that we have metadata associated with this document object. After the data has been loaded, we need to process and transform it for retrieval. So the way we do that is we transform a list of document objects into a list of node objects, and this can involve several different steps. It could be chunking, extracting metadata, then embedding each chunk in the transformation. And this node object is actually a first class citizen in Llamaindex. And you can directly define a node or you can parse it from a document. A transformation on a document object is essentially where we're taking an input and output as a node. Right, so we have have this transformation. What goes into this transformation is a node. What comes out of this is a node as well. So it's important to to point out here that actually document is a subclass of node. And so nodes are just chunks of documents. This could be text, this could be images, this could be metadata, this could be a relationship between nodes as well. And Llamaindex has a number of node parsers, right? So you could see here all the different node parsers that Llamaindex has. So what a node parser does is that it converts documents into nodes with all the necessary attributes that it needs. The high level API usage is just to use the from_documents method of a node parser. And this will automatically parse and chunk the document object into nodes. So what's happening under the hood is we're splitting the documents into node objects, and this is going to maintain the text and metadata with a link to the parent document. So what we're going to do here is just define a node parser, in this case, a sentence splitter. We'll talk a lot about splitting later, but just take it for granted that what we're doing is splitting a body of text, a string of text into chunks of 128 tokens, where each chunk has a 16 token overlap with the chunk before it. And we're also defining a paragraph separator here, in this case, two line characters. So we can go ahead and parse our nodes, and so let's take a look real quick. What is the length of this nodes list? So we have 890 nodes. So we started with 242 documents, we chunked those 242 documents using a sentence splitter, and now what we have is a list of 890 nodes. Each one of these nodes is a text node object, and this is what a node looks like. So again, it is essentially just a dictionary. We've got the page label of this node, the file name, the file path. We've got the creation date, the modified date, we also have the relationships for this node, specifically what node came before it, what node came after it, and we have the text of the node as well. Of course, you can construct a node manually if you'd like. In this case, the node can be constructed using the text node class from Llamaindex. We instantiated with the text argument, and here I am setting an ID for it as well. And you saw me mention node relationships, which is helpful because this assigns connections between chunks of texts. This is useful for documents that are organized in a hierarchical manner. For example, books with chapters, and sections, and subsections, and paragraphs, and sentences, and words, right? So if you have some hierarchical kind of structure to your data, then using a node relationship would be beneficial there. This also helps maintain the sequential order of your nodes, which again is extremely useful for things that have complex relationships, like legal documents that have links to clauses, or other cases, or other precedents and whatnot. And we can set node relationships just like this just by calling the relationships method of a node object. So here you can see we have node one. We're setting the relationships for the next node to be node two. Then on node two, we're setting the relationship of the previous node to be node one. And then here I've got a list of nodes. And you can see here I can define a parent to child type of relationship between nodes like so. So you can look at node two, and let's actually use a __dict__ on it. And we have the relationships between node one and node two present here. So now that we know how to load data, next we're going to talk about indexing and why this is important for retrieval augmented generation. I'll see you in the next video.