From the course: Introduction to AI-Native Vector Databases

Solution: Working with vectors

Now that you've taken the time to work on this challenge, let me demonstrate my solution. So, in my solution, let's start talking about the question itself, where we've been given two vectors that represent RGB colors. So, the color 40, 120, and 60, and then the color is 60, 50, and 90. Now again, remember these numbers have to do with how much red you need for this color, how much green you need, and then how much blue you need? So, these are RGB colors. So, first thing that we're going to do is import the libraries that we need. So, for this, we're going to use NumPy and Matplotlib to actually visualize our data. We want to initialize these two vectors that we've been given. So, I'm going to specify the first color here. And that's going to be the red intensity, the green intensity, and the blue intensity. The second color, similarly is going to be 60, 50, and 90. Now that we've initialized the two colors, we can go ahead and add them to our graph. So, what I've done here is I've set up a graph figure to make it easy for you to visualize these in three dimensions. And we're going to pop these vectors in one by one. There's a couple of things that we need to do to accomplish this. We're going to go ahead and add a scatter plot here, and we're going to specify the coordinates for this. So, the first coordinate here is going to be the red amount. The second coordinate here is going to be the green amount. Like so. And then the third coordinate or the third dimension that we're going to add here is going to be the blue amount. We're also going to go in here and specify the directionality that we want from this graph. So, we're going to go in here and specify the z-axis pointing towards the top. To make it easier to read, we'll also add a label so that we can identify which color we're referring to. So, here, we're referring to color one, and we can label that. Similarly, we're going to take this, and we're going to plot out the second color as well. So, the x coordinate for the second color is just going to come from the second vector. So, now it's going to be 60 instead of 40. Here we'll do color two and likewise. And we need to change the label here. So, this will be color two. We're also going to go ahead and plot out the origin. So, we want to plot out the 0,0,0. . Because later on, we're going to draw an arrow from this point to these two separate points that we've plotted out here. So, to do that, I'll just start off and take my code here. And then, rather than specifying these one by one, we can just go in and specify 0,0,0. Easy enough. And we can go ahead and label that as well. So, we're going to label this origin here. And we're also going to color this uniquely. We're going to say that this is the color black. So, that's good. You can run this. Notice how now we've got the color black here at the origin, we've got color two and we've got color one. We're still missing a certain component to this, right. We don't have the arrow that represents that vector that we saw earlier. So, let's add that arrow in. So, to add the arrow in, what we're going to do is go ahead and say we're going to refer to the quiver method, and we're going to specify where the arrow starts off, and where the arrow ends up. So, the arrow is going to start off at the origin 0,0,0, and it's going to end up for the first arrow at the coordinates for color one. So, the first coordinate, the second coordinate, and the third coordinate here. The other thing that we can do to differentiate this is we can go ahead and add colors. So, we can say color for this arrow is going to be black, and we can also make it so that it looks a little bit nicer. We can decrease the length of the arrow here. We can specify the arrow length ratio to be 0.1. So, if I draw this one out, you'll notice that now one of the arrows is drawn out We just have to draw the other vector out. So, to do that is easy enough. We'll copy-paste this and just change up the coordinates. So, now I'll specify color two that we want to plot out. Everything else should be pretty much the same. We run this code now. We've now got both vectors visualized starting off at the origin, ending off at the coordinates for the individual vectors or the colors. So, we've got color two and color one, which is situated in vector space based on its coordinates. And these are the three dimensional coordinates for that color. So, the next thing that we want to do now is calculate how far apart these two vectors are. And because there's different ways to calculate how far apart vectors are, we're going to go through each calculation one at a time. The first thing that we're going to do is calculate the Manhattan distance between these two vectors. So, to calculate Manhattan distance, we're essentially starting off at one vector, and we're walking along these axes, one by one, and we're adding up how much distance we've covered until we get to the other vector. So, to mathematically do that, we're going to go in, and we're going to start off and we're going to say, what is the distance between the individual color coordinates? so, we're going to say color one and color two, and we're going to walk from one to the other. We're going to take the difference between all of these for each one of the dimensions. So, we're going to loop over and we're going to start off go from zero, one and two to access the R, the G, and the B. So, this code is going to extract for us and give us the difference between the red intensities, the green intensities, and the blue intensities for our colors. So, once we run this, we can see that the differences are here. And this is still not the Manhattan distance so, we still need to modify this. We're going to take this, and we're going to say that we want to take this output, and we're going to run that through the absolute value function, so we're going to pass this in here, and we're going to sum all of this up. And this is the exact same formula that we saw earlier for the Manhattan distance. So, now, when we run this, we get the correct Manhattan distance. So, the color one is 120 units away from the color two based on the Manhattan distance measuring from one vector to the other vector. An alternate approach to calculate the exact same distance is just to use the built-in library in NumPy, the linear algebra library here, to calculate the norm between a and b of order one. This is why Manhattan distance is also known as the L1 distance. So, we can run this, and we can verify that we get the exact same distance here as well. Next, we're going to look at Euclidean distance. Euclidean distance if we scroll up, here is the shortest distance between this vector and this vector. So, it's literally the line that connects the orange dot to the blue dot. If we want to know how long that line is, the Euclidean distance is going to give us that measurement. So, the way to do this is we're going to set it up, and similarly, we're going to go through and we're going to extract out how different each color is. So, we're going to go to the first dimension of color one, we're going to take the first dimension of color two, and we're going to loop through this and we're going to square them. So, we're going to exponentiate them to the power of two. And we're going to loop through for all of our intensity. So, the R, the G, and the B intensity. Range length, and we want to make sure that we get to all the dimensions. So, we specify that over here. So, now we can plot this out. And this is the square of the distances that we have between them. So, you can verify that if you look at the vector that we had earlier. We had -20. If you take the square of -20, it's 400. So, it's a bit of a sanity check here. So, we're not done. So, the next thing that we need to do is take these and we need to pass it into the remaining formula that we showed earlier. So, we're going to take the square root of this information. I'm going to pass in the NumPy array which goes in and it takes this Python list, turns it into a NumPy array, and it sums across the dimensions, and it takes a square root. So, now we get our Euclidean distance between these two vectors. One small thing to note here. The direct line between the orange and the blue data point is smaller than walking along these axes to get from the orange to the blue, or vice versa. And as a consequence of that, the Manhattan distance is larger than the Euclidean distance. The Euclidean distance is only 78.74 around that, and the Manhattan distance is 120 units. Another way to do this, an alternate way is to use the built-in NumPy library linear algebra, where we can go in and ask for the norm between a and b, and ask for a order two norm. That's why Euclidean distance is also known as the L2 norm. So, we can do that and verify that we get the exact same answer. So, the next distance we're going to cover is the cosine distance. And the cosine distance measures the difference in directionality between the two vectors. So, for example, if this vector was pushed slightly over here, the cosine distance would increase because the direction would be more different. So, how do we measure cosine distance? We can go in here. And the cosine distance as we talked about earlier has to do with directionality. And it has nothing to do with the actual length of the vectors. So, we can start off by calculating the dot product between the two vectors. So, we can take the dot product between these two vectors, and we normalize that with the length of the individual vectors. We can calculate the length of the individual vectors using the linear algebra library that we've been using. And here, we can ask for the norm of the first vector multiplied by the norm of the second vector. Norm of the second vector can run that. And that gives us the cosine distance. So, the cosine distance at maximum is 1.0. And this is a very high cosine distance. And what this tells us is that these two vectors have pretty much the same directionality, which is apparent from the visual as well. They're pointing in the same direction, the cosine distance is high. If they were pointing in different directions, the cosine distance would be low. The cosine distance goes from -1 to 1. If the two vectors are at 90 degrees to each other, the cosine distance is zero. If the vectors are exact opposite, they're pointing in the exact opposite directions, then, the cosine distance is negative one. So, the last thing that we're going to do is talk about the dot product distance between these two vectors. In order to calculate the dot product between the two, we can just go in and multiply the individual dimensions together. So, we're going to take the first dimension or the i'th dimension of color one and the i'th dimension of color two, where I loops from the beginning to the last dimension for i in range length. And so, we're going to loop through this. Look at what the multiplication is. We're going to take this, and we're going to turn this into a NumPy array. So, we're going to pass this into the NumPy array function, and we're going to take that and we're going to add them all up. So, we're going to sum this. And that gives us the dot product between the two vectors. Another easy way to do this is just to say NumPy and use the dot product functionality built into NumPy. So, we can say np dot between a and b, and we get the exact same result. For the last question, we want to come up with a color that has a similar cosine distance to the first vector or the first color that we're given. So, in this case, the first color is 40, 120, and 60. We want to come up with a color or a color three here that is in the same direction or as close to the first color as possible. So, to do this, what we're going to do is simply go in and start off at color one, the red intensity. We're going to slightly nudge it. If we want to say how much red should it take to get to color three from color one? We're just going to nudge it a little bit, and we're going to say, okay, maybe we'll add a little bit more red. We don't want to modify it too much, so we'll just modify it a little bit. Same thing for the green intensity. We're going to go ahead, and we're going to modify it and increase it by 15 units of intensity. Same thing for the blue color. So, we'll go ahead and modify this by five. And the reason why this vector will be pretty close to color one is because we used the color one vector to generate it. We just simply nudged the intensities slightly up. So, now color three is very similar to color one. And we can visualize that in our graph. So, here, if we plot out the graph that we had before, we've got color one and color two. Now, if we plot out color three, it should be pretty close to this color one vector that we're showing here. So, let's code that up. So, remember, we need to add two things. We need to add the data point itself, the vector. We do that by adding a scatter point which has an x coordinate which is going to take from color three. It has a Y coordinate. It's going to be specified in the second dimension, and it has a Z coordinate, which is specified in the third dimension. To keep it consistent, we're also going to go ahead and make sure that the Z direction here is upwards. We're also going to label it so that we can see it later on. So, this is our new color or the third color that we've added. So, we can run this and we can see now that this new color that we've added is closer to color one than it is to color two. And that's as a result of us adding or nudging color one just a little bit to arrive at this new color. Now, let's draw the vector out. So, to get the vector, we'll go in, and we'll use quiver. And again, remember, we have to specify where the start and end-points are. So, that's 0,0,0 for the vector as the start point is the origin and the end-point is color zero. The second dimension and the third dimension here. We're also going to keep it consistent. We're going to make sure that the color here of the arrow is black, and we want to make sure that the arrow length ratio is the same as well. So, now that we've added this, let's have a look at this. So, this is the vector for the new color or the third color that we've added. Notice how it's very much similar to the blue data point here. And it's different from the orange data point. In this chapter, I introduced how data objects can be represented as vectors and furthermore, how we can measure the distance between these vectors. In the next chapter, we'll introduce the concept of vector search and get practical using a vector database.

Contents