adporn.net Understanding OOO execution - Computer Architecture Essentials Video Tutorial | LinkedIn Learning, formerly Lynda.com

LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: Computer Architecture Essentials

Understanding OOO execution

From the course: Computer Architecture Essentials

Start my 1-month free trial Buy for my team

Understanding OOO execution

“

- [Instructor] One of the greatest ideas in the history of computer architecture is out-of-order execution. The idea is simple. A program like this one could be executed much faster if we have many execution units not running the instructions in the order they have in the program, but partially processing instructions as soon as the data they require becomes available. Coming up with this idea requires an open mind to a few non-obvious facts. First, algorithms aren't always required to be sequential. They just happen to be expressed this way, but sometimes the ordering of instructions is negotiable. Second, in a processor, we may have multiple units to execute many instructions at the same time. I'm referring to execution units like ALUs or floating points, adders, and multipliers. Third, some instructions can be partially completed while others are running. In summary, this mindset doesn't follow the order of the instructions in the program, but the order of their data dependencies. Let's briefly look at an analogy to understand the logic behind out-of-order execution. Executing instructions in a simple CPU is similar to cooking for your family. You get a new task for every meal and you complete that task each time sequentially. Now, executing instructions in an out-of-order processor is similar to a restaurant kitchen. Here are some similarities. In a restaurant kitchen, waiters bring orders for different dishes to prepare in sequence. In a processor, instructions are fetched from memory in sequence. A restaurant kitchen has many cooking stations like stoves, chopping tables, ovens, grills, and so on. An out-of-order processor has many processing units like ALUs, multipliers, dividers, or bit shifters. In a kitchen, some foods are used in many dishes. For example, a sauce that needs to be prepared could be used for pasta, pizza, and sandwiches. You can't prepare any of these if the sauce isn't ready. In a program, there are lots of data dependencies, which makes sequential execution a requirement, but this sequence is only required for the instructions involved in the dependency. And lastly, dishes are completed out of order. That is, customers will not necessarily receive their dishes in the order they requested them. Instructions can also complete out of order, and you'll see that this doesn't necessarily disrupt the program's functionality. The main factor that determines when an instruction can execute is the availability of its upper ends. In other words, the data dependencies in the program. For now, we will concentrate on the feature of multiple execution units and you'll see that the out-of-order part will come naturally. Let me show you how much faster this sequence of instructions could execute if we were to run it in an out-of-order processor. I'm about to show you some data dependencies in this code, so be aware that all these instructions perform a calculation based on the two registers specified at the right and save the result in the register specified as the first upper end at the left. Also, notice that I have numbered these instructions. Look at instruction number one. It produces the value of r2. So there's a data dependency between instruction one and instruction two, which uses r2. There's also a data dependency between instruction one and instruction six through the same register, r2. Basically, there's a data dependency between an instruction that uses a register as an upper end and the latest instruction that produces that value. So, here we have more dependencies in this program. As an exercise, you may want to pause the video and see if you can confirm that these dependencies are correct. Actually, if you're up for the challenge, try to find the three dependencies that are missing in this diagram. But don't worry, those dependencies will not affect the point I'm about to make in this example. Based on these dependencies, we can produce a mathematical representation that will reveal the potential for parallelism in this program and its actual required order of execution. This representation is known as a dependency graph. Let me tell you how to interpret it. Each node represents an instruction in the program. Execution is supposed to flow from the top down, so each horizontal level represents a period of time when all instructions in that level may execute. This could be a single clock cycle or multiple clock cycles depending on the architecture. Arrows indicate the data dependencies present and the requirement for an instruction to execute. For example, instruction six has two incoming arrows. That means that instruction six cannot start executing until both instructions one and four are done. One interesting detail is the fact that instruction four can be executed in the first cycle along with instruction one. That's because none of them have any prior dependencies, as no other instructions produced the values of the reference. Go ahead and pause the video if you'd like to confirm this. So why is the dependency graph important? The dependency graph reveals many non-obvious aspects of the program. First, it reveals the strictly required order of instructions. Here we can see that instructions one, two, three, and five must execute in sequence. In general, every path from top to bottom must execute in sequence. Second, it reveals that this program can be executed in four cycles instead of 10. But don't get carried away, this is just an upper bound. Achieving this speed up has its challenges for every parallel system. The third feature revealed by this graph is the maximum number of parallel execution units this program could use. In this case, that number is three, so we need at least three execution units to achieve its full potential. And fourth, the graph also reveals that in some cycles not all execution units are needed. This is useful when analyzing the performance of a system. All of this means that not only this program can execute in parallel, but also that the instructions can be executed out of order. In this example, instructions one and four can execute first, then two and six, then three, seven, and nine, and then five, eight, and 10. That's why it makes sense to execute out of order. This graph is important because it is instrumental in the decisions made to improve the speed of execution. Here are some examples of hardware and software elements that use this graph. Remember the reordering of instructions a compiler may perform to deal with a control or data hazard? Well, optimizing compilers produce dependency graphs to learn which instructions they may reorder, if any. Superscaler processors constantly review the dependencies of the instructions they are about to execute as they would be executed in parallel, so dependencies are avoided. This is also done by processors with simultaneous multithreading. And lastly, the operation of an out-of-order processor is completely controlled by the dependencies in the program.

Contents