How can a processor execute more IPS than its frequency?

Question

This has been something that I can't seem to wrap my head around. Just about every modern processor is able to execute more instructions per second than its frequency.

I can understand why lower class processors can execute fewer IPS than its frequency. For instance, the ATmega328 executes about 16 MIPS at 16 MHZ (or at least that's what I've been told), and the Z80 executes 0.5 MIPS at 4 MHz. But then the Pentium 4 Extreme can execute more than 9 GIPS at only 3.2 GHz. That is about three instructions per clock cycle!

How is this done, and why isn't this implemented in smaller processors, such as AVR microcontrollers?

I found all of my information, except for the ATmega328, from here.

Mokubai · Accepted Answer · 2015-07-15T11:43:25.830

This is due to a combination of features of modern processors.

The first thing that contributes to a high IPS is the fact that modern processors have multiple execution units that can operate independently. In the below image (borrowed from Wikipedia: Intel Core Microarchitecture) you can see at the bottom that there are eight execution units (shown in yellow) that can all execute instructions concurrently. Not all of those units can secure the same types of instruction, but at least 5 of them can perform an ALU operation and there are three SSE capable units.

enter image description here

Combine that with a long instruction pipeline which can efficiently stack instructions ready for those units to execute instructions (out of order, if necessary) means that a modern processor can have a large number of instructions on the fly at any given time.

Each instruction might take a few clock cycles to execute, but if you can effectively parallelize their execution then you can give yourself a massive boost to IPS at the cost of processor complexity and thermal output.

Keeping these large pipelines full of instructions also needs a large cache that can be prefilled with instructions and data. This contributes to the size of the die and also the amount of heat the processor produces.

The reason this is not done on smaller processors is because it substantially increases the amount of control logic required around the processing cores, as well as the amount of space required and also heat generated. If you want a small, low power, highly responsive processor then you want a short pipeline without too much "extra" stuff surrounding the actual functional cores. So typically they minimise cache, restrict it to only one of each type of unit required to process instructions, and reduce the complexity of every part.

They could make a small processor as complex as as larger processor and achieve a similar performance, but then the power draw and cooling requirements would be exponentially increased.

score 4 · Answer 2 · answered Jul 15 '15 at 05:36

It's not hard to imagine. One cycle is all it takes to switch many thousands of transistors. As long as instructions are lined up in parallel, one cycle can be enough to execute them all.

Better than trying to explain it myself, here's a good starting point.

score 3 · Answer 3 · answered Jul 15 '15 at 12:47

To get a bit more fundamental than Mokubai's answer:

Superscalar CPUs analyse the instruction stream for data (and other) dependencies between instructions. Instructions that don't depend on each other can run in parallel.

Typical x86 desktop CPUs fetch 16 or 32B of instructions every clock cycle. Intel designs since Core2 can issue up to 4 instructions per cycle. (Or 5, if there's a compare-and-branch that can macro-fuse).

See Mobukai's nice answer for links and details on how CPUs in practice go about the task of extracting as much instruction-level parallelism as they do from the code they run.

Also see http://www.realworldtech.com/sandy-bridge/ and similar articles for other CPU architectures for an in-depth explanation of what's under the hood.

score -1 · Answer 4 · answered Jul 15 '15 at 16:24

Previous answers show how one gets more instructions executed by the processor's definition of "instruction" and one imagines that is actually the questioner's intent.

But another source of it may be that each "instruction" is actually a certain amount of data treated as an instruction input by the processor. If his source's counting just counts what the processor regards as instructions, the following adds nothing. But if his source counts all of what a human would call an "instruction," then: Add in that not every instruction is as physically long as every other instruction (one might be 12 bytes, another might be 56 bytes, etc.). So if it loads 64 bytes of material each cycle as "an instruction" (or as many full instructions as it can before hitting 64 bytes)and one has six instructions in that 64 bytes, then six instructions (as you and I might regard them) will be finished in that cycle.

Since many very basic instructions (our "sensible" definition) are leftovers from early days with 8 byte instruction lengths, and very basic instructions are, by definition, perhaps used disproportionately, just this would go a long way to having more "instructions" performed than frequency would seem to allow.

How can a processor execute more IPS than its frequency?

4 Answers4