Why is LOOP faster than DEC,JNZ on 8086?

Question

My professor claimed that LOOP is faster on 8086 because only one instruction is fetched instead of two, like in dec cx, jnz. So I think we are saving time by avoiding the extra fetch and decode per iteration.

But earlier in the lecture, he also mentioned that LOOP does the same stuff as DEC, JNZ under the hood, and I presume that its decoding should also be more complex, so the speed difference should kind of balance out. Then, why is the LOOP instruction faster? I went through this post, and the answers there pertain to processors more modern than 8086, although one of the answers (and the page it links) does point out that on 8088 (closely related to 8086), LOOP is faster.

Later, the professor used the same reasoning to explain why rep string operations might be faster than LOOP + individual movement instructions, but since I was not entirely convinced with the previous approach, I asked this question here.

Do you mean `dec ecx` / `jnz`, in that order? That's [the drop-in replacement for `loop`](https://stackoverflow.com/questions/46881279/how-exactly-does-the-x86-loop-instruction-work), other than modifying FLAGS. JMP is an unconditional jump. — Peter Cordes, Feb 14 '22 at 19:48
You do realize that there would be no single-instruction `LOOP` if it weren't "better" (smaller/faster) on the generation where it was introduced, right? If the Intel engineers hadn't found a way to have a faster single instruction than the `DEC, JNZ` pair, why would they have bothered to make an encoding for `LOOP`, taking up valuable space in the encoding table? — Ben Voigt, Feb 14 '22 at 20:03
@BenVoigt: It's still smaller, whether it was faster or not. Code-size was an even more valuable optimization goal more of the time back then. (Although that's a significant part of why it actually *is* faster, especially on 8088 with its 1-byte bus.) — Peter Cordes, Feb 14 '22 at 20:17
@PeterCordes: Agreed. But the answer to the question is still "LOOP exists because some Intel engineer saw an opportunity to shave 1 byte and 2 cycles off a loop". It may be a surprise that such an opportunity existed, but the fact that someone went to the trouble to make an instruction for it only makes sense when the specialized instruction is smaller/faster/both. — Ben Voigt, Feb 14 '22 at 20:26
@BenVoigt: Oh right, your first comment said "smaller/faster" which could be an either or, not necessarily both. But this question is specifically asking why it's faster, without considering or caring about it being smaller in machine code. I guess that's why you commented instead of answering, since it's highly related but doesn't 100% answer the question :P And BTW, the 8086 ISA was almost entirely architected on paper [by Stephen Morse, apparently before silicon design was done](//retrocomputing.stackexchange.com/posts/comments/11346). He probably expected speed benefits, too, though. — Peter Cordes, Feb 14 '22 at 20:34

Peter Cordes · Answer 1 · 2022-02-14T20:53:13.760

It's not decode that's the problem, it's usually fetch on 8086.

Starting two separate instruction-decode operations probably is more expensive than just fetching more microcode for one loop instruction. I'd guess that's what accounts for the numbers in the table below that don't include code-fetch bottlenecks.

Equally or more importantly, 8086 is often bottlenecked by memory access, including code-fetch. (8088 almost always is, breathing through a straw with it's 8-bit bus, unlike 8086's 16-bit bus).

dec cx is 1 byte, jnz rel8 is 2 bytes.
So 3 bytes total, vs. 2 for loop rel8.

8086 performance can be approximated by counting memory accesses and multiply by four, since its 6-byte instruction prefetch buffer allows it to overlap code-fetch with decode and execution of other instructions. (Except for very slow instructions like mul that would let the buffer fill up after at most three 2-byte fetches.)

See also Increasing Efficiency of binary -> gray code for 8086 for an example of optimizing something for 8086, with links to more resources like tables of instruction timings.

https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has instruction timings for 8086 (taken from Intel manuals I think, as cited in njuffa's answer), but those are only execution, when fetch isn't a bottleneck. (i.e. just decoding from the prefetch buffer.)

Decode / execute timings, not including fetch:

DEC     Decrement

    operand     bytes   8088    186     286     386     486     Pentium
    r8           2       3       3       2       2       1       1   UV
    r16          1       3       3       2       2       1       1   UV
    r32          1       3       3       2       2       1       1   UV
    mem       2+d(0,2)  23+EA   15       7       6       3       3   UV

Jcc     Jump on condition code

    operand     bytes   8088    186     286     386     486     Pentium
    near8        2      4/16    4/13    3/7+m   3/7+m   1/3     1    PV
    near16       3       -       -       -      3/7+m   1/3     1    PV

LOOP    Loop control with CX counter

      operand   bytes   8088    186     286     386     486     Pentium
      short      2      5/17    5/15    4/8+m   11+m    6/7     5/6  NP

So even ignoring code-fetch differences:

dec + taken jnz takes 3 + 16 = 19 cycles to decode / exec on 8086 / 8088.
taken loop takes 17 cycles to decode / exec on 8086 / 8088.

(Taken branches are slow on 8086, and discard the prefetch buffer; there's no branch prediction. IDK if those timings include any of that penalty, since they apparently don't for other instructions and non-taken branches.)

8088/8086 are not pipelined except for the code-prefetch buffer. Finishing execution of one instruction and starting decode / exec of the next take it some time; even the cheapest instructions (like mov reg,reg / shift / rotate / stc/std / etc.) take 2 cycles. Bizarrely more than nop (3 cycles).

I assumed the 3 cycles for `nop` is because it's really `xchg ax, ax`, and `xchg ax, reg` takes 3 cycles in all other cases. — Nate Eldredge, Feb 15 '22 at 07:10
@NateEldredge: Oh right, not special-casing `0x90` makes perfect sense, and isn't necessary for correctness until x86-64. :P — Peter Cordes, Feb 15 '22 at 17:41

score 1 · Answer 2 · answered Feb 14 '22 at 19:26

I presume that its decoding should also be more complex

There's no reason that the decoding is more complex for the loop instruction. This instruction has to do multiple things, but decoding is not at issue — it should decode as easily as JMP, since there's just the opcode and the one operand, the branch target, like JMP.

Saving one instruction's fetch & decode probably accounts for the speed improvement, since in execution they are effectively equivalent.

njuffa · Answer 3 · 2022-02-14T20:38:13.113

1

Looking at the "8086/8088 User's Manual: Programmer's and Hardware Reference" (Intel 1989) confirms that LOOP is marginally faster than the combination DEC CX; JNZ. DEC takes 3 clock cycles, JNZ takes 4 (not taken) or 16 (taken) cycles. So the combination requires 7 or 19 cycles. LOOP on the other hand requires 5 cycles (not taken) or 17 cycles (taken), for a saving of 2 cycles.

I do not see anything in the manual that describes why LOOP is faster. The faster instruction fetch due to the reduced number of opcode bytes seems like a reasonable hypothesis.

According to the "80286 and 80287 Programmer's Reference Manual" (Intel 1987), LOOP still has a slight advantage over the discrete replacement, in that it requires 8 cycles when taken and 4 cycles when not taken, while the combo requires 1 cycle more in both cases (DEC 2 cycles; JNZ 7 or 3 cycles).

The 8086 microcode has been disassembled, so one could theoretically take a look at the internal sequence of operations for both of these cases to establish exactly why LOOP is faster, if one is so inclined.

edited Feb 14 '22 at 20:38

answered Feb 14 '22 at 20:15

njuffa

23,970
4
78
130

without having to take a detour through the CX register which in the combo case is first written to, then read from again" what? JNZ looks at the FLAGS not the contents of CX. And the flags are definitely produced from ALU output, not by a read dependent on write of the register. – Ben Voigt Feb 14 '22 at 20:21
@Ben Voigt. Thanks, will fix. I am still on my first coffee, brain doesn't work correctly yet. I was probably thinking about `REP` mechanism which needs to update `CX`. – njuffa Feb 14 '22 at 20:35
Those timing numbers don't fully (or at all) include possible code-fetch bottleneck; e.g. every memory-fetch takes 4 cycles, so a 2-byte `add reg,reg` that takes 3 cycles according to that manual can actually only execute at one per 4 cycles on 8086, or one per 8 cycles on 8088 *sustained* throughput. – Peter Cordes Feb 14 '22 at 20:43
1

@PeterCordes I last dealt with the 8086 *thirty* years ago. I collected my own instruction timings at the time using proper microbenchmarks. I don't have that data any more, and *my* recollection is way too vague to convincingly confirm or refute such an assertion. I therefore thought it best to quote from official documentation and leave it at that. – njuffa Feb 14 '22 at 20:48
Sure, but you might want to link [When specifying Intel 80x86 instruction execution time, what is included in the cycle count?](https://retrocomputing.stackexchange.com/q/3255) on retrocomputing. – Peter Cordes Feb 14 '22 at 20:54
@PeterCordes I can only link to things I am aware of. Since you are aware, I think I can safely assume you included that link in your own answer. – njuffa Feb 14 '22 at 21:03
Yes, code-fetch is a major point of my answer. But you're aware now, too, so I thought you might want to edit your answer. I'm not saying you shouldn't have posted your answer that way in the first place, just that it turns out to be slightly misleading. – Peter Cordes Feb 14 '22 at 21:06

Why is LOOP faster than DEC,JNZ on 8086?

3 Answers3

It's not decode that's the problem, it's usually fetch on 8086.

Decode / execute timings, not including fetch:

Linked