Any way to move 2 bytes in 32-bit x86 using MOV without causing a mode switch or cpu stall?

Question

If I want to move 2 unsigned bytes from memory into a 32-bit register, can I do that with a MOV instruction and no mode switch?

I notice that you CAN do that with the MOVSE and MOVZE instructions. For example, with MOVSE the encoding 0F B7 moves 16 bits to a 32 bit register. It is a 3 cycle instruction, though.

Alternatively I guess I could move 4 bytes into the register and then somehow CMP just two of them somehow.

What is the fastest strategy for retrieving and comparing 16-bit data on 32-bit x86? Note that I am mostly doing 32-bit operations so I can't switch to 16-bit mode and stay there.

FYI to the uninitiated: the issue here is that 32-bit Intel x86 processors can MOV 8-bit data and 16-bit OR 32-bit data depending on what mode they are in. This mode is called the "D-bit" setting. You can use special prefixes 0x66 and 0x67 to use a non-default mode. For example, if you are in 32-bit mode, and you prefix the instruction with 0x66 this will cause the operand to be treated as 16-bit. The only problem is that doing this causes a big performance hit.

So that's on a P1 or PMMX then, right? The specific microarchitecture is quite important for questions like this. — harold, Oct 26 '12 at 19:11
The fastest way is to use SSE2/3/4/etc. The fastest way with non-vector instructions will be *highly* CPU-dependent, but there appears to be several encodings of CMP that allow 16-bit comparisons in 32-bit mode (but the Intel docs are cryptic!) — but note that it is not always safe to do a 32-bit load instead of a 16-bit load (you might get a page fault if it crosses a page boundary). — tc., Oct 26 '12 at 19:26
If you restrict your question to trying to optimize below 3 cycles, you're unlikely to do better than MOVSE/ZE. You might do better to describe what you want to do with the 16 bit quantity; it is more likely one can optimize the block of code containing the MOVZE/SE than the instruction itself, especially if that block has to touch the data "next to" the 16 bit quantity for other reasons. — Ira Baxter, Oct 26 '12 at 19:47
Your cycle timings are not accurate. For an Atom, `movsx reg,r/m16` cost 1/1 cyce. LCP stalls are heavily architecture dependent. The Intel advice is to load 32 bits and only use the 16-bit register. — Hans Passant, Oct 26 '12 at 19:53
I am looking for something generic, P4. If I load 32 bits then how can I "use" only 16-bits. The next step is a CMP which is like the MOV, either 8 bit or 32 bit. I could AND the register I suppose. So we would have a MOV + AND (0000FFFF). Which I guess which be 2 cycles so that might be the best option. — Tyler Durden, Oct 26 '12 at 21:56
This may be a completely ignoramus comment, but why can't you just use the 16-bit sub-parts of the 32 bit registers. Like `ax`, `bx`, `cx`, etc. True, this leaves half of the register unused, but it is a possibility. — Linuxios, Oct 27 '12 at 02:20
The operand size prefix isn't length-changing if you use it on an instruction that has no immediate operand (in some cases the 16bit version is then still somewhat slower, but it won't stall the decoders, unless there is a 16byte boundary between the opcode and the modr/m byte). What are you comparing with? If it's a constant, consider putting it in a register. Also, `movzx` is fast on anything after PMMX, so you could just use that. — harold, Oct 27 '12 at 08:36
Seriously, just use `movzx` or `movsx` as appropriate. They are fast on anything made in the last decade. — Stephen Canon, Jan 02 '13 at 16:15
LLVM spits out `movswl (%eax), %eax` for signed and `movzwl (%eax), %eax` for unsigned on x86. — CAFxX, Jan 02 '13 at 16:19

Peter Cordes · Answer 1 · 2021-04-06T19:24:23.360

Use movzx to load narrow data on modern CPUs. (Or movsx if it's useful to have it sign-extended instead of zero-extended, but movzx is sometimes faster and never slower.)

movzx is only slow on the ancient P5 (original Pentium) microarchitecture, not anything made this century. Pentium-branded CPUs based on recent microarchitectures, like Pentium G3258 (Haswell, 20th anniversary edition of original Pentium) are totally different beasts, and perform like the equivalent i3 but without AVX, BMI1/2, or hyperthreading.

Don't tune modern code based on P5 guidelines / numbers. However, Knight's Corner (Xeon Phi) is based on a modified P54C microarchitecture, so perhaps it has slow movzx as well. Neither Agner Fog nor Instlatx64 have per-instruction throughput / latency numbers for KNC.

Using a 16-bit operand size instruction doesn't switch the whole pipeline over to 16-bit mode or cause a big perf hit. See Agner Fog's microarch pdf to learn exactly what is and isn't slow on various x86 CPU microarchitectures (including ones as old as Intel P5 (original Pentium) which you seem to be talking about for some reason).

Writing a 16-bit register and then reading the full 32/64-bit register is slow on some CPU (partial-register stall when merging on Intel P6-family). On others, writing a 16-bit register merges into the old value so there's a false dependency on the old value of the full register when you write, even if you never read the full register. See which CPU does what. (Note that Haswell/Skylake only rename AH separately, unlike Sandybridge which (like Core2/Nehalem) also renames AL / AX separately from RAX, but merges without stalling.)

Unless you specifically care about in-order P5 (or possibly Knight's Corner Xeon Phi, based on the same core, but IDK if movzx is slow there, too), USE THIS:

movzx   eax, word [src1]        ; as efficient as a 32-bit MOV load on most CPUs
cmp      ax, word [src2]

The operand-size prefix for cmp decodes efficiently on all modern CPUs. Reading a 16-bit register after writing the full register is always fine, and the 16-bit load for the other operand is also fine.

The operand-size prefix isn't length-changing because there's no imm16 / imm32. e.g. cmp word [src2], 0x7F is fine (it can use a sign-extended imm8), but
cmp word [src2], 0x80 needs an imm16 and will LCP-stall on some Intel CPUs. (Without the operand-size prefix, the same opcode would have an imm32, i.e. the rest of the instruction would be a different length). Instead, use mov eax, 0x80 / cmp word [src2], ax.

The address-size prefix can be length-changing in 32-bit mode (disp32 vs. disp16), but we don't want to use 16-bit addressing modes to access 16-bit data. We're still using [ebx+1234] (or rbx), not [bx+1234].

On modern x86: Intel P6 / SnB-family / Atom / Silvermont, AMD since at least K7, i.e. anything made in this century, newer than actual P5 Pentium, movzx loads are very efficient.

On many CPUs, the load ports directly support movzx (and sometimes also movsx), so it runs as just a load uop, not as a load + ALU.

Data from Agner Fog's instruction-set tables: Note they may not cover every corner case, e.g. mov-load numbers might only be for 32 / 64-bit loads. Also note that Agner Fog's load latency numbers are not load-use latency from L1D cache; they only make sense as part of the store/reload (store-forwarding) latency, but relative numbers will tell us how many cycles movzx adds on top of mov (often no extra cycles).

(Update: https://uops.info/ has better test results that actually reflect load-use latency, and they're automated so typos and clerical errors in updating the spreadsheets aren't a problem. But uops.info only goes back to Conroe (first-gen Core 2) for Intel, and only Zen for AMD.)

P5 Pentium (in-order execution): movzx-load is a 3-cycle instruction (plus a decode bottleneck from the 0F prefix), vs. mov-loads being single cycle throughput. (They still have latency, though).
Intel:
PPro / Pentium II / III: movzx/movsx run on just a load port, same throughput as plain mov.
Core2 / Nehalem: same, including 64-bit movsxd, except on Core 2 where a movsxd r64, m32 load costs a load + ALU uop, which don't micro-fuse.
Sandybridge-family (SnB through Skylake and later): movzx/movsx loads are single-uop (just a load port), and perform identically to mov loads.
Pentium4 (netburst): movzx runs on the load port only, same perf as mov. movsx is load + ALU, and takes 1 extra cycle.
Atom (in-order): Agner's table is unclear for memory-source movzx/movsx needing an ALU, but they're definitely fast. The latency number is only for reg,reg.
Silvermont: same as Atom: fast but unclear on needing a port.
KNL (based on Silvermont): Agner lists movzx/movsx with a memory source as using IP0 (ALU), but latency is the same as mov r,m so there's no penalty. (execution-unit pressure is not a problem because KNL's decoders can barely keep its 2 ALUs fed anyway.)
AMD:
Bobcat: movzx/movsx loads are 1 per clock, 5 cycle latency. mov-load is 4c latency.
Jaguar: movzx/movsx loads are 1 per clock, 4 cycle latency. mov loads are 1 per clock, 3c latency for 32/64-bit, or 4c for mov r8/r16, m (but still only an AGU port, not an ALU merge like Haswell/Skylake do).
K7/K8/K10: movzx/movsx loads have 2-per-clock throughput, latency 1 cycle higher than a mov load. They use an AGU and an ALU.
Bulldozer-family: same as K10, but movsx-load has 5 cycle latency. movzx-load has 4 cycle latency, mov-load has 3 cycle latency. So in theory it might be lower latency to mov cx, word [mem] and then movsx eax, cx (1 cycle), if the false dependency from a 16-bit mov load doesn't require an extra ALU merge, or create a loop-carried dependency for your loop.
Ryzen: movzx/movsx loads run in the load port only, same latency as mov loads.
VIA
Via Nano 2000/3000: movzx runs on the load port only, same latency as mov loads. movsx is LD + ALU, with 1c extra latency.

When I say "perform identically", I mean not counting any partial-register penalties or cache-line splits from a wider load. e.g. a movzx eax, word [rsi] avoids a merging penalty vs mov ax, word [rsi] on Skylake, but I'll still say that mov performs identically to movzx. (I guess I mean that mov eax, dword [rsi] without any cache-line splits is as fast as movzx eax, word [rsi].)

xor-zeroing the full register before writing a 16-bit register avoids a later partial-register merging stall on Intel P6-family, as well as breaking false dependencies.

If you want to run well on P5 as well, this might be somewhat better there while not being much worse on any modern CPUs except PPro to PIII where xor-zeroing isn't dep-breaking, even though it is still recognized as a zeroing-idiom making EAX equivalent to AX (no partial-register stall when reading EAX after writing AL or AX).

;; Probably not a good idea, maybe not faster on anything.

;mov  eax, 0             ; some code tuned for PIII used *both* this and xor-zeroing.
xor   eax, eax           ; *not* dep-breaking on early P6 (up to PIII)
mov    ax, word [src1]
cmp    ax, word [src2]

; safe to read EAX without partial-reg stalls

The operand-size prefix isn't ideal for P5, so you could consider using a 32-bit load if you're sure it doesn't fault, cross a cache-line boundary, or cause a store-forwarding failure from a recent 16-bit store.

Actually, I think a 16-bit mov load might be slower on Pentium than the movzx/cmp 2 instruction sequence. There really doesn't seem to be a good option for working with 16-bit data as efficiently as 32-bit! (Other than packed MMX stuff, of course).

See Agner Fog's guide for the Pentium details, but the operand-size prefix takes an extra 2 cycles to decode on P1 (original P5) and PMMX, so this sequence may actually be worse than a movzx load. On P1 (but not PMMX), the 0F escape byte (used by movzx) also counts as a prefix, taking an extra cycle to decode.

Apparently movzx isn't pairable anyway. Multi-cycle movzx will hide the decode latency of cmp ax, [src2], so movzx / cmp is probably still the best choice. Or schedule instructions so the movzx is done earlier and the cmp can maybe pair with something. Anyway, the scheduling rules are quite complicated for P1/PMMX.

I timed this loop on Core2 (Conroe) to prove that xor-zeroing avoids partial register stalls for 16-bit registers as well as low-8 (like for setcc al):

mov     ebp, 100000000
ALIGN 32
.loop:
%rep 4
    xor   eax, eax
;    mov   eax, 1234    ; just break dep on the old value, not a zeroing idiom
    mov   ax, cx        ; write AX
    mov   edx, eax      ; read EAX
%endrep

    dec   ebp           ; Core2 can't fuse dec / jcc even in 32-bit mode
    jg   .loop          ; but SnB does

perf stat -r4 ./testloop output for this in a static binary that makes a sys_exit system call after :

 ;; Core2 (Conroe) with   XOR eax, eax
       469,277,071      cycles                    #    2.396 GHz
     1,400,878,601      instructions              #    2.98  insns per cycle
       100,156,594      branches                  #  511.462 M/sec
             9,624      branch-misses             #    0.01% of all branches

       0.196930345 seconds time elapsed                                          ( +-  0.23% )

2.98 instructions per cycle makes sense: 3 ALU ports, all instructions are ALU, and there's no macro-fusion, so each is 1 uop. So we're running at 3/4 of the front-end capacity. The loop has 3*4 + 2 instructions / uops.

Things are very different on Core2 with the xor-zeroing commented and using the mov eax, imm32 instead:

 ;; Core2 (Conroe) with   MOV eax, 1234
 1,553,478,677      cycles                    #    2.392 GHz
 1,401,444,906      instructions              #    0.90  insns per cycle
   100,263,580      branches                  #  154.364 M/sec
        15,769      branch-misses             #    0.02% of all branches

   0.653634874 seconds time elapsed                                          ( +-  0.19% )

0.9 IPC (down from 3) is consistent with the front-end stalling for 2 to 3 cycles to insert a merging uop on every mov edx, eax.

Skylake runs both loops identically, because mov eax,imm32 is still dependency-breaking. (Like most instructions with a write-only destination, but beware of false dependencies from popcnt and lzcnt/tzcnt).

Actually, the uops_executed.thread perf counter does show a difference: on SnB-family, xor-zeroing doesn't take an execution unit because it's handled in the issue/rename stage. (mov edx,eax is also eliminated at rename, so the uop count is actually quite low). The cycle count is the same to within less than 1% either way.

 ;;; Skylake (i7-6700k) with xor-zeroing
 Performance counter stats for './testloop' (4 runs):

         84.257964      task-clock (msec)         #    0.998 CPUs utilized            ( +-  0.21% )
                 0      context-switches          #    0.006 K/sec                    ( +- 57.74% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 3      page-faults               #    0.036 K/sec                  
       328,337,097      cycles                    #    3.897 GHz                      ( +-  0.21% )
       100,034,686      branches                  # 1187.243 M/sec                    ( +-  0.00% )
     1,400,195,109      instructions              #    4.26  insn per cycle           ( +-  0.00% )  ## dec/jg fuses into 1 uop
     1,300,325,848      uops_issued_any           # 15432.676 M/sec                   ( +-  0.00% )    ###   fused-domain
       500,323,306      uops_executed_thread      # 5937.994 M/sec                    ( +-  0.00% )    ### unfused-domain
                 0      lsd_uops                  #    0.000 K/sec                  

       0.084390201 seconds time elapsed                                          ( +-  0.22% )

lsd.uops is zero because the loop buffer is disabled by a microcode update. This bottlenecks on the front-end: uops (fused-domain) / clock = 3.960 (out of 4). That last .04 might be partly OS overhead (interrupts and so on), because this is only counting user-space uops.

score -1 · Answer 2 · edited Apr 10 '16 at 15:27

-1

stick to 32 bit mode and use 16 bit instructions

mov eax, 0         ; clear the register
mov ax, 10-binary  ; do 16 bit stuff

Alternatively I guess I could move 4 bytes into the register and then somehow CMP just two of them

mov eax, xxxx ; 32 bit num loaded
mov ebx, xxxx
cmp ax, bx    ; 16 bit cmp performed in 32 bit mode

edited Apr 10 '16 at 15:27

phuclv

37,963
15
156
475

answered Apr 30 '13 at 01:11

ady

155
1
8

1

Doing this causes a prefixed instruction (mode exception). If you actually assemble this code you will see a 0x66 prefix is added to the CMP opcode. This causes a processor stall and huge performance hit. – Tyler Durden May 02 '13 at 15:00
1

The operand-size prefix only causes a performance hit on Intel CPUs when used on an instruction with an `imm16` immediate operand (not `imm8`), because then it changes the length of the rest of the instruction, like `add ax, 0x1234`. **`cmp ax,bx` is fast**, and so is `movzx eax, word [mem]`. (On Intel SnB-family, `mov ax, 0x1234` doesn't have an LCP stall. The decoders handle 16bit `mov` specially.) – Peter Cordes Apr 10 '16 at 18:05
Does not always work. If your address is not 4 byte aligned you pay a penalty, and this code can crash (due to the ignored bytes being off the end of the page and the next page not mapped). – Joshua Apr 11 '16 at 18:20
1

`mov eax,0` is the wrong choice. `xor eax,eax` would make sense. But really the right answer is that `movzx` isn't slower except on P5. – Peter Cordes Nov 27 '17 at 17:30

Any way to move 2 bytes in 32-bit x86 using MOV without causing a mode switch or cpu stall?

2 Answers2

Linked