Loading and storing long doubles in x86-64

Question

I noticed a weird thing today. When copying a long double¹ all of gcc, clang and icc generate fld and fstp instructions, with TBYTE memory operands.

That is, the following function:

void copy_prim(long double *dst, long double *src) {
    *src = *dst;
}

Generates the following assembly:

copy_prim(long double*, long double*):
  fld TBYTE PTR [rdi]
  fstp TBYTE PTR [rsi]
  ret

Now according to Agner's tables this is a poor choice for performance, as fld takes four uops (none fused) and fstp takes a whopping seven uops (none fused) versus say a single fused uop each for movaps to/from an xmm register.

Interestingly, clang starts using movaps as soon as you put the long double in a struct. The following code:

struct long_double {
    long double x;
};

void copy_ld(long_double *dst, long_double *src) {
    *src = *dst;
}

Compiles to the same assembly with fld/fstp as previously shown for gcc and icc but clang now uses:

copy_ld(long_double*, long_double*):
  movaps xmm0, xmmword ptr [rdi]
  movaps xmmword ptr [rsi], xmm0
  ret

Oddly, if you stuff an additional int member into the struct (which doubles its size to 32 bytes due to alignment), all compilers generate SSE-only copy code:

copy_ldi(long_double_int*, long_double_int*):
  movdqa xmm0, XMMWORD PTR [rdi]
  movaps XMMWORD PTR [rsi], xmm0
  movdqa xmm0, XMMWORD PTR [rdi+16]
  movaps XMMWORD PTR [rsi+16], xmm0
  ret

Is there any functional reason to copy floating point values with fld and fstp or is just a missed optimization?

¹ Although a long double (i.e., x86 extended precision float) is nominally 10 bytes on x86, it has sizeof == 16 and alignof == 16 since alignments have to be a power of two and the size must usually be at least as large as the alignment.

A 10-byte store (8 + 2 I assume) and a 16-byte reload hits a store-forwarding stall. Other than that, seems like pure missed optimization to use the default code-gen for cases where you aren't going to operate on it. — Peter Cordes, Nov 28 '17 at 21:20
This reminds me of the missed-optimizations for `atomic` load/store: often bouncing to integer registers even when it doesn't need to CAS, just `mov`. https://stackoverflow.com/questions/45055402/atomic-double-floating-point-or-sse-avx-vector-load-store-on-x86-64 — Peter Cordes, Nov 28 '17 at 21:22
It's weird how tunneling through a `struct` sometimes avoids it. It seems like what happens is that "scalarization" is kicking in for `gcc` so that the simple struct with one `long double` just ends up looking like a `long double` and then goes back to the bad codegen (but not on clang). When you add enough other stuff, that stops and it goes to the usual struct copy logic which is much better. Oddly `icc` still handles _some_ more complex weirdly, like [this one](https://godbolt.org/g/1FtsyW). Try removing or adding an `int` member and the code totally changes. — BeeOnRope, Nov 28 '17 at 21:27
I think you're just seeing that compilers know how to copy whole structs around, regardless of contents. You're just getting the default code-gen for loading "a struct", or "a long double" when the compiler sees through the struct and "optimizes" it to what it would do for a single primitive type. It's only with `long double` that this is particularly bad. (Although really copying around `double` with SSE2 instead of x87 is also better, even with `-mfpmath=387`. There's no actual ALU uop, but the store-reload latency is higher by 1c for `fld`/`fstp` than `movq`/`movq` (SKL from Agner Fog) — Peter Cordes, Nov 29 '17 at 04:19
@peter did you check out the weird ICC behavior for `long double` plus 4 `int`s. — BeeOnRope, Nov 29 '17 at 05:47
No, I hadn't looked at that. Looks like when there's no padding, ICC "sees through" the struct and shoots itself in the foot. ICC is very good at auto-vectorizing (including search loops with data-dependent trip counts), but worse than gcc/clang at a lot of other stuff. (And BTW, ICC18 properly supports `-march=skylake` and so on now. ICC17 only seemed to recognize `-march=native` on Godbolt, or maybe some weird stuff like `corei7-avx` but not `skylake-avx512`. But that only affects code-gen if there's any padding: https://godbolt.org/g/SttDQT) — Peter Cordes, Nov 29 '17 at 16:38

score 1 · Accepted Answer · answered Nov 29 '17 at 17:20

It looks like a big missed-optimization for code that needs to copy long double without processing it. fstp m80/fld m80 round-trip latency 8 cycles on Skylake, vs. 5 for movdqa store-forwarding from store to reload. More importantly, Agner lists fstp m80 as one per 5 clocks throughput, so there's something non-pipelined going on!

The only possible benefit I can think of is store-forwarding from a still-in-flight long double store. Consider a data-dependency chain that involves some x87 math, a long double store, then your function, then a long double load and more x87 math. According to Agner's tables, fld/fstp will add 8 cycles, but movdqa will see a store-forwarding stall and add 5 + 11 cycles or so for a slow-path store-fowarding.

Probably the lowest latency strategy to copy an m80 would be 64-bit + 16-bit integer mov/movzx load/store instructions. We know that fstp m80 and fld m80 use 2 separate store-data (port 4) or load (p23) uops, and I think we can assume it's broken up as 64-bit mantissa and 16-bit sign:exponent.

Of course for throughput, and latency in cases other than store-forwarding, movdqa seems like by far the best choice because as you point out, the ABI guarantees 16-byte alignment. A 16-byte store can forward to a fld m80.

The same argument applies for copying double or float with integer vs. x87 (e.g. 32-bit code): fld m32/fstp m32 has 1 cycle higher round-trip latency than SSE movd, and 2 cycles higher latency than integer mov on Sandybridge-family CPUs. (Unlike PowerPC / Cell load-hit-store, there's no penalty for store-forwarding from FP stores to integer loads. x86's strong memory ordering model wouldn't allow separate store buffers for FP vs. integer, if that's what PPC does.)

Once the compiler realizes that it's not going to use any FP instructions on a float / double / long double, it should usually replace the load/store with non-x87. But copying a double or float with x87 is fine if integer / SSE register pressure is a problem.

Integer register pressure in 32-bit code is almost always high, and -mfpmath=sse is the default for 64-bit code. You could imagine rare cases where using x87 to copy a double in 64-bit code would be worth it, but compilers would be more likely to make things worse than better if they went looking for places to use x87. gcc has -mfpmath=sse+387, but it's not usually very good. (And that's not even considering physical register file pressure from using x87 + SSE. Hopefully an "empty" x87 state doesn't use any physical registers. xsave knows about parts of the architectural state being empty so it can avoid saving them...)

Loading and storing long doubles in x86-64

1 Answers1