I noticed a weird thing today. When copying a long double1 all of gcc, clang and icc generate fld and fstp instructions, with TBYTE memory operands.
That is, the following function:
void copy_prim(long double *dst, long double *src) {
*src = *dst;
}
Generates the following assembly:
copy_prim(long double*, long double*):
fld TBYTE PTR [rdi]
fstp TBYTE PTR [rsi]
ret
Now according to Agner's tables this is a poor choice for performance, as fld takes four uops (none fused) and fstp takes a whopping seven uops (none fused) versus say a single fused uop each for movaps to/from an xmm register.
Interestingly, clang starts using movaps as soon as you put the long double in a struct. The following code:
struct long_double {
long double x;
};
void copy_ld(long_double *dst, long_double *src) {
*src = *dst;
}
Compiles to the same assembly with fld/fstp as previously shown for gcc and icc but clang now uses:
copy_ld(long_double*, long_double*):
movaps xmm0, xmmword ptr [rdi]
movaps xmmword ptr [rsi], xmm0
ret
Oddly, if you stuff an additional int member into the struct (which doubles its size to 32 bytes due to alignment), all compilers generate SSE-only copy code:
copy_ldi(long_double_int*, long_double_int*):
movdqa xmm0, XMMWORD PTR [rdi]
movaps XMMWORD PTR [rsi], xmm0
movdqa xmm0, XMMWORD PTR [rdi+16]
movaps XMMWORD PTR [rsi+16], xmm0
ret
Is there any functional reason to copy floating point values with fld and fstp or is just a missed optimization?
1 Although a long double (i.e., x86 extended precision float) is nominally 10 bytes on x86, it has sizeof == 16 and alignof == 16 since alignments have to be a power of two and the size must usually be at least as large as the alignment.