With GCC 5.3 the following code compield with -O3 -fma
float mul_add(float a, float b, float c) {
  return a*b + c;
}
produces the following assembly
vfmadd132ss     %xmm1, %xmm2, %xmm0
ret
I noticed GCC doing this with -O3 already in GCC 4.8.
Clang 3.7 with -O3 -mfma produces
vmulss  %xmm1, %xmm0, %xmm0
vaddss  %xmm2, %xmm0, %xmm0
retq
but Clang 3.7 with -Ofast -mfma produces the same code as GCC with -O3 fast.
I am surprised that GCC does with -O3 because from this answer it says
The compiler is not allowed to fuse a separated add and multiply unless you allow for a relaxed floating-point model.
This is because an FMA has only one rounding, while an ADD + MUL has two. So the compiler will violate strict IEEE floating-point behaviour by fusing.
However, from this link it says
Regardless of the value of FLT_EVAL_METHOD, any floating-point expression may be contracted, that is, calculated as if all intermediate results have infinite range and precision.
So now I am confused and concerned.
- Is GCC justified in using FMA with -O3?
- Does fusing violate strict IEEE floating-point behaviour?
- If fusing does violate IEEE floating-point beahviour and since GCC returns __STDC_IEC_559__isn't this a contradiction?
Since FMA can be emulated in software it seems to be there should be two compiler switches for FMA: one to tell the compiler to use FMA in calculations and one to tell the compiler that the hardware has FMA.
Apprently this can be controlled with the option -ffp-contract. With GCC the default is -ffp-contract=fast and with Clang it's not. Other options such as -ffp-contract=on and -ffp-contract=off do no produce the FMA instruction.
For example Clang 3.7 with -O3 -mfma -ffp-contract=fast produces vfmadd132ss.
I checked some permutations of #pragma STDC FP_CONTRACT set to ON and OFF with -ffp-contract set to on, off, and fast. IN all cases I also used -O3 -mfma.
With GCC the answer is simple. #pragma STDC FP_CONTRACT ON or OFF makes no difference. Only -ffp-contract matters.
GCC it uses fma with
- -ffp-contract=fast(default).
With Clang it uses fma
- with -ffp-contract=fast.
- with -ffp-contract=on(default) and#pragma STDC FP_CONTRACT ON(default isOFF).
In other words with Clang you can get fma with #pragma STDC FP_CONTRACT ON (since -ffp-contract=on is the default) or with -ffp-contract=fast. -ffast-math (and hence -Ofast) set -ffp-contract=fast.
I looked into MSVC and ICC.
With MSVC it uses the fma instruction with /O2 /arch:AVX2 /fp:fast. With MSVC /fp:precise is the default.
With ICC it uses fma with -O3 -march=core-avx2 (acctually -O1 is sufficient). This is because by default ICC uses -fp-model fast. But ICC uses fma even with -fp-model precise. To disable fma with ICC use -fp-model strict or -no-fma.
So by default GCC and ICC use fma when fma is enabled (with -mfma for GCC/Clang or -march=core-avx2 with ICC) but Clang and MSVC do not.
 
     
     
    