I'm seeing a big difference in performance between code compiled in MSVC (on Windows) and GCC (on Linux) for an Ivy Bridge system. The code does dense matrix multiplication. I'm getting 70% of the peak flops with GCC and only 50% with MSVC. I think I may have isolated the difference to how they both convert the following three intrinsics.
__m256 breg0 = _mm256_loadu_ps(&b[8*i])
_mm256_add_ps(_mm256_mul_ps(arge0,breg0), tmp0)
GCC does this
vmovups ymm9, YMMWORD PTR [rax-256]
vmulps  ymm9, ymm0, ymm9
vaddps  ymm8, ymm8, ymm9
MSVC does this
vmulps   ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps   ymm3, ymm1, ymm3
Could somebody please explain to me if and why these two solutions could give such a big difference in performance?
Despite MSVC using one less instruction it ties the load to the mult and maybe that makes it more dependent (maybe the load can't be done out of order)? I mean Ivy Bridge can do one AVX load, one AVX mult, and one AVX add in one clock cycle but this requires each operation to be independent.
Maybe the problem lies elsewhere? You can see the full assembly code for GCC and MSVC for the innermost loop below. You can see the C++ code for the loop here Loop unrolling to achieve maximum throughput with Ivy Bridge and Haswell
g++ -S -masm=intel matrix.cpp -O3 -mavx -fopenmp
.L4:
    vbroadcastss    ymm0, DWORD PTR [rcx+rdx*4]
    add rdx, 1
    add rax, 256
    vmovups ymm9, YMMWORD PTR [rax-256]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm8, ymm8, ymm9
    vmovups ymm9, YMMWORD PTR [rax-224]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm7, ymm7, ymm9
    vmovups ymm9, YMMWORD PTR [rax-192]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm6, ymm6, ymm9
    vmovups ymm9, YMMWORD PTR [rax-160]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm5, ymm5, ymm9
    vmovups ymm9, YMMWORD PTR [rax-128]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm4, ymm4, ymm9
    vmovups ymm9, YMMWORD PTR [rax-96]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm3, ymm3, ymm9
    vmovups ymm9, YMMWORD PTR [rax-64]
    vmulps  ymm9, ymm0, ymm9
    vaddps  ymm2, ymm2, ymm9
    vmovups ymm9, YMMWORD PTR [rax-32]
    cmp esi, edx
    vmulps  ymm0, ymm0, ymm9
    vaddps  ymm1, ymm1, ymm0
    jg  .L4
MSVC /FAc /O2 /openmp /arch:AVX ...
vbroadcastss ymm2, DWORD PTR [r10]    
lea  rax, QWORD PTR [rax+256]
lea  r10, QWORD PTR [r10+4] 
vmulps   ymm1, ymm2, YMMWORD PTR [rax-320]
vaddps   ymm3, ymm1, ymm3    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-288]
vaddps   ymm4, ymm1, ymm4    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-256]
vaddps   ymm5, ymm1, ymm5    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-224]
vaddps   ymm6, ymm1, ymm6    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-192]
vaddps   ymm7, ymm1, ymm7    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-160]
vaddps   ymm8, ymm1, ymm8    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-128]
vaddps   ymm9, ymm1, ymm9    
vmulps   ymm1, ymm2, YMMWORD PTR [rax-96]
vaddps   ymm10, ymm1, ymm10    
dec  rdx
jne  SHORT $LL3@AddDot4x4_
EDIT:
I benchmark the code by claculating the total floating point operations as 2.0*n^3 where n is the width of the square matrix and dividing by the time measured with omp_get_wtime().  I repeat the loop several times.  In the output below I repeated it 100 times.
Output from MSVC2012 on an Intel Xeon E5 1620 (Ivy Bridge) turbo for all cores is 3.7 GHz
maximum GFLOPS = 236.8 = (8-wide SIMD) * (1 AVX mult + 1 AVX add) * (4 cores) * 3.7 GHz
n   64,     0.02 ms, GFLOPs   0.001, GFLOPs/s   23.88, error 0.000e+000, efficiency/core   40.34%, efficiency  10.08%, mem 0.05 MB
n  128,     0.05 ms, GFLOPs   0.004, GFLOPs/s   84.54, error 0.000e+000, efficiency/core  142.81%, efficiency  35.70%, mem 0.19 MB
n  192,     0.17 ms, GFLOPs   0.014, GFLOPs/s   85.45, error 0.000e+000, efficiency/core  144.34%, efficiency  36.09%, mem 0.42 MB
n  256,     0.29 ms, GFLOPs   0.034, GFLOPs/s  114.48, error 0.000e+000, efficiency/core  193.37%, efficiency  48.34%, mem 0.75 MB
n  320,     0.59 ms, GFLOPs   0.066, GFLOPs/s  110.50, error 0.000e+000, efficiency/core  186.66%, efficiency  46.67%, mem 1.17 MB
n  384,     1.39 ms, GFLOPs   0.113, GFLOPs/s   81.39, error 0.000e+000, efficiency/core  137.48%, efficiency  34.37%, mem 1.69 MB
n  448,     3.27 ms, GFLOPs   0.180, GFLOPs/s   55.01, error 0.000e+000, efficiency/core   92.92%, efficiency  23.23%, mem 2.30 MB
n  512,     3.60 ms, GFLOPs   0.268, GFLOPs/s   74.63, error 0.000e+000, efficiency/core  126.07%, efficiency  31.52%, mem 3.00 MB
n  576,     3.93 ms, GFLOPs   0.382, GFLOPs/s   97.24, error 0.000e+000, efficiency/core  164.26%, efficiency  41.07%, mem 3.80 MB
n  640,     5.21 ms, GFLOPs   0.524, GFLOPs/s  100.60, error 0.000e+000, efficiency/core  169.93%, efficiency  42.48%, mem 4.69 MB
n  704,     6.73 ms, GFLOPs   0.698, GFLOPs/s  103.63, error 0.000e+000, efficiency/core  175.04%, efficiency  43.76%, mem 5.67 MB
n  768,     8.55 ms, GFLOPs   0.906, GFLOPs/s  105.95, error 0.000e+000, efficiency/core  178.98%, efficiency  44.74%, mem 6.75 MB
n  832,    10.89 ms, GFLOPs   1.152, GFLOPs/s  105.76, error 0.000e+000, efficiency/core  178.65%, efficiency  44.66%, mem 7.92 MB
n  896,    13.26 ms, GFLOPs   1.439, GFLOPs/s  108.48, error 0.000e+000, efficiency/core  183.25%, efficiency  45.81%, mem 9.19 MB
n  960,    16.36 ms, GFLOPs   1.769, GFLOPs/s  108.16, error 0.000e+000, efficiency/core  182.70%, efficiency  45.67%, mem 10.55 MB
n 1024,    17.74 ms, GFLOPs   2.147, GFLOPs/s  121.05, error 0.000e+000, efficiency/core  204.47%, efficiency  51.12%, mem 12.00 MB
 
     
     
    