I'm confused by an auto-vectorization result. The following code addtest.c
#include <stdio.h>
#include <stdlib.h>
#define ELEMS 1024
int
main()
{
  float data1[ELEMS], data2[ELEMS];
  for (int i = 0; i < ELEMS; i++) {
    data1[i] = drand48();
    data2[i] = drand48();
  }
  for (int i = 0; i < ELEMS; i++)
    data1[i] += data2[i];
  printf("%g\n", data1[ELEMS-1]); 
  return 0;
}
is compiled with gcc 11.1.0 by
gcc-11 -O3 -march=haswell -masm=intel -save-temps -o addtest addtest.c
and the add-to loop is auto-vectorized as
.L3:
    vmovaps ymm1, YMMWORD PTR [r12]
    vaddps  ymm0, ymm1, YMMWORD PTR [rax]
    add r12, 32
    add rax, 32
    vmovaps YMMWORD PTR -32[r12], ymm0
    cmp r12, r13
    jne .L3
This is clear: load from data1, load and add from data2, store to data1, and in between, advance the indices.
If I pass the same code to https://godbolt.org, select x86-64 gcc-11.1 and options -O3 -march=haswell, I get the following assembly code:
.L3:
        vmovaps ymm1, YMMWORD PTR [rbp-4112+rax]
        vaddps  ymm0, ymm1, YMMWORD PTR [rbp-8208+rax]
        vmovaps YMMWORD PTR [rbp-8240], ymm1
        vmovaps YMMWORD PTR [rbp-8208+rax], ymm0
        add     rax, 32
        cmp     rax, 4096
        jne     .L3
One surprising thing is the different address handling, but the thing that confuses me completely is the additional store to [rbp-8240]. This location is never used again, as far as I can see.
If I select gcc 7.5 on godbolt, the superfluous store disappears (but from 8.1 upwards, it is produced).
So my questions are:
- Why is there a difference between my compiler and godbolt (different address handling, superfluous store)?
- What does the superfluous store do?
Thanks a lot for your help!
 
    