What's the problem
I am benchmarking the following code for (T& x : v) x = x + x; where T is int.
When compiling with mavx2 Performance fluctuates 2 times depending on some conditions.
This does not reproduce on sse4.2
I would like to understand what's happening.
How does the benchmark work
I am using Google Benchmark. It spins the loop until the point it is sure about the time.
The main benchmarking code:
using T = int;
constexpr std::size_t size = 10'000 / sizeof(T);
NOINLINE std::vector<T> const& data()
{
    static std::vector<T> res(size, T{2});
    return res;
}
INLINE void double_elements_bench(benchmark::State& state)
{
   auto v = data();
   for (auto _ : state) {
       for (T& x : v) x = x + x;
       benchmark::DoNotOptimize(v.data());
   }
}
Then I call double_elements_bench from multiple instances of a benchmark driver.
Machine, Compiler, Options
- processor: intel 9700k
- compiler: clang ~14, built from trunk.
- options: -mavx2 --std=c++20 --stdlib=libc++ -DNDEBUG -g -Werror -Wall -Wextra -Wpedantic -Wno-deprecated-copy -O3
I did align all functions to 128 to try, had no effect.
Results
When duplicated 2 times I get:
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
double_elements_0        105 ns          105 ns      6617708
double_elements_1        105 ns          105 ns      6664185
Vs duplicated 3 times:
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
double_elements_0       64.6 ns         64.6 ns     10867663
double_elements_1       64.5 ns         64.5 ns     10855206
double_elements_2       64.5 ns         64.5 ns     10868602
This reproduces on bigger data sizes too.
Perf stats
I looked for counters that I know can be relevant to code alignment
LSD cache (which is off on my machine due to some security issue a few years back), DSB cache and branch predictor:
LSD.UOPS,idq.dsb_uops,UOPS_ISSUED.ANY,branches,branch-misses
Slow case
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
double_elements_0        105 ns          105 ns      6663885
double_elements_1        105 ns          105 ns      6632218
 Performance counter stats for './transform_alignment_issue':
                 0      LSD.UOPS                                                    
    13,830,353,682      idq.dsb_uops                                                
    16,273,127,618      UOPS_ISSUED.ANY                                             
       761,742,872      branches                                                    
            34,107      branch-misses             #    0.00% of all branches        
       1.652348280 seconds time elapsed
       1.633691000 seconds user
       0.000000000 seconds sys 
Fast case
------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
double_elements_0       64.5 ns         64.5 ns     10861602
double_elements_1       64.5 ns         64.5 ns     10855668
double_elements_2       64.4 ns         64.4 ns     10867987
 Performance counter stats for './transform_alignment_issue':
                 0      LSD.UOPS                                                    
    32,007,061,910      idq.dsb_uops                                                
    37,653,791,549      UOPS_ISSUED.ANY                                             
     1,761,491,679      branches                                                    
            37,165      branch-misses             #    0.00% of all branches        
       2.335982395 seconds time elapsed
       2.317019000 seconds user
       0.000000000 seconds sys
Both look to me about the same.
UPD
I think this might be alignment of the data returned from malloc
0x4f2720 in fast case and 0x8e9310 in slow
So - since clang does not align - we get unaligned reads/writes. I tested on a transform that aligns - does not seem to have this variation.
Is there a way to confirm it?
 
    