Which alignment causes this performance difference

Question

What's the problem

I am benchmarking the following code for (T& x : v) x = x + x; where T is int. When compiling with mavx2 Performance fluctuates 2 times depending on some conditions. This does not reproduce on sse4.2

I would like to understand what's happening.

How does the benchmark work

I am using Google Benchmark. It spins the loop until the point it is sure about the time.

The main benchmarking code:

using T = int;
constexpr std::size_t size = 10'000 / sizeof(T);

NOINLINE std::vector<T> const& data()
{
    static std::vector<T> res(size, T{2});
    return res;
}

INLINE void double_elements_bench(benchmark::State& state)
{
   auto v = data();

   for (auto _ : state) {
       for (T& x : v) x = x + x;
       benchmark::DoNotOptimize(v.data());
   }
}

Then I call double_elements_bench from multiple instances of a benchmark driver.

Machine, Compiler, Options

processor: intel 9700k
compiler: clang ~14, built from trunk.
options: -mavx2 --std=c++20 --stdlib=libc++ -DNDEBUG -g -Werror -Wall -Wextra -Wpedantic -Wno-deprecated-copy -O3

I did align all functions to 128 to try, had no effect.

Results

When duplicated 2 times I get:

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
double_elements_0        105 ns          105 ns      6617708
double_elements_1        105 ns          105 ns      6664185

Vs duplicated 3 times:

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
double_elements_0       64.6 ns         64.6 ns     10867663
double_elements_1       64.5 ns         64.5 ns     10855206
double_elements_2       64.5 ns         64.5 ns     10868602

This reproduces on bigger data sizes too.

Perf stats

I looked for counters that I know can be relevant to code alignment

LSD cache (which is off on my machine due to some security issue a few years back), DSB cache and branch predictor:

LSD.UOPS,idq.dsb_uops,UOPS_ISSUED.ANY,branches,branch-misses

Slow case

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
double_elements_0        105 ns          105 ns      6663885
double_elements_1        105 ns          105 ns      6632218

 Performance counter stats for './transform_alignment_issue':

                 0      LSD.UOPS                                                    
    13,830,353,682      idq.dsb_uops                                                
    16,273,127,618      UOPS_ISSUED.ANY                                             
       761,742,872      branches                                                    
            34,107      branch-misses             #    0.00% of all branches        

       1.652348280 seconds time elapsed

       1.633691000 seconds user
       0.000000000 seconds sys

Fast case

------------------------------------------------------------
Benchmark                  Time             CPU   Iterations
------------------------------------------------------------
double_elements_0       64.5 ns         64.5 ns     10861602
double_elements_1       64.5 ns         64.5 ns     10855668
double_elements_2       64.4 ns         64.4 ns     10867987

 Performance counter stats for './transform_alignment_issue':

                 0      LSD.UOPS                                                    
    32,007,061,910      idq.dsb_uops                                                
    37,653,791,549      UOPS_ISSUED.ANY                                             
     1,761,491,679      branches                                                    
            37,165      branch-misses             #    0.00% of all branches        

       2.335982395 seconds time elapsed

       2.317019000 seconds user
       0.000000000 seconds sys

Both look to me about the same.

Code: https://github.com/DenisYaroshevskiy/small_benchmarks/blob/ade1ed42fc2113f5ad0a4313dafff5a81f9a0d20/transform_alignment_issue.cc#L1

UPD

I think this might be alignment of the data returned from malloc

0x4f2720 in fast case and 0x8e9310 in slow

So - since clang does not align - we get unaligned reads/writes. I tested on a transform that aligns - does not seem to have this variation.

Is there a way to confirm it?

Data alignment doesn't normally make *that* much difference, but you're testing small arrays that fit in L1d so that's plausible. You seem to be talking about code alignment earlier; one major thing there on Skylake-family CPUs is the microcode mitigation for the JCC erratum, which disables the DSB when a branch touches the end of a 32B boundary. But your results show most of the uops coming from the DSB so that's probably not it. [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) — Peter Cordes, Feb 12 '22 at 17:24
`ld_blocks.no_sr` counts number of times cache-line split loads are temporarily blocked because all resources for handling the split accesses are in use. The extra latency of split loads, and also the potential replays of uops waiting for those loads results, is another factor. You could look for the front-end stalling due to the ROB or RS being full, if you want to investigate the details. — Peter Cordes, Feb 12 '22 at 17:29
ld_blocks.no_sr , bad case: 62,082,374 good case: 28 ld_blocks.no_sr Even without knowing percentages it seems seems significant. — Denis Yaroshevskiy, Feb 12 '22 at 19:06
On a 100KB I reproduce the issue: 1075ns vs 1412ns. On 1 MB I don't think I see it. — Denis Yaroshevskiy, Feb 12 '22 at 19:10

score 3 · Accepted Answer · answered Feb 12 '22 at 20:11

Yes, data misalignment could explain your 2x slowdown for small arrays that fit in L1d. You'd hope that with every other load/store being a cache-line split, it might only slow down by a factor of 1.5x, not 2, if a split load or store cost 2 accesses to L1d instead of 1.

But it has extra effects like replays of uops dependent on the load result that apparently account for the rest of the problem, either making out-of-order exec less able to overlap work and hide latency, or directly running into bottlenecks like "split registers".

ld_blocks.no_sr counts number of times cache-line split loads are temporarily blocked because all resources for handling the split accesses are in use.

When a load execution unit detects that the load splits across a cache line, it has to save the first part somewhere (apparently in a "split register") and then access the 2nd cache line. On Intel SnB-family CPUs like yours, this 2nd access doesn't require the RS to dispatch the load uop to the port again; the load execution unit just does it a few cycles later. (But presumably can't accept another load in the same cycle as that 2nd access.)

https://chat.stackoverflow.com/transcript/message/48426108#48426108 - uops waiting for the result of a cache-split load will get replayed.
Are load ops deallocated from the RS when they dispatch, complete or some other time? But the load itself can leave the RS earlier.
How can I accurately benchmark unaligned access speed on x86_64? general stuff on split load penalties.

The extra latency of split loads, and also the potential replays of uops waiting for those loads results, is another factor, but those are also fairly direct consequences of misaligned loads. Lots of counts for ld_blocks.no_sr tells you that the CPU actually ran out of split registers and could otherwise be doing more work, but had to stall because of the unaligned load itself, not just other effects.

You could also look for the front-end stalling due to the ROB or RS being full, if you want to investigate the details, but not being able to execute split loads will make that happen more. So probably all the back-end stalling is a consequence of the unaligned loads (and maybe stores if commit from store buffer to L1d is also a bottleneck.)

On a 100KB I reproduce the issue: 1075ns vs 1412ns. On 1 MB I don't think I see it.

Data alignment doesn't normally make that much difference for large arrays (except with 512-bit vectors). With a cache line (2x YMM vectors) arriving less frequently, the back-end has time to work through the extra overhead of unaligned loads / stores and still keep up. HW prefetch does a good enough job that it can still max out the per-core L3 bandwidth. Seeing a smaller effect for a size that fits in L2 but not L1d (like 100kiB) is expected.

Of course, most kinds of execution bottlenecks would show similar effects, even something as simple as un-optimized code that does some extra store/reloads for each vector of array data. So this alone doesn't prove that it was misalignment causing the slowdowns for small sizes that do fit in L1d, like your 10 KiB. But that's clearly the most sensible conclusion.

Code alignment or other front-end bottlenecks seem not to be the problem; most of your uops are coming from the DSB, according to idq.dsb_uops. (A significant number aren't, but not a big percentage difference between slow vs. fast.)

How can I mitigate the impact of the Intel jcc erratum on gcc? can be important on Skylake-derived microarchitectures like yours; it's even possible that's why your idq.dsb_uops isn't closer to your uops_issued.any.