Is there a good reason why GCC would generate jump to jump just over one cheap instruction?

Question

I was benchmarking some counting in a loop code. g++ was used with -O2 code and I noticed that it has some perf problems when some condition is true in 50% of the cases. I assumed that may mean that code does unnecessary jumps(since clang produces faster code so it is not some fundamental limitation).

What I find in this asm output funny is that code jumps over one simple add.

=> 0x42b46b <benchmark_many_ints()+1659>:       movslq (%rdx),%rax
   0x42b46e <benchmark_many_ints()+1662>:       mov    %rax,%rcx
   0x42b471 <benchmark_many_ints()+1665>:       imul   %r9,%rax
   0x42b475 <benchmark_many_ints()+1669>:       shr    $0xe,%rax
   0x42b479 <benchmark_many_ints()+1673>:       and    $0x1ff,%eax
   0x42b47e <benchmark_many_ints()+1678>:       cmp    (%r10,%rax,4),%ecx
   0x42b482 <benchmark_many_ints()+1682>:       jne    0x42b488 <benchmark_many_ints()+1688>
   0x42b484 <benchmark_many_ints()+1684>:       add    $0x1,%rbx
   0x42b488 <benchmark_many_ints()+1688>:       add    $0x4,%rdx
   0x42b48c <benchmark_many_ints()+1692>:       cmp    %rdx,%r8
   0x42b48f <benchmark_many_ints()+1695>:       jne    0x42b46b <benchmark_many_ints()+1659>

Note that my question is not how to fix my code, I am just asking if there is a reason why a good compiler at O2 would generate jne instruction to jump over 1 cheap instruction. I ask because from what I understand one could "simply" get the comparison result and use that to without jumps increment the counter(rbx in my example) by 0 or 1.

edit: source: https://godbolt.org/z/v0Iiv4

Probably a good idea to post the C/C++ code that resulted in this assembly? — visibleman, Aug 31 '18 at 01:43

Peter Cordes · Accepted Answer · 2018-08-31T04:59:22.297

2

The relevant part of the source (from a Godbolt link in a comment which you should really edit into your question) is:

const auto cnt = std::count_if(lookups.begin(), lookups.end(),[](const auto& val){
    return buckets[hash_val(val)%16] == val;});

I didn't check the libstdc++ headers to see if count_if is implemented with an if() { count++; }, or if it uses a ternary to encourage branchless code. Probably a conditional. (The compiler can choose either, but a ternary is more likely to compile to a branchless cmovcc or setcc.)

It looks like gcc overestimated the cost of branchless for this code with generic tuning. -mtune=skylake (implied by -march=skylake) gives us branchless code for this regardless of -O2 vs. -O3, or -fno-tree-vectorize vs. -ftree-vectorize. (On the Godbolt compiler explorer, I also put the count in a separate function that counts a vector<int>&, so we don't have to wade through the timing and cout code-gen in main.)

branchy code: gcc8.2 -O2 or -O3, and O2/3 -march=haswell or broadwell
branchless code: gcc8.2 -O2/3 -march=skylake.

That's weird. The branchless code it emits has the same cost on Broadwell vs. Skylake. I wondered if Skylake vs. Haswell was favouring branchless because of cheaper cmov. GCC's internal cost model isn't always in terms of x86 instructions when its optimizing in the middle-end (in GIMPLE, an architecture-neutral representation). It doesn't yet know what x86 instructions would actually be used for a branchless sequence. So maybe a conditional-select operation is involved, and gcc models it as more expensive on Haswell, where cmov is 2 uops? But I tested -march=broadwell and still got branchy code. Hopefully we can rule that out assuming gcc's cost model knows that Broadwell (not Skylake) was the first Intel P6/SnB-family uarch to have single-uop cmov, adc, and sbb (3-input integer ops).

I don't know what else about gcc's Skylake tuning option that makes it favour branchless code for this loop. Gather is efficient on Skylake, but gcc is auto-vectorizing (with vpgatherqd xmm) even with -march=haswell, where it doesn't look like a win because gather is expensive, and and requires 32x64 => 64-bit SIMD multiplies using 2x vpmuludq per input vector. Maybe worth it with SKL, but I doubt HSW. Also probably a missed optimization not to pack back down to dword elements to gather twice as many elements with nearly the same throughput for vpgatherdd.

I did rule out the function being less optimized because it was called main (and marked cold). It's generally recommended not to put your microbenchmarks in main: compilers at least used to optimize main differently (e.g. for code-size instead of just speed).

Clang does make it branchless even with just -O2.

When compilers have to decide between branching and branchy, they have heuristics that guess which will be better. If they think it's highly predictable (e.g. probably mostly not-taken), that leans in favour of branchy.

In this case, the heuristic could have decided that out of all 2^32 possible values for an int, finding exactly the value you're looking for is rare. The == may have fooled gcc into thinking it would be predictable.

Branchy can be better sometimes, depending on the loop, because it can break a data dependency. See gcc optimization flag -O3 makes code slower than -O2 for a case where it was very predictable, and the -O3 branchless code-gen was slower.

-O3 at least used to be more aggressive at if-conversion of conditionals into branchless sequences like cmp ; lea 1(%rbx), %rcx; cmove %rcx, %rbx, or in this case more likely xor-zero / cmp/ sete / add. (Actually gcc -march=skylake uses sete / movzx, which is pretty much strictly worse.)

Without any runtime profiling / instrumentation data, these guesses can easily be wrong. Stuff like this is where Profile Guided Optimization shines. Compile with -fprofile-generate, run it, then compiler with -fprofile-use, and you'll probably get branchless code.

BTW, -O3 is generally recommended these days. Is optimisation level -O3 dangerous in g++?. It does not enable -funroll-loops by default, so it only bloats code when it auto-vectorizes (especially with very large fully-unrolled scalar prologue/epilogue around a tiny SIMD loop that bottlenecks on loop overhead. /facepalm.)

edited Aug 31 '18 at 04:59

answered Aug 31 '18 at 01:47

Peter Cordes

328,167
45
605
847

I tried -O3 now, same output, just different offsets, will see now with -fno-tree-vectorize – NoSenseEtAl Aug 31 '18 at 01:58
btw forgot to mention I am using mavx, maybe it assumes certain arch that is different from my AMD... – NoSenseEtAl Aug 31 '18 at 01:59
I tried with -fno-tree-vectorize also, it did not change the asm at all,perf is not changed noticeably... – NoSenseEtAl Aug 31 '18 at 02:03
@NoSenseEtAl: `-mavx` has zero impact on tuning decisions (choices between two sequences which don't require AVX). That's why you should always use `-march=haswell` or `-march=znver1` instead of just `-mavx2`, for example, especially if you do plan to run on a specific CPU. – Peter Cordes Aug 31 '18 at 02:05
@NoSenseEtAl: Odd that `-O3` didn't help; I thought it would. I'll update this answer once you have a [mcve] that compiles this way. What gcc version are you using? Also, try using profile-guided optimization; maybe gcc incorrectly guesses at compile time that this branch will predict well. – Peter Cordes Aug 31 '18 at 02:07
1

@NoSenseEtAl: Looks like a missed-optimization, but interestingly `-march=skylake` gives us branchless code. – Peter Cordes Aug 31 '18 at 04:59
Moving to GCC 8,1 and using mtune=skylake worked(I do not have skylake cpu so march skylake = dead exe)... Upvoted, but I will still leave this question open... Since it is quite bizzare that GCC would have this big codegen bug without some good reason... :/ But for now I do not see any good reason. – NoSenseEtAl Aug 31 '18 at 12:52
1

@NoSenseEtAl: Did you try using profile-guided optimization with another tuning? e.g. `-march=native` That solved the problem in the other case, in the linked question where `-O3` was slower than `-O2`. – Peter Cordes Aug 31 '18 at 13:37
it is buggy on windows: [New Thread 4108.0x5f4] [New Thread 4108.0x5dc] [New Thread 4108.0xfa4] [New Thread 4108.0xb88] warning: Can not parse XML library list; XML support was disabled at compile time Thread 1 received signal SIGSEGV, Segmentation fault. 0x0000000000550fc7 in std::random_device::_M_getval_pretr1() () (gdb) exit – NoSenseEtAl Aug 31 '18 at 14:27
1

@NoSenseEtAl: yeah, I just tried it, too. It's faster with PGO on my machine (gcc7.3 `-O3` with the default tune=generic), because it lays out the branch better. The branch is *very* predictable, though, at least on Skylake. `perf stat` says the program as a whole has 0.16% branch mispredict rate for a branchy version. It's not surprising that PGO kept the branch, because it predicts very well. `-march=skylake` (branchless) is slightly faster, though, so it turns out to have been the wrong choice I guess. (~722M clock cycles branchless, ~790M PGO, 814M plain -O3.) – Peter Cordes Aug 31 '18 at 14:49
for me it is quite bizarre, large program that I originally used gets diff perf based on mtune skylake/broadwell, but min example i posted does not... :/ – NoSenseEtAl Aug 31 '18 at 14:51
1

also regarding branches: i made mistake in my godbolt link: found/not found alternate regularly, I should have used rand() when creating lookups... – NoSenseEtAl Aug 31 '18 at 14:53
btw when I try the min example now with randomized lookups skylake is 3x faster than broadwell, so sorry about giving you wrong test data... – NoSenseEtAl Aug 31 '18 at 14:56
@NoSenseEtAl: Can you post a godbolt link for that version? I'll try it with PGO. – Peter Cordes Aug 31 '18 at 14:57
1

@NoSenseEtAl: That is weird, gcc 7.3 isn't making branchless code even with `-O3 -mtune=skylake`, not even with PGO. I'm seeing a 3.4% branch-mispredict rate. (I realized my earlier testing with `-O3 -march=skylake` was auto-vectorizing; that's probably why it was just slightly faster! Indeed, with the randomized version it's also faster, because SKL has fast gathers.) – Peter Cordes Aug 31 '18 at 15:07
I am using GCC 8.1 since per our earlier discussions I concluded that 7.3 I was using before would not generate branchless code even with skylake mtune... – NoSenseEtAl Aug 31 '18 at 15:10
@NoSenseEtAl: But clang's scalar branchless code is almost exactly as fast as gcc's auto-vectorized code. I'm still getting 1.8% branch mispredicts, so there's some unpredictable branching somewhere other than in the comparison? – Peter Cordes Aug 31 '18 at 15:10
@NoSenseEtAl: g++7.3 on Godbolt does make branchless code with `-mtune=skylake` for the source I linked in my answer. Maybe a different config or something? Or maybe a slightly different revision of g++7.3 than I have on my desktop. Oh, or maybe a different `` header. – Peter Cordes Aug 31 '18 at 15:12
I am on windows, so I did not profile the code... but like I said in my original question I noticed the slowdown when condition is true only 50% of the cases, otherwise(10%, 90%) it works fast. And in this small program IDK where else would be time be spent between timing start and timing end... – NoSenseEtAl Aug 31 '18 at 15:20
1

just double checked: if I do rand()%12 instead of rand()%2 in lookup generation difference drops from almost 3x to 1.4x... So it is the branching that causes the diff between skylake and broadwell mtune on GCC 8.1 -O2 – NoSenseEtAl Aug 31 '18 at 15:23
1

also I just did PGO, with broadwell mtune, it did not help... :/ – NoSenseEtAl Aug 31 '18 at 15:29
anyway, I think we can agree it is kind of a codegen bug most likely.... but like I said it is weird that GCC would have bug of this severity... But then again maybe I was just unlucky and it does not affect 99.9999% of the compiles. – NoSenseEtAl Aug 31 '18 at 15:34

Is there a good reason why GCC would generate jump to jump just over one cheap instruction?

1 Answers1

Linked