Understanding branch prediction efficiency

Question

I tried to measure branch prediction cost, I created a little program.

It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.

Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:

N   time
=========
8   2.2
9   2.2
10  2.2
11  2.2
12  2.3
13  4.6
14  9.5
15  11.6
16  12.7
20  12.9

For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.

My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?

(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)

Here's the code:

#include <cstdint>
#include <iostream>

volatile uint64_t init[2] = { 314159165, 27182818 };
// volatile uint64_t init[2] = { 0, 0 };
volatile uint64_t one = 1;

uint64_t next(uint64_t s[2]) {
    uint64_t s1 = s[0];
    uint64_t s0 = s[1];
    uint64_t result = s0 + s1;
    s[0] = s0;
    s1 ^= s1 << 23;
    s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
    return result;
}

int main() {
    uint64_t s[2];
    s[0] = init[0];
    s[1] = init[1];

    uint64_t sum = 0;

#if 1
    const int N = 16;

    unsigned char buffer[1<<N];
    for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;

    for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) {
        for (int j=0; j<1<<N; j++) {
            if (buffer[j]) {
                sum += one;
            }
        }
    }
#else
    for (uint64_t i=0; i<uint64_t(1)<<32; i++) {
        if (next(s)&1) {
            sum += one;
        }
    }

#endif
    std::cout<<sum<<"\n";
}

(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)

Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):

  401270:       80 3c 0c 00             cmp    BYTE PTR [rsp+rcx*1],0x0
  401274:       74 07                   je     40127d <main+0xad>
  401276:       48 03 35 e3 2d 00 00    add    rsi,QWORD PTR [rip+0x2de3]        # 404060 <one>
  40127d:       80 7c 0c 01 00          cmp    BYTE PTR [rsp+rcx*1+0x1],0x0
  401282:       74 07                   je     40128b <main+0xbb>
  401284:       48 03 35 d5 2d 00 00    add    rsi,QWORD PTR [rip+0x2dd5]        # 404060 <one>
  40128b:       48 83 c1 02             add    rcx,0x2
  40128f:       48 81 f9 00 00 01 00    cmp    rcx,0x10000
  401296:       75 d8                   jne    401270 <main+0xa0>

Yep, this is not surprising. The TAGE prediction technique is designed to specifically handle branches that may require maintaining thousands of bits of history. — Hadi Brais, Mar 28 '19 at 12:35
I've run your code on Haswell and reproduced your results. Also the TMA method shows that Bad Speculation is less than 5% of all issue slots when N<=10 and increases to 46.1% when N=16. — Hadi Brais, Mar 28 '19 at 12:40
@HadiBrais: thanks for the info. It seems that branch prediction is much more advanced and complicated than I thought. — geza, Mar 28 '19 at 12:42
Modern TAGE predictors are pretty fantastic. A while ago I was curious just how bad a code-size-optimized BubbleSort was for performance, so I benchmarked it sorting the same input repeatedly. (Re-copying with movdqa). My Skylake "learned" all the branches in the BubbleSort and had nearly no mispredicts for array sizes of 15 elements or so, maybe more I forget. Use `perf stat` to get HW counts of branches and branch-misses. — Peter Cordes, Mar 29 '19 at 05:21
In general; the first time code is executed the branch prediction rate is "less good" because there's no history; and there's no point executing code twice if nothing changed (you can store the result/s from last time) so the "excessively happy case" where CPU has complete branch history almost never happens in practice. Benchmarks that measure the "excessively happy case" only provide misinformation. — Brendan, Mar 31 '19 at 13:56
@Brendan: Yes. But this question is about that predicting 4096 random outcomes really is an "excessively happy case"? For me it seemed very unlikely (that's why I didn't bothered to check out `perf stat`. If I had checked out, this question wouldn't exist). But as it turned out, it is really the case. Current CPUs branch predictor is that good that it can memorize 4096 outcomes. That was a surprise for me. 20 years ago branch predictors was "strongly/weakly" * "taken/not taken". Now it can do much-much more. — geza, Mar 31 '19 at 14:06
@geza: For this code specifically, the results go from "pure irrelevant fantasy" (small values of N, outer loop causes "same code with same data" in the inner loop to happen many times) to "more realistic in practice" (larger values of N, outer loop causes "same code with same data" to happen less often). The results are as you'd expect - 50% mispredictions when the results are realistic (with 0% mispredictions when you look at the irrelevant fantasy). — Brendan, Mar 31 '19 at 14:19
@Brendan: it is never "pure irrelevant fantasy". Just to mention a counterexample: interpreters. It is very common that they follow the same path a lot of times. And a response to your first comment: "and there's no point executing code twice if nothing changed (you can store the result/s from last time)". That's wrong. Note, here the branch pattern is the same only. The data can differ (but follow the same path). Just like, when an interpreter runs a byte code. But, anyways, this question was about understanding the results of a benchmark, not about whether it's realistic or not. — geza, Apr 01 '19 at 09:48

geza · Accepted Answer · 2019-03-31T11:51:18.320

2

Branch prediction can be such effective. As Peter Cordes suggests, I've checked branch-misses with perf stat. Here are the results:

N   time          cycles  branch-misses (%)      approx-time
===============================================================
8    2.2   9,084,889,375         34,806 ( 0.00)    2.2
9    2.2   9,212,112,830         39,725 ( 0.00)    2.2
10   2.2   9,264,903,090      2,394,253 ( 0.06)    2.2
11   2.2   9,415,103,000      8,102,360 ( 0.19)    2.2
12   2.3   9,876,827,586     27,169,271 ( 0.63)    2.3
13   4.6  19,572,398,825    486,814,972 (11.33)    4.6
14   9.5  39,813,380,461  1,473,662,853 (34.31)    9.5
15  11.6  49,079,798,916  1,915,930,302 (44.61)   11.7
16  12.7  53,216,900,532  2,113,177,105 (49.20)   12.7
20  12.9  54,317,444,104  2,149,928,923 (50.06)   12.9

Note: branch-misses (%) is calculated for 2^32 branches

As you can see, when N<=12, branch predictor can predict most of the branches (which is surprising: the branch predictor can memorize the outcome of 4096 consecutive random branches!). When N>12, branch-misses starts to grow. At N>=16, it can only predict ~50% correctly, which means it is as effective as random coin flips.

The time taken can be approximated by looking at the time and branch-misses (%) column: I've added the last column, approx-time. I've calculated it by this: 2.2+(12.9-2.2)*branch-misses %/100. As you can see, approx-time equals to time (not considering rounding error). So this effect can be explained perfectly by branch prediction.

The original intent was to calculate how many cycles a branch-miss costs (in this particular case - as for other cases this number can differ):

(54,317,444,104-9,084,889,375)/(2,149,928,923-34,806) = 21.039 = ~21 cycles.

edited Mar 31 '19 at 11:51

answered Mar 31 '19 at 10:43

geza

28,403
6
61
135

1

Branch misprediction penalty cannot be characterized by a single number because it depends on how much time it takes to restreer the frontend and how much pending work there still is in-flight in the RS before the mispredicted jump at the time the misprediction is detected. A penalty of 21 cycles looks a bit too high to me, and probably indicates that there are frontend issues. In addition, your analysis did not consider the cost of the potential misprediction of the last iteration of the inner loop. – Hadi Brais Mar 31 '19 at 11:38
1

@HadiBrais: Thanks for your comment. Yes, the cost of branch-miss depends on a lot of things. I was interested in an approximate value. For example, how it relates to a cost of floating point division. Which is faster: using a hardly-predicted-branch, or a fp-divison. Yep, I didn't consider the mispredictions of the last iteration, because it doesn't influence the result too much (less than 1% for N=8 case). I've edited my answer a little bit to say that the calculated cost is for this particular case only. – geza Mar 31 '19 at 11:49
Well, the latency of division also varies significantly depending on the input operands. The cost of misprediction is defined as the increase in execution time compared to the case where the misprediction did not occur. So if you want to measure the cost of misprediction in this particular case, a better way to do it is , by following definition, to compare the execution time against a loop nest with the same number of inner and outer iterations but the condition `if (buffer[j])` is always true (easily predicted)... – Hadi Brais Mar 31 '19 at 13:05
...This allows to estimate the cost of a single inner iteration where `if (buffer[j])` is correctly predicted. Multiply this by the number of correct predictions of `if (buffer[j])` and subtract the result from the total execution time. What remains is the sum of the cost of all mispredictions. Finally, divide this quantity by the number of times the branch `if (buffer[j])` was mispredicted. The result is the average cost of mispredicting `if (buffer[j])`. – Hadi Brais Mar 31 '19 at 13:05
@HadiBrais: "the latency of division also varies significantly depending on the input operands". Hmm, what do you mean by this? `float` vs `double`, or something else? I've calculated cost the way you say, I got ~22 cycles (22.074). – geza Mar 31 '19 at 13:22
I mean it depends on the input values. See Agner's [instruction tables](https://www.agner.org/optimize/instruction_tables.pdf). – Hadi Brais Mar 31 '19 at 13:29
@HadiBrais: Ah, thanks (but it is only true for older CPUs, right? For example, on Haswell, `divss` is constant 11 cycles - or is there other part of the doc which talks about this?). Btw, measuring/comparing with `if (buffer[j])` is always true may not be the best method, because it influences the `sum += one;` part. It should occur 50% of the time, not always. Perhaps that's the cause that the always false version runs in 1.6 sec, while the random (but still predicted perfectly) version runs in 2.2 sec. – geza Mar 31 '19 at 13:37
Yep. One thing you can do is count the number of taken and not taken `if (buffer[j])` where the values are randomized using perf events. Then you can design the easily predicted case by initializing `buffer` so that the same number of taken branches are placed sequentially and then the same number of not taken branches are placed sequentially. In this way, there will be an extra misprediction in between, but I don't think it would have a significant impact on the estimation. – Hadi Brais Mar 31 '19 at 13:44
@HadiBrais: I've supposed in my measurement that `buffer[j]` is true 50%. Btw., I practically measure the way you say. I just don't bother to "cluster" true/false, because when N is small, prediction is (almost) perfect anyways, it shouldn't cause any difference (when I calculated ~21 cycles, this number is the difference between the `N=8` (almost perfect prediction) and `N=20` case). – geza Mar 31 '19 at 13:49
@HadiBrais and geza: See [Characterizing the Branch Misprediction Penalty](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.8373&rep=rep1&type=pdf), a nice paper that models the real effect on a superscalar Out-of-Order pipeline, of branch misses, data cache misses, and I-cache misses. And how long it takes the CPU to return to a steady-state of lots of instructions in flight hiding execution latencies. Ramping back up to that state costs time, too, which is why true mispredict cost is context-sensitive, on top of how many cycles before starting execution on the correct path. – Peter Cordes Apr 01 '19 at 00:24
@HadiBrais and \@geza: I updated https://stackoverflow.com/tags/branch-prediction/info with lots of links. That edit could probably use another pair of eyes to see if it's comprehensible and in a useful order. – Peter Cordes Apr 01 '19 at 01:58
@PeterCordes Nice update. You can also add return address prediction ([Performance of “conditional call” on amd64](https://stackoverflow.com/questions/54868559/performance-of-conditional-call-on-amd64) & [Henry's article](http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/)) and [The inner workings of Spectre (v2)](https://stackoverflow.com/questions/54541157/the-inner-workings-of-spectre-v2). – Hadi Brais Apr 01 '19 at 04:24
@HadiBrais: I forgot you don't have enough rep to make tag-wiki edits yourself without going through review. Anyway, good suggestions, added those, and a wikipedia Spectre link. – Peter Cordes Apr 01 '19 at 04:51
@PeterCordes: thanks, I didn't know about that info page, it is very useful. Now I have some material to read before sleep :) (edit: now I see that you've added a lot of material there recently, thanks!) – geza Apr 01 '19 at 09:16
Yeah, before it was just a copy/paste from Wikipedia. SO tag wikis are good places to collect useful links, like I've done with the x86 tag wiki: https://stackoverflow.com/tags/x86/info. I didn't start this idea, the C++ tag did it first. – Peter Cordes Apr 01 '19 at 09:40

Understanding branch prediction efficiency

1 Answers1