I tried to measure branch prediction cost, I created a little program.
It creates a little buffer on stack, fills with random 0/1. I can set the size of the buffer with N. The code repeatedly causes branches for the same 1<<N random numbers.
Now, I've expected, that if 1<<N is sufficiently large (like >100), then the branch predictor will not be effective (as it has to predict >100 random numbers). However, these are the results (on a 5820k machine), as N grows, the program becomes slower:
N time
=========
8 2.2
9 2.2
10 2.2
11 2.2
12 2.3
13 4.6
14 9.5
15 11.6
16 12.7
20 12.9
For reference, if buffer is initialized with zeros (use the commented init), time is more-or-less constant, it varies between 1.5-1.7 for N 8..16.
My question is: can branch predictor effective for predicting such a large amount of random numbers? If not, then what's going on here?
(Some more explanation: the code executes 2^32 branches, no matter of N. So I expected, that the code runs the same speed, no matter of N, because the branch cannot be predicted at all. But it seems that if buffer size is less than 4096 (N<=12), something makes the code fast. Can branch prediction be effective for 4096 random numbers?)
Here's the code:
#include <cstdint>
#include <iostream>
volatile uint64_t init[2] = { 314159165, 27182818 };
// volatile uint64_t init[2] = { 0, 0 };
volatile uint64_t one = 1;
uint64_t next(uint64_t s[2]) {
uint64_t s1 = s[0];
uint64_t s0 = s[1];
uint64_t result = s0 + s1;
s[0] = s0;
s1 ^= s1 << 23;
s[1] = s1 ^ s0 ^ (s1 >> 18) ^ (s0 >> 5);
return result;
}
int main() {
uint64_t s[2];
s[0] = init[0];
s[1] = init[1];
uint64_t sum = 0;
#if 1
const int N = 16;
unsigned char buffer[1<<N];
for (int i=0; i<1<<N; i++) buffer[i] = next(s)&1;
for (uint64_t i=0; i<uint64_t(1)<<(32-N); i++) {
for (int j=0; j<1<<N; j++) {
if (buffer[j]) {
sum += one;
}
}
}
#else
for (uint64_t i=0; i<uint64_t(1)<<32; i++) {
if (next(s)&1) {
sum += one;
}
}
#endif
std::cout<<sum<<"\n";
}
(The code contains a non-buffered version as well, use #if 0. It runs around the same speed as the buffered version with N=16)
Here's the inner loop disassembly (compiled with clang. It generates the same code for all N between 8..16, only the loop count differs. Clang unrolled the loop twice):
401270: 80 3c 0c 00 cmp BYTE PTR [rsp+rcx*1],0x0
401274: 74 07 je 40127d <main+0xad>
401276: 48 03 35 e3 2d 00 00 add rsi,QWORD PTR [rip+0x2de3] # 404060 <one>
40127d: 80 7c 0c 01 00 cmp BYTE PTR [rsp+rcx*1+0x1],0x0
401282: 74 07 je 40128b <main+0xbb>
401284: 48 03 35 d5 2d 00 00 add rsi,QWORD PTR [rip+0x2dd5] # 404060 <one>
40128b: 48 83 c1 02 add rcx,0x2
40128f: 48 81 f9 00 00 01 00 cmp rcx,0x10000
401296: 75 d8 jne 401270 <main+0xa0>