The problem:
I'm trying to figure out how to write a code (C preffered, ASM only if there is no other solution) that would make the branch prediction miss in 50% of the cases.
So it has to be a piece of code that "is imune" to compiler optimizations related to branching and also all the HW branch prediction should not go better than 50% (tossing a coin). Even a greater challenge is being able to run the code on multiple CPU architectures and get the same 50% miss ratio.
I managed to write a code that goes to 47% branch miss ratio on an x86 platform. I'm suspecting the missing could 3% come from:
- Program launch overhead that has branching in it (very small though)
- Profiler overhead - Basically for each counter read an interrupt is raised so this might add additional predictable branches.
- System calls running in the background that contain loops and predictable branching
I written my own random number generator to avoid calls to a rand whose implementation might have hidden predictable branches. It can use also rdrand when available. Latency does not matter for me.
The questions:
- Can I do better than my version of code? Better means getting a higher branch misspredict and same results for all CPU architectures.
- Can this code be predicated? What would that mean?
The code:
#include <stdio.h>
#include <time.h>
#define RDRAND
#define LCG_A   1103515245
#define LCG_C   22345
#define LCG_M   2147483648
#define ULL64   unsigned long long
ULL64 generated;
ULL64 rand_lcg(ULL64 seed)
{
#ifdef RDRAND
    ULL64 result = 0;
    asm volatile ("rdrand %0;" : "=r" (result));
    return result;
#else
    return (LCG_A * seed + LCG_C) % LCG_M;
#endif
}
ULL64 rand_rec1()
{
    generated = rand_lcg(generated) % 1024;
    if (generated < 512)
        return generated;
    else return rand_rec1();
}
ULL64 rand_rec2()
{
    generated = rand_lcg(generated) % 1024;
    if (!(generated >= 512))
        return generated;
    else return rand_rec2();
}
#define BROP(num, sum)                  \
    num = rand_lcg(generated);          \
    asm volatile("": : :"memory");      \
    if (num % 2)                        \
        sum += rand_rec1();             \
    else                                \
        sum -= rand_rec2();
#define BROP5(num, sum)     BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum) BROP(num, sum)
#define BROP25(num, sum)    BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum) BROP5(num, sum)
#define BROP100(num, sum)   BROP25(num, sum) BROP25(num, sum) BROP25(num, sum) BROP25(num, sum)
int main()
{
    int i = 0;
    int iterations = 500000;    
    ULL64 num = 0;
    ULL64 sum = 0;
    generated = rand_lcg(0) % 54321;
    for (i = 0; i < iterations; i++)
    {
        BROP100(num, sum);
        // ... repeat the line above 10 times
    }
    printf("Sum = %llu\n", sum);
}
Update v1:
Following the suggestion of usr, I generated various patterns by varying the LCG_C parameter from the command line in a script. I was able to go to 49.67% BP miss. That is enough for my purpose and I have the methodology to produce this on various architectures.
 
     
     
     
     
    