I have been trying a few experiments on x86 - namely the effect of mfence on store/load latencies, etc.
Here is what I have started with:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define ARRAY_SIZE 10
#define DUMMY_LOOP_CNT 1000000
int main()
{
    char array[ARRAY_SIZE];
    for (int i =0; i< ARRAY_SIZE; i++)
        array[i] = 'x'; //This is to force the OS to give allocate the array
    asm volatile ("mfence\n");
    for (int i=0;i<DUMMY_LOOP_CNT;i++); //A dummy loop to just warmup the processor
    struct result_tuple{
        uint64_t tsp_start;
        uint64_t tsp_end;
        int offset;
        };
    struct result_tuple* results = calloc(ARRAY_SIZE , sizeof (struct result_tuple));
    for (int i = 0; i< ARRAY_SIZE; i++)
    {
        uint64_t *tsp_start,*tsp_end;
        tsp_start = &results[i].tsp_start;
        tsp_end = &results[i].tsp_end;
        results[i].offset = i;
        
        asm volatile (
        "mfence\n"
        "rdtscp\n"
        "mov %%rdx,%[arg]\n"
        "shl $32,%[arg]\n"
        "or %%rax,%[arg]\n"
        :[arg]"=&r"(*tsp_start)
        ::"rax","rdx","rcx","memory"
        );
        array[i] = 'y'; //A simple store
        asm volatile (
        "mfence\n"
        "rdtscp\n"
        "mov %%rdx,%[arg]\n"
        "shl $32,%[arg]\n"
        "or %%rax,%[arg]\n"
        :[arg]"=&r"(*tsp_end)
        ::"rax","rdx","rcx","memory"
        );
    }
    
    printf("Offset\tLatency\n");
    for (int i=0;i<ARRAY_SIZE;i++)
    {
        printf("%d\t%lu\n",results[i].offset,results[i].tsp_end - results[i].tsp_start);
    }
    free (results);
}   
I compile quite simply with gcc microbenchmark.c -o microbenchmark
My system configuration is as follows:
CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Operating system : GNU/Linux (Linux 5.4.80-2)
My issue is this:
- In a single run, all the latencies are similar
- When repeating the experiment over and over, I don't get results similar to the previous run!
For instance:
In run 1 I get:
Offset  Latency
1   275
2   262
3   262
4   262
5   275
...
252 275
253 275
254 262
255 262
In another run I get:
Offset  Latency
1   75
2   75
3   75
4   72
5   72
...
251 72
252 72
253 75
254 75
255 72
This is pretty surprising (The among-run variation is pretty high, whereas there is negligible within-run variation)! I am not sure how to explain this. What is the issue with my microbenchmark?
Note: I do understand that a normal store would be a write allocate store.. Technically making my measurement that of a load (rather than a store). Also, mfence should flush the store buffer, thereby ensuring that no stores are 'delayed'.
 
    