I want to time a function call with rdtsc. So I measured it in two ways as follows.
- Call it in a loop. Aggregate each rdtsc difference within the loop and divide by number of calls. (Let's say this is N)
- Call it in a loop. Get the rdtsc difference of the loop itself and divide by N.
But I see couple of inconsistent behaviors.
- When I increase N the times get reduced rather monotonically in both method 1 and 2. For method 2 it is understandable in that it would amortize the loop control overhead. But I am not sure how it is so for method 1.
- Actually for method 2 each time when I increase the N the value I get for N=1 seems to be just divided by the new N each time. Inspecting gdb disassembly made me realize it is some compiler optimization at -O2 where loop is skipped at second case. So I retried with -O0, where the gdb disassembly show the actual loop being there for the second case as well.
Code is given below.
    #include <stdio.h>
    #include <inttypes.h>
    #include <stdlib.h>
    typedef unsigned long long ticks;
    static __inline__ ticks getticks(void) {
      unsigned a, d; 
      asm volatile("rdtsc" : "=a" (a), "=d" (d)); 
      return ((ticks)a) | (((ticks)d) << 32); 
    }
    __attribute__ ((noinline))
    void bar() {
    }
    int main(int argc, char** argv) {
       long long N = 1000000; 
       N = atoi(argv[1]);
       int i;
       long long bar_total = 0;
       ticks start = 0, end = 0;
       for (i = 0; i < N; i++) {
         start = getticks();
         bar();
         end = getticks();
         bar_total += (end - start);
       } 
       fprintf(stdout, "Total invocations : %lld\n", N);
       fprintf(stdout, "[regular] bar overhead : %lf\n", ((double)bar_total/  N));
      start = getticks();
      for (i = 0; i < N; i++) {
        bar();
      } 
      end = getticks();
      bar_total = (end - start);
      fprintf(stdout, "[Loop] bar overhead : %lf\n", ((double)bar_total/ N));
      return 0;
     }
Any idea what's going on here? I can put the gdb disassembly if needed as well. I used the rdtsc implementation from http://dasher.wustl.edu/tinker/distribution/fftw/kernel/cycle.h
Edit: I am going to have to retract my second statement that at -O0 the time gets dropped directly proportional to N in the second case. I guess it's some mistake I made during the build causing some older version to persist. Any how it still goes down somewhat along with figure for method 1. Here are some numbers for different N values.
taskset -c 2 ./example.exe 1
Total invocations : 1
[regular] bar overhead : 108.000000
[Loop] bar overhead : 138.000000
taskset -c 2 ./example.exe 10
Total invocations : 10
[regular] bar overhead : 52.900000
[Loop] bar overhead : 40.700000
taskset -c 2 ./example.exe 100
Total invocations : 100
[regular] bar overhead : 46.780000
[Loop] bar overhead : 15.570000
taskset -c 2 ./example.exe 1000
Total invocations : 1000
[regular] bar overhead : 46.069000
[Loop] bar overhead : 13.669000
taskset -c 2 ./example.exe 100000
Total invocations : 10000
[regular] bar overhead : 46.010100
[Loop] bar overhead : 13.444900
taskset -c 2 ./example.exe 100000000
Total invocations : 100000000
[regular] bar overhead : 26.970272
[Loop] bar overhead : 5.201252
taskset -c 2 ./example.exe 1000000000
Total invocations : 1000000000
[regular] bar overhead : 18.853279
[Loop] bar overhead : 5.218234
taskset -c 2 ./example.exe 10000000000
Total invocations : 1410065408
[regular] bar overhead : 18.540719
[Loop] bar overhead : 5.216395
I see two new behaviors now.
- Method 1 converges slower than the method 2. But still I am puzzling over why there is such a drastic difference in values for different N settings. Perhaps I am doing some basic mistake here which I don't see at the moment.
- Method 1 value is actually larger than method 2 by some margin. I expected it be on par or slightly smaller than the method 2 value since it doesn't contain loop control overhead.
Questions
So in summary my questions are
- Why are the values given by both methods change so drastically when increasing the N? Specially for method 1 which doesn't account for loop control overhead. 
- Why is second method result is less than the first method's when first method excludes the loop control overhead in the calculations? 
Edit 2
Regarding the suggested rdtscp solution.
Being uninitiated about the inline assembly I did the following.
static __inline__ ticks getstart(void) {
  unsigned cycles_high = 0, cycles_low = 0; 
  asm volatile ("CPUID\n\t"
             "RDTSC\n\t"
             "mov %%edx, %0\n\t"
             "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
             "%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
}
static __inline__ ticks getend(void) {
  unsigned cycles_high = 0, cycles_low = 0; 
  asm volatile("RDTSCP\n\t"
         "mov %%edx, %0\n\t"
          "mov %%eax, %1\n\t"
           "CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low)::
           "%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high) | (((ticks)cycles_low) << 32); 
}
and used above methods before and after the function call. But now I get non sensical results like follows.
Total invocations : 1000000
[regular] bar overhead : 304743228324.708374
[Loop] bar overhead : 33145641307.734016
What's the catch? I wanted to factor out those as inlined methods since I see use of it in multiple places.
A. Solution in the comments.
 
     
    