I'm running the following benchmark:
int main(int argc, char **argv)
{
 char *d = malloc(sizeof(char) * 13);
 TIME_THIS(func_a(999, d), 99999999);
 TIME_THIS(func_b(999, d), 99999999);
 return 0;
}
with normal compilation, the results are the same for both functions
% gcc func_overhead.c func_overhead_plus.c -o func_overhead && ./func_overhead                                                                               
[func_a(999, d)                     ]      9276227.73
[func_b(999, d)                     ]      9265085.90
but with -O3 they are very different
% gcc -O3 func_overhead.c func_overhead_plus.c -o func_overhead && ./func_overhead                                                                
[func_a(999, d)                     ]    178580674.69
[func_b(999, d)                     ]     48450175.29
func_a and func_b are defined like this:
char *func_a(uint64_t id, char *d)
{
 register size_t i, j;
 register char c;
 for (i = 0, j = 36; i <= 11; i++)
  if (i == 4 || i == 8)
   d[i] = '/';
  else {
   c = ((id >> j) & 0xf) + '0';
   if (c > '9') 
    c = c - '9' - 1 + 'A';
   d[i] = c;
   j -= 4;
  }
 d[12] = '\0';
 return d;
}
the only difference is that func_a in the same file as main() and func_b is in the func_overhead_plus.c file
I'm wondering if anyone could elaborate on what's going on
Thanks
Edit:
Sorry about all the confusion regarding the results. they are actually calls per second, so func_a() is faster than func_b() with -O3
TIME_THIS is defined like so:
double get_time(void)
{
    struct timeval t;
    gettimeofday(&t, NULL);
    return t.tv_sec + t.tv_usec*1e-6;
}
#define TIME_THIS(func, runs) do {                  \
        double t0, td;                              \
        int i;                                      \
        t0 = get_time();                            \
        for (i = 0; i < runs; i++)                  \
            func;                                   \
        td = get_time() - t0;                       \
        printf("[%-35s] %15.2f\n", #func, runs / td);   \
} while(0)
The architecture is Linux
Linux komiko 2.6.30-gentoo-r2 #1 SMP PREEMPT Wed Jul 15 17:27:51 IDT 2009 i686 Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz GenuineIntel GNU/Linux
gcc is 4.3.3
as suggested, here are the results of mixing the calls a little
-O3
[func_b(999, d)                     ]     48926120.09
[func_a(999, d)                     ]    135299870.52
[func_b(999, d)                     ]     49075900.30
[func_a(999, d)                     ]    135748939.12
[func_b(999, d)                     ]     49039535.67
[func_a(999, d)                     ]    134055084.58
-O2
[func_b(999, d)                     ]     27243732.97
[func_a(999, d)                     ]     27341371.38
[func_b(999, d)                     ]     27303284.93
[func_a(999, d)                     ]     27349177.65
[func_b(999, d)                     ]     27325398.25
[func_a(999, d)                     ]     27343935.88
(-O1 and -Os were same as -O2 in this test)
no optimizations
[func_b(999, d)                     ]      8852314.88
[func_a(999, d)                     ]      9646166.81
[func_b(999, d)                     ]      8909973.33
[func_a(999, d)                     ]      9734883.99
[func_b(999, d)                     ]      8726127.49
[func_a(999, d)                     ]      9566052.21
looks like no optimizations behaves like -O3 in the way that func_a seems to be faster than func_b
just for fun, compiling with gcc 4.4.4 seems to be interesting
no optimizations
[func_b(999, d)                     ]     16982343.03
[func_a(999, d)                     ]     19693688.36
[func_b(999, d)                     ]     17260359.40
[func_a(999, d)                     ]     18137352.08
[func_b(999, d)                     ]     16790465.45
[func_a(999, d)                     ]     19828836.94
-O3
[func_b(999, d)                     ]     52184739.72
[func_a(999, d)                     ] 99999237556468.61
[func_b(999, d)                     ]     52430823.56
[func_a(999, d)                     ]    101030101.92
[func_b(999, d)                     ]     52404446.52
[func_a(999, d)                     ]    100842538.40
this is pretty weird, isn't it?
Edit:
If the performance difference is indeed an inability of gcc4.3/4.4 to inline across objects, should it be considered a good practice to include performance critical function in the same file?
e.g
#include "performance_critical.c"
or is it just messy and most likely not really significant?
Thanks
 
     
     
     
     
     
     
    