gcc -O3 question, calling same function from different file yields different performance

Question

I'm running the following benchmark:

int main(int argc, char **argv)
{
 char *d = malloc(sizeof(char) * 13);

 TIME_THIS(func_a(999, d), 99999999);
 TIME_THIS(func_b(999, d), 99999999);

 return 0;
}

with normal compilation, the results are the same for both functions

% gcc func_overhead.c func_overhead_plus.c -o func_overhead && ./func_overhead                                                                               
[func_a(999, d)                     ]      9276227.73
[func_b(999, d)                     ]      9265085.90

but with -O3 they are very different

% gcc -O3 func_overhead.c func_overhead_plus.c -o func_overhead && ./func_overhead                                                                
[func_a(999, d)                     ]    178580674.69
[func_b(999, d)                     ]     48450175.29

func_a and func_b are defined like this:

char *func_a(uint64_t id, char *d)
{
 register size_t i, j;
 register char c;

 for (i = 0, j = 36; i <= 11; i++)
  if (i == 4 || i == 8)
   d[i] = '/';
  else {
   c = ((id >> j) & 0xf) + '0';

   if (c > '9') 
    c = c - '9' - 1 + 'A';

   d[i] = c;

   j -= 4;
  }

 d[12] = '\0';

 return d;
}

the only difference is that func_a in the same file as main() and func_b is in the func_overhead_plus.c file

I'm wondering if anyone could elaborate on what's going on

Thanks

Edit:

Sorry about all the confusion regarding the results. they are actually calls per second, so func_a() is faster than func_b() with -O3

TIME_THIS is defined like so:

double get_time(void)
{
    struct timeval t;
    gettimeofday(&t, NULL);
    return t.tv_sec + t.tv_usec*1e-6;
}

#define TIME_THIS(func, runs) do {                  \
        double t0, td;                              \
        int i;                                      \
        t0 = get_time();                            \
        for (i = 0; i < runs; i++)                  \
            func;                                   \
        td = get_time() - t0;                       \
        printf("[%-35s] %15.2f\n", #func, runs / td);   \
} while(0)

The architecture is Linux

Linux komiko 2.6.30-gentoo-r2 #1 SMP PREEMPT Wed Jul 15 17:27:51 IDT 2009 i686 Intel(R) Core(TM)2 Quad CPU Q8200 @ 2.33GHz GenuineIntel GNU/Linux

gcc is 4.3.3

as suggested, here are the results of mixing the calls a little

-O3

[func_b(999, d)                     ]     48926120.09
[func_a(999, d)                     ]    135299870.52
[func_b(999, d)                     ]     49075900.30
[func_a(999, d)                     ]    135748939.12
[func_b(999, d)                     ]     49039535.67
[func_a(999, d)                     ]    134055084.58

-O2

[func_b(999, d)                     ]     27243732.97
[func_a(999, d)                     ]     27341371.38
[func_b(999, d)                     ]     27303284.93
[func_a(999, d)                     ]     27349177.65
[func_b(999, d)                     ]     27325398.25
[func_a(999, d)                     ]     27343935.88

(-O1 and -Os were same as -O2 in this test)

no optimizations

[func_b(999, d)                     ]      8852314.88
[func_a(999, d)                     ]      9646166.81
[func_b(999, d)                     ]      8909973.33
[func_a(999, d)                     ]      9734883.99
[func_b(999, d)                     ]      8726127.49
[func_a(999, d)                     ]      9566052.21

looks like no optimizations behaves like -O3 in the way that func_a seems to be faster than func_b

just for fun, compiling with gcc 4.4.4 seems to be interesting

no optimizations

[func_b(999, d)                     ]     16982343.03
[func_a(999, d)                     ]     19693688.36
[func_b(999, d)                     ]     17260359.40
[func_a(999, d)                     ]     18137352.08
[func_b(999, d)                     ]     16790465.45
[func_a(999, d)                     ]     19828836.94

-O3

[func_b(999, d)                     ]     52184739.72
[func_a(999, d)                     ] 99999237556468.61
[func_b(999, d)                     ]     52430823.56
[func_a(999, d)                     ]    101030101.92
[func_b(999, d)                     ]     52404446.52
[func_a(999, d)                     ]    100842538.40

this is pretty weird, isn't it?

Edit:

If the performance difference is indeed an inability of gcc4.3/4.4 to inline across objects, should it be considered a good practice to include performance critical function in the same file?

e.g

#include "performance_critical.c"

or is it just messy and most likely not really significant?

Thanks

Sorry to disappoint you ;.). Except for the outlier in GCC 4.4 -O3 func_a 99999237556468.61 I see no weirdness. Results like this come up all the time when tuning performance. That's why measuring is so important (and premature optimization so evil ...). The outlier has to be an artifact due to some timing or calculation error. — Peter G., Jul 07 '10 at 10:26
The 99999237556468.61 is likely because the whole test loop is eliminated. — kennytm, Jul 07 '10 at 10:52

score 6 · Accepted Answer · edited May 23 '17 at 10:32

6

Whenever you're curious about what's going on under the optimization hood, check out the -S option. This will let you examine the assembly output to see exactly what's different between the two versions.

When a compiler is working within a single file (read: translation unit), it has access to all the types, objects, etc. that exist within (after preprocessing). When another file is brought into the mix, the compiler doesn't know about the code in the first file. The linker, which puts the two object files together only sees symbol names and machine code.

In your case, the compiler is likely figuring out that how the pointers are used and realizes it can inline the function call in the first file. When you add in the second file, it MUST use pointers to communicate, so you get the added function call overhead.

Edit

torak pointed out that I interpreted this backwards. I don't know why the single-file version would perform more slowly...

edited May 23 '17 at 10:32

Community

1
1

answered Jul 06 '10 at 17:33

Cogwheel

22,781
4
49
67

I was thinking along the same lines, except with the elimination of code that computes unused reslts. However, the OP said that the slower instance (func_a) is the one that is in the same file as main, which contradicts that line of thinking. The advice to look at the assembly is still sound though. – torak Jul 06 '10 at 17:38
2

Where does the OP say func_a is slower? I see "timing" numbers increasing for -O3 and being biggest for func_a. My interpretation is that the numbers represent the amount of work done in a specific time. So I see -O3 func_a running fastest - just as it should be per Cogwheels explanation to which I too subscribe. – Peter G. Jul 06 '10 at 18:49
@Peter G. That would make much more sense... At first (well, at second, really) I thought it was a measurement of time passed. Any clarification from the OP would be welcome. – Cogwheel Jul 06 '10 at 22:44

score 1 · Answer 2 · answered Jul 06 '10 at 22:37

It's all about code caching. The most important function here is TIME_THIS, which is missing in your description. I rewrote your test with a TIME_THIS_A and a TIME_THIS_B which are located in the corresponding c-files. Then with whatever optimization, the effect is gone, both have the same speed then. I can see the effect that func_b is faster if TIME_THIS is located in main.c. But as I said, that's all about instruction caching. Even enlarging the second c-file by some stupid method (I took main and renamed it to main_b) had some influence. If you'd enable L1/L2 cache miss logging, you could see why and where this happens. For the sake of completeness, my (nearly) complete code here:

// main.c
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "foo_b.h"

void prefoo_a()
{
    static volatile int i = 1;
}

void foo_a()
{
    char *d = malloc(sizeof(char) * 13);

    for ( int i = 0; i < 10000000; i++ )
        func_a(999, d);
}

char *func_a(uint64_t id, char *d)
{
    // snipped

    return d;
}

int main(int argc, char **argv)
{   
    clock_t start;

    prefoo_a();
    start = clock();
    foo_a();
    printf ( "func_a %f\n", ( (double)clock() - start ) / CLOCKS_PER_SEC );

    prefoo_b();
    start = clock();
    foo_b();
    printf ( "func_b %f\n", ( (double)clock() - start ) / CLOCKS_PER_SEC );

    prefoo_a();
    start = clock();
    foo_a();
    printf ( "func_a %f\n", ( (double)clock() - start ) / CLOCKS_PER_SEC );

    prefoo_b();
    start = clock();
    foo_b();
    printf ( "func_b %f\n", ( (double)clock() - start ) / CLOCKS_PER_SEC );

    return 0;
}

And foo.b

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include "foo_b.h"

void prefoo_b()
{
    static volatile int i = 1;
}

void foo_b()
{
    char *d = malloc(sizeof(char) * 13);

    for ( int i = 0; i < 10000000; i++ )
        func_b(999, d);
}

char *func_b(uint64_t id, char *d)
{
            // ... snippped

    return d;
}

And a suitable header:

typedef unsigned long long uint64_t;
char *func_a(uint64_t id, char *d);
char *func_b(uint64_t id, char *d);
void prefoo_a();
void prefoo_b();
void foo_a();
void foo_b();

Results on my computer (x64 Snow Leopard) with -O3:

func_a 0.043674
func_b 0.043825
func_a 0.044268
func_b 0.043997
func_a 0.043879
func_b 0.043950

Same with -Od

func_a 0.853132
func_b 0.852719
func_a 0.872263
func_b 0.851980
func_a 0.852977
func_b 0.853294

score 1 · Answer 3 · answered Jul 07 '10 at 10:19

1

My guess — inlining.

-O3 performs aggressive inlining, but -O2 does not. It is possible to inline within the same file, not inlining across objects are not possible until gcc 4.5.

answered Jul 07 '10 at 10:19

kennytm

510,854
105
1,084
1,005

score 0 · Answer 4 · answered Jul 06 '10 at 17:29

0

-O3 is not recommended. See Optimization in GCC.

answered Jul 06 '10 at 17:29

mcandre

22,868
20
88
147

There are two things going on, one of which is the slowdown, which this link potentially explains. – Nathan S. Jul 06 '10 at 17:34
4

An article about GCC from 2005 is hardly relevant anymore. Cache sizes have grown, the compiler has grown in capability. The only good benchmark is benchmarking. – Yann Ramin Jul 06 '10 at 17:39
not currently relevant does not mean, to me,that the article can't contain some interesting idea about the issue. – ShinTakezou Jul 06 '10 at 18:03
@theatrus: I'm not sure L1 cache sizes have grown that much since 2005, though... – ninjalj Jul 06 '10 at 18:10
2

Of course, it would be best to just test with -O2, -Os and -O3. In certain circumstances, -Os can be the better performer. – ninjalj Jul 06 '10 at 18:13
I agree with theatrus and ninjalj that in any case you should test the specific optimization options and simply choose the best performer. – Peter G. Jul 06 '10 at 19:19
My experience is that -Os is always better than -O2 or -O3... often by a factor of 1.5-2. – R.. GitHub STOP HELPING ICE Jul 07 '10 at 19:27
It is still generally the case that -O3 loses to -O2 and/or -Os for almost all code. With GCC 4.5, though, they've gotten more serious about paying attention to the negative speed effects of code size expansion, so -O2 now beats -Os again for a lot of codes. – zwol Jul 22 '10 at 00:54

score 0 · Answer 5 · answered Jul 06 '10 at 18:36

You didn't tell us much about the platform, where you run this example, but in any case such a dramatic slowdown when you are using optimizing flags looks very weird. You might have a measurement problem, but here also we can only speculate, since you didn't show us your TIME_THIS macro.

In any case such measurements as you present them can not be very conclusive, since you may have all kinds of side-effects if you run each of your functions only once and always in the same order.

To be conclusive, you should run the functions a few times without measuring and then in a loop in random order, accumulating the times and averaging at the end.

And yes, as others stated, first look into the assembler to see if there is some weirdness.

score 0 · Answer 6 · answered Jul 22 '10 at 00:42

0

I think that moving the whole function in the header file with

static inline

will solve the problem. Compilers usually cannot optimize across modules.

answered Jul 22 '10 at 00:42

ruslik

14,714
1
39
40

gcc -O3 question, calling same function from different file yields different performance

6 Answers6