So I have been having a look at some of the magic that is O3 in GCC (well actually I'm compiling using Clang but it's the same with GCC and I'm guessing a large part of the optimiser was pulled over from GCC to Clang).
Consider this C program:
int foo(int n) {
if (n == 0) return 1;
return n * foo(n-1);
}
int main() {
return foo(10);
}
The first thing I was pretty WOW-ed at (which was also WOW-ed at in this question - https://stackoverflow.com/a/414774/1068248) was how int foo(int) (a basic factorial function) compiles into a tight loop. This is the ARM assembly for it:
.globl _foo
.align 2
.code 16
.thumb_func _foo
_foo:
mov r1, r0
movs r0, #1
cbz r1, LBB0_2
LBB0_1:
muls r0, r1, r0
subs r1, #1
bne LBB0_1
LBB0_2:
bx lr
Blimey I thought. That's pretty interesting! Completely tight looping to do the factorial. WOW. It's not a tail call optimisation since, well, it's not a tail call. But it appears to have done a much similar optimisation.
Now look at main:
.globl _main
.align 2
.code 16
.thumb_func _main
_main:
movw r0, #24320
movt r0, #55
bx lr
That just blew my mind to be honest. It's just totally bypassing foo and returning 3628800 which is 10!.
This makes me really realise how your compiler can often do a much better job than you can at optimising your code. But it raises the question, how does it manage to do such a good job? So, can anyone explain (possibly by linking to relevant code) how the following optimisations work:
The initial
foooptimisation to be a tight loop.The optimisation where
mainjust goes and returns the result directly rather than actually executingfoo.
Also another interesting side effect of this question would be to show some more interesting optimisations which GCC/Clang can do.