I was just playing around with recursive functions in C++ and Fortran and I realised that a simple recursive function in Fortran is almost twice as fast as its equivalent C++ function. Now, before getting into this, I know there are similar questions here, specifically:
- Why does adding assembly comments cause such radical change in generated code?
- Working of asm volatile (“” : : : “memory”)
- Equivalent to asm volatile in gfortran
However, I am a bit more specific and puzzled as the Fortran compiler seems to be doing what you can achieve with asm volatile in gcc. To give you some context let's consider the following recursive Fibonacci number implementation:
Fortran Code:
module test
implicit none
private
public fib
contains
! Fibonacci function
integer recursive function fib(n) result(r)
    integer, intent(in) :: n
    if (n < 2) then
        r = n
    else
        r = fib(n-1) + fib(n-2)
    end if
end function  ! end of Fibonacci function
end module
program fibonacci
use test, only: fib
implicit none
integer :: r,i 
integer :: n = 1e09
real(8) :: start, finish, cum_time
cum_time=0
do i= 1,n 
    call cpu_time(start)
    r = fib(20)
    call cpu_time(finish) 
    cum_time = cum_time + (finish - start)
    if (cum_time >0.5) exit
enddo  
print*,i,'runs, average elapsed time is', cum_time/i/1e-06, 'us' 
end program
Compiled with:
gfortran -O3 -march=native
C++ Code:
#include <iostream>
#include <chrono>
using namespace std;
// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    return r;
} // end of fib
template<typename T, typename ... Args>
double timeit(T (*func)(Args...), Args...args)
{
    double counter = 1.0;
    double mean_time = 0.0;
    for (auto iter=0; iter<1e09; ++iter){
        std::chrono::time_point<std::chrono::system_clock> start, end;
        start = std::chrono::system_clock::now();
        func(args...);
        end = std::chrono::system_clock::now();
        std::chrono::duration<double> elapsed_seconds = end-start;
        mean_time += elapsed_seconds.count();
        counter++;
        if (mean_time > 0.5){
            mean_time /= counter;
            std::cout << static_cast<long int>(counter)
            << " runs, average elapsed time is "
            << mean_time/1.0e-06 << " \xC2\xB5s" << std::endl; 
            break;
        }
    }
    return mean_time;
}
int main(){
    timeit(fib,20);
    return 0;
}
Compiled with:
g++ -O3 -march=native
Timing:
Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 12355 runs, average elapsed time is 40.471 µs
So gfortran is twice as fast there compared to gcc. Looking at the assembly code, I get
Assembly (Fortran):
.L28:
    cmpl    $1, %r13d
    jle .L29
    leal    -8(%rbx), %eax
    movl    %ecx, 12(%rsp)
    movl    %eax, 48(%rsp)
    leaq    48(%rsp), %rdi
    leal    -9(%rbx), %eax
    movl    %eax, 16(%rsp)
    call    __bench_MOD_fib
    leaq    16(%rsp), %rdi
    movl    %eax, %r13d
    call    __bench_MOD_fib
    movl    12(%rsp), %ecx
    addl    %eax, %r13d
Assembly (C++):
.L28:
    movl    72(%rsp), %edx
    cmpl    $1, %edx
    movl    %edx, %eax
    jle .L33
    subl    $3, %eax
    movl    $0, 52(%rsp)
    movl    %eax, %esi
    movl    %eax, 96(%rsp)
    movl    92(%rsp), %eax
    shrl    %eax
    movl    %eax, 128(%rsp)
    addl    %eax, %eax
    subl    %eax, %esi
    movl    %edx, %eax
    subl    $1, %eax
    movl    %esi, 124(%rsp)
    movl    %eax, 76(%rsp)
Both assembly codes are made up of almost similar blocks/labels repeated over and over again. As you can see the Fortran assembly makes two calls to fib function whereas in the C++ assembly, gcc has probably unrolled  all the recursive calls which probably requires more stack push/pop and tail jumps.
Now if I just put one inline assembly comment in the C++ code like so
Modified C++ Code:
// Fib function
int fib(const int n)
{
    int r;
    if (n < 2)
        r = n;
    else
        r = fib(n-1) + fib(n-2);
    asm("");
    return r;
} // end of fib
The generated assembly code, changes to
Assembly (C++ Modified):
.L7:
    cmpl    $1, %edx
    jle .L17
    leal    -4(%rbx), %r13d
    leal    -5(%rbx), %edx
    cmpl    $1, %r13d
    jle .L19
    leal    -5(%rbx), %r14d
    cmpl    $1, %r14d
    jle .L55
    leal    -6(%rbx), %r13d
    movl    %r13d, %edi
    call    _Z3fibi
    leal    -7(%rbx), %edi
    movl    %eax, %r15d
    call    _Z3fibi
    movl    %r13d, %edi
    addl    %eax, %r15d
You can now see two calls to fib function. Timing them gives me
Timing:
Fortran: 24991 runs, average elapsed time is 20.087 us
C++    : 25757 runs, average elapsed time is 19.412 µs
I know the effect of asm with no output and asm volatile is to suppress aggressive compiler optimisations but in this case, gcc thought it was too smart but ended up generating a less efficient code in the first place. 
So the question is:
- Why can gccnot see this "optimisation", whengfortanclearly can?
- The inline assembly line has to be before the return statement. Put it elsewhere and it will have no effect. Why?
- Is this behaviour compiler specific? For instance can you mimick the same behaviour with clang/MSVC?
- Are there safer ways to make recursions faster in CorC++(without relying on inline assembly or iteration-style coding)? Variadic templates perhaps?
UPDATE:
- The result shown above are all with gcc 4.8.4. I have also tried compiling it withgcc 4.9.2andgcc 5.2and I get identical results.
- The problem can also be replicated (fixed?) if instead of putting asmI declare the input argument as volatile i.e.(volatile int n)instead of(const int n), although this will result in a tad bit slower run-time on my machine.
- As Michael Karcher has mentioned, we can instead pass the -fno-optimize-sibling-callsflag to fix this problem. Since this flag is activated at-O2level and beyond, even compiling with-O1fixes this problem.
- I have run the same example with clang 3.5.1with-O3 -march=nativeand though the situation is not identical,clangalso appears to generate faster code withasm.
Clang Timing:
clang++ w/o asm    :  8846 runs, average elapsed time is 56.4555 µs
clang++ with asm   : 10427 runs, average elapsed time is 47.8991 µs 
 
     
    