I'm implementing an MPI program in C which does SOR (successive overrelaxation) on a grid. When benchmarking it, I came across something quite unexpected, namely that the address-of operator & appears to be very slow. I can't show the entire code here, and it's also too long, but the relevant parts are as follows.
double maxdiff, diff;
do {
    maxdiff = 0.0;
    /* inner loops updating maxdiff a lot */
    /* diff is used as a receive buffer here */
    MPI_Allreduce(&maxdiff, &diff, 1, MPI_DOUBLE, MPI_MAX, my_comm);
    maxdiff = diff;
} while(maxdiff > stopdiff);
Here, stopdiff is some magic value. The slow behaviour appears in the MPI_Allreduce() operation. The strange thing is that that operation is even very slow when running on just a single node, even though no communication is needed in that case. When I comment the operation out, the runtime for a particular problem on one node decreases from 290 seconds to just 225 seconds. Also, when I replace the operation with an MPI_Allreduce() call using other bogus variables, I get 225 seconds as well. So it looks like it is specifically getting the addresses of maxdiff and diff which is causing the slowdown.
I updated the program by making two extra double variables used as temporary send/receive buffers, as follows.
send_buf = maxdiff;
MPI_Allreduce(&send_buf, &recv_buf, 1, MPI_DOUBLE, MPI_MAX, my_comm);
maxdiff = recv_buf;
This also made the program run in 225 seconds instead of 290. My question is, obviously, how can this be?
I do have a suspicion: the program is compiled using gcc with optimization level O3, so I suspect that the compiler is doing some optimization which is making the reference operation very slow. For instance, perhaps the variables are stored in cpu registers because they are used so often in the loop, and because of this they have to be flushed back to memory whenever their address is requested. However, I can't seem to find out via googling what kind of optimization might cause this problem, and I'd like to be sure about the problem. Does anybody have an idea what might be causing this?
Thanks in advance!
I should add some other important information here. The specific problem being run fills up the memory pretty bad. It uses 3GB of memory, and the nodes have a total of 4GB RAM. I also observe that the slowdown gets worse for larger problem sizes, as RAM fills up, so the amount of load on the RAM seems to be a factor in the problem. Also, strangely enough, when I add the MPI_Allreduce() just once after the loop instead of inside the loop, the slowdown is still there in the non-optimized version of the program, and it is still just as bad. The program does not run any faster that way.
As requested below, here is part of the gcc assembly output. Unfortunately, I don't have enough experience with assembly to gather the problem from this. This is the version with the added send and receive buffers, so the version which runs in 225 seconds rather than 290.
    incl    %r13d
    cmpl    $1, %r13d
    jle     .L394
    movl    136(%rsp), %r9d
    fldl    88(%rsp)
    leaq    112(%rsp), %rsi
    leaq    104(%rsp), %rdi
    movl    $100, %r8d
    movl    $11, %ecx
    movl    $1, %edx
    fstpl   104(%rsp)
    call    MPI_Allreduce
    fldl    112(%rsp)
    incl    84(%rsp)
    fstpl   40(%rsp)
    movlpd  40(%rsp), %xmm3
    ucomisd 96(%rsp), %xmm3
    jbe     .L415
    movl    140(%rsp), %ebx
    xorl    %ebp, %ebp
    jmp     .L327
Here is what I believe is the corresponding part in the program without the extra send and receive buffers, so the version which runs in 290 seconds.
    incl    %r13d
    cmpl    $1, %r13d
    jle     .L314
    movl    120(%rsp), %r9d
    leaq    96(%rsp), %rsi
    leaq    88(%rsp), %rdi
    movl    $100, %r8d
    movl    $11, %ecx
    movl    $1, %edx
    call    MPI_Allreduce
    movlpd  96(%rsp), %xmm3
    incl    76(%rsp)
    ucomisd 80(%rsp), %xmm3
    movsd   %xmm3, 88(%rsp)
    jbe     .L381
    movl    124(%rsp), %ebx
    jmp     .L204
 
     
     
     
    