Why are serializing instructions inherently pipeline-unfriendly?
On this other answer [ Deoptimizing a program for the pipeline in Intel Sandybridge-family CPUs ] was stated this:
Time each iteration independently, with something even heavier than RDTSC. e.g. CPUID / RDTSC or a time function that makes a system call. Serializing instructions are inherently pipeline-unfriendly.
I think it should be the opposite. Serialized instructions are very good for pipe line. For example,
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
sum = 5 * sum;
Assembly by g++ main.cpp -S
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
movl    %eax, -4(%rbp)
movl    -4(%rbp), %edx
movl    %edx, %eax
sall    $2, %eax
addl    %edx, %eax
Is much better for pipe line, instead of:
for( int i = 0; i < 7; i++ )
{
    sum = 5 * sum;
}
sum = sum + 5;
Assembly by g++ main.cpp -S
    movl    $0, -4(%rbp)
    movl    $0, -8(%rbp)
.L3:
    cmpl    $6, -8(%rbp)
    jg  .L2
    movl    -4(%rbp), %edx
    movl    %edx, %eax
    sall    $2, %eax
    addl    %edx, %eax
    movl    %eax, -4(%rbp)
    addl    $1, -8(%rbp)
    jmp .L3
.L2:
    addl    $5, -4(%rbp)
    movl    $0, %eax
    addq    $48, %rsp
    popq    %rbp
Because each time the loop goes:
- Is need to perform a if( i < 7 )
- Adding branch prediction, for the above loop we could assume the first time the prediction will fail
- The instruction sum = sum + 5will be discarded.
- And the next time the pipe line will do sum = 5 * sum,
- Until the condition if( i < 7 )fail,
- Then the sum = 5 * sumwill be discarded
- And sum = sum + 5will be finally processed.
 
     
     
    