Recently I had to write a code for critical real time functionality and I used few __builtin_... functions. I understand that such code is not portable because not all the compilers support "__builtin_..." functions or syntax. I was wondering if there is a way to write code in a plain C so that the compiler would be able to recognize it and use some internal "__builtin_..."-like function?
Below is a description of a small experement I did but my question is:
- Are there any tips, best known methods, guidelines to write a portable C code so that the compiler would be able to detect (let's put aside the compiler bugs) the pattern and use the maximum ability of the target CPU architecture.
For example reverse bytes in a Dword (so that the first byte become the last one, the last one becomes the first one and so on), the x86_64 architecture has a dedicated assembly instruction for it - bswap. I tried 4 different options:
#include <stdint.h>
#include <stdlib.h>
typedef union _helper_s
{
    uint32_t val;
    uint8_t bytes[4];
} helper_u;
uint32_t reverse(uint32_t d)
{
    helper_u b;
    uint8_t temp;
    b.val = d;
    temp = b.bytes[0];
    b.bytes[0] = b.bytes[3];
    b.bytes[3] = temp;
    temp = b.bytes[1];
    b.bytes[1] = b.bytes[2];
    b.bytes[2] = temp;
    return b.val;
}
uint32_t reverse1(uint32_t d)
{
    helper_u b;
    uint8_t temp;
    b.val = d;
    for (size_t i = 0; i < sizeof(uint32_t) / 2; i++)
    {
        temp = b.bytes[i];
        b.bytes[i] = b.bytes[sizeof(uint32_t) - i - 1];
        b.bytes[sizeof(uint32_t) - i - 1] = temp;
    }
    return b.val;
}
uint32_t reverse2(uint32_t d)
{
    return (d << 24) | (d >> 24 ) | ((d & 0xFF00) << 8) | ((d & 0xFF0000) >> 8);
}
uint32_t reverse3(uint32_t d)
{
    return __builtin_bswap32(d);
}
All the options provide the same functionality. I compiled it with different compilers and different optimization levels, the results were not so good:
- GCC - did great! For both - -O3and- -Osoptimization levels it gave the same result for all the functions:- reverse: mov eax, edi bswap eax ret reverse1: mov eax, edi bswap eax ret reverse2: mov eax, edi bswap eax ret reverse3: mov eax, edi bswap eax ret
- Clang a little disappointed me. With the - -O3it gave the same result as GCC however with the- -Osit totally lost the path in- reverse1. It didn't recognize the pattern and produced way less optimal binary:- reverse1: # @reverse1 lea rax, [rsp - 8] mov dword ptr [rax], edi mov ecx, 3 .LBB1_1: # =>This Inner Loop Header: Depth=1 mov sil, byte ptr [rax] mov dl, byte ptr [rsp + rcx - 8] mov byte ptr [rax], dl mov byte ptr [rsp + rcx - 8], sil dec rcx inc rax cmp rcx, 1 jne .LBB1_1 mov eax, dword ptr [rsp - 8] ret- Actually the difference between - reverseand- reverse1is that- reverseis the "loop unrolled" version of- reverse1, so I assume that with- -Osthe compiler didn't even try to unroll or try to anticipate the purpose of the- forloop.
- With the ICC, the things went even worse because it was unable to recognize the pattern in - reverseand- reverse1functions both with the- -O3and the- -Osoptimization levels.
P.S.
I often hear people say that the code has to be written so that even junior programmer would easily be able to understand it and the modern compilers are "smart" enough to take care of the optimizations. Now I have an evidence that it is not true (or at least not always true).
 
     
    