How to write compiler "understandable" C code?

Question

Recently I had to write a code for critical real time functionality and I used few __builtin_... functions. I understand that such code is not portable because not all the compilers support "__builtin_..." functions or syntax. I was wondering if there is a way to write code in a plain C so that the compiler would be able to recognize it and use some internal "__builtin_..."-like function?

Below is a description of a small experement I did but my question is:

Are there any tips, best known methods, guidelines to write a portable C code so that the compiler would be able to detect (let's put aside the compiler bugs) the pattern and use the maximum ability of the target CPU architecture.

For example reverse bytes in a Dword (so that the first byte become the last one, the last one becomes the first one and so on), the x86_64 architecture has a dedicated assembly instruction for it - bswap. I tried 4 different options:

#include <stdint.h>
#include <stdlib.h>

typedef union _helper_s
{
    uint32_t val;
    uint8_t bytes[4];
} helper_u;

uint32_t reverse(uint32_t d)
{
    helper_u b;
    uint8_t temp;

    b.val = d;
    temp = b.bytes[0];
    b.bytes[0] = b.bytes[3];
    b.bytes[3] = temp;
    temp = b.bytes[1];
    b.bytes[1] = b.bytes[2];
    b.bytes[2] = temp;

    return b.val;
}

uint32_t reverse1(uint32_t d)
{
    helper_u b;
    uint8_t temp;

    b.val = d;
    for (size_t i = 0; i < sizeof(uint32_t) / 2; i++)
    {
        temp = b.bytes[i];
        b.bytes[i] = b.bytes[sizeof(uint32_t) - i - 1];
        b.bytes[sizeof(uint32_t) - i - 1] = temp;
    }

    return b.val;
}

uint32_t reverse2(uint32_t d)
{
    return (d << 24) | (d >> 24 ) | ((d & 0xFF00) << 8) | ((d & 0xFF0000) >> 8);
}

uint32_t reverse3(uint32_t d)
{
    return __builtin_bswap32(d);
}

All the options provide the same functionality. I compiled it with different compilers and different optimization levels, the results were not so good:

GCC - did great! For both -O3 and -Os optimization levels it gave the same result for all the functions:

reverse:
        mov     eax, edi
        bswap   eax
        ret
reverse1:
        mov     eax, edi
        bswap   eax
        ret
reverse2:
        mov     eax, edi
        bswap   eax
        ret
reverse3:
        mov     eax, edi
        bswap   eax
        ret

Clang a little disappointed me. With the -O3 it gave the same result as GCC however with the -Os it totally lost the path in reverse1. It didn't recognize the pattern and produced way less optimal binary:

reverse1:                               # @reverse1
        lea     rax, [rsp - 8]
        mov     dword ptr [rax], edi
        mov     ecx, 3
.LBB1_1:                                # =>This Inner Loop Header: Depth=1
        mov     sil, byte ptr [rax]
        mov     dl, byte ptr [rsp + rcx - 8]
        mov     byte ptr [rax], dl
        mov     byte ptr [rsp + rcx - 8], sil
        dec     rcx
        inc     rax
        cmp     rcx, 1
        jne     .LBB1_1
        mov     eax, dword ptr [rsp - 8]
        ret

Actually the difference between reverse and reverse1 is that reverse is the "loop unrolled" version of reverse1, so I assume that with -Os the compiler didn't even try to unroll or try to anticipate the purpose of the for loop.

With the ICC, the things went even worse because it was unable to recognize the pattern in reverse and reverse1 functions both with the -O3 and the -Os optimization levels.

P.S.

I often hear people say that the code has to be written so that even junior programmer would easily be able to understand it and the modern compilers are "smart" enough to take care of the optimizations. Now I have an evidence that it is not true (or at least not always true).

I think the advice would be - follow common patterns for common problems. Don't use "smart" hacks such as `xor` method for swapping, some weird arithmetics for finding minimum and such. These common patterns are the ones the compiler is likely to recognize. — Eugene Sh., Oct 01 '19 at 20:07
I'll echo @EugeneSh.'s comment here, as a reformed `xor`-swap abuser. — Christian Gibbons, Oct 01 '19 at 20:11
@EugeneSh. Right. `reverse` is a straight forward solution fore byte order reverse, however ICC still failed recognize it... — Alex Lop., Oct 01 '19 at 20:12
@AlexLop. That's where maharvey's answer would come into play. If one compiler doesn't optimize nicely, you can use conditional compilation directives to give it a special case. You might also consider filing a bug (new feature?) on the lesser-optimized compiler's bug tracker so they might consider adding in an optimization for the common pattern. — Christian Gibbons, Oct 01 '19 at 20:17
Didn't your test reveal the `reverse2` works for all the compilers? — jxh, Oct 01 '19 at 20:31
@jxh for the three I checked, yes. But I wouldn't say it is more readable than `reverse` for instance. — Alex Lop., Oct 01 '19 at 20:33
ICC kinda sucks. I use the free license for it and, to be honest, the code generation is just not that good. — S.S. Anne, Oct 01 '19 at 22:27

score 1 · Answer 1 · answered Oct 01 '19 at 19:55

As far as I am aware, the proper way to do this is with conditional compilation.

My suggestion is to write plain normal code in standard C as the default, both for maintainability and as a fall-back path that all compilers can handle. Utilize conditional compilation only as necessary to optimize for specific compilers, with a comment explaining the reason for the exception.

jxh · Answer 2 · 2019-10-02T22:32:19.633

The technique used for reverse2 is fairly idiomatic (here, for example), and your own testing showed that it is properly optimized on all the systems you tested on. To make the implementation easier to understand, you can introduce more whitespace, and follow a more regular pattern.

uint32_t reverse2(uint32_t d)
{
    return ((d & 0x000000FFU) << 24) |
           ((d & 0x0000FF00U) << 8)  |
           ((d & 0x00FF0000U) >> 8)  |
           ((d & 0xFF000000U) >> 24) ;
}

Try It Online : gcc

Try It Online : clang

To your specific points:

_{Are there any tips, best known methods, guidelines to write a portable C code so that the compiler would be able to detect (let's put aside the compiler bugs) the pattern and use the maximum ability of the target CPU architecture.}

The key take away should be to try to write idiomatic code. Judging code to be understandable is somewhat subjective. What may seem clear to me can appear incomprehensible to someone else (and vice versa). However, there are common idioms in C programming that should be followed whenever it is appropriate to do so.

Unfortunately, I do not have at the top of my head a handy list of idioms. But, I can say I largely learned C from reading The C Programming Language (by K & R, of course). And I was an avid reader of C Programming FAQs (by Steve Summit).

However, a very good resource for C idioms can be found by reading and comprehending open source C projects, and of course the source code base of the company you work at. Following the latter has the added benefit that any code you add that follows existing conventions will naturally increase the chances of it being understood by someone else in the company.

_{I often hear people say that the code has to be written so that even junior programmer would easily be able to understand it and the modern compilers are "smart" enough to take care of the optimizations. Now I have an evidence that it is not true (or at least not always true).}

Compilers are just programs, so they cannot read your mind. The compiler will be programmed to look for particular patterns in the AST and apply optimizations to transform the tree into what it considers more optimal. Similarly, the peephole optimizer will look for patterns in the generated machine instructions, and then transform them into fewer equivalent instructions.

But these transformations are only possible if the generated tree or generated instructions follow a recognizable pattern. And these patterns are often determined by analyzing real-world software to see what kind of code gets generated for certain operations. If your code does not result in code that can be recognized by the compiler, you may be partially losing out on the compilers help to optimize.

Thus, another reason to try to write idiomatic C code.

Now, it can be argued that forcing oneself to write idiomatic C is a form of micro-optimization. Should you try to teach the compiler how to optimize the way you write code, or let the compiler teach you how to write code it knows how to optimize? However, the momentum is carried by the existing C programmers that write code idiomatically. New C programmers adopt these idioms for the sake of writing code more easily understood by the people that will be reviewing their code.

How to write compiler "understandable" C code?

2 Answers2