Why is this C++ wrapper class not being inlined away?

Question

EDIT - something's up with my build system. I'm still figuring out exactly what, but gcc was producing weird results (even though it's a .cpp file), but once I used g++ then it worked as expected.

This is a very reduced test-case for something I've been having trouble with, where using a numerical wrapper class (which I thought would be inlined away) made my program 10x slower.

This is independent of optimisation level (tried with -O0 and -O3).

Am I missing some detail in my wrapper class?

C++

I have the following program, in which I define a class which wraps a double and provides the + operator:

#include <cstdio>
#include <cstdlib>

#define INLINE __attribute__((always_inline)) inline

struct alignas(8) WrappedDouble {
    double value;

    INLINE friend const WrappedDouble operator+(const WrappedDouble& left, const WrappedDouble& right) {
        return {left.value + right.value};
    };
};

#define doubleType WrappedDouble // either "double" or "WrappedDouble"

int main() {
    int N = 100000000;
    doubleType* arr = (doubleType*)malloc(sizeof(doubleType)*N);
    for (int i = 1; i < N; i++) {
        arr[i] = arr[i - 1] + arr[i];
    }

    free(arr);
    printf("done\n");

    return 0;
}

I thought that this would compile to the same thing - it's doing the same calculations, and everything is inlined.

However, it's not - it produces a larger and slower result, regardless of optimisation level.

(This particular result is not significantly slower, but my actual use-case includes more arithmetic.)

EDIT - I am aware that this isn't constructing my array elements. I thought this might produce less ASM so I could understand it better, but I can change it if it's a problem.

EDIT - I am also aware that I should be using new[]/delete[]. Unfortunately gcc refused to compile that, even though it was in a .cpp file. This was a symptom of my build system being screwed up, which is probably my actual problem.

EDIT - If I use g++ instead of gcc, it produces identical output.

EDIT - I posted the wrong version of the ASM (-O0 instead of -O3), so this section isn't helpful.

Assembly

I'm using XCode's gcc on my Mac, on a 64-bit system. The result is the same, aside from the body of the for-loop.

Here's what it produces for the body of the loop if doubleType is double:

movq    -16(%rbp), %rax
movl    -20(%rbp), %ecx
subl    $1, %ecx
movslq  %ecx, %rdx
movsd   (%rax,%rdx,8), %xmm0    ## xmm0 = mem[0],zero
movq    -16(%rbp), %rax
movslq  -20(%rbp), %rdx
addsd   (%rax,%rdx,8), %xmm0
movq    -16(%rbp), %rax
movslq  -20(%rbp), %rdx
movsd   %xmm0, (%rax,%rdx,8)

The WrappedDouble version is much longer:

movq    -40(%rbp), %rax
movl    -44(%rbp), %ecx
subl    $1, %ecx
movslq  %ecx, %rdx
shlq    $3, %rdx
addq    %rdx, %rax
movq    -40(%rbp), %rdx
movslq  -44(%rbp), %rsi
shlq    $3, %rsi
addq    %rsi, %rdx
movq    %rax, -16(%rbp)
movq    %rdx, -24(%rbp)
movq    -16(%rbp), %rax
movsd   (%rax), %xmm0           ## xmm0 = mem[0],zero
movq    -24(%rbp), %rax
addsd   (%rax), %xmm0
movsd   %xmm0, -8(%rbp)
movsd   -8(%rbp), %xmm0         ## xmm0 = mem[0],zero
movsd   %xmm0, -56(%rbp)
movq    -40(%rbp), %rax
movslq  -44(%rbp), %rdx
movq    -56(%rbp), %rsi
movq    %rsi, (%rax,%rdx,8)

You should almost *never* use `malloc` in C++. It only allocates memory, but it doesn't construct objects. And almost never use `new[]` to allocate arrays, use `std::vector` instead. — Some programmer dude, Jan 07 '19 at 11:17
Thanks for the code review - does that affect the way the body of my for-loop is compiling? — cloudfeet, Jan 07 '19 at 11:22
Same timing with `std::vector` [Demo](http://quick-bench.com/UaSU2OONRvfchNh4LHQX-UpRyjI) :) — Jarod42, Jan 07 '19 at 11:30
Using a vector and the `-O2` flag, using a `double` of your class [compiles to the same code](https://godbolt.org/z/FekrSf) (with GCC 8.2). Note that removing the `INLINE` macro or using "proper" type-aliases didn't change anything. — Some programmer dude, Jan 07 '19 at 11:30
Let's assume that I have a good reason for wanting to work with a raw pointer - in this particular case, my actual code works with an SDK that passes me a raw pointer. — cloudfeet, Jan 07 '19 at 11:35
Using `new[]` makes the generated code smaller, but still the same for both types: https://godbolt.org/z/r0FxBQ — Some programmer dude, Jan 07 '19 at 11:40
What I've learned so far is that my problem goes away if I'm using `g++` instead of `gcc`. I'm surprised, because I thought they'd be the same when compiling "demo.cpp". — cloudfeet, Jan 07 '19 at 11:46
@cloudfeet I would not have any expectations regarding code compilation if you have obvious undefined behavior. If the compiler can prove that your code is UB (which is not hard in this case), it is free to emit literally anything, from an empty executable to an `rm -rf /` call. See [UB can result in time travel](https://blogs.msdn.microsoft.com/oldnewthing/20140627-00/?p=633). — Max Langhof, Jan 07 '19 at 11:52
The asm code you've put here is from a non-optimized build, so it is completely useless. Put asm code here, which is from an optimized build. "inlined away": what do you mean by this? `operator+` is inlined, even in non-optimized builds, just as you requested it by using the `always_inline` attribute. — geza, Jan 07 '19 at 12:13
OK, something is screwed up with my build system. I was attempting to compile with `gcc` - however, even though it was compiling `main.cpp`, it was throwing a fit about `new` and `delete`, which is why I'd fallen back to using malloc. — cloudfeet, Jan 07 '19 at 12:32
@geza - I tried it with optimisation on and off - it gave me longer results both times. However, that was for `gcc` (which was also refusing to compile `new`/`delete[]`). When I used `g++`, it worked as everyone else here described - and then `gcc` started working identically as well. — cloudfeet, Jan 07 '19 at 12:33
@cloudfeet: A longer result doesn't necessarily mean longer running time. Put optimized asm build here, as there is no point comparing non-optimized builds performance-wise. — geza, Jan 07 '19 at 12:37
The optimised version also ran slower (faster than the unoptimised version, but slower than the version without the wrapper), by about 10%. I don't have the original outputs anymore - and ever since I tried `g++`, `gcc` started behaving properly and won't give me the same results I saw originally. Sorry. :/ — cloudfeet, Jan 07 '19 at 12:42
IIRC, `gcc` on a `.cpp` file will compile it as C++, but since you used the `gcc` front-end it won't link the C++ standard library. So you'll get a link error if you use `new` instead of `malloc`. There's no good reason to ever use `gcc` on C++ code AFAIK, that's just what happens if you do so by accident. Of course you probably have a `gcc` that's actually Apple `clang`, but probably the behaviour is the same. — Peter Cordes, Jan 07 '19 at 13:26
BTW, if you want to actually speed this up, see the Prefix Sum links in my answer. It is possible to use SIMD for this, especially with AVX, but the compiler won't auto-vectorize it for you. — Peter Cordes, Jan 07 '19 at 13:30

score 7 · Answer 1 · answered Jan 07 '19 at 12:31

It is inlined, but not optimized away because you compiled with -O0 (the default). That generates asm for consistent debugging, allowing you to modify any C++ variable while stopped at a breakpoint on any line.

This means the compiler spills everything from registers after every statement, and reloads what it needs for the next. So more statements to express the same logic = slower code, whether they're in the same function or not. Why does clang produce inefficient asm for this simple floating point sum (with -O0)? explains in more detail.

Normally -O0 won't inline functions, but it does respect __attribute__((always_inline)).

C loop optimization help for final assignment explains why benchmarking or tuning with -O0 is totally pointless. Both versions are ridiculous garbage for performance.

If it wasn't inlined, there'd be a call instruction that called it inside the loop.

The asm is actually creating the pointers in registers for const WrappedDouble& left and right. (very inefficiently, using multiple instructions instead of one lea. The addq %rdx, %rax is the final step in one of those.)

Then it spills those pointer args to stack memory, because they're real variables and have to be in memory where a debugger could modify them. That's what movq %rax, -16(%rbp) and %rdx ... is doing.

After reloading and dereferencing those pointers, the addsd (add scalar double) result is itself spilled back to a local in stack memory with movsd %xmm0, -8(%rbp). This isn't a named variable, it's the return value of the function.

It's then reloaded and copied again to another stack location, then finally arr and i are loaded from the stack, along with the double result of operator+, and that's stored into arr[i] with movq %rsi, (%rax,%rdx,8). (Yes, LLVM used a 64-bit integer mov to copy a double that time. The earlier times used SSE2 movsd.)

All of those copies of the return value are on the critical path for the loop-carried dependency chain, because the next iteration reads arr[i-1]. Those ~5 or 6 cycle store-forwarding latencies really add up vs. 3 or 4 cycle FP add latency.

Obviously that's massively inefficient. With optimization enabled, gcc and clang have no trouble inlining and optimizing away your wrapper.

They also optimize by keeping around the arr[i] result in a register for use as the arr[i-1] result in the next iteration. This avoids the ~6 cycle store-forwarding latency that would otherwise be inside the loop, if it made asm like the source.

i.e. the optimized asm looks kind of like this C++:

double tmp = arr[0];   // kept in XMM0

for(...) {
   tmp += arr[i];   // no re-read of mmeory
   arr[i] = tmp;
}

Amusingly, clang doesn't bother to initialize its tmp (xmm0) before the loop, because you don't bother to initialize the array. Strange it doesn't warn about UB. In practice a big malloc with glibc's implementation will give you fresh pages from the OS, and they will all hold zeros, i.e. 0.0. But clang will give you whatever was left around in XMM0! If you add a ((double*)arr)[0] = 1;, clang will load the first element before the loop.

Unfortunately the compiler doesn't know how to do any better than that for your Prefix Sum calculation. See parallel prefix (cumulative) sum with SSE and SIMD prefix sum on Intel cpu for ways to speed this up by another factor of maybe 2, and/or parallelize it.

I prefer Intel syntax, but the Godbolt compiler explorer can give you AT&T syntax like in your question if you like.

# gcc8.2 -O3 -march=haswell -Wall
.LC1:
    .string "done"
main:
    sub     rsp, 8
    mov     edi, 800000000
    call    malloc                  # return value in RAX

    vmovsd  xmm0, QWORD PTR [rax]   # load first elmeent
    lea     rdx, [rax+8]            # p = &arr[1]
    lea     rcx, [rax+800000000]    # endp = arr + len

.L2:                                   # do {
    vaddsd  xmm0, xmm0, QWORD PTR [rdx]   # tmp += *p
    add     rdx, 8                        # p++
    vmovsd  QWORD PTR [rdx-8], xmm0       # p[-1] = tmp
    cmp     rdx, rcx
    jne     .L2                        # }while(p != endp);

    mov     rdi, rax
    call    free
    mov     edi, OFFSET FLAT:.LC0
    call    puts
    xor     eax, eax
    add     rsp, 8
    ret

Clang unrolls a bit, and like I said doesn't bother to init its tmp.

# just the inner loop from clang -O3
# with -march=haswell it unrolls a lot more, so I left that out.
# hence the 2-operand SSE2 addsd instead of 3-operand AVX vaddsd
.LBB0_1:                                # do {
    addsd   xmm0, qword ptr [rax + 8*rcx - 16]
    movsd   qword ptr [rax + 8*rcx - 16], xmm0
    addsd   xmm0, qword ptr [rax + 8*rcx - 8]
    movsd   qword ptr [rax + 8*rcx - 8], xmm0
    addsd   xmm0, qword ptr [rax + 8*rcx]
    movsd   qword ptr [rax + 8*rcx], xmm0
    add     rcx, 3                            # i += 3
    cmp     rcx, 100000002
    jne     .LBB0_1                      } while(i!=100000002)

Apple XCode's gcc is really clang/LLVM in disguise, on modern OS X systems.

It's always a pleasure to read one of your answers. Amazing what difference a few comments in asm make! — Max Langhof, Jan 07 '19 at 12:45
@MaxLanghof: thanks, I'm glad someone enjoyed it. I spent longer than I initially meant to writing an answer that basically says "don't use -O0". :P — Peter Cordes, Jan 07 '19 at 13:00

score 2 · Accepted Answer · answered Jan 07 '19 at 12:17

2

Both versions result in identical assembly code with g++ and clang++ when you turn on optimizations with -O3.

answered Jan 07 '19 at 12:17

Maxim Egorushkin

131,725
17
180
271

Thanks - I was using `-O3`, but I was using it with `gcc`. I compiled with `g++` instead, and it worked. – cloudfeet Jan 07 '19 at 12:35
The fact that `gcc` started producing different results after I'd run `g++` probably means something is interestingly, creatively wrong with my build environment. I appreciate your help, even though it turned out to be a foolish question which got downvoted! – cloudfeet Jan 07 '19 at 12:44
@cloudfeet: The asm output in your question is *clearly* from `-O0`. The stores/reloads make it totally obvious. See the commented optimized asm in my answer. You could have also had a `-O0` on the command line, or `-O3` wasn't actually being passed to your compiler at all. The 10x slower behaviour is also consistent with the asm you show for the two versions. – Peter Cordes Jan 07 '19 at 12:55
@PeterCordes Yeah, sorry - the one I posted was `-O0`. However, the `-O3` code was still slower and longer for the wrapped version (but shorter and faster than `-O0`). I should have posted that instead. (Also, I was passing the arguments directly on the command-line.) – cloudfeet Jan 07 '19 at 13:09

score 1 · Answer 3 · answered Jan 09 '19 at 11:54

For future reference (mine and anyone else): I was seeing a few different things:

The XCode project I was using originally (which I adapted but didn't create) is somehow configured so that even the Release build wasn't using -O3.
Using gcc for C++ code is a bad idea. Even when compiling a .cpp file, it doesn't link to the standard library by default. Using g++ is much smoother.
The most interesting (to me): even when the wrapper was inlining correctly, the wrapper disrupted some optimisations!

The third point was what caused the slowdown in my original code (not listed here) which led me down this path.

When you are adding a bunch of floating-point values, e.g. a + b + c + d, it isn't allowed to re-order c or d because (since floating-point values are approximate) that might produce a subtly different result. However, it is allowed to swap a and b, because that first addition is symmetrical - and in my case, this let it use SIMD instructions on 64-bit builds.

However, when the wrapper was used, it didn't carry over the information that the first + is in fact commutative! It dutifully inlined everything away, but somehow didn't realise it was still allowed to swap the first two arguments. When I re-ordered the sums manually in the appropriate way, my two versions got equal performance.

Was that missed-optimization with `gcc -O2` instead of `-O3` or something? That would have been a more interesting question than the un-optimized asm, if `arr[i] = arr[i - 1] + arr[i];` compiled differently than `arr[i] = arr[i] + arr[i - 1];` with gcc or LLVM. — Peter Cordes, Jan 10 '19 at 06:12

Why is this C++ wrapper class not being inlined away?

C++

Assembly

3 Answers3

Linked