Convert uint64_t to byte array portably and optimally in Clang

Question

If you want to convert uint64_t to a uint8_t[8] (little endian). On a little endian architecture you can just do an ugly reinterpret_cast<> or memcpy(), e.g:

void from_memcpy(const std::uint64_t &x, uint8_t* bytes) {
    std::memcpy(bytes, &x, sizeof(x));
}

This generates efficient assembly:

mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret

However it is not portable. It will have different behaviour on a little endian machine.

For converting uint8_t[8] to uint64_t there is a great solution - just do this:

void to(const std::uint8_t* bytes, std::uint64_t &x) {
    x = (std::uint64_t(bytes[0]) << 8*0) |
        (std::uint64_t(bytes[1]) << 8*1) |
        (std::uint64_t(bytes[2]) << 8*2) |
        (std::uint64_t(bytes[3]) << 8*3) |
        (std::uint64_t(bytes[4]) << 8*4) |
        (std::uint64_t(bytes[5]) << 8*5) |
        (std::uint64_t(bytes[6]) << 8*6) |
        (std::uint64_t(bytes[7]) << 8*7);
}

This looks inefficient but actually with Clang -O2 it generates exactly the same assembly as before, and if you compile on a big endian machine it will be smart enough to use a native byte swap instruction. E.g. this code:

void to(const std::uint8_t* bytes, std::uint64_t &x) {
    x = (std::uint64_t(bytes[7]) << 8*0) |
        (std::uint64_t(bytes[6]) << 8*1) |
        (std::uint64_t(bytes[5]) << 8*2) |
        (std::uint64_t(bytes[4]) << 8*3) |
        (std::uint64_t(bytes[3]) << 8*4) |
        (std::uint64_t(bytes[2]) << 8*5) |
        (std::uint64_t(bytes[1]) << 8*6) |
        (std::uint64_t(bytes[0]) << 8*7);
}

Compiles to:

mov     rax, qword ptr [rdi]
bswap   rax
mov     qword ptr [rsi], rax
ret

My question is: is there an equivalent reliably-optimised construct for converting in the opposite direction? I've tried this, but it gets compiled naively:

void from(const std::uint64_t &x, uint8_t* bytes) {
    bytes[0] = x >> 8*0;
    bytes[1] = x >> 8*1;
    bytes[2] = x >> 8*2;
    bytes[3] = x >> 8*3;
    bytes[4] = x >> 8*4;
    bytes[5] = x >> 8*5;
    bytes[6] = x >> 8*6;
    bytes[7] = x >> 8*7;
}

Edit: After some experimentation, this code does get compiled optimally with GCC 8.1 and later as long as you use uint8_t* __restrict__ bytes. However I still haven't managed to find a form that Clang will optimise.

My guess is that your inverse conversion doesn’t work because you’re not masking the values and instead rely on truncation via lossy implicit conversion. Try explicitly masking the byte out. You can also put the code into a loop, IIRC clang has no issue unrolling that and recognising the pattern. *In principle* this definitely works (though GCC’s optimiser seems to be dogged by QOI issues). — Konrad Rudolph, May 07 '19 at 12:43
@KonradRudolph: Also I experimented and GCC's successful optimisation does *not* work if you put it in a loop. — Timmmm, May 07 '19 at 14:28
Maybe this can help: [How to make GCC generate bswap instruction for big endian store without builtins?](https://stackoverflow.com/q/36497605/260411) — Toribio, May 07 '19 at 15:11
Unfortunately not - the answers there either use `memcpy()` or convert in the opposite direction. — Timmmm, May 07 '19 at 16:42
I tried changing OP's code in the link I provided to just inverting the `bytes[0]` to `bytes[7]`, etc. and it seems to have worked on Godbolt. It seems the pointer assignments before and after the shifts that are the trick for the optimization. — Toribio, May 07 '19 at 16:51
Can you post your code as an answer if it really works? I suspect you have solved a different problem though. — Timmmm, May 07 '19 at 17:03

Toribio · Answer 1 · 2019-05-10T07:14:36.537

3

Here's what I could test based on the discussion in OP's comments:

void from_optimized(const std::uint64_t &x, std::uint8_t* bytes) {
    std::uint64_t big;
    std::uint8_t* temp = (std::uint8_t*)&big;
    temp[0] = x >> 8*0;
    temp[1] = x >> 8*1;
    temp[2] = x >> 8*2;
    temp[3] = x >> 8*3;
    temp[4] = x >> 8*4;
    temp[5] = x >> 8*5;
    temp[6] = x >> 8*6;
    temp[7] = x >> 8*7;
    std::uint64_t* dest = (std::uint64_t*)bytes;
    *dest = big;
}

Looks like this will make things clearer for the compiler and let it assume the necessary parameters to optimize it (both on GCC and Clang with -O2).

Compiling to x86-64 (little endian) on Clang 8.0.0 (test on Godbolt):

mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret

Compiling to aarch64_be (big endian) on Clang 8.0.0 (test on Godbolt):

ldr     x8, [x0]
rev     x8, x8
str     x8, [x1]
ret

edited May 10 '19 at 07:14

answered May 07 '19 at 17:13

Toribio

3,963
3
34
48

Looks correct to me but yeah I agree that is pretty complicated. – Timmmm May 07 '19 at 19:14
It's very easy to check because there are a lot of compilers for big-endian and bi-endian on godbolt like ARM, MIPS, PPC, Sparc... For ARM GCC you can use `-mbig-endian`. With Clang you can also specify the [`-target` option](https://stackoverflow.com/a/39114754/995714) to get any architecture you want. And [the result is not as good](https://godbolt.org/z/i0kL8l) – phuclv May 10 '19 at 02:24
2

Doesn't it violate strict aliasing rules? – Kane May 10 '19 at 07:27
I beg to pardon. If you've read my link then you'll notice that the outputs in MIPS, PowerPC, Sparc... **aren't optimized**. Only ARM64 has good output. And you shouldn't even put a main in the godbolt link to remove the unnecessary distraction – phuclv May 12 '19 at 02:24
1

Careful - this violates strict aliasing unless `uint8_t` is `unsigned char`. Though it probably is. – Lightness Races in Orbit May 13 '19 at 12:20

score 3 · Accepted Answer · answered May 13 '19 at 13:43

3

What about returning a value? Easy to reason about and small assembly:

#include <cstdint>
#include <array>

auto to_bytes(std::uint64_t x)
{
    std::array<std::uint8_t, 8> b;
    b[0] = x >> 8*0;
    b[1] = x >> 8*1;
    b[2] = x >> 8*2;
    b[3] = x >> 8*3;
    b[4] = x >> 8*4;
    b[5] = x >> 8*5;
    b[6] = x >> 8*6;
    b[7] = x >> 8*7;
    return b;
}

https://godbolt.org/z/FCroX5

and big endian:

#include <stdint.h>

struct mybytearray
{
    uint8_t bytes[8];
};

auto to_bytes(uint64_t x)
{
    mybytearray b;
    b.bytes[0] = x >> 8*0;
    b.bytes[1] = x >> 8*1;
    b.bytes[2] = x >> 8*2;
    b.bytes[3] = x >> 8*3;
    b.bytes[4] = x >> 8*4;
    b.bytes[5] = x >> 8*5;
    b.bytes[6] = x >> 8*6;
    b.bytes[7] = x >> 8*7;
    return b;
}

https://godbolt.org/z/WARCqN

(std::array not available for -target aarch64_be? )

answered May 13 '19 at 13:43

phön

1,215
8
20

1

please see my comment on Flavio's answer. This still doesn't solve the issue on other big-endian architectures like MIPS, PPC or Sparc https://godbolt.org/z/t9Ck8- – phuclv May 13 '19 at 14:58
@phuclv yeah i see. hm. i played around with different ideas, but couldnt come up with better code gen for your mentioned architectures. still this solution seems nice and easy, in my opinion – phön May 13 '19 at 15:43
yeah. It seems modern compilers almost ignore the big endian architectures – phuclv May 13 '19 at 16:06

sebrockm · Answer 3 · 2019-05-13T12:16:46.087

First of all, the reason why your original from implementation cannot be optimized is because you are passing the arguments by reference and pointer. So, the compiler has to consider the possibility that both of of them point to the very same address (or at least that they overlap). As you have 8 consecutive read and write operations to the (potentially) same address, the as-if rule cannot be applied here.

Note, that just by removing the the & from the function signature, apparently GCC already considers this as proof that bytes does not point into x and thus this can safely be optimized. However, for Clang this is not good enough. Technically, of course bytes can point to from's stack memory (aka. to x), but I think that would be undefined behavior and thus Clang just misses this optimization.

Your implementation of to doesn't suffer from this issue because you have implemented it in such a way that first you read all the values of bytes and then you make one big assignment to x. So even if x and bytes point to the same address, as you do all the reading first and all the writing afterwards (instead of mixing reads and writes as you do in from), this can be optimized.

Flávio Toribio's answer works because it does precisely this: It reads all the values first and only then writes to the destination.

However, there are less complicated ways to achieve this:

void from(uint64_t x, uint8_t* dest) {
    uint8_t bytes[8];
    bytes[7] = uint8_t(x >> 8*7);
    bytes[6] = uint8_t(x >> 8*6);
    bytes[5] = uint8_t(x >> 8*5);
    bytes[4] = uint8_t(x >> 8*4);
    bytes[3] = uint8_t(x >> 8*3);
    bytes[2] = uint8_t(x >> 8*2);
    bytes[1] = uint8_t(x >> 8*1);
    bytes[0] = uint8_t(x >> 8*0);

    *(uint64_t*)dest = *(uint64_t*)bytes;
}

gets compiled to

mov     qword ptr [rsi], rdi
ret

on little endian and to

rev     x8, x0
str     x8, [x1]
ret

on big endian.

Note, that even if you passed x by reference, Clang would be able to optimize this. However, that would result in one more instruction each:

mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret

and

ldr     x8, [x0]
rev     x8, x8
str     x8, [x1]
ret

respectively.

Also note, that you can improve your implementation of to with a similar trick: Instead of passing the result by non-const reference, take the "more natural" approach and just return it from the function:

uint64_t to(const uint8_t* bytes) {
    return
        (uint64_t(bytes[7]) << 8*7) |
        (uint64_t(bytes[6]) << 8*6) |
        (uint64_t(bytes[5]) << 8*5) |
        (uint64_t(bytes[4]) << 8*4) |
        (uint64_t(bytes[3]) << 8*3) |
        (uint64_t(bytes[2]) << 8*2) |
        (uint64_t(bytes[1]) << 8*1) |
        (uint64_t(bytes[0]) << 8*0);
}

Summary:

Don't pass arguments by reference.
Do all the reading first, then all the writing.

Here are the best solutions I could get to for both, little endian and big endian. Note, how to and from are truly inverse operations that can be optimized to a no-op if executed one after another.

please stop just looking at ARM64 output and conclude that it works for other major bi/big-endian architectures like MIPS, PPC or Sparc. Yours is inefficient on those just like other answers https://godbolt.org/z/txvvZl — phuclv, May 13 '19 at 15:01

score 0 · Answer 4 · answered May 13 '19 at 12:32

0

The code you've given is way overcomplicated. You can replace it with:

void from(uint64_t x, uint8_t* dest) {
    x = htole64(x);
    std::memcpy(dest, &x, sizeof(x));
}

Yes, this uses the Linux-ism htole64(), but if you're on another platform you can easily reimplement that.

Clang and GCC optimize this perfectly, on both little- and big-endian platforms.

answered May 13 '19 at 12:32

John Zwinck

239,568
38
324
436

2

For some definitions of "easily", isn't this just moving the troublesome code from the question into a macro? – M.M May 13 '19 at 12:37
1

"you can easily reimplement that" - erm... sure. I might even ask a question about reimplementing it on StackOverflow. – Timmmm May 13 '19 at 14:19
@M.M: Not really, most platforms have a 64-bit version of `htonl()` already, and applications that contain code like the OP has probably would already have reimplemented that if needed. And it's more general-purpose/modular than the OP code. – John Zwinck May 14 '19 at 05:24

Convert uint64_t to byte array portably and optimally in Clang

4 Answers4

Summary:

Linked