If you want to convert uint64_t to a uint8_t[8] (little endian). On a little endian architecture you can just do an ugly reinterpret_cast<> or memcpy(), e.g:
void from_memcpy(const std::uint64_t &x, uint8_t* bytes) {
    std::memcpy(bytes, &x, sizeof(x));
}
This generates efficient assembly:
mov     rax, qword ptr [rdi]
mov     qword ptr [rsi], rax
ret
However it is not portable. It will have different behaviour on a little endian machine.
For converting uint8_t[8] to uint64_t there is a great solution - just do this:
void to(const std::uint8_t* bytes, std::uint64_t &x) {
    x = (std::uint64_t(bytes[0]) << 8*0) |
        (std::uint64_t(bytes[1]) << 8*1) |
        (std::uint64_t(bytes[2]) << 8*2) |
        (std::uint64_t(bytes[3]) << 8*3) |
        (std::uint64_t(bytes[4]) << 8*4) |
        (std::uint64_t(bytes[5]) << 8*5) |
        (std::uint64_t(bytes[6]) << 8*6) |
        (std::uint64_t(bytes[7]) << 8*7);
}
This looks inefficient but actually with Clang -O2 it generates exactly the same assembly as before, and if you compile on a big endian machine it will be smart enough to use a native byte swap instruction. E.g. this code:
void to(const std::uint8_t* bytes, std::uint64_t &x) {
    x = (std::uint64_t(bytes[7]) << 8*0) |
        (std::uint64_t(bytes[6]) << 8*1) |
        (std::uint64_t(bytes[5]) << 8*2) |
        (std::uint64_t(bytes[4]) << 8*3) |
        (std::uint64_t(bytes[3]) << 8*4) |
        (std::uint64_t(bytes[2]) << 8*5) |
        (std::uint64_t(bytes[1]) << 8*6) |
        (std::uint64_t(bytes[0]) << 8*7);
}
Compiles to:
mov     rax, qword ptr [rdi]
bswap   rax
mov     qword ptr [rsi], rax
ret
My question is: is there an equivalent reliably-optimised construct for converting in the opposite direction? I've tried this, but it gets compiled naively:
void from(const std::uint64_t &x, uint8_t* bytes) {
    bytes[0] = x >> 8*0;
    bytes[1] = x >> 8*1;
    bytes[2] = x >> 8*2;
    bytes[3] = x >> 8*3;
    bytes[4] = x >> 8*4;
    bytes[5] = x >> 8*5;
    bytes[6] = x >> 8*6;
    bytes[7] = x >> 8*7;
}
Edit: After some experimentation, this code does get compiled optimally with GCC 8.1 and later as long as you use uint8_t* __restrict__ bytes. However I still haven't managed to find a form that Clang will optimise.