I've seen this r10 weirdness a few times, so let's see if anyone knows what's up. 
Take this simple function:
#define SZ 4
void sink(uint64_t *p);
void andpop(const uint64_t* a) {
    uint64_t result[SZ];
    for (unsigned i = 0; i < SZ; i++) {
        result[i] = a[i] + 1;
    }
    sink(result);
}
It just adds 1 to each of the 4 64-bit elements of the passed-in array and stores it in a local and calls sink() on the result (to avoid the whole function being optimized away).
Here's the corresponding assembly:
andpop(unsigned long const*):
        lea     r10, [rsp+8]
        and     rsp, -32
        push    QWORD PTR [r10-8]
        push    rbp
        mov     rbp, rsp
        push    r10
        sub     rsp, 40
        vmovdqa ymm0, YMMWORD PTR .LC0[rip]
        vpaddq  ymm0, ymm0, YMMWORD PTR [rdi]
        lea     rdi, [rbp-48]
        vmovdqa YMMWORD PTR [rbp-48], ymm0
        vzeroupper
        call    sink(unsigned long*)
        add     rsp, 40
        pop     r10
        pop     rbp
        lea     rsp, [r10-8]
        ret
It's hard to understand almost everything that is going on with r10. First, r10 is set to point to rsp + 8, then push    QWORD PTR [r10-8], which as far as I can tell pushes a copy of the return address on the stack. Following that, rbp is set up as normal and then finally r10 itself is pushed. 
To unwind all this, r10 is popped off of the stack and used to restore rsp to its original value.
Some observations:
- Looking at the entire function, all of this seems like a totally roundabout way of simply restoring rspto it's original value beforeret- but the usual epilog ofmov rsp, rpbwould do just as well (seeclang)!
- That said, the (expensive) push QWORD PTR [r10-8]doesn't even help in that mission: this value (the return address?) is apparently never used.
- Why is r10pushed and popped at all? The value isn't clobbered in the very small function body and there is no register pressure.
What's up with that? I've seen it several times before, and it usually wants to use r10, sometimes r13. It seems likely that has something to do with aligning the stack to 32 bytes, since if you change SZ to be less than 4 it uses xmm ops and the issue disappears.
Here's SZ == 2 for example:
andpop(unsigned long const*):
        sub     rsp, 24
        vmovdqa xmm0, XMMWORD PTR .LC0[rip]
        vpaddq  xmm0, xmm0, XMMWORD PTR [rdi]
        mov     rdi, rsp
        vmovaps XMMWORD PTR [rsp], xmm0
        call    sink(unsigned long*)
        add     rsp, 24
        ret
Much nicer!
 
     
    