I'm trying to efficiently implement SHLD and SHRD instructions of x86 without using inline assembly.
uint32_t shld_UB_on_0(uint32_t a, uint32_t b, uint32_t c) {
return a << c | b >> 32 - c;
}
seems to work, but invokes undefined behaviour when c == 0 because the second shift's operand becomes 32. The actual SHLD instruction with third operand being 0 is well defined to do nothing. (https://www.felixcloutier.com/x86/shld)
uint32_t shld_broken_on_0(uint32_t a, uint32_t b, uint32_t c) {
return a << c | b >> (-c & 31);
}
doesn't invoke undefined behaviour, but when c == 0 the result is a | b instead of a.
uint32_t shld_safe(uint32_t a, uint32_t b, uint32_t c) {
if (c == 0) return a;
return a << c | b >> 32 - c;
}
does what's intended, but gcc now puts a je. clang on the other hand is smart enough to translate it to a single shld instruction.
Is there any way to implement it correctly and efficiently without inline assembly?
And why is gcc trying so much not to put shld? The shld_safe attempt is translated by gcc 11.2 -O3 as (Godbolt):
shld_safe:
mov eax, edi
test edx, edx
je .L1
mov ecx, 32
sub ecx, edx
shr esi, cl
mov ecx, edx
sal eax, cl
or eax, esi
.L1:
ret
while clang does,
shld_safe:
mov ecx, edx
mov eax, edi
shld eax, esi, cl
ret