AT&T syntax uses the opposite order from Intel syntax. The rotate count has to be first, not last: rol $1, %0.
Also, you don't need and shouldn't use inline asm for this: https://gcc.gnu.org/wiki/DontUseInlineAsm
As described in Best practices for circular shift (rotate) operations in C++, GNU C has intrinsics for narrow rotates, because the rotate-idiom recognition code fails to optimize away an and of the rotate count. x86 shifts/rotates mask the count with count & 31 even for 8-bit and 16-bit, but rotates still wrap around. It does matter for shifts though.
Anyway, gcc has a builtin function for narrow rotates to avoid any overhead. There's a __rolb wrapper for it in x86intrin.h, but MSVC uses its own __rotr8 and so on from its intrin.h. Anyway, clang doesn't support either the __builtin or the x86intrin.h wrappers for rotates, but gcc and ICC do.
#include <stdint.h>
uint8_t rotate_left_byte_by1(uint8_t a) {
return __builtin_ia32_rolqi(a, 1); // qi = quarter-integer
}
I used uint8_t from stdint.h like a normal person instead of defining a byte type.
This doesn't compile at all with clang, but it compiles as you'd hope with gcc7.2:
rotate_left_byte_by1:
movl %edi, %eax
rolb %al
ret
This gives you a function that compiles as efficiently as your inline asm ever could, but which can optimize away completely for compile-time constants, and the compiler knows how it works / what it does and can optimize accordingly.