Basically I have a struct with the definition
#define BATCH_SIZE 8
#define BATCH_SIZE_LOG 3
//#define BATCH_MASK 0x7070707070707070
// for the sake of understanding the ASM turn this into a no-op
#define BATCH_MASK (~(0UL))
struct batcher {
uint8_t indexes[8];
uint64_t vals[8 * BATCH_SIZE];
uint32_t __attribute__((noinline))
push(const uint64_t i, const uint64_t v) {
if(__builtin_expect(indexes[i] < (BATCH_SIZE - 1), 1)) {
vals[8 * i + indexes[i]++] = v;
return 0;
}
return 1;
}
uint32_t __attribute__((noinline))
claim(const uint64_t i) {
if(__builtin_expect(indexes[i] == (BATCH_SIZE - 1), 1)) {
indexes[i] = 8;
return 0;
}
return 1;
}
uint32_t
can_pop() const {
if(*((uint64_t *)(&indexes[0])) & BATCH_MASK) {
return 1;
}
return 0;
}
uint64_t __attribute__((noinline))
pop() {
if(__builtin_expect(can_pop(), 1)) {
const uint32_t idx = _tzcnt_u64(*((uint64_t *)(&indexes[0])) & BATCH_MASK) >> BATCH_SIZE;
return vals[8 * idx + --indexes[idx]];
}
return 0;
}
};
What I am curious about is whether pop can be implemented with only 1 memory load from indexes (so 2 total, 1 from indexes and 1 from vals)
The first memory load is to interpret all of indexes as a uint64_t so that I can check if it is non 0, and if so use one of those indexes.
I have been looking at the assembly output here
which has implements pop with
batcher::pop():
movq (%rdi), %rax // first load from indexes
testq %rax, %rax
jne .L11
ret
.L11:
xorl %edx, %edx
movzbl (%rdi,%rdx), %eax // second load from indexes
decl %eax
movb %al, (%rdi,%rdx)
movzbl %al, %eax
movq 8(%rdi,%rax,8), %rax
ret
The way the compiler implements this is one from from %(rdi) to %rax for the interpretation as a uint64_t (testing if there are any non 0 indexes) and a second load if the condition passes loading the calculated uint8_t index.
I am wondering if there is a way to implement pop in assembly (what I will be doing) without two loads. I know I could accomplish the same logic with shifting / masking on the result from the first load. What I am specifically wondering is if there is a way for me to index into the uint64_t resulting from the first load as if it where is uint8_t[8] array.
My guess is that this is not possibly because register don't have a memory address so it doesn't fully make sense to be able to do that but I might be missing some instruction made specifically for isolating bytes in a uint64_t or some way that the assembly implementation of pop could be refactored to enable this.
Note: I am limited to the instruction sets available on Intel Skylake.
If anyone has any ideas I would appreciate them. Thank you!