Suppose I have two uint16_t[4] arrays, a and b. Each integer in these arrays is in the range [0, 16383], so bits 14 and 15 aren't set. Then I have some code to find the minimum and maximum among a[i] and b[i] for each i:
uint16_t min[4], max[4];
for (int i = 0; i < 4; i++) {
if (a[i] < b[i]) {
min[i] = a[i];
max[i] = b[i];
} else {
min[i] = b[i];
max[i] = a[i];
}
}
Suppose for some reason I can't/won't use SIMD, but I'd still like to compute this as fast as possible, on a 64-bit platform. Thus, a natural solution is to use the SIMD-within-a-register (SWAR) paradigm on 64-bit registers to compute these 4 values in a single iteration, rather than over 4 iterations with 16-bit arithmetic.
What bit-twiddling hacks could be used to implement either (min or max) or ideally both operations using the SWAR paradigm, so that the resulting code is faster than the loop above? My target architecture is ARMv8, so feel free to use any ARMv8 instructions that help reducing instruction counts.
C, assembly, or C + inline assembly solutions are all welcome.