10

Suppose I have two uint16_t[4] arrays, a and b. Each integer in these arrays is in the range [0, 16383], so bits 14 and 15 aren't set. Then I have some code to find the minimum and maximum among a[i] and b[i] for each i:

uint16_t min[4], max[4];
for (int i = 0; i < 4; i++) {
    if (a[i] < b[i]) {
        min[i] = a[i];
        max[i] = b[i];
    } else {
        min[i] = b[i];
        max[i] = a[i];
    }
}

Suppose for some reason I can't/won't use SIMD, but I'd still like to compute this as fast as possible, on a 64-bit platform. Thus, a natural solution is to use the SIMD-within-a-register (SWAR) paradigm on 64-bit registers to compute these 4 values in a single iteration, rather than over 4 iterations with 16-bit arithmetic.

What bit-twiddling hacks could be used to implement either (min or max) or ideally both operations using the SWAR paradigm, so that the resulting code is faster than the loop above? My target architecture is ARMv8, so feel free to use any ARMv8 instructions that help reducing instruction counts.

C, assembly, or C + inline assembly solutions are all welcome.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
swineone
  • 2,296
  • 1
  • 18
  • 32
  • Just curious, but why would you not use SIMD here? The SIMD instruction set is a required feature in ARMv8, and it has compare and blend instructions that are exactly what you need here. – Nate Eldredge Jan 18 '23 at 02:33
  • I'm trying to see if I can get some extra speed from integer+SIMD rather than SIMD alone. Would be sad to leave the integer units idle. – swineone Jan 18 '23 at 02:34
  • 2
    I would strongly suspect that the SIMD by itself is fast enough that you will already be memory bound. – Nate Eldredge Jan 18 '23 at 02:36
  • I guess I forgot to mention, but this is just part of a larger algorithm, whose working set is small enough to fit the L1 cache (but the algorithms iterates over the same set many many times). – swineone Jan 18 '23 at 02:38
  • 1
    I was thinking along these lines as well, but thought I should set bits 15, 31, 47 and 63 of `a` at the start. This avoids the borrow bits of `ai - bi` for each `i` = 0 to 3 from crossing over 16-bit lanes if `ai < bi`. Then these bits would remain 1 if `ai >= bi` and 0 if `ai < bi`, i.e. a `NOT` of the sign bit of `ai - bi`. – swineone Jan 18 '23 at 02:48
  • @NateEldredge: I know some 32-bit ARM CPUs also have high latency or a full pipeline stall when moving data from a vector reg to integer to branch on it, unlike x86; I think that's less common with AArch64 but maybe some are still like that? Or some ARM64 chips might have a fairly wide front-end, but maybe only 2 or 3 SIMD execution pipes. Probably hard to beat SIMD packed min/max on 8-byte vectors for most use-cases, though. – Peter Cordes Jan 18 '23 at 03:03
  • @PeterCordes: It does still affect at least some ARM64. On Cortex-A72, move from SIMD to integer is 5 cycles latency and uses the load pipeline. If moving 8 or 16 bits, you also need zero or sign extension, which adds 1 cycle latency and uses one of the two integer ALU pipelines. (Integer to SIMD is worse, 8 cycles latency and you need the load pipeline plus one FP/SIMD arithmetic pipeline.) However, I'm not sure how that ties in with this code, which shouldn't need to branch at all. – Nate Eldredge Jan 18 '23 at 03:16
  • @PeterCordes: Oh right, there's SIMD min and max. Even better than cmp/blend. – Nate Eldredge Jan 18 '23 at 03:16
  • @NateEldredge: IDK what the surrounding code might do with the result. It's plausible that you might want to branch on the max being non-zero, or on not having any negative numbers (`tst` with a bit-pattern, which ARM64 can encode as an immediate). – Peter Cordes Jan 18 '23 at 03:50
  • Re: performance for vector to int: high latency sure, but fully stalling integer execution is another matter for an OoO exec core like A72. I was thinking the worst case was out-of-order cores where SIMD->int had to drain the ROB or something since SIMD and integer execution weren't tightly coupled to track dependencies between execution domains. But I never read any details so I might just be misinterpreting what people (maybe mostly Jake Lee) have said about "stalling" on SIMD->int. If anyone is considering using SWAR to avoid such a penalty, worth investigating the details of it. – Peter Cordes Jan 18 '23 at 03:51
  • I have preliminarily removed my answer as it is incorrect and tricky to complete. – fuz Jan 18 '23 at 13:38
  • @fuz I kept a copy of it and am wondering what is wrong with it. I didn't run it, just tried to follow it in my head, and couldn't find anything wrong. – swineone Jan 18 '23 at 13:40
  • @swineone Managed to fix it! – fuz Jan 18 '23 at 13:52

1 Answers1

7

You could use code like this, though it's really a lot longer than just doing it with SIMD:

orr     x2, x0, #0x8000800080008000     // x2 = 0x8000 | x0
sub     x2, x2, x1                      // x2 = (0x8000 | x0) - x1
and     x2, x2, #0x8000800080008000      // x2 = x0 < x1 ? 0x0000 : 0x8000
mov     x3, #0x7fff7fff7fff7fff
add     x2, x3, x2, lsr #15             // x2 = x0 < x1 ? 0x7fff : 0x8000
eor     x4, x0, x1                      // x4 = x0 ^ x1
and     x3, x4, x2                      // x3 = x0 < x1 ? x0 ^ x1 : 0x0000
eor     x4, x1, x3                      // x4 = x0 < x1 ? x0 : x1
eor     x3, x0, x3                      // x3 = x0 < x1 ? x1 : x0

The critical path of this algorithm has 6 instructions. The instructions

mov     x3, #0x7fff7fff7fff7fff
eor     x4, x0, x1                      // x4 = x0 ^ x1

are not on the critical path. If executed in a loop, the constant load can likely be hoisted out. The last two instructions can be evaluated independently, yielding minimum and maximum with the same latency.

fuz
  • 88,405
  • 25
  • 200
  • 352
  • Nice solution! The BSL NEON instruction would really help here with the last 4 instructions. I'll try to think of something to emulate it. Although we need 2 BSLs (one for each of min and max), so if BSL existed for scalars, you could replace the last 4 instructions with 2. But with emulation, one could at best break even if it could be emulated with 2 instructions, so I don't think this will help. – swineone Jan 18 '23 at 13:01
  • @swineone It's kind of unfortunate that no scalar version of `BSL` exists. This may be related to the general absence of 3-input logic operations on AArch64. – fuz Jan 19 '23 at 17:07