Get C string length of a 16 or 32-byte fixed-size buffer? (XMM or YMM register width)

Question

Is there any way to get the length of an ASCII string that is stored in a 16- or 32-byte buffer by loading it into an XMM or YMM register? Essentially I am looking for the index (in bits or bytes) of the first zero byte.

My goal is to avoid looping and branching. I am hoping that something exists in AVX or SSE along the lines of BSF (bit scan forward) but operating on bytes, not bits.

Maybe something like the following?

_my_constant_time_strlen:
 vpxor ymm0, ymm0
 VPCMPEQB ymm0, ymm0, [rdi]
 vpmovmskb eax, ymm0
 bsf eax, eax
 ; string length is in eax?

   ; and rax, 31              ; editor's note: useless AND
 ret

Yup, that's how you'd write it if you know your string is at most 32 bytes long, and is aligned or known not to be at the end of a page followed by an unmapped page. (But you can use EAX instead of RAX, and you can use a memory source operand for `vpcmpeqb`). Also you can use `tzcnt`, which is faster on AMD than `bsf`. If you did need to loop for arbitrary-length strings, you'd use `test eax,eax` and only BSF after ending the loop. — Peter Cordes, Jun 05 '19 at 03:09
Why are you trying to avoid `bsf`? If you want the integer result *in* an XMM register, I think your best bet is `vmovd xmm0, eax` to get the movmsk -> bsf result back into a vector reg. — Peter Cordes, Jun 05 '19 at 03:10
@PeterCordes I'm searching for a zero byte, not zero bits. I'm basically trying to write my own C strlen function that will run in O(1) time instead of O(n) time. — , Jun 05 '19 at 03:23
see also [Implementing strcmp, strlen, and strstr using SSE 4.2 instructions](https://www.strchr.com/strcmp_and_strlen_using_sse_4.2) — phuclv, Jun 05 '19 at 08:11
... for a demonstration that SSE4.2 string instructions are slower than `pcmpeqb` / `pmovmskb` for strlen, @phuclv. See also [How much faster are SSE4.2 string instructions than SSE2 for memcmp?](//stackoverflow.com/q/46762813) — Peter Cordes, Jun 05 '19 at 11:48

Peter Cordes · Answer 1 · 2022-06-04T23:17:19.857

This is exactly how you implement strlen or memchr with AVX2.

vpmovmskb (Intrinsic: int _mm256_movemask_epi8(__m256i)) turns your compare vector into a bitmap of byte-compare results, which you search with bsf (or preferably tzcnt).

Your code already does exactly what you want. (For a fixed-size¹ buffer where you know there will be a match somewhere in the buffer; your and eax,31 doesn't help with that²)

You can hoist the vpxor-zeroing out of a loop if you avoid destroying it. Also, zero a register with vpxor XMM, not YMM: Saves a uop on AMD Zen1, and Alder Lake E-cores.

Shift/OR to assemble a 32 or 64-bit mask from narrower pmovmskb/ps/pd results can be useful, allowing you to bit-scan up to 64 elements without branching.

The real work is only 3 total uops on Intel CPUs (Haswell/Skylake): vpcmpeqb is 1 micro-fused uop because you avoid an indexed addressing mode. vpmovmskb is 1 uop with 2 to 3 cycle latency. tzcnt is 1 uop on Intel or AMD CPUs. (bsf is also 1 uop on Intel). On Intel tzcnt or bsf has 3 cycle latency.

So on mainstream Intel, the total latency from the vector load data being ready to the length in RAX is 1 (vpcmpeqb) + 2 or 3 (movmsk) + 3 (tzcnt) = 6 or 7 cycles. This is branchless, just a data dependency, so that's pretty reasonable. This doesn't count the load-use latency or store-forwarding latency, whether the address or the data was on the critical path. And throughput is excellent, at 1 strlen per clock (bottlenecked on port 0 and/or port 1) on Intel.

On AMD Zen1, vpcmpeqb ymm is 2 uops with 2c latency. vpmovmskb ymm is 2 uops (for port FP2) with 3c latency. tzcnt is 2 uops with 2c latency. So total latency = 7 cycles, and throughput is 1 per 2 cycles bottlenecked on movemask throughput. (Ryzen lzcnt is 1 uop / 1c latency; presumably tzcnt is a bit-reverse + lzcnt or something like that.)

AMD Zen2 and later widen the SIMD execution units to 256-bit wide, with single-uop vpcmpeqb ymm / vpmovmskb r32, ymm, but still 2-uop tzcnt

Numbers from https://agner.org/optimize/ and https://uops.info/

Other options: none as efficient as movemask

SSE4.2 pcmpistri can scan a vector for containing a zero byte, but it's relatively slow and can only do 16 bytes at a time, no AVX2 version. It or pcmpistrm are multiple uops, and have 3 cycle throughput on Intel, 2 on AMD Zen. It's an interesting and powerful instruction, but overkill and slower for problems you can solve with vpcmpeqb / vpmovmskb. https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 shows how to use it for strlen if you want to see how much slower it is (especially on CPUs more recent than Nehalem). More usefully, that page explains the different kinds of searches it can do (like any == any to search a set, or using pairs of one vector as ranges, or substring search like strstr), how the immediate operand encodes that.

pcmpistri could be a useful instruction if you only have 16 bytes to look at; it takes care of the position = bitscan(mask) part. But if you have (potentially) more data to look at, it's not so great. And on Intel, all 3 of its uops run on the same port, so it's bad for throughput if it's in a loop without much surrounding code that out-of-order exec can overlap with.

Other than that, the only thing that comes to mind for horizontal vector search/scan without movmsk to integer first is phminposuw, which isn't what you want. (It could find a 16-bit zero element.

Or maybe vpand a vpcmpeqb result with a vector of 1,2,4,8,16, ... powers of 2, then vpsadbw to do a horizontal add of the bytes. The lowest set bit in the result tells the position of the first 0 in that 8-byte chunk. But that only works up to 8 elements -> 8-bit bitmask because it has to fit in a byte. So it's just an inefficient way to emulate vpmovmskb with the result in an XMM register.

Or you could do log2(vector_length) steps of shuffling and masking the next element so you end up with a vector where only the first 0 in the input has a 0xff element. Then VPAND with a vector of 0,1,2,3,4,... and vpsadbw to hsum, and the only non-zero element will be the byte-position. But this is much more expensive that vpcmpeqb / vpmovmskb / bsf / vmovd back to an XMM register, if you really want the result in an XMM register for some reason. (And the hsum would actually need vpsadbw + vextracti128 / vpaddb / vpshufd / vpaddb.)

Footnote 1: Longer strings

For longer strings, you'd test eax,eax / jz .keep_looping instead of actually bit-scanning each movemask result. bsf and tzcnt do set FLAGS based on the input being zero (ZF=1 or CF=1 respectively), but test+jcc can macro-fuse into a single test-and-branch uop. Front-end bandwidth (and maybe back-end execution ports) are already a problem for throughput of a small but not tiny strlen (with data hot in L1d cache) if you're not careful.

The main loop of memchr or strlen for long strings might vpor multiple compare results from a cache line or two to amortize the movemask/branch, then sort out where it came from once outside the loop.

Or for strlen specifically, vpminub two vectors of raw input data before vpcmpeqb, to get a zero byte iff there's a zero byte in the input. (After leaving the loop, that strategy needs to re-check the input data, not just the compare vectors.) glibc strlen does this; see also Is it safe to read past the end of a buffer within the same page on x86 and x64? for links and another issue to watch out if using this on variable-length data that might be near the end of a page.

Footnote 2: `and eax,31` and strlen >= 32

Your bsf eax,eax will leave EAX unmodified if EAX was zero (documented by AMD, implemented that way by both. But Intel documents it as an "undefined" integer result). Otherwise it will write EAX with a value from 0..31, so either way the AND was fully redundant.

I think all AVX2 CPUs also support BMI1 tzcnt, which will give you 32 for an input of 0 (no 00 bytes found). It's also better for performance: bsf is slower on AMD.

CPUs with AVX2:

AMD since Excavator (already had tzcnt since Piledriver)
Intel since Haswell (BMI1 with tzcnt was also new)
VIA / ZHAOXIN since KaiXian ZX-C+ which has BMI1 / tzcnt.

On a CPU that doesn't support BMI1 tzcnt, it would execute as bsf, which is fine for non-zero inputs (same integer result). So it's a drop in performance upgrade in code that's already ruled out an all-zero mask or doesn't care what happens in that case.

bsf sets ZF=1 if the input was zero, unlike tzcnt which sets ZF based on the output like a normal instruction, but sets CF=1 for an all-zero input. Then it would matter which instruction you use (and which the CPU runs it as).

If you're only looking at one fixed-size chunk, the FLAGS results of bsf or tzcnt may be useful to detect a zero input if you want to cmov something else on it. test/jcc before bit-scan is 1 uop (and means you skip it in the (rare?) case of not found), same as a stand-alone jcc after. Recent CPUs macro-fuse test/jz. But jcc after saves some machine-code size. And cmov won't fuse, so if you're consuming the FLAGS result with it, a test before tzcnt would cost extra.

Get C string length of a 16 or 32-byte fixed-size buffer? (XMM or YMM register width)

1 Answers1

Other options: none as efficient as movemask

Footnote 1: Longer strings

Footnote 2: and eax,31 and strlen >= 32

Footnote 2: `and eax,31` and strlen >= 32