2

Let's say I want to reverse the byte order of a very large byte array. I can do this the slow way using the main registers but I would like to speed it up using the XMM or YMM registers.

Is there a way to reverse the byte order in an XMM or YMM register?

  • 2
    https://dev.to/wunk/fast-array-reversal-with-simd-j3p? – GSerg Jun 01 '19 at 14:59
  • 2
    The link in @GSerg 's comment is great, but note that for a very large array, which doesn't fit in the L1/L2/L3 cache, the performance is likely limited by the maximum bandwidth.between DRAM and the core. In that case you won't benefit much from using SSE/AVX instructions. – wim Jun 01 '19 at 16:20
  • @wim You might still because with scalar instructions, it is likely that performance is not actually memory bound. – fuz Jun 01 '19 at 16:37
  • 1
    @fuz: I haven't looked into the details, but I guess that with some unrolling it might be possible to do one load of 64 bits, `bswap` it, and store it back to memory, per cpu cycle. In that case we need 16 bytes of bandwidth per cpu cycle. With a 4 GHz cpu that would be 64 GB/s bandwidth. The single core DRAM bandwidth of most today's CPUs is likely lower. So I guess the `bswap` solution might be able to saturate the single core DRAM bandwidth. Therefore I wouldn't expect too much performance improvement from the SSE/AVX solutions, unless the data fits in L1/L2/L3 cache. – wim Jun 01 '19 at 18:37
  • Are you trying to reverse the entire array (as assumed by other commenters), or are you trying to change endianness of a sequence of words (or dwords or qwords)? – Ruud Helderman Jun 01 '19 at 20:47
  • @wim: you'll get the best throughput with AVX2 `vpshufb`; fewer uops for the same amount of work data gives better "lookahead" for getting closer to maxing out single-threaded bandwidth. And `bswap` r64 is 2 uops on Skylake, so loop overhead will make it hard to even run at 8 bytes per cycle. – Peter Cordes Jun 01 '19 at 22:44
  • @RuudHelderman My actual plan is to reverse it but also do a computation on it. My preliminary question is whether XMM/YMM can be used to reverse bytes at all, and then secondarily I need to find out what byte-level operations can be done. –  Jun 03 '19 at 14:07
  • Certainly, AVX2 `vpshufb` is the best option here. In my previous comment I just wanted to point out that I wouldn't call the `bswap` solution slow, like the OP did, although it isn't great either, I admit. Indeed 2 uops for `bswap` on Skylake is disappointing. Particularly, considering that on AMD Ryzen `bswap` has a throughput of 4 per cycle. – wim Jun 04 '19 at 22:17

2 Answers2

7

Yes, use SSSE3 _mm_shuffle_epi8 or AVX2 _mm256_shuffle_epi8 to shuffle bytes within 16-byte AVX2 "lanes". Depending on the shuffle control vector, you can swap pairs of bytes, reverse 4-byte units, or reverse 8-byte units. Or reverse all 16 bytes.

But vpshufb isn't lane-crossing, so you can't reverse 32 bytes with one instruction until AVX512VBMI vpermb. vpshufb ymm does 2x 16-byte shuffles in the two 128-bit lanes of the YMM vector.

So if you're byte-reversing an entire array, rather than the endianness / byte-order of individual elements in an array, you have 3 options:

  • Stick to 128-bit vectors (simple and portable, and probably not slower on current CPUs). And only needs 16-byte alignment for best performance.
  • Load with vmovdqu / vinsert128, then vpshufb then 32-byte store. (Or do 32-byte loads and split 16-byte stores, but that's probably not as good). Vectorize random init and print for BigInt with decimal digit array, with AVX2? includes a cache-blocked byte-aarray reverse into a tmp buffer to feed fwrite in 8kiB chunks.
  • Use vpermq to lane-swap before or after vpshufb (not great on AMD, and bottlenecks on 1 per clock shuffle throughput on current Intel). But potentially very good on Ice Lake (2 shuffle ports)

vpshufb is a single uop instruction on Intel, or 2 on AMD, and processes 32 bytes of data at once.

For very large inputs, it's probably worth it to reach a 32 or 64-byte alignment boundary before your vectorized loop, so none of the loads/stores cross cache-line boundaries. (For small inputs the minor benefit isn't worth the extra prologue/epilogue code and branching.)


But potentially even better is to only swap a 16kiB chunk before you use it, so it's still hot in L1d cache when the next step reads it. This is called cache blocking. Or maybe use 128kiB chunks to block for L2 cache size.

You might swap in chunks as you read the data from a file. e.g. do read() system calls in chunks of 64k or 128k and swap the result while it's still hot in cache after the kernel copied data from the pagecache into your user-space buffer. Or use mmap to memory-map a file, and run a copy-and-swap loop from that. (Or for a private mapping, in-place swap; but that will trigger copy-on-write anyway so not much benefit. And file-backed mmap on Linux can't use anonymous hugepages).

Another option is to simply swap on the fly if you only read the data a couple times; if the later uses are still memory bound, or have room for a shuffle uop without bottlenecking, it probably won't slow them down to shuffle on the fly.

A pass that touches all your data and only byte-swaps it has very poor computational intensity; you want to be doing more things with your data while it's in registers, or at least while it's hot in cache. But if you only byte-swap once and then read the data many times, or in a random access pattern, or from another language like Python or JavaScript that can't efficiently swap on the fly, then sure do a swap pass.

Or a swap pass is helpful if you will make multiple passes over it that aren't memory-bound, and an extra shuffle would slow down each later pass. In that case you do want to cache-block the swapping so the later pass's input is hot in cache.


The scalar option, bswap, is limited to at best 8 bytes per clock cycle, and every 8 bytes needs a separate load and store instruction. (movbe to load from memory with byte-swapping saves an instruction, but on mainstream CPUs doesn't micro-fuse into a single load+swap uop. On Silvermont it's single-uop, though.) Also, Intel bswap r64 is 2 uops, so it's not great.

This might saturate single-threaded memory bandwidth on modern CPUs with some loop unrolling, but SIMD with fewer total uops to process the same data lets out-of-order execution "see" farther ahead and start processing TLB misses for upcoming pages sooner, for example. HW data prefetch and TLB prefetch do help a lot, but it's typically at least slightly better to use wider loads/stores for memcpy.

(vpshufb is cheap enough that this will still basically perform like memcpy. Or better if rewriting in place.)

And of course if you ever have any cache hits, even just L3 cache, SIMD will really shine.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The documentation on vshufb seems to suggest it needs to be followed with a permute instruction and possibly an OR instruction. It's pretty confusing stuff. –  Jun 03 '19 at 14:03
  • @derik: are you trying to byte-reverse the entire array? Normally "byte order" means endianness, so I thought you were just reversing 2, 4, or 8-byte chunks of it. But yes if you want to [`std::reverse`](https://en.cppreference.com/w/cpp/algorithm/reverse) a byte array with AVX2 vpshufb, it does two separate byte shuffles within the 16-byte "lanes". You might want to store with `vmovdqu [rdi+16]` + `vextracti128 [rdi+0], ymm0, 1`. Or do your loads with `vmovdqu` + `vinserti128`, which is probably better. – Peter Cordes Jun 03 '19 at 22:18
  • If you don't care about AMD CPUs at all, another option is `vperm2i128` or `vpermq` to lane-swap a YMM register so you're doing 256b loads+stores. But lane-crossing 256n shuffles are extra expensive on AMD before Ryzen2. And an extra load uop as part of `vinserti128` probably won't hurt throughput on Intel; you'll still bottleneck on shuffle throughput until IceLake, so 1 shuffle per clock = 1x 256b store per 2 clocks. – Peter Cordes Jun 03 '19 at 22:19
  • A bit off topic but do you know of any ways fastest to reverse byte order in a ```XMM```/```YMM```/```ZMM``` just in ```epi16``` blocks? More specifically I am wondering if there is anything faster than (with ```YMM``` example) ```_mm256_shuffle_epi8(vec, _mm256_set_epi8(30, 31, 28, 29,.... 0, 1))```. (Assuming the constant shuffle values are have to be read in in from memory i.e the 2 uop case). Only other reasonable method I can think of is 2x shuffle + or – Noah Jan 29 '21 at 02:00
  • @Noah: `vpshufb` can do arbitrary byte shuffles within each 16-byte lane, including swapping pairs of bytes. AVX512F introduced SIMD rotates, but only down to dword granularity so there's no `vprolw`. But it sounds like you actually want to reverse all the words across a whole vector register; an in-lane shuffle can't do that. For that you want AVX-512 `vpermw`. (Or on IceLake, `vpermb` because they made that single uop but `vpermw` is still 2 :/) Or with AVX2, probably `vpshufb` + (`vpermq` or `vperm2i128`) – Peter Cordes Jan 29 '21 at 02:08
  • no I'm basically looking for the fastest way to achieve ```vprolw```. I.e just swapping bytes in ```epi16```. Only ways I can think of are ```vpshufb``` or 2 shifts + or – Noah Jan 29 '21 at 02:17
  • 1
    @Noah: Ok, then yeah you of course just want `vpshufb`. It needs a constant, but you can reuse it in a loop. And it has excellent throughput overall (1/clock worst case, or 2/clock on Ice Lake and Zen2 for the YMM version.) Keep in mind that it only cares about the low 4 bits of the index, so you can broadcast-load the same 16-byte in-lane pattern. (Unless C compilers constant-propagate and stupidly make a wider constant.) You *can* write stuff like `(30,31, ...` but that's can be as much misleading as it is helpful. – Peter Cordes Jan 29 '21 at 02:28
  • Ahh sorry for clarity! Thank you! – Noah Jan 29 '21 at 02:33
  • So in my case am able to avoid all but 1 memory ```vpshufb``` with ```vprord``` (my manipulating the mask) but the compiler (clang especially) seems to really like loading constants from memory. [Is this not a missed optimization](https://godbolt.org/z/as1vzs)? As far as I can tell 64 bytes of memory is all thats needed (could do 32 maybe if you build the permute mask) but in general cant see who the loads would beat the ```vprord``` that are off critical path and not stealing ports from critical path. (Both compilers fall over if you put it in inline asm but its funny to see xD) – Noah Jan 29 '21 at 06:02
  • 1
    @Noah: Some constant-propagation makes sense if the surrounding code (or other hyper-thread) is high-throughput. Remember that OoO exec to overlap this with earlier or later code is going to happen. But yeah, compilers are annoyingly prone to overdoing it with constant propagation. I haven't looked at your case in a ton of detail, but it's likely that generating at least some of the vectors from other loads would be good. And it would be nice to have more control sometimes without having to do something horrible like `static const volatile uint8_t shuf[]` and loading from that. – Peter Cordes Jan 29 '21 at 06:13
  • I guess my thinking is either A) OoO exec gets to the preperation for the mask early in which case doesnt matter if its a ```vpmovdqa``` or ```vprord``` (assuming no port contention) so you would rather use the lower memory footprint option or B) OoO exec doesnt prepare mask early in which case both are + 1uop on critical path so again memory should tie break. Generally outside of microbenchmarks where there is context switching and memory pressure some non trivial % miss L1 cache (or kick something out of L1). Do you think the reasoning is flawed? – Noah Jan 29 '21 at 07:14
  • @Noah: *assuming no port contention* - there's the rub. If the preceding code is also heavy SIMD computation without a lot of latency bottlenecks, it may nearly saturate ports 0 and 5 but leave loads. (Especially for 512-bit uops which shut down the vector ALUs on port 1). Also note that `vpshufb xyz, xyz, [mem]` can micro-fuse the load, sneaking it through the front-end for free. If you're doing it repeatedly in a loop, by hand in asm I'd probably use 1 load + `vprord` for constant setup, but in one pass for a repeatedly-called function that doesn't inline I'd consider memory. – Peter Cordes Jan 29 '21 at 07:39
  • @Noah: I mean, you certainly do have a point, though, and compiler heuristics are probably tuned too much towards microbenchmarks (bloating .rodata by constant-propagating too much, especially as vectors become larger and larger with AVX-512). It's possible to play devil's advocate, though. CPUs do have significant memory-level parallelism and can eat an occasional L2 hit, or maybe even L2 miss / L3 hit if seen early enough. Spending fewer total uops lets more uops be in flight in the ROB, increasing the power of OoO exec. – Peter Cordes Jan 29 '21 at 07:42
  • > assuming port no contention - there the rub indeed. But i generally think with a bit of care you can pair it with the appropriate operations. In this case there is a ```pshufb``` with runs on p1/p5 that is on the critical path and all following instructions depend on so a ```vprord``` on following that will get p0/p1 same cycle (unless frontend bound) w.o any overhead. On skylake would be easier because ```vprord``` and ```pshufb``` don't contend for any ports. But if your not manually ordering the instructions I see what your saying. – Noah Jan 29 '21 at 16:28
  • Hoping clang will eventually implement ```llvm-mca``` into their optimizer to do it for us. – Noah Jan 29 '21 at 16:32
1

I can't compete legendary Peter Cordes... I want to show C implementation.

Here is are examples of reversing bytes order using C intrinsics (can be used for byte-reverse an entire array).

There are 3 code samples.

  1. Using SSE2 instruction set.
  2. Using SSSE3 instruction set.
  3. Using AVX2 instruction set.

//Initialize XMM register with uint8 values 0 to 15 (for testing):
__m128i a_F_E_D_C_B_A_9_8_7_6_5_4_3_2_1_0 = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);


//SSE2:
//Advantage: No need to build a shuffle mask (efficient for very short loops).
//////////////////////////////////////////////////////////////////////////
//Reverse order of uint32:
__m128i a_3_2_1_0_7_6_5_4_B_A_9_8_F_E_D_C = _mm_shuffle_epi32(a_F_E_D_C_B_A_9_8_7_6_5_4_3_2_1_0, _MM_SHUFFLE(0, 1, 2, 3));

//Swap pairs of uint16:
__m128i a_1_0_3_2_5_4_7_6_9_8_B_A_D_C_F_E = _mm_shufflehi_epi16(_mm_shufflelo_epi16(a_3_2_1_0_7_6_5_4_B_A_9_8_F_E_D_C, _MM_SHUFFLE(2, 3, 0, 1)), _MM_SHUFFLE(2, 3, 0, 1));

//Swap pairs of uint8:
__m128i a_0_1_2_3_4_5_6_7_8_9_A_B_C_D_E_F = _mm_or_si128(_mm_slli_epi16(a_1_0_3_2_5_4_7_6_9_8_B_A_D_C_F_E, 8), _mm_srli_epi16(a_1_0_3_2_5_4_7_6_9_8_B_A_D_C_F_E, 8));
//////////////////////////////////////////////////////////////////////////


//SSSE3: 
//Advantage: Not requires AVX2 support
//////////////////////////////////////////////////////////////////////////
//Build shuffle mask
const __m128i shuffle_mask = _mm_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);

a_0_1_2_3_4_5_6_7_8_9_A_B_C_D_E_F = _mm_shuffle_epi8(a_F_E_D_C_B_A_9_8_7_6_5_4_3_2_1_0, shuffle_mask);
//////////////////////////////////////////////////////////////////////////


//AVX2: 
//Advantage: Potentially faster than SSSE3
//////////////////////////////////////////////////////////////////////////
//Initialize YMM register with uint8 values 0 to 31 (for testing):
__m256i a__31_to_0 = _mm256_set_epi8(31, 30, 29, 28, 27, 26, 25, 24, 23, 22, 21, 20, 19, 18, 17, 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);

//Build shuffle mask
const __m256i shuffle_mask2 = _mm256_set_epi8(0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15);

//Reverse bytes oreder of upper lane and lower lane of YMM register.
__m256i a__16_to_31__0_to_15 = _mm256_shuffle_epi8(a__31_to_0, shuffle_mask2);

//Swap upper and lower lane of YMM register
__m256i a__0_to_31 = _mm256_permute4x64_epi64(a__16_to_31__0_to_15, _MM_SHUFFLE(1, 0, 3, 2));
//////////////////////////////////////////////////////////////////////////
Rotem
  • 30,366
  • 4
  • 32
  • 65
  • oops, I got mixed up between `__m256i a__31_to_0` and the shuffle-control vector. I thought you were using `_mm_setr_epi8(31, 30, ...)` for the shuffle-control, but that's the data to be shuffled. Also I meant to say `_mm256_shuffle_epi8` is not lane-crossing, not `set`. Anyway, nevermind. – Peter Cordes Jun 04 '19 at 21:35