Counting differences between 2 buffers seems too slow

Question

My problem

I have 2 adjacent buffers of bytes of identical size (around 20 MB each). I just want to count the differences between them.

My question

How much time this loop should take to run on a 4.8GHz Intel I7 9700K with 3600MT RAM ?

How do we compute max theoretical speed ?

What I tried

uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
    uint64_t diffFound = 0;

    for(uint64_t byte = 0; byte < commonSize; ++byte)
        diffFound += static_cast<uint64_t>(buffer[byte] != buffer[byte + commonSize]);

    return diffFound;
}

It takes 11ms on my PC (9700K 4.8Ghz RAM 3600 Windows 10 Clang 14.0.6 -O3 MinGW ) and I feel it is too slow and that I am missing something.

40MB should take less than 2ms to be read on the CPU (my RAM bandwidth is between 20 and 30GB/s)

I don't know how to count cycles required to execute one iteration (especially because CPUs are superscalar nowadays). If I assume 1 cycle per operation and if I don't mess up my counting, it should be 10 ops per iteration -> 200 million ops -> at 4.8 Ghz with only one execution unit -> 40ms. Obviously I am wrong on how to compute the number of cycles per loop.

Fun fact: I tried on Linux PopOS GCC 11.2 -O3 and it ran at 4.5ms. Why such a difference?

Here are the dissassemblies vectorised and scalar produced by clang:

compareFunction(char const*, unsigned long): # @compareFunction(char const*, unsigned long)
        test    rsi, rsi
        je      .LBB0_1
        lea     r8, [rdi + rsi]
        neg     rsi
        xor     edx, edx
        xor     eax, eax
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        movzx   r9d, byte ptr [rdi + rdx]
        xor     ecx, ecx
        cmp     r9b, byte ptr [r8 + rdx]
        setne   cl
        add     rax, rcx
        add     rdx, 1
        mov     rcx, rsi
        add     rcx, rdx
        jne     .LBB0_4
        ret
.LBB0_1:
        xor     eax, eax
        ret

Clang14 O3:

.LCPI0_0:
        .quad   1                               # 0x1
        .quad   1                               # 0x1
compareFunction(char const*, unsigned long):                # @compareFunction(char const*, unsigned long)
        test    rsi, rsi
        je      .LBB0_1
        cmp     rsi, 4
        jae     .LBB0_4
        xor     r9d, r9d
        xor     eax, eax
        jmp     .LBB0_11
.LBB0_1:
        xor     eax, eax
        ret
.LBB0_4:
        mov     r9, rsi
        and     r9, -4
        lea     rax, [r9 - 4]
        mov     r8, rax
        shr     r8, 2
        add     r8, 1
        test    rax, rax
        je      .LBB0_5
        mov     rdx, r8
        and     rdx, -2
        lea     r10, [rdi + 6]
        lea     r11, [rdi + rsi]
        add     r11, 6
        pxor    xmm0, xmm0
        xor     eax, eax
        pcmpeqd xmm2, xmm2
        movdqa  xmm3, xmmword ptr [rip + .LCPI0_0] # xmm3 = [1,1]
        pxor    xmm1, xmm1
.LBB0_7:                                # =>This Inner Loop Header: Depth=1
        movzx   ecx, word ptr [r10 + rax - 6]
        movd    xmm4, ecx
        movzx   ecx, word ptr [r10 + rax - 4]
        movd    xmm5, ecx
        movzx   ecx, word ptr [r11 + rax - 6]
        movd    xmm6, ecx
        pcmpeqb xmm6, xmm4
        movzx   ecx, word ptr [r11 + rax - 4]
        movd    xmm7, ecx
        pcmpeqb xmm7, xmm5
        pxor    xmm6, xmm2
        punpcklbw       xmm6, xmm6              # xmm6 = xmm6[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        pshuflw xmm4, xmm6, 212                 # xmm4 = xmm6[0,1,1,3,4,5,6,7]
        pshufd  xmm4, xmm4, 212                 # xmm4 = xmm4[0,1,1,3]
        pand    xmm4, xmm3
        paddq   xmm4, xmm0
        pxor    xmm7, xmm2
        punpcklbw       xmm7, xmm7              # xmm7 = xmm7[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        pshuflw xmm0, xmm7, 212                 # xmm0 = xmm7[0,1,1,3,4,5,6,7]
        pshufd  xmm5, xmm0, 212                 # xmm5 = xmm0[0,1,1,3]
        pand    xmm5, xmm3
        paddq   xmm5, xmm1
        movzx   ecx, word ptr [r10 + rax - 2]
        movd    xmm0, ecx
        movzx   ecx, word ptr [r10 + rax]
        movd    xmm1, ecx
        movzx   ecx, word ptr [r11 + rax - 2]
        movd    xmm6, ecx
        pcmpeqb xmm6, xmm0
        movzx   ecx, word ptr [r11 + rax]
        movd    xmm7, ecx
        pcmpeqb xmm7, xmm1
        pxor    xmm6, xmm2
        punpcklbw       xmm6, xmm6              # xmm6 = xmm6[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        pshuflw xmm0, xmm6, 212                 # xmm0 = xmm6[0,1,1,3,4,5,6,7]
        pshufd  xmm0, xmm0, 212                 # xmm0 = xmm0[0,1,1,3]
        pand    xmm0, xmm3
        paddq   xmm0, xmm4
        pxor    xmm7, xmm2
        punpcklbw       xmm7, xmm7              # xmm7 = xmm7[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        pshuflw xmm1, xmm7, 212                 # xmm1 = xmm7[0,1,1,3,4,5,6,7]
        pshufd  xmm1, xmm1, 212                 # xmm1 = xmm1[0,1,1,3]
        pand    xmm1, xmm3
        paddq   xmm1, xmm5
        add     rax, 8
        add     rdx, -2
        jne     .LBB0_7
        test    r8b, 1
        je      .LBB0_10
.LBB0_9:
        movzx   ecx, word ptr [rdi + rax]
        movd    xmm2, ecx
        movzx   ecx, word ptr [rdi + rax + 2]
        movd    xmm3, ecx
        add     rax, rsi
        movzx   ecx, word ptr [rdi + rax]
        movd    xmm4, ecx
        pcmpeqb xmm4, xmm2
        movzx   eax, word ptr [rdi + rax + 2]
        movd    xmm2, eax
        pcmpeqb xmm2, xmm3
        pcmpeqd xmm3, xmm3
        pxor    xmm4, xmm3
        punpcklbw       xmm4, xmm4              # xmm4 = xmm4[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        pshuflw xmm4, xmm4, 212                 # xmm4 = xmm4[0,1,1,3,4,5,6,7]
        pshufd  xmm4, xmm4, 212                 # xmm4 = xmm4[0,1,1,3]
        movdqa  xmm5, xmmword ptr [rip + .LCPI0_0] # xmm5 = [1,1]
        pand    xmm4, xmm5
        paddq   xmm0, xmm4
        pxor    xmm2, xmm3
        punpcklbw       xmm2, xmm2              # xmm2 = xmm2[0,0,1,1,2,2,3,3,4,4,5,5,6,6,7,7]
        pshuflw xmm2, xmm2, 212                 # xmm2 = xmm2[0,1,1,3,4,5,6,7]
        pshufd  xmm2, xmm2, 212                 # xmm2 = xmm2[0,1,1,3]
        pand    xmm2, xmm5
        paddq   xmm1, xmm2
.LBB0_10:
        paddq   xmm0, xmm1
        pshufd  xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        paddq   xmm1, xmm0
        movq    rax, xmm1
        cmp     r9, rsi
        je      .LBB0_13
.LBB0_11:
        lea     r8, [r9 + rsi]
        sub     rsi, r9
        add     r8, rdi
        add     rdi, r9
        xor     edx, edx
.LBB0_12:                               # =>This Inner Loop Header: Depth=1
        movzx   r9d, byte ptr [rdi + rdx]
        xor     ecx, ecx
        cmp     r9b, byte ptr [r8 + rdx]
        setne   cl
        add     rax, rcx
        add     rdx, 1
        cmp     rsi, rdx
        jne     .LBB0_12
.LBB0_13:
        ret
.LBB0_5:
        pxor    xmm0, xmm0
        xor     eax, eax
        pxor    xmm1, xmm1
        test    r8b, 1
        jne     .LBB0_9
        jmp     .LBB0_10

A couple things to try first: `-march=native` to ensure you are using all the SIMD features your current CPU has available. The baseline is only SSE2. — Nate Eldredge, Nov 12 '22 at 18:59
Btw, the posting restrictions are because your account is new. New accounts are what spammers use for posts that just have a bunch of links to whatever they are plugging, and so new accounts are limited as to how many links they can include. The "hello" is also omitted because folks want posts to get straight to the point instead of starting with pleasantries. — Nate Eldredge, Nov 12 '22 at 19:01
Thank you for the input. I thought arch=native was implicit. It is not the case ? Regarding the difference between GCC and Clang. From what i see in the ASM, Clang O1 is close to GCC O2 for the scalar version. I actually wonder if vectorizing this code makes it faster. GCC11 O2 https://godbolt.org/z/nxEPf78jh GCC11 O3 https://godbolt.org/z/Woe59c5W3 Thanks also for the explanation regarding new accounts. — Scr3amer, Nov 12 '22 at 19:02
Whether `-march=native` is default depends on how your compiler install was configured when it was built (by you or your package distributor). I think with the pure vanilla distribution, it isn't default. — Nate Eldredge, Nov 12 '22 at 19:04
Looking at Clang's output, there should be room for improvement using manual vectorization. Clang vectorized it, but in an ugly way, worse than any human would reasonably do. BTW the speed estimated by uiCA on Clangs asm is 20 cycles per iteration on your CPU, and 8 (or maybe 16?) items are processed per iteration. — harold, Nov 12 '22 at 19:05
For stuff like vectorization, it's not particularly unusual for gcc and clang to perform very differently as far as optimization. — Nate Eldredge, Nov 12 '22 at 19:05
As one note, the code as it stands would correctly handle a buffer of more than 4 GB, which means the compiled code has to zero-extend all the compare results to 64 bits, and do all the addition 64 bits at a time. Since you say your buffers are actually only 20 MB, you may get an improvement by using `uint32_t diffFound`, as then you get twice as many adds per instruction. — Nate Eldredge, Nov 12 '22 at 19:10
Another thought is that since this operation is "embarrassingly parallel" and you have 8 cores, you could divide your buffer into chunks and give them to different threads. (I wonder how well C++17 execution policies would do for that?) — Nate Eldredge, Nov 12 '22 at 19:20
@harold i think if AVX is used it would be 32 elements processed at the same, 16 with SSE. Right ? — Scr3amer, Nov 13 '22 at 03:01
@NateEldredge I use U64 because buffers can be arbitrarly big. It just happens that my test sample is 20 MB. march=native did the trick. I thought about improving further with multithreaded but I wanted first to squeeze max performance in single thread before using MT. — Scr3amer, Nov 13 '22 at 03:03
Cool! I tried implementing an AVX2 solution and it appears to be about 3 times faster than GCC's autovectorized solution. I don't have clang to test it with, though. Let me know if you'd like me to post that as an answer. — Ovinus Real, Nov 13 '22 at 04:52
Actually, https://stackoverflow.com/questions/54541129/how-to-count-character-occurrences-using-simd basically does exactly what I did. — Ovinus Real, Nov 13 '22 at 05:16
It might still be worth testing the input size and branching to a 32-bit version if small enough. For that matter, even if the size is somewhat larger than 32 bits, it may be worth breaking into chunks of less than 4 GB and running the 32-bit version several times, then extending and summing the results from each chunk as 64-bit values at the very end. — Nate Eldredge, Nov 13 '22 at 07:11
Large chunks <4GiB are indeed better since they increase the occupancy of SIMD units, but only by a factor of two. Very small chunks can be much more efficient here since SIMD register can be used at their maximum capacity: 16 counters per 128-bit SIMD register. The downside is to do more frequently the accumulation in a larger counter but it clearly worth it (theoretically up to 4x faster). — Jérôme Richard, Nov 13 '22 at 12:43

Jérôme Richard · Accepted Answer · 2022-11-14T00:40:04.603

TLDR: the reason why the Clang code is so slow comes from a poor vectorization method saturating the port 5 (known to be often an issue). GCC does a better job here, but it is still far from being efficient. One can write a much faster chunk-based code using AVX-2 not saturating the port 5.

Analysis of the unvectorized Clang code

To understand what is going on it is better to start with a simple example. Indeed, as you said, modern processor are superscalar so it is not easy to understand the speed of some generated code on such architecture.

The code generated by Clang using the -O1 optimization flag is a good start. Here is the code of the hot loop produced by GodBold provided in your question:

(instructions)                                 (ports)

.LBB0_4:
        movzx   r9d, byte ptr [rdi + rdx]      p23
        xor     ecx, ecx                       p0156
        cmp     r9b, byte ptr [r8 + rdx]       p0156+p23
        setne   cl                             p06
        add     rax, rcx                       p0156
        add     rdx, 1                         p0156
        mov     rcx, rsi                       (optimized)
        add     rcx, rdx                       p0156
        jne     .LBB0_4                        p06

Modern processors like the Coffee Lake 9700K are structured in two big parts: a front-end fetching/decoding the instructions (and splitting them into micro-instructions, aka. uops), and a back-end scheduling/executing them. The back-end schedule the uops on many ports and each of them can execute some specific sets of instructions (eg. only memory load, or only arithmetic instruction). For each instruction, I put the ports that can execute them. p0156+p23 means the instruction is split in two uops: the first can be executed by the ports 0 or 1 or 5 or 6, and the second can be executed by the ports 2 or 3. Note that the front-end can somehow optimize the code so not to produce any uops for basic instructions like the mov in the loop (thanks to a mechanism called register renaming).

For each loop iteration, the processor needs to read 2 value from memory. A Coffee Lake processor like the 9700K can load two values per cycle so the loop will at least take 1 cycle/iteration (assuming the loads in r9d and r9b does not conflict due to the use of different part of the same r9 64-bit register). This processor has a uops cache and the loop has a lot of instructions so the decoding part should not be a problem. That being said, there is 9 uops to execute and the processor can only execute 6 of them per cycle so the loop cannot take less than 1.5 cycle/iteration. More precisely, the ports 0, 1, 5 and 6 are under pressure, so even assuming the processor perfectly load balance the uops, 2 cycle/iterations are needed. This is an optimistic lower-bound execution time since the processor may not perfectly schedule the instruction and there are many things that could possibly go wrong (like a sneaky hidden dependency I did not see). With a frequency of 4.8GHz, the final execution time is at least 8.3 ms. It can reach 12.5 ms with 3 cycle/iteration (note that 2.5 cycle/iteration is possible due to the scheduling of uops to ports).

The loop can be improved using unrolling. Indeed, a significant number of instructions are needed just to do the loop and not the actual computation. Unrolling can help to increase the ratio of useful instructions so to make a better usage of available ports. Still, the 2 loads prevent the loop to be faster than 1 cycle/iteration, that is 4.2 ms.

Analysis of the vectorized Clang code

The vectorized code generated by Clang is complex. One could try to apply the same analysis than in the previous code but it would be a tedious task.

One can note that even though the code is vectorized, the loads are not vectorized. This is an issue since only 2 loads can be done per cycle. That being said, loads are performed by pairs two contiguous char values so loads are not so slow compared to the previously generated code.

Clang does that since only two 64-bit values can fit in a 128-bit SSE register and a 64-bit and it needs to do that because diffFound is a 64-bit integer. The 8-bit to 64-bit conversion is the biggest issue in the code because it requires several SSE instructions to do the conversion. Moreover, only 4 integers can be computed at a time since there is 3 SSE integer units on Coffee Lake and each of them can only compute two 64-bit integers at a time. In the end, Clang only put 2 values in each SSE register (and use 4 of them so to compute 8 items per loop iteration) so one should expect a code running more than twice faster (especially due to SSE and the loop unrolling), but this is not much the case due to fewer SSE ports than ALU ports and a more instructions required for the type conversions. Put it shortly, the vectorization is clearly inefficient, but this is not so easy for Clang to generate an efficient code in this case. Still, with 28 SSE instructions and 3 SSE integer units computing 8 items per loop, one should expect the computing part of the code to take about 28/3/8 ~= 1.2 cycle/item which is far from what you can observe (and this is not due to other instruction since they can mostly be executed in parallel as they can mostly be scheduled on other ports).

In fact, the performance issue certainly comes from the saturation of the port 5. Indeed, this port is the only one that can shuffle items of SIMD registers. Thus, the instructions punpcklbw, pshuflw, pshufd and even the movd can only be executed on the port 5. This is a pretty common issue with SIMD codes. This is a big issue since there is 20 instructions per loop and the processor may not even use it perfectly. This means the code should take at least 10.4 ms which is very close to the observed execution time (11 ms).

Analysis of the vectorized GCC code

The code generated by GCC is actually pretty good compared to the one of Clang. Firstly, GCC loads items using SIMD instruction directly which is much more efficient as 16 items are computed per instruction (and by iteration): it only need 2 load uops per iteration reducing the pressure on the port 2 and 3 (1 cycle/iteration for that, so 0.0625 cycle/item). Secondly, GCC only uses 14 punpckhwd instructions while each iteration compute 16 items, reducing critical pressure on the port 5 (0.875 cycle/item for that). Thirdly, the SIMD registers are nearly fully used, at least for the comparison since the pcmpeqb comparison instruction compare 16 items at a time (as opposed to 2 with Clang). The other instructions like paddq are cheap (for example, paddq can be scheduled on the 3 SSE ports) and they should not impact much the execution time. In the end, this version should still be bounded by the port 5, but it should be much faster than the Clang version. Indeed, one should expect the execution time to reach 1 cycle/item (since the port scheduling is certainly not perfect and memory loads may introduce some stalling cycles). This means an execution time of 4.2 ms. This is close to the observed results.

Faster implementation

The GCC implementation is not perfect.

First of all, it does not use AVX2 supported by your processor since the -mavx2 flag is not provided (or any similar flag like -march=native). Indeed, GCC like other mainstream compilers only use SSE2 by default for sake of compatibility with previous architecture: SSE2 is safe to use on all x86-64 processors, but not other instruction sets like SSE3, SSSE3, SSE4.1, SSE4.2, AVX, AVX2. With such flag, GCC should be able to produce a memory bound code.

Moreover, the compiler could theoretically perform a multi-level sum reduction. The idea is to accumulate the result of the comparison in a 8-bit wide SIMD lane using chunks with a size of 1024 items (ie. 64x16 items). This is safe since the value of each lane cannot exceed 64. To avoid overflow, the accumulated values needs to be stored in wider SIMD lanes (eg. 64-bit ones). With this strategy, the overhead of the punpckhwd instructions is 64 time smaller. This is a big improvement since it removes the saturation of the port 5. This strategy should be sufficient to generate a memory-bound code, even using only SSE2. Here is an example of untested code requiring the flag -fopenmp-simd to be efficient.

uint64_t compareFunction(const char *const __restrict buffer, const uint64_t commonSize)
{
    uint64_t byteChunk = 0;
    uint64_t diffFound = 0;

    if(commonSize >= 127)
    {
        for(; byteChunk < commonSize-127; byteChunk += 128)
        {
            uint8_t tmpDiffFound = 0;
            #pragma omp simd reduction(+:tmpDiffFound)
            for(uint64_t byte = byteChunk; byte < byteChunk + 128; ++byte)
                tmpDiffFound += buffer[byte] != buffer[byte + commonSize];
            diffFound += tmpDiffFound;
        }
    }

    for(uint64_t byte = byteChunk; byte < commonSize; ++byte)
        diffFound += buffer[byte] != buffer[byte + commonSize];

    return diffFound;
}

Both GCC and Clang generates a rather efficient code (while sub-optimal for data fitting in the cache), especially Clang. Here is for example the code generated by Clang using AVX2:

.LBB0_4:
        lea     r10, [rdx + 128]
        vmovdqu ymm2, ymmword ptr [r9 + rdx - 96]
        vmovdqu ymm3, ymmword ptr [r9 + rdx - 64]
        vmovdqu ymm4, ymmword ptr [r9 + rdx - 32]
        vpcmpeqb        ymm2, ymm2, ymmword ptr [rcx + rdx - 96]
        vpcmpeqb        ymm3, ymm3, ymmword ptr [rcx + rdx - 64]
        vpcmpeqb        ymm4, ymm4, ymmword ptr [rcx + rdx - 32]
        vmovdqu ymm5, ymmword ptr [r9 + rdx]
        vpaddb  ymm2, ymm4, ymm2
        vpcmpeqb        ymm4, ymm5, ymmword ptr [rcx + rdx]
        vpaddb  ymm3, ymm4, ymm3
        vpaddb  ymm2, ymm3, ymm2
        vpaddb  ymm2, ymm2, ymm0
        vextracti128    xmm3, ymm2, 1
        vpaddb  xmm2, xmm2, xmm3
        vpshufd xmm3, xmm2, 238
        vpaddb  xmm2, xmm2, xmm3
        vpsadbw xmm2, xmm2, xmm1
        vpextrb edx, xmm2, 0
        add     rax, rdx
        mov     rdx, r10
        cmp     r10, r8
        jb      .LBB0_4

All the loads are 256-bit SIMD ones. The number of vpcmpeqb is optimal. The number of vpaddb is relatively good. There are few other instructions, but they should clearly not be a bottleneck. The loop operate on 128 items per iteration and I expect it to takes less than a dozen of cycles per iteration for data already in the cache (otherwise it should be completely memory-bound). This means <0.1 cycle/item, that is, far less than the previous implementation. In fact, the uiCA tool indicates about 0.055 cycle/item, that is 81 GiB/s! One may manually write a better code using SIMD intrinsics, but at the expense of a significantly worse portability, maintenance and readability.

Note that generating a sequential memory-bound does not always mean the RAM throughput will be saturated. In fact, on one core, there is sometimes not enough concurrency to hide the latency of memory operations though it should be fine on your processor (like it is on my i5-9600KF with 2 interleaved 3200 MHz DDR4 memory channels).

Amazing, thank you Jerome for the detailed analysis of the compilers outputs. I have some questions: 1) what resources did you use to come up with the abilities of the ports for these CPUs and how opcodes are split into which uops ? Intel guide ? Database ? A software ? 2) Did you write the ports by hand opcode by opcode ? Or is it a software output like uica ? 3) What do you mean by "while sub-optimal for data fitting in the cache" ? 4) You say "6 uops to execute, cpu can do 6/cy so limit is 1.5 cy per loop". Where does the 0.5 comes from ? Pressure on ports ? I missed that point. — Scr3amer, Nov 13 '22 at 22:16
Glad you like it :) . For (1), I mainly used [uops.info/table](https://uops.info/table.html) combined with [wikichip](https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake). For (2), I unfortunately did that by hand opcode by opcode before discovering one the question was nearly fully written that uica does the job pretty well (at least in this case where there is no nested loops to analyse). For (4), you are right, there was a mistake: it is 9 and not 6, I fixed it (9/6=1.5). — Jérôme Richard, Nov 13 '22 at 23:31
For (4), I mean that if you are working in the L1 cache for example, then the code should not be memory-bound anymore and it can be improved so to improve the throughput. Here, this is not much a problem since the input is too big to fit in any cache on the target processor. Note It can fit in the L3 of some recent processor though. If you plan to deal with smaller input, then note that the code can be improved by doing the reduction directly in a SIMD register (something compilers fail to do), the chunk can be a bit bigger and the loop can be unrolled further. — Jérôme Richard, Nov 13 '22 at 23:36
In the OP's Godbolt link, the true failure isn't the port5 bottleneck; it's doing 2x 64-bit elements per vector instead of 16x 8-bit. The garbage vectorization strategy led to all these port5 shuffles. (The scalar movzx + `movd xmm, gpr` doesn't help either, but even with `-march=skylake` it still does 2x `vmovd` loads and `vpcmpeqb` on 4 bytes at once, not 32, then `vpmovzxbq`.) A good strategy doesn't have any shuffles inside the loop at all, as discussed in my answer, instead using `psadbw` to widen. Anything worse than that is a compromise, but your final code is a much less bad one! — Peter Cordes, Nov 13 '22 at 23:37
Huh, just noticed that your OpenMP with `int8_t tmpDiffFound = 0;` ends up using `vpsadbw` after reducing to 8 bytes. Interesting. Using it earlier, before show shuffle/add steps, could allow `vpaddq` for more data without overflow, but otherwise no efficiency change if it's going to insist on reducing to scalar inside the inner loop. :( At least this hand-holding does much better than the original. Still some missed micro-optimizations, like `vpextrb` should be `vmovd` because of how `psadbw` works. And indexed addressing modes for `vpcmpeqb ymm,mem` cost 2 front-end uops on Intel. — Peter Cordes, Nov 13 '22 at 23:43
And it could be doing better with the loop bound, did you try `byteChunk < commonSize-63`? That would require a separate check before the first iteration if the sizes are unsigned, but might help avoid the `lea` for `byteChunk+63` inside the loop. (The compiler doesn't prove it can't wrap for unsigned.) — Peter Cordes, Nov 13 '22 at 23:47
@PeterCordes I am confused, which GodBolt link are you talking about? It is the Clang O3, isn't it ? If so, the load are indeed inefficient but the 20 instructions using the port 5 are the bottleneck since the loop should take about 20 cycle and the port 5 is saturated during 20 cycles for each iteration (also confirmed by uiCA). I agree for the shuffle. Note that the final OpenMP version does that because it tries to reduce the lanes to a unique value which is a missed optimization. That being said, the unique shuffle is not critical here since it can be scheduled on the unsaturated port 5 — Jérôme Richard, Nov 14 '22 at 00:10
Yeah, clang -O3 (without -march) https://godbolt.org/z/KTj7sG8rq. The throughput bottleneck in the final code is port 5, yes, but that's a result of choosing a terrible strategy in the first place, of widening compare results to 64-bit before doing any adding. No matter how you do that, it's going to be unacceptably slow. It's true that clang's actual implementation of that strategy sucks more than it needs to, like without SSE4.1 it might as well do 16-byte load+compare and unpack lo + hi, instead of just low. That would still be *far* from optimal, but not as bad. — Peter Cordes, Nov 14 '22 at 00:16
@PeterCordes Interesting for the OpenMP change. I am unfortunately unable to reproduce it, can you provide a link or specify which parameter+compiler did you used? I agree for the scalar, I spent some time trying to overpass it but it is not easy to avoid. I agree that the OpenMP final code is not great but it should be enough to maximize the saturation of the RAM from one core, and certainly even the L3. Good point for the `commonSize-63` : I wanted to avoid a wraparound but indeed a check is sufficient to do that improve a bit the speed. Thank you. — Jérôme Richard, Nov 14 '22 at 00:34
I was talking about changes that a smarter compiler could have made; I'm not optimistic that clang could be hand-held into actually making better asm. But https://godbolt.org/z/4Woeq1sG9 shows a micro-optimized version of the same strategy, reducing to scalar inside the loop. 21 front-end uops for SKL, down from 28 (including saving the lea / mov and not defeating macro-fusion), uICA predicts 5.31c/iter on SKL, 4.81 on Tiger / Rocket Lake where it predicts a slight port 0/1 bottleneck from imperfect scheduling. vs. your current codegen at 6.97 on SKL, 6.0 on TGL/RKL with 27 front-end uops — Peter Cordes, Nov 14 '22 at 01:18
4.8 cycles per 4 vector compares is getting pretty close to the limit of 4 imposed by 2 loads per clock. 6 or 7 cycles is worse but might still keep up with L2 cache if neither array is hot in L1d (but is hot in L2). Still less hyperthreading friendly, but likely an acceptable tradeoff for fairly compact and maintainable portable source. I've been meaning to try looking at GCC or clang code sometime to see if I can teach them how to minimize indexed addressing modes when unrolling, when tuning for generic on Intel. Their loops are so obviously (to me) sub-optimal so often. — Peter Cordes, Nov 14 '22 at 01:33
Thank you very much for the additional information @PeterCordes. So in summary Port 5 pressure is not a root cause but a symptom. Since you mention HT, hyperthreading friendliness comes from generating the least amount of work for the frontend while waiting for data to reach the L1D ? I am not familiar at all on how to leverage efficiently that feature. IIRC, HT is just allowing the CPU to do some work for another workload while waiting for data or code. — Scr3amer, Nov 14 '22 at 01:37
@Scr3amer: Yeah, if another thread is running on the other logical core, it's competing for front-end cycles and back-end execution resources. If your threads actually bottleneck on different things, like memory bandwidth or latency, or ALU latency (or branch misses and cache misses), and their sum total of uop throughput at that other bottleneck is less than the front-end and back-end limits, ideally they'll come closer to each thread running as fast as they would if they had a core to themselves. (They won't because each having only half the ROB limits OoO exec and stuff like that.) — Peter Cordes, Nov 14 '22 at 01:46
@Scr3amer: Fewer uops per iter also helps without HT: the same [ROB capacity](https://blog.stuffedcow.net/2013/05/measuring-rob-capacity/) lets the CPU see more iterations ahead, so loads from a new page can be seen sooner to start the HW prefetcher going, and TLB. Also let the integer loop condition run ahead of the main work, so the branch miss on loop exit can get resolved well ahead of the vector work, and get started on the next stuff. [Avoid stalling pipeline by calculating conditional early](https://stackoverflow.com/q/49932119) — Peter Cordes, Nov 14 '22 at 01:49
Jérôme and @Scr3amer: https://godbolt.org/z/erd6jvY9G is a version that clang compiles with efficient loops. As is often the case `char *endp = buf+size` and a pointer increment works well. I was even able to get it to use a non-indexed addressing mode for the `vpcmpeqb` memory source, indexed for the `vmovdqu` loads. (Clang doesn't know that's important; reversing the operands to `!=` in the source made the difference!!) That works because you have the two input arrays contiguous so `b[i]` = `a[i+size]`. I also tried a probably-broken idea to save one vpaddb set1_epi8(4) fixup in the loop. — Peter Cordes, Nov 14 '22 at 03:56
@Scr3amer: https://godbolt.org/z/6nThzz67c (including UICA link in a comment) is a version without the probably-broken parts. uICA says SKL will run it at 5.51 cycles / iter, or 5.05 on ICL/RKL, almost as good as my hand-written. The only way it's still worse is `vpextrb` (because it doesn't prove that truncation to int8_t isn't needed?) instead of `vmovd` or `vmovq` which costs an extra front-end and port5 uop; with that change, 5.25c / iter. It also avoids unrolling of the clean-up loop, just vectorizing with its dumb 4 bytes per iter strat, about right for cleanup of size%128 bytes. — Peter Cordes, Nov 14 '22 at 04:19
It is just awesome how far you pushed what was initially just an evening exercise. I learnt a lot from it, thanks again. I wonder what's your take on this. If it was production code would you go directly (inline?) ASM instead of holding the hand of the compiler to get him to generate the best code ? — Scr3amer, Nov 14 '22 at 18:33
@Scr3amer: For production code where performance was important, I'd use intrinsics like `_mm256_cmpeq_epi8` and `_mm256_sub_epi8`, as in the linked Q&A about counting matches. (`mismatches = n - matches`). That's a good tradeoff between maintainability and micro-optimizations, letting you manually implement a vectorization strategy. Usually it's possible to get compilers to make pointer-increment code instead of indexed addressing modes, if you change the source around like I did here. The last drop of performance, from indexing one array relative to the other, is usually not worth it. — Peter Cordes, Nov 15 '22 at 08:52

Peter Cordes · Answer 2 · 2022-11-14T06:42:14.917

Yes, if your data is not hot in cache, even SSE2 should keep up with memory bandwidth. Compare-and-sum of 32 compare results per cycle (from two 32-byte loads) is totally possible if data is hot in L1d cache, or whatever bandwidth outer levels of cache can provide.

If not, the compiler did a bad job. That's unfortunately common for problems like this reducing into a wider variable; compilers don't know good vectorization strategies for summing bytes, especially compare-result bytes that must be 0/-1. They probably widen to 64-bit with pmovsxbq right away (or even worse if SSE4.1 instructions aren't available).

So even -O3 -march=native doesn't help much; this is a big missed-optimization; hopefully GCC and clang will learn how to vectorize this kind of loop at some point, summing compare results probably comes up in enough codebases to be worth recognizing that pattern.

The efficient way is to use psadbw to sum horizontally into qwords. But only after an inner loop does some iterations of vsum -= cmp(p, q), subtracting 0 or -1 to increment a counter or not. 8-bit elements can do 255 iterations of that without risk of overflow. And with unrolling for multiple vector accumulators, that's many vectors of 32 bytes each, so you don't have to break out of that inner loop very often.

See How to count character occurrences using SIMD for manually-vectorized AVX2 code. (And one answer has a Godbolt link to an SSE2 version.) Summing the compare results is the same problem as that, but you're loading two vectors to feed pcmpeqb instead of broadcasting one byte outside the loop to find occurrences of a single char.

An answer there has benchmarks that report 28 GB/s for AVX2, 23 GB/s for SSE2, on an i7-6700 Skylake (at only 3.4GHz, maybe they disabled turbo or are just reporting the rated speed. DRAM speed not mentioned.)

I'd expect 2 input streams of data to achieve about the same sustained bandwidth as one.

This is more interesting to optimize if you benchmark repeated passes over smaller arrays that fit in L2 cache, then efficiency of your ALU instructions matters. (The strategy in the answers on that question are pretty good and well tuned for that case.)

Fast counting the number of equal bytes between two arrays is an older Q&A using a worse strategy, not using psadbw to sum bytes to 64-bit. (But not as bad as GCC/clang, still hsumming as it widens to 32-bit.)

Multiple threads/cores will barely help on a modern desktop, especially at high core clocks like yours. Memory latency is low enough and each core has enough buffers to keep enough requests in flight that it can nearly saturate dual-channel DRAM controllers.

On a big Xeon, that would be very different; you need most of the cores to achieve peak aggregate bandwidth, even for just memcpy or memset so there's zero ALU work, just loads/stores. The higher latency means a single core has much less memory bandwidth available than on a desktop (even in an absolute sense, let alone as a percentage of 6 channels instead of 2). See also Enhanced REP MOVSB for memcpy and Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

Portable source that compiles to less-bad asm, micro-optimized from Jérôme's: 5.5 cycles per 4x 32-byte vectors, down from 7 or 8, assuming L1d cache hits.

Still not good (as it reduces to scalar every 128 bytes, or 192 if you want to try that), but @Jérôme Richard came up with a clever way to give clang something it could vectorize a short with a good strategy, with a uint8_t sum, using that as an inner loop short enough to not overflow.

But clang still does some dumb things with that loop, as we can see in his answer. I modified the loop control to use a pointer increment, which reduces the loop overhead a bit, just one pointer-add and compare/jcc, not LEA/MOV. I don't know why clang was doing it inefficiently using integer indexing.

And it avoids an indexed addressing mode for the vpcmpeqb memory source operands, letting them stay micro-fused on Intel CPUs. (Clang doesn't seem to know that this matters at all! Reversing operands to != in the source was enough to make it use indexed addressing modes for vpcmpeqb instead of for vmovdqu pure loads.)

// micro-optimized version of Jérôme's function, clang compiles this better
// instead of 2 arrays, it compares first and 2nd half of one array, which lets it index one relative to the other with an offset if we hand-hold clang into doing that.

uint64_t compareFunction_sink_fixup(const char *const __restrict buffer, const size_t commonSize)
{
    uint64_t byteChunk = 0;
    uint64_t diffFound = 0;

    const char *endp = buffer + commonSize;
    const char *__restrict ptr = buffer;

    if(commonSize >= 127) {
        // A signed type for commonSize wouldn't avoid UB in pointer subtraction creating a pointer before the object
        // in practice it would be fine except maybe when inlining into a function where the compiler could see a compile-time-constant array size.
        for(; ptr < endp-127 ; ptr += 128)
        {
            uint8_t tmpDiffFound = 0;
            #pragma omp simd reduction(+:tmpDiffFound)
            for(int off = 0 ; off < 128; ++off)
                tmpDiffFound += ptr[off + commonSize] != ptr[off];
                // without AVX-512, we get -1 for ==, 0 for not-equal.  So clang adds set1_epi(4) to each bucket that holds the sum of four 0 / -1 elements
            diffFound += tmpDiffFound;
        }
    }

    // clang still auto-vectorizes, but knows the max trip count is only 127
    // so doesn't unroll, just 4 bytes per iter.
    for(int byte = 0 ; byte < commonSize % 128 ; ++byte)
        diffFound += ptr[byte] != ptr[byte + commonSize];

    return diffFound;
}

Godbolt with clang15 -O3 -fopenmp-simd -mavx2 -march=skylake -mbranches-within-32B-boundaries

# The main loop, from clang 15 for x86-64 Skylake
.LBB0_4:                                # =>This Inner Loop Header: Depth=1
        vmovdqu ymm2, ymmword ptr [rdi + rsi]
        vmovdqu ymm3, ymmword ptr [rdi + rsi + 32]     # Indexed addressing modes are fine here
        vmovdqu ymm4, ymmword ptr [rdi + rsi + 64]
        vmovdqu ymm5, ymmword ptr [rdi + rsi + 96]
        vpcmpeqb        ymm2, ymm2, ymmword ptr [rdi]      # non-indexed allow micro-fusion without un-lamination
        vpcmpeqb        ymm3, ymm3, ymmword ptr [rdi + 32]
        vpcmpeqb        ymm4, ymm4, ymmword ptr [rdi + 64]
        vpaddb  ymm2, ymm4, ymm2
        vpcmpeqb        ymm4, ymm5, ymmword ptr [rdi + 96]
        vpaddb  ymm3, ymm4, ymm3
        vpaddb  ymm2, ymm2, ymm3

        vpaddb  ymm2, ymm2, ymm0       # add a vector of set1_epi8(4) to turn sums of 0 / -1 into sums of 1 / 0
        vextracti128    xmm3, ymm2, 1
        vpaddb  xmm2, xmm2, xmm3
        vpshufd xmm3, xmm2, 238                 # xmm3 = xmm2[2,3,2,3]
        vpaddb  xmm2, xmm2, xmm3              # reduced to 8 bytes
        vpsadbw xmm2, xmm2, xmm1              # hsum to one qword
        vpextrb edx, xmm2, 0                  # extract and zero-extend
        add     rax, rdx                      # accumulate the chunk sum

        sub     rdi, -128                # pointer increment (with a sign_extended_imm8 instead of +imm32)
        cmp     rdi, rcx
        jb      .LBB0_4                # }while(p < endp)

This could use 192 instead of 128 to further amortize the loop overhead, at the cost of needing to do %192 (not a power of 2), and making the cleanup loop worst case be 191 bytes. We can't go to 256, or anything higher than UINT8_MAX (255), and sticking to multiples of 32 is necessary. Or 64 for good measure.

There's an extra vpaddb of a fixup constant, set1_epi8(4), which turns the sum of four 0 / -1 into a sum of four 1 / 0 results from the C != operator.

I don't think there's any way to get rid of it or sink it out of the loop while still accumulating into a uint8_t, which is necessary for clang to vectorize this way. It doesn't know how to use vpsadbw to do a widening (non-truncating) sum of bytes, which is ironic because that's what it actually does when used against an all-zero register. If you do something like sum += ptr[off + commonSize] == ptr[off] ? -1 : 0 you can get it to use the vpcmpeqb result directly, summing 4 vectors down to one with 3 adds, and eventually feeding that to vpsadbw after some reduction steps. So you get a sum of matches * 0xFF truncated to uint8_t for each block of 128 bytes. Or as an int8_t, that's a sum of -1 * matches, so 0..-128, which doesn't overflow a signed byte. So that's interesting. But adding with zero-extension into a 64-bit counter might destroy information, and sign-extension inside the outer loop would cost another instruction. It would be a scalar movsx instruction instead of vpaddb, but that's not important for Skylake, probably only if using AVX-512 with 512-bit vectors (which clang and GCC both do badly, not using masked adds). Can we do 128*n_chunks - count after the loop to recover the differences from the sum of matches? No, I don't think so.

uiCA static analysis predicts Skylake (such as your CPU) will run the main loop at 5.51 cycles / iter (4 vectors) if data is hot in L1d cache, or 5.05 on Ice Lake / Rocket Lake. (I had to hand-tweak the asm to emulate the padding effect -mbranches-within-32B-boundaries would have, for uiCA's default assumption of where the top of the loop is relative to a 32-byte alignment boundary. I could have just changed that setting in uiCA instead. :/)

The only missed micro-optimization in implementing this sub-optimal strategy is that it's using vpextrb (because it doesn't prove that truncation to uint8_t isn't needed?) instead of vmovd or vmovq. So it costs an extra uop for the front-end, and for port 5 in the back end. With that optimized (comment + uncomment in the link), 5.25c / iter on Skylake, or 4.81 on Ice Lake, pretty close to the 2 load/clock bottleneck.

(Doing 6 vectors per iter, 192 bytes, predicts 7 cycles per iter on SKL, or 1.166 per vector, down from 5.5 / iter = 1.375 per vector. Or about 6.5 on ICL/RKL = 1.08 c/vec, hitting back-end ALU port bottlecks.)

This is not bad for something we were able to coax clang into generating from portable C++ source, vs. 4 cycles per 4 vectors of 32 byte-compares each for efficient manual vectorization. This will very likely keep up with memory or cache bandwidth even from L2 cache, so it's pretty usable, and not much slower with data hot in L1d. Taking a few more uops does hurt out-of-order exec, and uses up more execution resources that another logical core sharing a physical core could use. (Hyperthreading).

Unfortunately gcc/clang do not make good use of AVX-512 for this. If you were using 512-bit vectors (or AVX-512 features on 256-bit vectors), you'd compare into mask registers, then do something like vpaddb zmm0{k1}, zmm0, zmm1 merge-masking to conditionally increment a vector, where zmm1 = set1_epi8( 1 ). (Or a -1 constant with sub.) Instruction and uop count per vector should be about the same as AVX2 if done properly, but gcc/clang use about twice as many, so the only saving is in the reduction to scalar which seems to be the price for getting anything at all usable.

This version also avoids unrolling of the clean-up loop, just vectorizing with its dumb 4 bytes per iter strategy, which is about right for cleanup of size%128 bytes. It's pretty silly that it uses both vpxor to flip and vpand to turn 0xff into 0x01, when it could have used vpandn to do both those things in one instruction. That would get that cleanup loop down to 8 uops, just twice the pipeline width on Haswell / Skylake, so it would issue more efficiently from the loop buffer, except Skylake disabled that in microcode updates. It would help a bit on Haswell

I have not done a full implementation, but this looks like a case where AVX-512(VL/BW) would help. Load 2 512-bit registers, VPCMPEQB, move the resulting mask to a GPR and execute POPCNT. Unroll by four to tolerate the latencies and asymptotic performance should be limited by the two Port 5 instructions to 64 comparisons every two cycles. — John D McCalpin, Nov 13 '22 at 19:12
@JohnDMcCalpin: I think you'd still be better off with a nested loop, with the inner loop doing load+vpcmpb / `vpaddb zmm0{k1}, zmm0, zmm1` (p05 conditional increment. Or vpsubb with an easier-to-generate all-ones constant). 2 vector ALU uop per 2 loads, 64 byte-compares per 1 cycle. You can do up to 255 unrolled iterations before `vpsadbw` / `vpaddq`. If your data won't be hot in L1d and you're not worried about the other hyperthread, then yeah you might save code size w/ `kmov` (p5) + `popcnt` (p1) + `add` (p0156). As you say, `vpcmpb` runs on p5 only, same as `kmov`. — Peter Cordes, Nov 13 '22 at 20:16
[Can counting byte matches between two strings be optimized using SIMD?](https://stackoverflow.com/a/76450022) has another attempt at auto-vectorizable code, getting clang to use `vpmovmskb` / `popcnt`. — Peter Cordes, Jun 12 '23 at 22:27

score 3 · Answer 3 · edited Nov 13 '22 at 19:27

Correct me if I am wrong but the answer seems to be

-march=native for the win.
the scalar version of the code was CPU bottlenecked and not RAM bottlenecked
use uica.uops.info to have an estimate of the cycles per loop

I will try to write my own AVX code to compare.

Details

After an afternoon tinkering around with the suggestions, here is what I found with clang:

-O1 around 10ms, scalar code
-O3 enables SSE2 and is as slow as O1, maybe poor assembly code
-O3 -march=westmere enables also SSE2 but is faster (7ms)
-O3 -march=native enables AVX -> 2.5ms and we are probably RAM bandwidth limited (close to the theoretical speed)

The scalar 10ms makes sense now because according to that awesome tool uica.uops.info it takes

2.35 cycles per loop
47 million cycles for the whole comparison (20 million iterations)
Processor is clocked at 4.8GHz meaning it should take around 9.8ms and it is close to what is measured.

g++ seems to generate better default code when no flags are added

O1 11ms
O2 scalar still but 9ms
O3 SSE 4.5ms
O3 -march=westmere 7ms like clang
O3 -march=native 3.4ms, slightly slower than clang