SIMD prefix sum on Intel cpu

Question

I need to implement a prefix sum algorithm and would need it to be as fast as possible.
Ex:

[3, 1,  7,  0,  4,  1,  6,  3]

should give:

[3, 4, 11, 11, 15, 16, 22, 25]

Is there a way to do this using SSE SIMD CPU instruction?

My first idea is to sum each pair in parallel recursively until all sum have been computed like below!

//in parallel do 
for (int i = 0; i < z.length; i++) {
    z[i] = x[i << 1] + x[(i << 1) + 1];
}

To make the algorithm a little bit more clear, z is not the final output, but instead used to compute the output.

int[] w = computePrefixSum(z);
for (int i = 1; i < ouput.length; i++) {
    ouput[i] = (i % 2 == 0) ? (x[i] + ouput[i - 1]) :  w[(i - 1) >> 1];
}

It doesn't strike me as at all obvious that you're going to gain a lot of parallelism here -- each result value depends on all previous results, which pretty much defines a serial algorithm. — Jerry Coffin, May 14 '12 at 16:56
it doesn't if you look at the loop i copy pasted it will add 3 and 1 in parallel to adding 6 and 3 as well as 4 and 1 this should require log( N ) such pass over the input to complete te prefix sum but it should still be better then on serial pass — skyde, May 14 '12 at 16:59
For the right size of array, it might help a little, but given the degree to which the cache affects things like this, I wouldn't bet a lot on it. As an aside, your loop doesn't look right to me. It's saying `z[0] = x[0] + x[1]` and `z[1] = x[2] + x[3]`. Maybe you intended a right shift (and probably want to start `i` from `1` instead of `0`)? — Jerry Coffin, May 14 '12 at 17:07
Good question! There's a [classic paper](http://bwrc.eecs.berkeley.edu/classes/ee225c/2000%20225c/Papers/KoggeStone.pdf) which shows how to do this sort of thing in O(lg(n)) time and which inspired carry-lookahead adders, but I don't know how to best coax a CPU into effectively doing it. — Robert Cooper, May 14 '12 at 17:40
Related: https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-scan-operations-explicit-vectorization.html - Intel article on manual vectorization with AVX-512, vs. `#pragma omp scan inclusive(varname)` (which ICC supports). — Peter Cordes, Jul 18 '22 at 18:35

Z boson · Accepted Answer · 2017-08-09T08:26:09.920

The fastest parallel prefix sum algorithm I know of is to run over the sum in two passes in parallel and use SSE as well in the second pass.

In the first pass you calculate partial sums in parallel and store the total sum for each partial sum. In the second pass you add the total sum from the preceding partial sum to the next partial sum. You can run both passes in parallel using multiple threads (e.g. with OpenMP). The second pass you can also use SIMD since a constant value is being added to each partial sum.

Assuming n elements of an array, m cores, and a SIMD width of w the time cost should be

n/m + n/(m*w) = (n/m)*(1+1/w)

Since the fist pass does not use SIMD the time cost will always be greater than n/m

For example for four cores with a SIMD_width of 4 (four 32bit floats with SSE) the cost would be 5n/16. Or about 3.2 times faster than sequential code which has a time cost of n. Using hyper threading the speed up will be greater still.

In special cases it's possible to use SIMD on the first pass as well. Then the time cost is simply

2*n/(m*w)

I posted the code for the general case which uses OpenMP for the threading and intrinsics for the SSE code and discuss details about the special case at the following link parallel-prefix-cumulative-sum-with-sse

Edit: I managed to find a SIMD version for the first pass which is about twice as fast as sequential code. Now I get a total boost of about 7 on my four core ivy bridge system.

Edit: For larger arrays one problem is that after the first pass most values have been evicted from the cache. I came up with a solution which runs in parallel inside a chunk but runs each chunk serially. The chunk_size is a value that should be tuned. For example I set it to 1MB = 256K floats. Now the second pass is done while the values are still inside the level-2 cache. Doing this gives a big improvement for large arrays.

Here is the code for SSE. The AVX code is about the same speed so I did not post it here. The function which does the prefix sum is scan_omp_SSEp2_SSEp1_chunk. Pass it an array a of floats and it fills the array s with the cumulative sum.

__m128 scan_SSE(__m128 x) {
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4))); 
    x = _mm_add_ps(x, _mm_shuffle_ps(_mm_setzero_ps(), x, 0x40)); 
    return x;
}

float pass1_SSE(float *a, float *s, const int n) {
    __m128 offset = _mm_setzero_ps();
    #pragma omp for schedule(static) nowait
    for (int i = 0; i < n / 4; i++) {
        __m128 x = _mm_load_ps(&a[4 * i]);
        __m128 out = scan_SSE(x);
        out = _mm_add_ps(out, offset);
        _mm_store_ps(&s[4 * i], out);
        offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3));
    }
    float tmp[4];
    _mm_store_ps(tmp, offset);
    return tmp[3];
}

void pass2_SSE(float *s, __m128 offset, const int n) {
    #pragma omp for schedule(static)
    for (int i = 0; i<n/4; i++) {
        __m128 tmp1 = _mm_load_ps(&s[4 * i]);
        tmp1 = _mm_add_ps(tmp1, offset);
        _mm_store_ps(&s[4 * i], tmp1);
    }
}

void scan_omp_SSEp2_SSEp1_chunk(float a[], float s[], int n) {
    float *suma;
    const int chunk_size = 1<<18;
    const int nchunks = n%chunk_size == 0 ? n / chunk_size : n / chunk_size + 1;
    //printf("nchunks %d\n", nchunks);
    #pragma omp parallel
    {
        const int ithread = omp_get_thread_num();
        const int nthreads = omp_get_num_threads();

        #pragma omp single
        {
            suma = new float[nthreads + 1];
            suma[0] = 0;
        }

        float offset2 = 0.0f;
        for (int c = 0; c < nchunks; c++) {
            const int start = c*chunk_size;
            const int chunk = (c + 1)*chunk_size < n ? chunk_size : n - c*chunk_size;
            suma[ithread + 1] = pass1_SSE(&a[start], &s[start], chunk);
            #pragma omp barrier
            #pragma omp single
            {
                float tmp = 0;
                for (int i = 0; i < (nthreads + 1); i++) {
                    tmp += suma[i];
                    suma[i] = tmp;
                }
            }
            __m128 offset = _mm_set1_ps(suma[ithread]+offset2);
            pass2_SSE(&s[start], offset, chunk);
            #pragma omp barrier
            offset2 = s[start + chunk-1];
        }
    }
    delete[] suma;
}

Does this hide the extra bypass-delay latency of using an integer shuffle (`_mm_slli_si128`) between FP adds? Gotta love SSE non-orthogonality, not having an FP shuffle that can zero an element like `pshufb` or `pslldq`. Anyway, if this doesn't saturate ports 1 and 5 (add and shuffle), you could unroll like I did in my integer single-thread solution. It took me a while to see it, since you split `scan` out into a separate function, but your pass1 is the same as what I did. Your `offset` matches my `carry`, in carrying the loop dependency between iterations. — Peter Cordes, Sep 10 '15 at 14:00
At the end of your `pass1_SSE`, you have `offset` holding a broadcast copy of the final prefix sum of that chunk. You store all 4 copies, then load the final one as the return value? /boggle. All you have to do is return the low element. `float _mm_cvtss_f32(m128)` exists to express this with intrinsics. It says it compiles to `movss`, but a smart compiler should just use `xmm0` for offset in the first place. — Peter Cordes, Sep 10 '15 at 14:07
I like the your idea of doing prefix sums on sub-arrays in parallel, then doing another pass once the end sums are known. I don't know OpenMP, so maybe you're already doing this, but you can skip the pass2 for `c=0`, because adding `0.0f` to every element is a no-op. This will only matter much for small problem sizes. Speaking of which, I thought cache blocking for ~ 1/2 L2 size was the usual suggestion. Your 1MiB chunks will give each core a buffer that exactly fills their entire L2, meaning some will be evicted for code, page tables, kernel data, etc. Do pass2 in reverse order maybe? — Peter Cordes, Sep 10 '15 at 14:19
@PeterCordes, thanks for your comments, sorry for the delayed response. I have been too busy for the last months but should be able to get back to SO finally next week. When I wrote this question I did not understand what memory bandwidth bound meant. I'm not sure running of chunks helped. For very large N I think it's entirely memory bandwidth bound anyway. I also did not really appreciate ports either when I did this. I have learned a lot since then. — Z boson, Sep 21 '15 at 09:59
Hmm yeah, it's pretty easy to be memory bound. If different CPUs generated parts of the array in the first place, and have them in L2, having them do their first-stage sums would be a win. — Peter Cordes, Sep 21 '15 at 16:22
I wrote an integer version of the scan_SSE building block, using 16-bit elements, as part of [this answer](http://stackoverflow.com/a/39379901/224132). That means it's three steps long, and does all of them with `_mm_bslli_si128()` 2, 4, 8. — Peter Cordes, Sep 07 '16 at 22:52

Gunther Piez · Answer 2 · 2013-10-23T22:28:50.930

You can exploit some minor parallelism for large register lengths and small sums. For instance, adding up 16 values of 1 byte (which happen to fit into one sse register) requires only log₂16 additions and an equal number of shifts.
Not much, but faster than 15 depended additions and the additional memory accesses.

__m128i x = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3);
x = _mm_add_epi8(x, _mm_srli_si128(x, 1));
x = _mm_add_epi8(x, _mm_srli_si128(x, 2));
x = _mm_add_epi8(x, _mm_srli_si128(x, 4));
x = _mm_add_epi8(x, _mm_srli_si128(x, 8));

// x == 3, 4, 11, 11, 15, 16, 22, 25, 28, 29, 36, 36, 40, 41, 47, 50

If you have longer sums, the dependencies could be hidden by exploiting instruction level parallelism and taking advantage of instruction reordering.

Edit: something like

__m128i x0 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3);
__m128i x1 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3);
__m128i x2 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3);
__m128i x3 = _mm_set_epi8(3,1,7,0,4,1,6,3,3,1,7,0,4,1,6,3);

__m128i mask = _mm_set_epi8(0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0);

x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 1));
x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 1));
x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 1));
x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 1));

x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 2));
x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 2));
x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 2));
x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 2));

x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 4));
x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 4));
x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 4));
x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 4));

x0 = _mm_add_epi8(x0, _mm_srli_si128(x0, 8));
x1 = _mm_add_epi8(x1, _mm_srli_si128(x1, 8));
x2 = _mm_add_epi8(x2, _mm_srli_si128(x2, 8));
x3 = _mm_add_epi8(x3, _mm_srli_si128(x3, 8));

x1 = _mm_add_epi8(_mm_shuffle_epi8(x0, mask), x1);
x2 = _mm_add_epi8(_mm_shuffle_epi8(x1, mask), x2);
x3 = _mm_add_epi8(_mm_shuffle_epi8(x2, mask), x3);

I'd love to hear more about the "long sums" scenario. How can you exploit instruction level parallelism? — Igor ostrovsky, May 14 '12 at 19:34
@hirschhornsalz I don't understand the last three adds. I printed out the results. `x0 = [3 4 11 11 15 16 22 25 28 29 36 36 40 41 47 50]`. x1 should = x0 + 50 (the last element of x0). However your code is not doing that. It gives x1 = [6 8 22 22 30 32 44 50 56 58 72 72 80 82 94 100]. I think you want to broadcast the last element and add those. — Z boson, Oct 22 '13 at 07:37
@redrum Yes, of course you are right. I edited the broadcast in (without testing, I hope I got it right^^). — Gunther Piez, Oct 23 '13 at 09:58
The mask is wrong. Change it to `__m128i mask = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0);` and it works. — Z boson, Oct 23 '13 at 10:12
BTW, I mentioned you in my answer (to my own question) [parallel-prefix-cumulative-sum-with-sse](http://stackoverflow.com/questions/19494114/parallel-prefix-cumulative-sum-with-sse). I used your idea for bytes with 32-bit words with SSE, AVX, and AVX-512. — Z boson, Oct 23 '13 at 10:14
This code is backwards, because `_mm_set_epi8` takes its args in big-endian order. `_mm_setr_epi8` takes them in reverse order, which matches what you'd get from loading from `int8_t data[] = { 3,1,7, ...};` The shifts need to be left-shifts, and the broadcast does need to broadcast element 15. (`_mm_set1_epi8(15)`). — Peter Cordes, Sep 10 '15 at 05:02

score 9 · Answer 3 · edited Nov 24 '19 at 23:45

9

prefix-sum can be computed in parallel, it's actually one of the foundational algorithms in GPU programming. If you're using SIMD extensions on an Intel processor I'm not sure if doing it in parallel will actually benefit you much, but take a look at this paper from nvidia on implementing parallel prefix-sum (just look at the algorithms and ignore the CUDA): Parallel Prefix Sum (Scan) with CUDA.

edited Nov 24 '19 at 23:45

fospathi

537
1
6
7

answered May 14 '12 at 17:29

Dan

476
2
9

2

Nvidia should compare their GPU solution to my CPU solution. I'm confident the 20x advantage they claim for the GPU would be less that 5x for floats and likely even slower than the CPU for doubles with my code. – Z boson Oct 25 '13 at 09:36

Peter Cordes · Answer 4 · 2015-09-10T12:33:46.940

For an array of 1000 32bit integers, I was able to get a small speedup of about 1.4x single-threaded, using @hirschhornsalz's method in a loop on Intel Sandybridge. With a 60kiB buffer of ints, the speedup is about 1.37. With 8MiB of ints, the speedup is still 1.13. (i5-2500k at 3.8GHz turbo, with DDR3-1600.)

Smaller elements (int16_t or uint8_t, or the unsigned versions) would take an extra stage of shift/add for each doubling of the number of elements per vector. Overflow is bad, so don't try to use a data type that can't hold the sum of all elements, even though it gives SSE a bigger advantage.

#include <immintrin.h>

// In-place rewrite an array of values into an array of prefix sums.
// This makes the code simpler, and minimizes cache effects.
int prefix_sum_sse(int data[], int n)
{
//    const int elemsz = sizeof(data[0]);
#define elemsz sizeof(data[0])   // clang-3.5 doesn't allow compile-time-const int as an imm8 arg to intrinsics

    __m128i *datavec = (__m128i*)data;
    const int vec_elems = sizeof(*datavec)/elemsz;
    // to use this for int8/16_t, you still need to change the add_epi32, and the shuffle

    const __m128i *endp = (__m128i*) (data + n - 2*vec_elems);  // don't start an iteration beyond this
    __m128i carry = _mm_setzero_si128();
    for(; datavec <= endp ; datavec += 2) {
        IACA_START
        __m128i x0 = _mm_load_si128(datavec + 0);
        __m128i x1 = _mm_load_si128(datavec + 1); // unroll / pipeline by 1
//      __m128i x2 = _mm_load_si128(datavec + 2);
//      __m128i x3;

        x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, elemsz));  // for floats, use shufps not bytewise-shift
        x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, elemsz));

        x0 = _mm_add_epi32(x0, _mm_slli_si128(x0, 2*elemsz));
        x1 = _mm_add_epi32(x1, _mm_slli_si128(x1, 2*elemsz));

    // more shifting if vec_elems is larger

        x0 = _mm_add_epi32(x0, carry);  // this has to go after the byte-shifts, to avoid double-counting the carry.
        _mm_store_si128(datavec +0, x0); // store first to allow destructive shuffle (non-avx pshufb if needed)

        x1 = _mm_add_epi32(_mm_shuffle_epi32(x0, _MM_SHUFFLE(3,3,3,3)), x1);
        _mm_store_si128(datavec +1, x1);

        carry = _mm_shuffle_epi32(x1, _MM_SHUFFLE(3,3,3,3)); // broadcast the high element for next vector
    }
    // FIXME: scalar loop to handle the last few elements
    IACA_END
    return data[n-1];
    #undef elemsz
}

int prefix_sum_simple(int data[], int n)
{
    int sum=0;
    for (int i=0; i<n ; i++) {
        IACA_START
        sum += data[i];
        data[i] = sum;
    }
    IACA_END
    return sum;
}

// perl -we '$n=1000; sub rnlist($$) { return map { int rand($_[1]) } ( 1..$_[0] );}  @a=rnlist($n,127);   $"=", "; print "$n\n@a\n";'

int data[] = { 51, 83, 126, 11,   20, 63, 113, 102,
       126,67, 83, 113,   86, 123, 30, 109,
       97, 71, 109, 86,   67, 60,  47, 12,
       /* ... */ };


int main(int argc, char**argv)
{
    const int elemsz = sizeof(data[0]);
    const int n = sizeof(data)/elemsz;
    const long reps = 1000000 * 1000 / n;
    if (argc >= 2 && *argv[1] == 'n') {
        for (int i=0; i < reps ; i++)
            prefix_sum_simple(data, n);
    }else {
        for (int i=0; i < reps ; i++)
            prefix_sum_sse(data, n);
    }
    return 0;
}

Testing with n=1000, with the list compiled into the binary. (And yes, I checked that it's actually looping, not taking any compile-time shortcuts that make the vector or non-vector test meaningless.)

Note that compiling with AVX to get 3-operand non-destructive vector instructions saves a lot of movdqa instructions, but only saves a tiny amount of cycles. This is because shuffle and vector-int-add can both only run on ports 1 and 5, on SnB/IvB, so port0 has plenty of spare cycles to run the mov instructions. uop-cache throughput bottlenecks might be the reason why the non-AVX version is slightly slower. (All those extra mov instructions push us up to 3.35 insn/cycle). The frontend is only idle 4.54% of cycles, so it's barely keeping up.

gcc -funroll-loops -DIACA_MARKS_OFF -g -std=c11 -Wall -march=native -O3 prefix-sum.c -mno-avx -o prefix-sum-noavx

  # gcc 4.9.2

################# SSE (non-AVX) vector version ############
$ ocperf.py stat -e task-clock,cycles,instructions,uops_issued.any,uops_dispatched.thread,uops_retired.all,uops_retired.retire_slots,stalled-cycles-frontend,stalled-cycles-backend ./prefix-sum-noavx 
perf stat -e task-clock,cycles,instructions,cpu/event=0xe,umask=0x1,name=uops_issued_any/,cpu/event=0xb1,umask=0x1,name=uops_dispatched_thread/,cpu/event=0xc2,umask=0x1,name=uops_retired_all/,cpu/event=0xc2,umask=0x2,name=uops_retired_retire_slots/,stalled-cycles-frontend,stalled-cycles-backend ./prefix-sum-noavx

 Performance counter stats for './prefix-sum-noavx':

        206.986720      task-clock (msec)         #    0.999 CPUs utilized          
       777,473,726      cycles                    #    3.756 GHz                    
     2,604,757,487      instructions              #    3.35  insns per cycle        
                                                  #    0.01  stalled cycles per insn
     2,579,310,493      uops_issued_any           # 12461.237 M/sec
     2,828,479,147      uops_dispatched_thread    # 13665.027 M/sec
     2,829,198,313      uops_retired_all          # 13668.502 M/sec (unfused domain)
     2,579,016,838      uops_retired_retire_slots # 12459.818 M/sec (fused domain)
        35,298,807      stalled-cycles-frontend   #    4.54% frontend cycles idle   
         1,224,399      stalled-cycles-backend    #    0.16% backend  cycles idle   

       0.207234316 seconds time elapsed
------------------------------------------------------------


######### AVX (same source, but built with -mavx).  not AVX2 #########
$ ocperf.py stat -e task-clock,cycles,instructions,uops_issued.any,uops_dispatched.thread,uops_retired.all,uops_retired.retire_slots,stalled-cycles-frontend,stalled-cycles-backend ./prefix-sum-avx

 Performance counter stats for './prefix-sum-avx':

        203.429021      task-clock (msec)         #    0.999 CPUs utilized          
       764,859,441      cycles                    #    3.760 GHz                    
     2,079,716,097      instructions              #    2.72  insns per cycle        
                                                  #    0.12  stalled cycles per insn
     2,054,334,040      uops_issued_any           # 10098.530 M/sec                  
     2,303,378,797      uops_dispatched_thread    # 11322.764 M/sec                  
     2,304,140,578      uops_retired_all          # 11326.509 M/sec                  
     2,053,968,862      uops_retired_retire_slots # 10096.735 M/sec                  
       240,883,566      stalled-cycles-frontend   #   31.49% frontend cycles idle   
         1,224,637      stalled-cycles-backend    #    0.16% backend  cycles idle   

       0.203732797 seconds time elapsed
------------------------------------------------------------


################## scalar version (cmdline arg) #############    
$ ocperf.py stat -e task-clock,cycles,instructions,uops_issued.any,uops_dispatched.thread,uops_retired.all,uops_retired.retire_slots,stalled-cycles-frontend,stalled-cycles-backend ./prefix-sum-avx n

 Performance counter stats for './prefix-sum-avx n':

        287.567070      task-clock (msec)         #    0.999 CPUs utilized          
     1,082,611,453      cycles                    #    3.765 GHz                    
     2,381,840,355      instructions              #    2.20  insns per cycle        
                                                  #    0.20  stalled cycles per insn
     2,272,652,370      uops_issued_any           # 7903.034 M/sec                  
     4,262,838,836      uops_dispatched_thread    # 14823.807 M/sec                  
     4,256,351,856      uops_retired_all          # 14801.249 M/sec                  
     2,256,150,510      uops_retired_retire_slots # 7845.650 M/sec                  
       465,018,146      stalled-cycles-frontend   #   42.95% frontend cycles idle   
         6,321,098      stalled-cycles-backend    #    0.58% backend  cycles idle   

       0.287901811 seconds time elapsed

------------------------------------------------------------

Haswell should be about the same, but maybe slightly slower per-clock, because shuffle can only run on port 5, not port 1. (vector-int add is still p1/5 on Haswell.)

OTOH, IACA thinks Haswell will be slightly faster than SnB for one iteration, if you compile without -funroll-loops (which does help on SnB). Haswell can do branches on port6, but on SnB branches are on port5, which we already saturate.

 # compile without -DIACA_MARKS_OFF
$ iaca -64 -mark 1 -arch HSW prefix-sum-avx    
Intel(R) Architecture Code Analyzer Version - 2.1
Analyzed File - prefix-sum-avx
Binary Format - 64Bit
Architecture  - HSW
Analysis Type - Throughput

*******************************************************************
Intel(R) Architecture Code Analyzer Mark Number 1
*******************************************************************

Throughput Analysis Report
--------------------------
Block Throughput: 6.20 Cycles       Throughput Bottleneck: Port5

Port Binding In Cycles Per Iteration:
---------------------------------------------------------------------------------------
|  Port  |  0   -  DV  |  1   |  2   -  D   |  3   -  D   |  4   |  5   |  6   |  7   |
---------------------------------------------------------------------------------------
| Cycles | 1.0    0.0  | 5.8  | 1.4    1.0  | 1.4    1.0  | 2.0  | 6.2  | 1.0  | 1.3  |
---------------------------------------------------------------------------------------

N - port number or number of cycles resource conflict caused delay, DV - Divider pipe (on port 0)
D - Data fetch pipe (on ports 2 and 3), CP - on a critical path
F - Macro Fusion with the previous instruction occurred
* - instruction micro-ops not bound to a port
^ - Micro Fusion happened
# - ESP Tracking sync uop was issued
@ - SSE instruction followed an AVX256 instruction, dozens of cycles penalty is expected
! - instruction not supported, was not accounted in Analysis

| Num Of |                    Ports pressure in cycles                     |    |
|  Uops  |  0  - DV  |  1  |  2  -  D  |  3  -  D  |  4  |  5  |  6  |  7  |    |
---------------------------------------------------------------------------------
|   1    |           |     | 1.0   1.0 |           |     |     |     |     |    | vmovdqa xmm2, xmmword ptr [rax]
|   1    | 1.0       |     |           |           |     |     |     |     |    | add rax, 0x20
|   1    |           |     |           | 1.0   1.0 |     |     |     |     |    | vmovdqa xmm3, xmmword ptr [rax-0x10]
|   1    |           |     |           |           |     | 1.0 |     |     | CP | vpslldq xmm1, xmm2, 0x4
|   1    |           | 1.0 |           |           |     |     |     |     |    | vpaddd xmm2, xmm2, xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | vpslldq xmm1, xmm3, 0x4
|   1    |           | 1.0 |           |           |     |     |     |     |    | vpaddd xmm3, xmm3, xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | vpslldq xmm1, xmm2, 0x8
|   1    |           | 1.0 |           |           |     |     |     |     |    | vpaddd xmm2, xmm2, xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | vpslldq xmm1, xmm3, 0x8
|   1    |           | 1.0 |           |           |     |     |     |     |    | vpaddd xmm3, xmm3, xmm1
|   1    |           | 0.9 |           |           |     | 0.2 |     |     | CP | vpaddd xmm1, xmm2, xmm0
|   2^   |           |     |           |           | 1.0 |     |     | 1.0 |    | vmovaps xmmword ptr [rax-0x20], xmm1
|   1    |           |     |           |           |     | 1.0 |     |     | CP | vpshufd xmm1, xmm1, 0xff
|   1    |           | 0.9 |           |           |     | 0.1 |     |     | CP | vpaddd xmm0, xmm1, xmm3
|   2^   |           |     | 0.3       | 0.3       | 1.0 |     |     | 0.3 |    | vmovaps xmmword ptr [rax-0x10], xmm0
|   1    |           |     |           |           |     | 1.0 |     |     | CP | vpshufd xmm0, xmm0, 0xff
|   1    |           |     |           |           |     |     | 1.0 |     |    | cmp rax, 0x602020
|   0F   |           |     |           |           |     |     |     |     |    | jnz 0xffffffffffffffa3
Total Num Of Uops: 20

BTW, gcc compiled the loop to use a one-register addressing mode even when I had a loop counter and was doing load(datavec + i + 1). That's the best code, esp. on SnB-family where 2-register addressing modes can't micro-fuse, so I change the source to that loop condition for the benefit of clang.

@GlennSlayden: See [Micro fusion and addressing modes](//stackoverflow.com/q/26046634) for more details about it. My answer there has newer info that Intel didn't document back when I wrote this answer. As well as more details about what it actually is. — Peter Cordes, Jun 15 '19 at 18:59

score 5 · Answer 5 · answered Oct 05 '21 at 14:36

NOTE: in C++ standard "prefix sum" is called "inclusive scan", so this is what we call it.

We have ported and generalised the SIMD part of @Z bozon 's answer (Thanks for the fantastic work!) to all x86 (sse - avx512) and arm (neon/aarch-64) in eve library. It's open source and MIT licensed.

NOTE: we support only version of AVX-512 that matches skylake-avx512. If your machine does not support all the requirements, we will use avx2.

We also support doing this over parallel arrays, so for example you can do an inclusive scan for complex numbers: example.

Here you can see the asm we generate for different architectures (you can change the T type for different types): godbolt. In case the links get stale, arm-64, avx-2.

Here are some numbers for int, for different x86 architectures, compared to the scalar code on 10'000 bytes of data. Processor intel-9700k.

NOTE: unfortunately, no benchmarks for arm at this point.

On sse2-sse4 we are about 1.6-1.7 times faster, on avx2 - about 2 times.

The algorithm's limit is cardinal / log(cardinal) - so 4 / 2 - 2 times on sse2-sse4.2 and 8 / 3 - 2.7 times on avx2 - we are not too far off.

What about threads?

In eve library we don't directly deal with threads, however we have inclusive_scan and transform which are the building blocks to do the parallel version.

Here is my sketch of parallel / vectorised version on top of eve. You'd need some decent threading library though, I used std::async/std::future in the example, which are bad.

Other relevant capabilities

You can use inclusive_scan_to instead of inclusive_scan_inplace, no problem if you want to keep the original data. inlcusive_scan_to

We also have support different types (same way as the standard inclusive scan) - so you can sum floats to doubles and similar. floats to doubles

We support custom plus operations, so you can use min if you want to. I already mentioned zip and ability to scan multiple arrays in one go. example

Feel free to create an issue on the library if you want to try it out and need help.

Your code (and std::inclusive_scan) got *slower* when enabling SSE4.2? That's not good. What compiler / version / options did you use? — Peter Cordes, Oct 05 '21 at 17:58
I don't believe that's real: there is a very small difference. Probably just side effects of running on a real machine. Though there is some asm diff: https://godbolt.org/z/r9rY1o9Es — Denis Yaroshevskiy, Oct 05 '21 at 19:15

SIMD prefix sum on Intel cpu

5 Answers5

What about threads?

Other relevant capabilities

Linked