How to Shuffle a Vector128 and Add the elements, then Extract a scalar value properly?

Question

I am using Vector128<byte> in C# in order to count matches from a byte array with 16 index.

This is part of implementing a byte version of Micro Optimization of a 4-bucket histogram of a large array or list, using the technique from How to count character occurrences using SIMD of widening 8-bit counters to 64 inside an outer loop (hsum_epu8_epu64 helper function), and then after all the loops of summing that vector of counters down to one scalar (hsum_epu64_scalar).

So that C++ with Intel intrinsics has to be ported to C#. And without AVX2 so we're using 128-bit integer vectors, not 256.

The byte array consists of numbers 0 and 1 where 5 0 occurs.

The task now is to count those 5 0 where we can see that 2 of the 0 occurs in the upperband of the Vector128<byte> and 3 of the 0 occurs in the lowerband of the Vector128<byte>.

I have succeed with code all the way to where I Sse2.SumAbsoluteDifferences and can extract the number of 0 for sumHigh and sumLow showing 3 and 2 respectively.

The problem starts now where I need to shuffle so the upperband and lowerband change places so I later can extract the opposites in: sumHigh and sumLow for sum64b

I have put a lot of comments in the code also so I think it is possible to follow the code and see there also exactly how I try to shuffle and complete the code.

(The code shows also that my AMD K10 processor supports: Sse, Sse2, Sse3)

    using System.Runtime.Intrinsics;
    using System.Runtime.Intrinsics.X86;

    private void button2_Click(object sender, EventArgs e)
    {
        //This shows what is supported on my processor. However it seems that I could use something from "Avx" anyway
        bool avx = Avx.IsSupported; //false
        bool avx2 = Avx2.IsSupported; //false

        bool sse = Sse.IsSupported; //true
        bool sse2 = Sse2.IsSupported; //true
        bool sse3 = Sse3.IsSupported; //true
        bool ssse3 = Ssse3.IsSupported; //false
        bool sse41 = Sse41.IsSupported; //false
        bool sse42 = Sse42.IsSupported; //false

        //Create a bytearray of 16 indexes. As seen: '0' occur 2 times in the upper band and 3 times in the lower band
        //We want to count those "0" in the below code
        byte[] v1 = new byte[16];
        v1[0] = 0; v1[1] = 0; v1[2] = 1; v1[3] = 1; v1[4] = 1; v1[5] = 1; v1[6] = 1; v1[7] = 1;
        v1[8] = 1; v1[9] = 0; v1[10] = 0; v1[11] = 0; v1[12] = 1; v1[13] = 1; v1[14] = 1; v1[15] = 1;

        Vector128<byte> counts = Vector128<byte>.Zero;
        unsafe
        {
            fixed (byte* fixedInput = v1)
            {
                //Load byte Vector with 16 indexes
                var v = Avx.LoadVector128(&fixedInput[0]);

                //Now match how many "0" we can find in "Vector128: v". 'counts' show the result string where: '1' tells where we found: "0".
                //As seen it happened as expected total times: 5 (2 times in the upper band and 3 times in the lower band of the Vector)
                byte val = 0;
                var match = Avx.CompareEqual(v, Vector128.Create(val));
                counts = Avx.Subtract(counts, match); //counts: <1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0>



                //Extract high/low bands
                //So we use "SumAbsoluteDifferences" to "Separately sum the 8 low differences and 8 high differences to produce two unsigned word integer results."
                //We can see on index 0: 2 and on index 4: 3
                Vector128<ushort> sum64 = Vector128<ushort>.Zero;
                sum64 = Sse2.Add(sum64, Sse2.SumAbsoluteDifferences(counts, Vector128<byte>.Zero)); //sum64: <2,0,0,0,3,0,0,0>

                //I AM NOT SURE OF THE CODE BELOW HOW TO DO IT PROPERLY!

                //Now I need to shuffle the above: "<2,0,0,0,3,0,0,0>" but are not sure of how the complete process is to do this correctly?
                //Below is a start of an "attempt" but are not sure how to do this all the way correctly?

                Vector128<uint> result = Sse2.Shuffle(sum64.AsUInt32(), 0xB1); 

                //Extract high/low bands from ther shuffle above?
                //Vector128<uint> sum64b = Vector128<uint>.Zero;
                //sum64b = Sse2.Add(sum64b, result);
                //sumHigh = Sse2.Extract(sum64b, 1); //0
                //sumLow = Sse2.Extract(sum64b, 0); //                                                                    
            }
        }
    }

Using 16-bit extracts would be possible but not usable for larger counts.

    var sumHigh = Sse2.Extract(sum64, 4);    // pextrw
    var sumLow = Sse2.Extract(sum64, 0); //sumHigh == 3 and sumLow == 2
    var sumScalar = SumLow + sumHigh;

Note from @PeterCordes: the real use-case would loop to add up to 255 vectors into counts, then in an outer loop accumulate into wide elements in sum64 with Sse2.SumAbsoluteDifferences and Sse2.Add, and reset counts. That part looks correct in this C# port, except that sum64 should not use ushort elements.

The part this question is asking about is the horizontal sum of two 64-bit vector elements down to one scalar integer. (The real use-case has three vectors of counts, from 3 histogram buckets; a transpose and sum could work but simply doing separate horizontal sums for each vector is fine.)

`sum64` should be a `Vector128` or `Vector128`. Using `Sse2.Extract(sum64, 4);` (`pextrw`) to extra a 16-bit chunk is not going to work for your real use-case because counts can be larger than 65535. — Peter Cordes, Apr 13 '20 at 15:24
`Vector128 srcAsUInt32 = counts.AsUInt32();` - this should be `sum64.AsUInt32();` You were taking the 8-bit count buckets, not the `psadbw` result. — Peter Cordes, Apr 13 '20 at 15:28
@Peter Cordes Thanks again for the help with the question! Yes you are right that is what I can see also. The problem where I get stuck is that `Sse2.Extract(` only can return an `ushort` and the that function can also only accept `Vector128` as a parameter. I am not sure but it seems that I get stuck there because I like to use `` for example so no overflow can occur. From the code: `Sse2.Shuffle(` down to `Sse2.Extract(` I can't even find a combination of types that will work because they accept different types. — Andreas, Apr 13 '20 at 15:54
`sum64.AsUInt32();` Yes ofcourse that is true. That must be the one to pass there. I did a mistake there also. — Andreas, Apr 13 '20 at 15:56
You might want to fix that bug in your question since it distracts from what you're asking about. But anyway, so apparently don't use `Extract`. Use a vector to vector shuffle and Add. Then surely there's some way to get the low element other than `Extract`? In C++ there are special intrinsics for getting the low element as a scalar, other than extract. Otherwise at worst you could transpose and add two `sum64` vectors (from two buckets) with unpacklo/unpackhi (`punpcklqdq`) for a vector store into an array of `Uint64`. Or even worse just store into a tmp array and reload as scalar. — Peter Cordes, Apr 13 '20 at 16:00
Yes perheps this actually worked. Lets see what you think. I am stuck with that `Sse2.SumAbsoluteDifferences` must return a `ushort` so `sum64` is a `ushort`. But then I used the `Unpack` like this and in the messageBox I could convert to Scalar and get the numbers `3` and `2`: .... `Vector128 upper = Sse2.UnpackHigh(sum64, sum64); Vector128 lower = Sse2.UnpackLow(sum64, sum64); MessageBox.Show(upper.ToScalar() + ":" + lower.ToScalar());` Could this be a valid solution. Then I will pass `3` and `2` to `Int64` counters to not overflow and then reset after each `255` iteration? — Andreas, Apr 13 '20 at 16:27
Yes, UnpackHigh looks useful, but you definitely want to use it on a `Vector128`, not to interleave 16-bit `ushort` elements. And you can do a SIMD Sse2.Add instead of 2 separate `ToScalar()`. But yes, `.ToScalar()` is presumably the intrinsic for `movd / movq`, so yes that's the right building block to use instead of `.Extract` (`pextrw`). — Peter Cordes, Apr 13 '20 at 16:52
Also, `Sse2.UnpackLow(sum64, sum64)` is pointless. The low element is already at the bottom, where you want it. And a `ushort` unpack is just going to produce `(2<<16) | 2` in the low 32 bits when it interleaves the `ushort` elements from the low halves of the two vectors. To get the right shuffle, you have to use the right method on the right element-type vector with C#, unlike C++ where everything is `__m128i` and the shuffle element size is part of the intrinsic name. — Peter Cordes, Apr 13 '20 at 16:55
I did cast the `SumAbsoluteDifferences` function to `.AsUInt64()` so I can go all the way down with that instead and change all types to `UInt64` — Andreas, Apr 13 '20 at 17:16
I am not sure why `Sse2.UnpackLow(sum64, sum64)` is pointless. The only thing I can understand there is that I use that method to get the value from the lower band? (I am not sure if the scenario changed when using only `UInt64` now) — Andreas, Apr 13 '20 at 17:16
Look at the example diagram in the asm manual for [`punpcklbw`](https://www.felixcloutier.com/x86/punpcklbw:punpcklwd:punpckldq:punpcklqdq) - note that the low element stays the low element. For 64-bit elements, you're just duplicating the low element to the upper element when you unpack against itself, but you never read that high SIMD element. https://www.officedaytime.com/simd512e/ also has some diagrams and categorization of instructions by what they do. — Peter Cordes, Apr 13 '20 at 17:52
*so I can go all the way down with that instead and change all types to UInt64* - yes, after SAD, you should treat that vector as having Uint64 elements from that point onwards. Messing around with 16-bit chunks of those 64-bit numbers is the mistake you need to avoid. — Peter Cordes, Apr 13 '20 at 17:56
Yes I looked at the `unpack` in the link there and the "red"/low element color stays at the low from the beginning. But I am not sure how to return the lower value without using: `Sse2.UnpackLow`. Should I use any other syntax on `sum64` to return this value? I think I miss something here how to do that. Perheps I should only use: `sum64.ToScalar()` as it return `2` actually which is the lower element? — Andreas, Apr 13 '20 at 18:36
That is great. So I will always think of using `Uint64` so I don't mix around with different types. — Andreas, Apr 13 '20 at 18:37
Yes, `vec.ToScalar()` returns the low element of the vector. https://learn.microsoft.com/en-us/dotnet/api/system.runtime.intrinsics.vector128.toscalar?view=netcore-3.1 For a Uint64 vector that's the low 64 bits. IDK why you thought UnpackLow would be necessary for ToScalar to be usable or to get the desired value, but it's not. — Peter Cordes, Apr 13 '20 at 19:13
Yes that did return the lower element. I was not sure it actually did that but could see that in the comment on the syntax itself also. Now I know that also. That is great. — Andreas, Apr 13 '20 at 19:29

Andreas · Answer 1 · 2020-04-13T20:18:03.057

1

This should be an answer of how to count how many 0 in the upper and lower elements of the v1 byte array.

Answer will be:
lower elements: 2
higher elements: 3

So first Sse2.SumAbsoluteDifferences is used to:
Sum the 8 low differences and 8 high differences to produce two unsigned word integer results
Then we can Sse2.UnpackHigh the upper elements

Use sum64.ToScalar() to get the lower elements because scalar cointains the value of the first element.

private void button2_Click(object sender, EventArgs e)
{
    //Create a bytearray of 16 indexes. As seen: '0' occur 2 times in the upper band and 3 times in the lower band
    //We want to count those "0" in the below code
    byte[] v1 = new byte[16];
    v1[0] = 0; v1[1] = 0; v1[2] = 1; v1[3] = 1; v1[4] = 1; v1[5] = 1; v1[6] = 1; v1[7] = 1;
    v1[8] = 1; v1[9] = 0; v1[10] = 0; v1[11] = 0; v1[12] = 1; v1[13] = 1; v1[14] = 1; v1[15] = 1;

    Vector128<byte> counts = Vector128<byte>.Zero;
    unsafe
    {
        fixed (byte* fixedInput = v1)
        {
            //Load byte Vector with 16 indexes
            var v = Avx.LoadVector128(&fixedInput[0]);

            //Now match how many "0" we can find in "Vector128: v". 'counts' show the result string where: '1' tells where we found: "0".
            //As seen it happened as expected total times: 5 (2 times in the upper band and 3 times in the lower band of the Vector)
            byte val = 0;
            var match = Avx.CompareEqual(v, Vector128.Create(val));
            counts = Avx.Subtract(counts, match); //counts: <1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0>

            //SumAbsoluteDifferences
            Vector128<UInt64> sum64 = Vector128<UInt64>.Zero;
            sum64 = Sse2.Add(sum64, Sse2.SumAbsoluteDifferences(counts, Vector128<byte>.Zero).AsUInt64()); //sum64: <2,0,0,0,3,0,0,0>

            //UnpackHigh and add the lower,upper element from the Vector128<UInt64>
            //var lower = sum64; // low element already where we want it
            UInt64 upper = Sse2.UnpackHigh(sum64, sum64).ToScalar(); //3
            Uint64 total_matches_of_0 = Sse2.Add(sum64, upper).ToScalar(); //2 + 3
        }
    }
}

edited Apr 13 '20 at 20:18

answered Apr 13 '20 at 18:58

Andreas

1,121
4
17
34

1

You don't actually want separate upper and lower results, you just want their sum. So you could do that before ToScalar, like `Uint64 total_matches_of_0 = Sse2.Add(sum64, upper).ToScalar();` to get `5`. Like `hsum_epu64_scalar` does with a SIMD add before one final `_mm_cvtsi128_si64` (aka ToScalar). – Peter Cordes Apr 13 '20 at 19:20
That was even better. I changed to that line as you mentioned there. Perheps it can still show in the code where the lower comes from just for demonstrating purposes. Now I think I can start experimenting with this code and see how I can create the 2 outer loops and go back to the other thread with the incomplete question later. It was great sorting this problem out. Thank you! – Andreas Apr 13 '20 at 19:33
1

Sure, `var lower = sum64; // low element already where we want it` – Peter Cordes Apr 13 '20 at 19:34
Yes it is possible to write it like that. Nice. I changed to that line in the code above also. Thanks again! I will start experimenting a little to see what I can do :) – Andreas Apr 13 '20 at 19:37
/facepalm. That only makes sense as an input to `Sse2.Add(lower, upper).ToScalar();` But you left out `Sse2.Add` so now this answer produces a scalar Uint64 from the high half, and a `Vector lower`. – Peter Cordes Apr 13 '20 at 19:48
That was true, I first thought is was a simplifier but yes that was `=` same ofcourse. – Andreas Apr 13 '20 at 20:04
Ok, now your answer is back to not bothering to produce the total answer, just two partial results. Using `Sse2.Add()` is more efficient than two `.ToScalar()`. Of course that's mostly irrelevant for this cleanup after the outer loop, but if you're thinking in SIMD terms then it's the natural way to do the final step. – Peter Cordes Apr 13 '20 at 20:05
Yes lets add that line after all. I just left out the other one to clearly see where the low value comes from but this is better as a complete approach then. – Andreas Apr 13 '20 at 20:10
Like I said, `var lower = sum64; // low element already where we want it` could clearly document that before `Sse2.Add(lower, upper).ToScalar();` – Peter Cordes Apr 13 '20 at 20:12
Yes I added that as a comment before `Sse2.Add(` so this will not actually be declared since it will be an unnecessary extra line then. – Andreas Apr 13 '20 at 20:19

How to Shuffle a Vector128 and Add the elements, then Extract a scalar value properly?

1 Answers1