I am using Vector128<byte> in C# in order to count matches from a byte array with 16 index.
This is part of implementing a byte version of Micro Optimization of a 4-bucket histogram of a large array or list, using the technique from How to count character occurrences using SIMD of widening 8-bit counters to 64 inside an outer loop (hsum_epu8_epu64 helper function), and then after all the loops of summing that vector of counters down to one scalar (hsum_epu64_scalar).
So that C++ with Intel intrinsics has to be ported to C#. And without AVX2 so we're using 128-bit integer vectors, not 256.
The byte array consists of numbers 0 and 1 where 5 0 occurs.
The task now is to count those 5 0 where we can see that 2 of the 0 occurs in the upperband of the Vector128<byte> and 3 of the 0 occurs in the lowerband of the Vector128<byte>.
I have succeed with code all the way to where I Sse2.SumAbsoluteDifferences and can extract the number of 0 for sumHigh and sumLow showing 3 and 2 respectively.
The problem starts now where I need to shuffle so the upperband and lowerband change places so I later can extract the opposites in: sumHigh and sumLow for sum64b
I have put a lot of comments in the code also so I think it is possible to follow the code and see there also exactly how I try to shuffle and complete the code.
(The code shows also that my AMD K10 processor supports: Sse, Sse2, Sse3)
    using System.Runtime.Intrinsics;
    using System.Runtime.Intrinsics.X86;
    private void button2_Click(object sender, EventArgs e)
    {
        //This shows what is supported on my processor. However it seems that I could use something from "Avx" anyway
        bool avx = Avx.IsSupported; //false
        bool avx2 = Avx2.IsSupported; //false
        bool sse = Sse.IsSupported; //true
        bool sse2 = Sse2.IsSupported; //true
        bool sse3 = Sse3.IsSupported; //true
        bool ssse3 = Ssse3.IsSupported; //false
        bool sse41 = Sse41.IsSupported; //false
        bool sse42 = Sse42.IsSupported; //false
        //Create a bytearray of 16 indexes. As seen: '0' occur 2 times in the upper band and 3 times in the lower band
        //We want to count those "0" in the below code
        byte[] v1 = new byte[16];
        v1[0] = 0; v1[1] = 0; v1[2] = 1; v1[3] = 1; v1[4] = 1; v1[5] = 1; v1[6] = 1; v1[7] = 1;
        v1[8] = 1; v1[9] = 0; v1[10] = 0; v1[11] = 0; v1[12] = 1; v1[13] = 1; v1[14] = 1; v1[15] = 1;
        Vector128<byte> counts = Vector128<byte>.Zero;
        unsafe
        {
            fixed (byte* fixedInput = v1)
            {
                //Load byte Vector with 16 indexes
                var v = Avx.LoadVector128(&fixedInput[0]);
                //Now match how many "0" we can find in "Vector128: v". 'counts' show the result string where: '1' tells where we found: "0".
                //As seen it happened as expected total times: 5 (2 times in the upper band and 3 times in the lower band of the Vector)
                byte val = 0;
                var match = Avx.CompareEqual(v, Vector128.Create(val));
                counts = Avx.Subtract(counts, match); //counts: <1,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0>
                //Extract high/low bands
                //So we use "SumAbsoluteDifferences" to "Separately sum the 8 low differences and 8 high differences to produce two unsigned word integer results."
                //We can see on index 0: 2 and on index 4: 3
                Vector128<ushort> sum64 = Vector128<ushort>.Zero;
                sum64 = Sse2.Add(sum64, Sse2.SumAbsoluteDifferences(counts, Vector128<byte>.Zero)); //sum64: <2,0,0,0,3,0,0,0>
                //I AM NOT SURE OF THE CODE BELOW HOW TO DO IT PROPERLY!
                //Now I need to shuffle the above: "<2,0,0,0,3,0,0,0>" but are not sure of how the complete process is to do this correctly?
                //Below is a start of an "attempt" but are not sure how to do this all the way correctly?
                Vector128<uint> result = Sse2.Shuffle(sum64.AsUInt32(), 0xB1); 
                //Extract high/low bands from ther shuffle above?
                //Vector128<uint> sum64b = Vector128<uint>.Zero;
                //sum64b = Sse2.Add(sum64b, result);
                //sumHigh = Sse2.Extract(sum64b, 1); //0
                //sumLow = Sse2.Extract(sum64b, 0); //                                                                    
            }
        }
    }
Using 16-bit extracts would be possible but not usable for larger counts.
    var sumHigh = Sse2.Extract(sum64, 4);    // pextrw
    var sumLow = Sse2.Extract(sum64, 0); //sumHigh == 3 and sumLow == 2
    var sumScalar = SumLow + sumHigh;
Note from @PeterCordes: the real use-case would loop to add up to 255 vectors into counts, then in an outer loop accumulate into wide elements in sum64 with Sse2.SumAbsoluteDifferences and Sse2.Add, and reset counts.  That part looks correct in this C# port, except that sum64 should not use ushort elements.
The part this question is asking about is the horizontal sum of two 64-bit vector elements down to one scalar integer. (The real use-case has three vectors of counts, from 3 histogram buckets; a transpose and sum could work but simply doing separate horizontal sums for each vector is fine.)
