System.Numerics.Vector on large data sets

Question

I am trying to boost performance for a .NET Core library by utilizing System.Numerics to perform SIMD operations on float[] arrays. System.Numerics is a bit funky right now, and I'm having a hard time seeing how it can be beneficial. I understand that in order to see a performance boost with SIMD, it must be amortized over a large quantity of computation, but given how it is currently implemented, I can't figure out how to accomplish this.

Vector<float> requires 8 float values - no more, no less. If I want to perform SIMD operations on a group of values smaller than 8, I am forced to copy the values to a new array and pad the remainer with zeroes. If the group of values is greater than 8, I need to copy the values, pad with zeroes to ensure its length is aligned to a multiple of 8, and then loop over them. The length requirement makes sense, but accomodating for this seems like a good way to nullify any performance gain.

I have written a test wrapper class that takes care of the padding and alignment:

public readonly struct VectorWrapper<T>
  where T : unmanaged
{

  #region Data Members

  public readonly int Length;
  private readonly T[] data_;

  #endregion

  #region Constructor

  public VectorWrapper( T[] data )
  {
    Length = data.Length;

    var stepSize = Vector<T>.Count;
    var bufferedLength = data.Length - ( data.Length % stepSize ) + stepSize;

    data_ = new T[ bufferedLength ];
    data.CopyTo( data_, 0 );
  }

  #endregion

  #region Public Methods

  public T[] ToArray()
  {
    var returnData = new T[ Length ];
    data_.AsSpan( 0, Length ).CopyTo( returnData );
    return returnData;
  }

  #endregion

  #region Operators

  public static VectorWrapper<T> operator +( VectorWrapper<T> l, VectorWrapper<T> r )
  {
    var resultLength = l.Length;
    var result = new VectorWrapper<T>( new T[ l.Length ] );

    var lSpan = l.data_.AsSpan();
    var rSpan = r.data_.AsSpan();

    var stepSize = Vector<T>.Count;
    for( var i = 0; i < resultLength; i += stepSize )
    {
      var lVec = new Vector<T>( lSpan.Slice( i ) );
      var rVec = new Vector<T>( rSpan.Slice( i ) );
      Vector.Add( lVec, rVec ).CopyTo( result.data_, i );
    }

    return result;
  }

  #endregion

}

This wrapper does the trick. The calculations appear to be correct, and Vector<T> doesn't complain about the input count of the elements. However, it is twice as slow as a simple range-based for loop.

Here's the benchmark:

  public class VectorWrapperBenchmarks
  {

    #region Data Members

    private static float[] arrayA;
    private static float[] arrayB;

    private static VectorWrapper<float> vecA;
    private static VectorWrapper<float> vecB;

    #endregion

    #region Constructor

    public VectorWrapperBenchmarks()
    {
      arrayA = new float[ 1024 ];
      arrayB = new float[ 1024 ];
      for( var i = 0; i < 1024; i++ )
        arrayA[ i ] = arrayB[ i ] = i;

      vecA = new VectorWrapper<float>( arrayA );
      vecB = new VectorWrapper<float>( arrayB );
    }

    #endregion

    [Benchmark]
    public void ForLoopSum()
    {
      var aA = arrayA;
      var aB = arrayB;
      var result = new float[ 1024 ];

      for( var i = 0; i < 1024; i++ )
        result[ i ] = aA[ i ] + aB[ i ];
    }

    [Benchmark]
    public void VectorSum()
    {
      var vA = vecA;
      var vB = vecB;
      var result = vA + vB;
    }

  }

And the results:

|     Method |       Mean |    Error |   StdDev |
|----------- |-----------:|---------:|---------:|
| ForLoopSum |   757.6 ns | 15.67 ns | 17.41 ns |
|  VectorSum | 1,335.7 ns | 17.25 ns | 16.13 ns |

My processor (i7-6700k) does support SIMD hardware acceleration, and this is running in Release mode, 64-bit with optimizations enabled on .NET Core 2.2 (Windows 10).

I realize that the Array.CopyTo() is likely a large part of what is killing performance, but it seems there is no easy way to have both padding/alignment and data sets that don't explicitly conform to Vector<T>'s specifications.

I'm rather new to SIMD, and I understand that the C# implementation is still in its early phase. However, I don't see a clear way to actually benefit from it, especially considering it is most beneficial when scaled to larger data sets.

Is there a better way to go about this?

I don't see why you're making this an either/or scenario. Use vector instructions for the vector-sized slices of the source array, then process the last few remaining elements one by one in a standard loop. No copying or padding should be needed. As an aside, if you're exclusively targeting Core you may also be interested in the newer and more flexible `System.Runtime.Intrinsics` namespace. — kalimag, May 01 '19 at 04:21
I can't give a full answer. But if you need to create a span or a data copy for each vector to process that's the wrong approach because of overhead (as you know). There are code patterns that do not need that. Each loop iteration for `+` is just two memory reads, one add, one store. Just like in native code. The tail of the array must be processed element wise using a second loop. No padding. — usr, May 05 '19 at 10:08

score 3 · Accepted Answer · edited May 07 '19 at 23:13

I´m not sure what you mean by "funky", but it´s perfectly usable right now (although it probably could be more performant). Using your case (summing floats) I get the following results over 10003 items with an elderly Haswell CPU:

BenchmarkDotNet=v0.11.5, OS=Windows 10.0.17134.706 (1803/April2018Update/Redstone4)
Intel Core i7-4500U CPU 1.80GHz (Haswell), 1 CPU, 4 logical and 2 physical cores
Frequency=1753753 Hz, Resolution=570.2057 ns, Timer=TSC
.NET Core SDK=2.1.602
  [Host]     : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT
  DefaultJob : .NET Core 2.1.9 (CoreCLR 4.6.27414.06, CoreFX 4.6.27415.01), 64bit RyuJIT


|   Method |      Mean |     Error |    StdDev |
|--------- |----------:|----------:|----------:|
| ScalarOp | 12.974 us | 0.2579 us | 0.2533 us |
| VectorOp |  3.956 us | 0.0570 us | 0.0505 us |
| CopyData |  1.455 us | 0.0273 us | 0.0228 us |

The copying of data from the vector back to the array is (relatively) slow, since it uses up almost half of the time. But still: the overall time of the vectorized operation is less than 1/3rd of the scalar one...

Looking at the disassembly (BenchmarkDotNet will generate it) it seems that the memory copy operation uses the (slower) unaligned op. Possibly a future version of .Net Core will look into that.

You can avoid the copy operation completely by using Span<T> and MemoryMarshal.Cast to put the resulting vector right into a Span. It reduces the time to sum by approx. a third compared to copying (not shown below).

For reference, the benchmark code is (floatSlots = Vector<float>.Count; the arrays are created before the benchmark run and filled with data) and is not necessarily an optimal solution:

        [Benchmark]
        public void ScalarOp()
        {            
            for (int i = 0; i < data1.Length; i++)
            {
                sums[i] = data1[i] + data2[i];
            }            
        }

        [Benchmark]
        public void VectorOp()
        {                      
            int ceiling = data1.Length / floatSlots * floatSlots;
            int leftOver = data1.Length % floatSlots;
            for (int i = 0; i < ceiling; i += floatSlots)
            {                
                Vector<float> v1 = new Vector<float>(data1, i);                
                Vector<float> v2 = new Vector<float>(data2, i);                
                (v1 + v2).CopyTo(sums, i); 

            }
            for (int i = ceiling; i < data1.Length; i++)
            {
                sums[i] = data1[i] + data2[i];
            }
        }

        [Benchmark]
        public void CopyData()
        {                        
            Vector<float> v1 = new Vector<float>(8);
            int ceiling = data1.Length / floatSlots * floatSlots;
            int leftOver = data1.Length % floatSlots;
            for (int i = 0; i < ceiling; i += floatSlots)
            {                               
                (v1).CopyTo(sums, i);
            }
            for(int i = ceiling; i < data1.Length; i++)
            {
                sums[i] = 8;
            }                
        }

Edit: Corrected scalar benchmark since to be same as vector, added mention of Span and MemoryMarshal.Cast.

Haswell is not that "elderly"; it supports AVX2 + FMA and has 256-bit wide load/store execution units. Skylake isn't a big change over Haswell for most FP operations. But a newer system would have faster RAM, so that matters a lot for looping over big arrays that will miss in cache. — Peter Cordes, May 07 '19 at 23:16
Can you post the complete benchmark, please? Are you positive that the difference is not related to memory / cache performance ? — Haymo Kutschbach, Nov 03 '20 at 21:10
@HaymoKutschbach I´m not sure what you mean. Cache performance is always a factor, but mostly related to data set size and memory access patterns. Vectorized operations come to shine when the inner loop has several operations on data without intermediate fecthes / stores (as opposed to just fetch - sum - store) — C. Gonzalez, Nov 10 '20 at 15:09
Right. but this is exactly what it looks like in the example, no? — Haymo Kutschbach, Nov 11 '20 at 17:05

System.Numerics.Vector on large data sets

1 Answers1