I am trying to boost performance for a .NET Core library by utilizing System.Numerics to perform SIMD operations on float[] arrays. System.Numerics is a bit funky right now, and I'm having a hard time seeing how it can be beneficial. I understand that in order to see a performance boost with SIMD, it must be amortized over a large quantity of computation, but given how it is currently implemented, I can't figure out how to accomplish this.
Vector<float> requires 8 float values - no more, no less. If I want to perform SIMD operations on a group of values smaller than 8, I am forced to copy the values to a new array and pad the remainer with zeroes. If the group of values is greater than 8, I need to copy the values, pad with zeroes to ensure its length is aligned to a multiple of 8, and then loop over them. The length requirement makes sense, but accomodating for this seems like a good way to nullify any performance gain.
I have written a test wrapper class that takes care of the padding and alignment:
public readonly struct VectorWrapper<T>
  where T : unmanaged
{
  #region Data Members
  public readonly int Length;
  private readonly T[] data_;
  #endregion
  #region Constructor
  public VectorWrapper( T[] data )
  {
    Length = data.Length;
    var stepSize = Vector<T>.Count;
    var bufferedLength = data.Length - ( data.Length % stepSize ) + stepSize;
    data_ = new T[ bufferedLength ];
    data.CopyTo( data_, 0 );
  }
  #endregion
  #region Public Methods
  public T[] ToArray()
  {
    var returnData = new T[ Length ];
    data_.AsSpan( 0, Length ).CopyTo( returnData );
    return returnData;
  }
  #endregion
  #region Operators
  public static VectorWrapper<T> operator +( VectorWrapper<T> l, VectorWrapper<T> r )
  {
    var resultLength = l.Length;
    var result = new VectorWrapper<T>( new T[ l.Length ] );
    var lSpan = l.data_.AsSpan();
    var rSpan = r.data_.AsSpan();
    var stepSize = Vector<T>.Count;
    for( var i = 0; i < resultLength; i += stepSize )
    {
      var lVec = new Vector<T>( lSpan.Slice( i ) );
      var rVec = new Vector<T>( rSpan.Slice( i ) );
      Vector.Add( lVec, rVec ).CopyTo( result.data_, i );
    }
    return result;
  }
  #endregion
}
This wrapper does the trick. The calculations appear to be correct, and Vector<T> doesn't complain about the input count of the elements. However, it is twice as slow as a simple range-based for loop.
Here's the benchmark:
  public class VectorWrapperBenchmarks
  {
    #region Data Members
    private static float[] arrayA;
    private static float[] arrayB;
    private static VectorWrapper<float> vecA;
    private static VectorWrapper<float> vecB;
    #endregion
    #region Constructor
    public VectorWrapperBenchmarks()
    {
      arrayA = new float[ 1024 ];
      arrayB = new float[ 1024 ];
      for( var i = 0; i < 1024; i++ )
        arrayA[ i ] = arrayB[ i ] = i;
      vecA = new VectorWrapper<float>( arrayA );
      vecB = new VectorWrapper<float>( arrayB );
    }
    #endregion
    [Benchmark]
    public void ForLoopSum()
    {
      var aA = arrayA;
      var aB = arrayB;
      var result = new float[ 1024 ];
      for( var i = 0; i < 1024; i++ )
        result[ i ] = aA[ i ] + aB[ i ];
    }
    [Benchmark]
    public void VectorSum()
    {
      var vA = vecA;
      var vB = vecB;
      var result = vA + vB;
    }
  }
And the results:
|     Method |       Mean |    Error |   StdDev |
|----------- |-----------:|---------:|---------:|
| ForLoopSum |   757.6 ns | 15.67 ns | 17.41 ns |
|  VectorSum | 1,335.7 ns | 17.25 ns | 16.13 ns |
My processor (i7-6700k) does support SIMD hardware acceleration, and this is running in Release mode, 64-bit with optimizations enabled on .NET Core 2.2 (Windows 10).
I realize that the Array.CopyTo() is likely a large part of what is killing performance, but it seems there is no easy way to have both padding/alignment and data sets that don't explicitly conform to Vector<T>'s specifications.
I'm rather new to SIMD, and I understand that the C# implementation is still in its early phase. However, I don't see a clear way to actually benefit from it, especially considering it is most beneficial when scaled to larger data sets.
Is there a better way to go about this?
 
     
    