Simd matmul program gives different numerical results

Question

I am trying to program the matrix multiplication in C using simd intrinsics. I was pretty sure of my implementation, but when I execute, i get some numerical errors starting from the 5th digit of the resulting matrix's coefficients.

REAL_T is just a float with typedef

/* This is my matmul Version with simd, using floating simple precision*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  __m256 vA, vB, vC, vRes;
  for (i=0; i<n; i++){
    for (j=0; j<n; j++){  
      for (k=0; k<n; k= k+8){
        vA = _mm256_load_ps(&A[i*n+k]);
        vB = _mm256_loadu_ps(&B[k*n+j]);
        vC = _mm256_mul_ps(vA, vB);
        vC = _mm256_hadd_ps(vC, vC);
        vC = _mm256_hadd_ps(vC, vC);
        /*To get the resulting coefficient, after doing 2 hadds,
        I have to get the first and the last element of the resulting
        Vector vC*/
        C[i*n+j] += ((float )(vC[0])) + ((float )(vC[7]));
      } /* for k */
    } /* for j */
  } /* for i */
}
*/End of program

/*And this is the sequential Version*/
void matmul(int n, REAL_T *A, REAL_T *B, REAL_T *C){
  int i,j,k;
  for (i=0; i<n; i++){ 
    for (j=0; j<n; j++){
      for (k=0; k<n; k++){
        C[i*n+j] +=  A[i*n+k] *  B[k*n+j];  
      } /* for k */
    } /* for j */
  } /* for i */  
}
/*End of program*/

/*The matrix are initialized as follows*/
  for (i = 0; i < n; i++)
    for (j = 0; j < n; j++){
      *(A+i*n+j) = 1 / ((REAL_T) (i+j+1));
      *(B+i*n+j) = 1.0;
      *(C+i*n+j) = 1.0;
    }
/*End of initialization*/

The tested matrix are of size 512*512. For the sequential version, the top left square of the resulting matrix gives:

+6.916512e+01  +6.916512e+01  
+5.918460e+01  +5.918460e+01  

+7.946186e+00  +7.946186e+00  
+7.936391e+00  +7.936391e+00

However, for the simd version, the square is:

+6.916510e+01  +6.916510e+01  
+5.918463e+01  +5.918463e+01  

+7.946147e+00  +7.946147e+00  
+7.936355e+00  +7.936355e+00

There is as shown a numerical error between the 2 versions. Any help would be really appreciated !

Peter Cordes · Accepted Answer · 2019-04-04T20:30:20.967

This looks normal; adding numbers in a different order produces different rounding in the temporaries.

FP math is not associative; optimizing as if it is will change the results.¹ Is Floating point addition and multiplication associative? / Are floating point operations in C associative?

The amount of change depends on the data. Differences only in the 5th decimal place seems reasonable for float.

Unless you were taking special numerical precautions like adding up the small numbers first, the sequential-order result isn't "more correct", they just have different errors.

In fact, using multiple accumulators generally increases precision for large lists, assuming your numbers all have similar magnitude. (Ideally multiple SIMD vectors each composed of multiple elements, to hide FP-add or FMA latency). https://en.wikipedia.org/wiki/Pairwise_summation is a numerical technique that takes this to the next level: summing subsets of the list in a tree, to avoid adding single array elements to a much larger value. See for example, How to avoid less precise sum for numpy-arrays with multiple columns

Using a fixed number of accumulators (e.g. 8x __m256 = 64 float accumulators) might reduce expected error by a factor of 64, instead of from N to log N for full pairwise summation.

Footnote 1: Associativity is necessary for parallelization, and SIMD, and multiple accumulators. Associativity gives us parallelizability. But what does commutativity give?

On a machine with for example 4-cycle latency 2-per-clock throughput FMA, with a SIMD width of 8 floats, i.e. a Skylake system with AVX2, the potential speedup is 4*2 = 8 from multiple accumulators, * 8 from SIMD width, times number of cores, vs. a pure sequential version, even for problems where it might be less accurate instead of just different.

Most people consider a factor of 8*8 = 64 worth it! (And you can in theory also parallelize for another factor of maybe 4 on a quad-core, assuming perfect scaling for large matrices).

You're already using float instead of double for performance.

See also Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for more about using multiple accumulators to hide FMA latency in a reduction, exposing that other factor of 8 speedup.

Also, do not use hadd inside an inner-most loop. Sum vertically, and use an efficient reduction at the end of the loop. (Fastest way to do horizontal float vector sum on x86). You really really want to avoid having the compiler extract your vectors to scalar at every step, that defeats most of the benefit of SIMD! Besides the fact that hadd is not worth using for horizontal sums of 1 vector; it costs 2 shuffles + a regular add on all existing CPUs.

Thank you so much for the clear answer ! I will change my hadd implementation as I did not know it had that kind of effect ... — abdul rahman taleb, Apr 02 '19 at 19:49

Simd matmul program gives different numerical results

1 Answers1

Linked

Related