Tiled Matrix Multiplication using AVX

Question

I have coded the following C function for multiplying two NxN matrices using tiling/blocking and AVX vectors to speed up the calculation. Right now though I'm getting a segmentation fault when I try to combine AVX intrinsics with tiling. Any idea why that happens?

Also, is there a better memory access pattern for matrix B? Maybe transposing it first or even changing the k and j loop? Because right now, I'm traversing it column-wise which is probably not very efficient in regards to spatial locality and cache lines.

  1 void mmult(double A[SIZE_M][SIZE_N], double B[SIZE_N][SIZE_K], double C[SIZE_M][SIZE_K])
  2 {
  3   int i, j, k, i0, j0, k0;
  4   // double sum;
  5   __m256d sum;
  6   for(i0 = 0; i0 < SIZE_M; i0 += BLOCKSIZE) {
  7   for(k0 = 0; k0 < SIZE_N; k0 += BLOCKSIZE) {
  8   for(j0 = 0; j0 < SIZE_K; j0 += BLOCKSIZE) {
  9       for (i = i0; i < MIN(i0+BLOCKSIZE, SIZE_M); i++) {
 10         for (j = j0; j < MIN(j0+BLOCKSIZE, SIZE_K); j++) {
 11           // sum = C[i][j];
 12           sum = _mm256_load_pd(&C[i][j]);
 13           for (k = k0; k < MIN(k0+BLOCKSIZE, SIZE_N); k++) {
 14             // sum += A[i][k] * B[k][j];
 15             sum = _mm256_add_pd(sum, _mm256_mul_pd(_mm256_load_pd(&A[i][k]), _mm256_broadcast_sd(&B[k][j])));
 16           }
 17           // C[i][j] = sum;
 18           _mm256_store_pd(&C[i][j], sum);
 19         }
 20       }
 21   }
 22   }
 23   }
 24 }

@zx485: huh? notice the loop bounds: the inner loops are over BLOCKSIZE tiles of the full matrix, allowing reuse of data while it's still hot in L1d cache. And the outer loops use `+=BLOCKSIZE`. However, IIRC it should only be 5 nested loops, though, not 6, and the details might be wrong. — Peter Cordes, Nov 23 '19 at 19:00
@zx485 Well, chances are that `BLOCKSIZE` has a fixed (small) value, like 8 or 16, while N is probably much bigger. So, it isn't really O(N^6), but it should be the "usual" O(N^3). — Bob__, Nov 23 '19 at 19:01

score 3 · Accepted Answer · answered Nov 23 '19 at 19:44

_mm256_load_pd is an alignment-required load but you're only stepping by k++, not k+=4 in the inner-most loop that loads a 32-byte vector of 4 doubles. So it faults because 3 of every 4 loads are misaligned.

You don't want to be doing overlapping loads, your real bug is the indexing; if your input pointers are 32-byte aligned you should be able to keep using _mm256_load_pd instead of _mm256_loadu_pd. So using _mm256_load_pd successfully caught your bug instead of working but giving numerically wrong results.

Your strategy for vectorizing four row*column dot products (to produce a C[i][j+0..3] vector) should load 4 contiguous doubles from 4 different columns (B[k][j+0..3] via a vector load from B[k][j]), and broadcast 1 double from A[i][k]. Remember you want 4 dot products in parallel.

Another strategy might involve a horizontal sum at the end down to a scalar C[i][j] += horizontal_add(__m256d), but I think that would require transposing one input first so both row and column vectors are in contiguous memory for one dot product. But then you need shuffles for a horizontal sum at the end of each inner loop.

You probably also want to use at least 2 sum variables so you can read a whole cache line at once, and hide FMA latency in the inner loop and hopefully bottleneck on throughput. Or better do 4 or 8 vectors in parallel. So you produce C[i][j+0..15] as sum0, sum1, sum2, sum3. (Or use an array of __m256d; compilers will typically fully unroll a loop of 8 and optimize the array into registers.)

I think you only need 5 nested loops, to block over rows and columns. Although apparently 6 nested loops are a valid option: see loop tiling/blocking for large dense matrix multiplication which has a 5-nested loop in the question but a 6-nested loop in an answer. (Just scalar, though, not vectorized).

There might be other bugs besides the row*column dot product strategy here, I'm not sure.

If you're using AVX, you might want to use FMA as well, unless you need to run on Sandbybridge/Ivybridge, and AMD Bulldozer. (Piledriver and later have FMA3).

Other matmul strategies include adding into the destination inside the inner loop so you're loading C and A inside the inner loop, with a load from B hoisted. (Or B and A swapped, I forget.) What Every Programmer Should Know About Memory? has a vectorized cache-blocked example that works this way in an appendix, for SSE2 __m128d vectors. https://www.akkadia.org/drepper/cpumemory.pdf

Tiled Matrix Multiplication using AVX

1 Answers1