Matrix-Vector and Matrix-Matrix multiplication using SSE

Question

I need to write matrix-vector and matrix-matrix multiplication functions but I cannot wrap my head around SSE commands.

The dimensions of matrices and vectors are always multiples of 4.

I managed to write the vector-vector multiplication function that looks like this:

void vector_multiplication_SSE(float* m, float* n, float* result, unsigned const int size)
{
    int i;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_n = (__m128*)n;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < size / 4; ++i)
        p_result[i] = _mm_mul_ps(p_m[i], p_n[i]);

    // print the result
    for (int i = 0; i < size; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

and now I'm trying to implement matrix-vector multiplication.

Here's what I have so far:

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; i += 4)
    {
        __m128 tmp = _mm_load_ps(&result[i]);
        __m128 p_m_tmp = _mm_load_ps(&m[i]);

        tmp = _mm_add_ps(tmp, _mm_mul_ps(tmp, p_m_tmp));
        _mm_store_ps(&result[i], tmp);

        // another for loop here? 
    }

    // print the result
    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

This function looks completely wrong. I mean not only it doesn't work correctly, but it also seems that I'm moving in the wrong direction.

Could anyone help me with implementing vector-matrix and matrix-matrix multiplication? I'd really appreciate some piece of example code and a very detailed explanation

Update

Here's my attempt number 2:

it fails with Access reading violation exception but still feels closer

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;

    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; ++i)
    {
        p_result[i] = _mm_mul_ps(_mm_load_ps(&m[i]), _mm_load_ps1(&v[i]));
    }

    // print the result
    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
}

Update 2

void multiply_matrix_by_vector_SSE(float* m, float* v, float* result, unsigned const int vector_dims)
{
    int i, j;
    __declspec(align(16))__m128 *p_m = (__m128*)m;
    __declspec(align(16))__m128 *p_v = (__m128*)v;
    __declspec(align(16))__m128 *p_result = (__m128*)result;

    for (i = 0; i < vector_dims; ++i)
    {
        for (j = 0; j < vector_dims * vector_dims / 4; ++j)
        {
            p_result[i] = _mm_mul_ps(p_v[i], p_m[j]);
        }
    }

    for (int i = 0; i < vector_dims; ++i)
    {
        if (i % 4 == 0) cout << endl;
        cout << result[i] << '\t';
    }
    cout << endl;
}

Do you know how to write it in scalar code? Because not even the general structure matches that of a matrix-vector multiplication — harold, Nov 19 '15 at 19:30
well, in scalar code it is really easy, but it's been just two hours since I found out about SSE, and I might do a lot of stupid stuff here — Denis Yakovenko, Nov 19 '15 at 19:32
http://stackoverflow.com/questions/14967969/efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod — Z boson, Nov 20 '15 at 20:00

harold · Accepted Answer · 2020-03-25T03:19:07.057

Without any tricks or anything, a matrix-vector multiplication is just a bunch of dot products between the vector and a row of the matrix. Your code doesn't really have that structure. Writing it actually as dot products (not tested):

for (int row = 0; row < nrows; ++row) {
    __m128 acc = _mm_setzero_ps();
    // I'm just going to assume the number of columns is a multiple of 4
    for (int col = 0; col < ncols; col += 4) {
        __m128 vec = _mm_load_ps(&v[col]);
        // don't forget it's a matrix, do 2d addressing
        __m128 mat = _mm_load_ps(&m[col + ncols * row]);
        acc = _mm_add_ps(acc, _mm_mul_ps(mat, vec));
    }
    // now we have 4 floats in acc and they have to be summed
    // can use two horizontal adds for this, they kind of suck but this
    // isn't the inner loop anyway.
    acc = _mm_hadd_ps(acc, acc);
    acc = _mm_hadd_ps(acc, acc);
    // store result, which is a single float
    _mm_store_ss(&result[row], acc);
}

There are some obvious tricks, such as processing several rows at once, reusing the load from the vector, and creating several independent dependency chains so you can make better use of the throughput (see below). Also a really simple trick is using FMA for the mul/add combo, but support is not that widespread yet (it wasn't in 2015, but it is fairly widespread now in 2020).

You can build matrix-matrix multiplication from this (if you change the place the result goes), but that is not optimal (see further below).

Taking four rows at once (not tested):

for (int row = 0; row < nrows; row += 4) {
    __m128 acc0 = _mm_setzero_ps();
    __m128 acc1 = _mm_setzero_ps();
    __m128 acc2 = _mm_setzero_ps();
    __m128 acc3 = _mm_setzero_ps();
    for (int col = 0; col < ncols; col += 4) {
        __m128 vec = _mm_load_ps(&v[col]);
        __m128 mat0 = _mm_load_ps(&m[col + ncols * row]);
        __m128 mat1 = _mm_load_ps(&m[col + ncols * (row + 1)]);
        __m128 mat2 = _mm_load_ps(&m[col + ncols * (row + 2)]);
        __m128 mat3 = _mm_load_ps(&m[col + ncols * (row + 3)]);
        acc0 = _mm_add_ps(acc0, _mm_mul_ps(mat0, vec));
        acc1 = _mm_add_ps(acc1, _mm_mul_ps(mat1, vec));
        acc2 = _mm_add_ps(acc2, _mm_mul_ps(mat2, vec));
        acc3 = _mm_add_ps(acc3, _mm_mul_ps(mat3, vec));
    }
    acc0 = _mm_hadd_ps(acc0, acc1);
    acc2 = _mm_hadd_ps(acc2, acc3);
    acc0 = _mm_hadd_ps(acc0, acc2);
    _mm_store_ps(&result[row], acc0);
}

There are only 5 loads per 4 FMAs now, versus 2 loads per 1 FMA in the version that wasn't row-unrolled. Also there are 4 independent FMAs, or add/mul pairs without FMA contraction, either way it increases the potential for pipelined/simultaneous execution. Actually you might want to unroll even more, for example Skylake can start 2 independent FMAs per cycle and they take 4 cycles to complete, so to completely occupy both FMA units you need 8 independent FMAs. As a bonus, those 3 horizontal adds in the end work out relatively nicely, for horizontal summation.

The different data layout initially seems like a disadvantage, it's no longer possible to simply do vector-loads from both the matrix and the vector and multiply them together (that would multiply a tiny row vector of the first matrix by a tiny row vector of the second matrix again, which is wrong). But full matrix-matrix multiplication can make use of the fact that it's essentially multiplying a matrix by lots of independent vectors, it's full of independent work to be done. The horizontal sums can be avoided easily too. So actually it's even more convenient than matrix-vector multiplication.

The key is taking a little column vector from matrix A and a little row vector from matrix B and multiplying them out into a small matrix. That may sound reversed compared to what you're used to, but doing it this way works out better with SIMD because the computations stay independent and horizontal-operation-free the whole time.

For example (not tested, assumes the matrixes have dimensions divisible by the unroll factors, requires x64 otherwise it runs out of registers)

for (size_t i = 0; i < mat1rows; i += 4) {
    for (size_t j = 0; j < mat2cols; j += 8) {
        float* mat1ptr = &mat1[i * mat1cols];
        __m256 sumA_1, sumB_1, sumA_2, sumB_2, sumA_3, sumB_3, sumA_4, sumB_4;
        sumA_1 = _mm_setzero_ps();
        sumB_1 = _mm_setzero_ps();
        sumA_2 = _mm_setzero_ps();
        sumB_2 = _mm_setzero_ps();
        sumA_3 = _mm_setzero_ps();
        sumB_3 = _mm_setzero_ps();
        sumA_4 = _mm_setzero_ps();
        sumB_4 = _mm_setzero_ps();

        for (size_t k = 0; k < mat2rows; ++k) {
            auto bc_mat1_1 = _mm_set1_ps(mat1ptr[0]);
            auto vecA_mat2 = _mm_load_ps(mat2 + m2idx);
            auto vecB_mat2 = _mm_load_ps(mat2 + m2idx + 4);
            sumA_1 = _mm_add_ps(_mm_mul_ps(bc_mat1_1, vecA_mat2), sumA_1);
            sumB_1 = _mm_add_ps(_mm_mul_ps(bc_mat1_1, vecB_mat2), sumB_1);
            auto bc_mat1_2 = _mm_set1_ps(mat1ptr[N]);
            sumA_2 = _mm_add_ps(_mm_mul_ps(bc_mat1_2, vecA_mat2), sumA_2);
            sumB_2 = _mm_add_ps(_mm_mul_ps(bc_mat1_2, vecB_mat2), sumB_2);
            auto bc_mat1_3 = _mm_set1_ps(mat1ptr[N * 2]);
            sumA_3 = _mm_add_ps(_mm_mul_ps(bc_mat1_3, vecA_mat2), sumA_3);
            sumB_3 = _mm_add_ps(_mm_mul_ps(bc_mat1_3, vecB_mat2), sumB_3);
            auto bc_mat1_4 = _mm_set1_ps(mat1ptr[N * 3]);
            sumA_4 = _mm_add_ps(_mm_mul_ps(bc_mat1_4, vecA_mat2), sumA_4);
            sumB_4 = _mm_add_ps(_mm_mul_ps(bc_mat1_4, vecB_mat2), sumB_4);
            m2idx += 8;
            mat1ptr++;
        }
        _mm_store_ps(&result[i * mat2cols + j], sumA_1);
        _mm_store_ps(&result[i * mat2cols + j + 4], sumB_1);
        _mm_store_ps(&result[(i + 1) * mat2cols + j], sumA_2);
        _mm_store_ps(&result[(i + 1) * mat2cols + j + 4], sumB_2);
        _mm_store_ps(&result[(i + 2) * mat2cols + j], sumA_3);
        _mm_store_ps(&result[(i + 2) * mat2cols + j + 4], sumB_3);
        _mm_store_ps(&result[(i + 3) * mat2cols + j], sumA_4);
        _mm_store_ps(&result[(i + 3) * mat2cols + j + 4], sumB_4);
    }
}

The point of that code is that it's easy to arrange to computation to be very SIMD-friendly, with a lots of independent arithmetic to saturate the floating point units with, and at the same time use relatively few loads (which otherwise could become a bottleneck, even putting aside that they might miss L1 cache, just by there being too many of them).

You can even use this code, but it's not competitive with Intel MKL. Especially for medium or big matrixes, where tiling is extremely important. It's easy to upgrade this to AVX. It's not suitable for tiny matrixes at all, for example to multiply two 4x4 matrixes see Efficient 4x4 matrix multiplication.

could you elaborate more on those obvious tricks? wouldn't processing several rows at once require to load the matrix by columns? — truvaking, Mar 24 '20 at 16:22
@truvaking sure, I added a bunch of stuff. Processing several rows at once is not the same thing as loading the matrix by column, the next column in the same row is used soon enough that it's still in L1, there is no TLB thrashing, and linear prefetching still works. — harold, Mar 25 '20 at 03:28

Matrix-Vector and Matrix-Matrix multiplication using SSE

Update

Update 2

1 Answers1

Linked