Multiply rows of the matrix by a vector (low-level optimization)?

Question

I'm optimizing a function and I want to get rid of slow for loops. I'm looking for a faster way to multiply each row of a matrix by a vector.

I'm not looking for a 'classical' multiplication.

Eg. I have a matrix that has 1024 columns and 20 rows and a vector that has the length of 1024. In a result, I want to have matrix 1024 x 20 that has each row multiplied by the vector.

What I am doing now I am iterating in the for loop over the matrix rows and using mkl v?Mul performing an element by element multiplication of current matrix row and the vector. Any ideas how to improve this?

The question is the copy Multiply rows of matrix by vector? but for C++ with possible low-level optimizations and MKL, not for R

I assume you mean 1024 rows and 20 columns? Is the 20 fixed (or known at compile time and guaranteed to be a multiple of 4)? Is your matrix stored rowmajor or columnmajor? — chtz, Sep 23 '17 at 18:41
@chtz 1024 columns - that are features. In another case 52 columns. Both are fixed and are multiple of 4. And 20 is the batch size. I chose it, but it can't be large. And there are many such multiplicaton iterations. — Brans Ds, Sep 23 '17 at 19:03
I'm probably misunderstanding what you want to do. If you want element-by-element multiplication for each row, then your matrix should have as many columns as the vector has elements, shouldn't it? (If that's not what you want, please write some pseudo code, or the code you are currently using) — chtz, Sep 23 '17 at 19:08
@chtz sory my copypaste mistake. I have corrected the question. vector size should be 1024 if we have 1024 colums matrix. — Brans Ds, Sep 23 '17 at 19:10
It's memory bandwidth bound so I don't see that you have much to gain other than doing it with an obvious solution and compiling with vectorization enabled i.e. more effort is premature optimization. — Z boson, Sep 28 '17 at 08:41

chtz · Answer 1 · 2017-09-23T19:16:04.050

2

Using the Eigen matrix library, what you are doing is essentially multiplying by a diagonal matrix. If you have a matrix of arbitrary many rows and 20 columns, you can write the following (not really worth making a function for that):

void multRows(Eigen::Matrix<double, Eigen::Dynamic, 20>& mat,
              const Eigen::Matrix<double,20,1>& vect)
{
    mat = mat * vect.asDiagonal();
}

Eigen does generate AVX2 code if it is enabled by the compiler. You may want to experiment if it is more efficient to store mat row major or column major in your use case.

Addendum (due to edited question): If you have (much) more than 20 columns, you should just use dynamic sized matrices all-together:

void multRows(Eigen::MatrixXd& mat, const Eigen::VectorXd& vect)
{
    mat = mat * vect.asDiagonal();
}

edited Sep 23 '17 at 19:16

answered Sep 23 '17 at 18:53

chtz

17,329
4
26
56

mathematically you are correct but: 1) Extra memory allocation for the diagonal matrix is expensive(all these zeroes) 2) Matrix multiplication is much more expensive operation then elementwise multiplication(because of cache misses). So I almost sure this will be slower than my current approach – Brans Ds Sep 23 '17 at 19:17
Eigen won't allocate a full matrix to store the diagonal (in fact no allocations at all will happen for that multiplication). This gets essentially optimized to two loops of elements-wise scalar products, exploiting SIMD if possible. – chtz Sep 23 '17 at 19:19
You are right this approach it fast. At the same time mat.array().rowwise() *= vect.array(); is the same or slightly faster. And is 15 % faster than for loop + mkl v?Mul – Brans Ds Sep 24 '17 at 06:52
Theoretically, the `mat.array().rowwise() *= vect.array();` line should generate exactly the same code on a (reasonably well-) optimizing compiler (that means you should use the variant which is more readable for you). But for some expressions you will undoubtly experience different results with different compilers (or different versions of Eigen). – chtz Sep 25 '17 at 21:48

score 2 · Answer 2 · answered Sep 30 '17 at 13:41

Most of the recent processors support AVX technology. It provides a vector containing 4 doubles (256-bit registers). Thus, a solution for this optimization might be using AVX. For this, I've implemented it using x86intrin.h library which is a part of GCC compiler. I also used OpenMP to make the solution multi-threaded.

//gcc -Wall  -fopenmp -O2 -march=native -o "MatrixVectorMultiplication" "MatrixVectorMultiplication.c" 
//gcc 7.2, Skylake Corei7-6700 HQ
//The performance improvement is significant (5232 Cycle in my machine) but MKL is not available to test
#include <stdio.h>
#include <x86intrin.h>
double A[20][1024] __attribute__(( aligned(32))) = {{1.0, 2.0, 3.0, 3.5, 1.0, 2.0, 3.0, 3.5}, {4.0, 5.0, 6.0, 6.5,4.0, 5.0, 6.0, 6.5},{7.0, 8.0, 9.0, 9.5, 4.0, 5.0, 6.0, 6.5 }};//The 32 is for 256-bit registers of AVX
double B[1024]  __attribute__(( aligned(32))) = {2.0, 2.0, 2.0, 2.0, 3.0, 3.0, 3.0, 3.0 }; //the vector
double C[20][1024] __attribute__(( aligned(32)));//the results are stored here

int main()
{
    int i,j;
    __m256d vec_C1, vec_C2, vec_C3, vec_C4;

    //begin_rdtsc
    //get the start time here
    #pragma omp parallel for
    for(i=0; i<20;i++){
        for(j=0; j<1024; j+=16){

            vec_C1 = _mm256_mul_pd(_mm256_load_pd(&A[i][j]), _mm256_load_pd(&B[j]));
            _mm256_store_pd(&C[i][j], vec_C1);

            vec_C2 = _mm256_mul_pd(_mm256_load_pd(&A[i][j+4]), _mm256_load_pd(&B[j+4]));
            _mm256_store_pd(&C[i][j+4], vec_C2);

            vec_C3 = _mm256_mul_pd(_mm256_load_pd(&A[i][j+8]), _mm256_load_pd(&B[j+8]));
            _mm256_store_pd(&C[i][j+8], vec_C3);

            vec_C4 = _mm256_mul_pd(_mm256_load_pd(&A[i][j+12]), _mm256_load_pd(&B[j+12]));
            _mm256_store_pd(&C[i][j+12], vec_C4);

        }
    }
    //end_rdtsc
    //calculate the elapsead time

    //print the results
    for(i=0; i<20;i++){
        for(j=0; j<1024; j++){
            //printf(" %lf", C[i][j]);
        }
        //printf("\n");
    }

    return 0;
}

Multiply rows of the matrix by a vector (low-level optimization)?

2 Answers2