How to optimize my C++ OpenMp Matrix Multiplication code

Question

I have written a C++ OpenMp Matrix Multiplication code that multiplies two 1000x1000 matrices. So far I have gotten a 0.700 sec execution time using OpenMp but I want to see if there is other ways I can make it faster using OpenMp?

I appreciate any advice or tips you can give me.

Here is my code:

#include <iostream>
#include <time.h>
#include <omp.h>

using namespace std;

void Multiply()
 {
    //initialize matrices with random numbers
    int aMatrix[1000][1000], i, j;
  for( i = 0; i < 1000; ++i)
  {for( j = 0;  j < 1000; ++j)
     {aMatrix[i][j] = rand();}
  }
    int bMatrix[1000][1000], i1, j2;
  for( i1 = 0; i1 < 1000; ++i1)
  {for( j2 = 0;  j2 < 1000; ++j2)
     {bMatrix[i1][j2] = rand();}
  }
  //Result Matrix
    int product[1000][1000]  = {0};


    for (int row = 0; row < 1000; row++) {
        for (int col = 0; col < 1000; col++) {
            // Multiply the row of A by the column of B to get the row, column of product.
            for (int inner = 0; inner < 1000; inner++) {
                product[row][col] += aMatrix[row][inner] * bMatrix[inner][col];
            }

        }
    
    }
}

int main() {

   time_t begin, end;
    time(&begin);

    Multiply();

  time(&end);
  time_t elapsed = end - begin;

  cout << ("Time measured: %ld seconds.\n", elapsed);

return 0;
}

Take a look at my answer [here](https://stackoverflow.com/a/71991373/5282154). It shows various ways of improving matrix multiplies and associated times for a 401x401 matrix of doubles. Should be faster with ints. — doug, May 01 '22 at 21:57
Your code can probably be a lot faster. (Do you have MKL, OpenBlas, Atlas, Blis installed? Just see what performance you get there.) However, not easily. All optimized implementations break all 3 of the loop into blocks. So then you have 6 factorial different choices for how you nest the loops, and 3 parameters for the block size. If you choose everything correctly, you get cache reuse, TLB, register..... But this is not easy. — Victor Eijkhout, May 01 '22 at 23:23
Please do not use vectors of vectors. This is very inefficient. Use flatten array. Please us a BLAS if possible (as mentioned in the earlier question...). If you still want to do it yourself, then please read [this](https://gist.github.com/nadavrot/5b35d44e8ba3dd718e595e40184d03f0). Be prepared to spend several days/weeks to be able to compete with a fast BLAS implementation. In the end, if you succeed to get the same performance, you should be aware that the final code will be big, barely readable and likely not portable, nor maintainable. This is why you should *really use a BLAS*! — Jérôme Richard, May 02 '22 at 01:00
Your code doesn't even compile. Not only is some of it missing (start of `Multiply` function), but what is there also contains bugs. For example you are initializing the first matrix two times instead of the second matrix. Also you are including the initialization (and probably allocation) in your measurement which seems unhelpful. The position of the `#pragma omp for` also makes no sense to me. There are more issues than that. Please make sure to post a **working** example. — paleonix, May 02 '22 at 08:56
@paleonix Sorry I am pretty new to C++. I tried to fix my code on my own but besides the other errors what other bugs do you see? I appreciate the advice and help. — Qi'ra, May 05 '22 at 07:36
1. Take the initialization of the matrices out of the `Multiply` function. The matrices should be arguments of it. 2. Now you can measure the time of the pure multiplication. As you are using OpenMP, you may want to use their [own timing capabilities](https://www.openmp.org/spec-html/5.0/openmpsu160.html) `omp_get_wtime()`. If you want something independent of OpenMP, use the `chrono` header from the C++ standard library. Specifically `std::chrono::steady_clock::now()` (see [here](https://en.cppreference.com/w/cpp/chrono/steady_clock/now)). — paleonix, May 05 '22 at 10:38
3. Allocating matrices on the stack is very convenient, but you might quickly run into a stack overflow, as the stack is limited in size. Therefore rather use a `std::vector(n_row * n_col)`. For indexing into that C++ will get a feature in the future called `mdspan`, but for now you will have to implement it yourself `product[row][col]` -> `product[row * n_col + col]`. 4. Now you can get rid of the double for loops for filling the matrices with values and use a single loop or an STL algorithm (`std::generate`) instead. — paleonix, May 05 '22 at 10:47
5. To get rid of the last C, use the `random` header instead of `rand()` like [here](https://en.cppreference.com/w/cpp/numeric/random/uniform_int_distribution). I have concentrated on the stuff around the actual multiplication here, as there is enough information about optimization in the answers and below them. My points are sorted by priority from high to low. E.g. one might argue to `rand()` is okay for a small demonstration like this, but as you seem to want to learn the language, I thought it might be of interest to you. — paleonix, May 05 '22 at 10:55
If you want to call C stuff from C++ (which is ok in this instance), at least include the C++ style header (i.e. `cstdlib`) and then use the standard namespace (i.e. `std::rand()`). — paleonix, May 05 '22 at 11:01
If you want to call C stuff from C++ (which is ok in this instance), at least include the C++ style header (i.e. `cstdlib`) as can be seen [here](https://en.cppreference.com/w/cpp/numeric/random/rand). — paleonix, May 05 '22 at 11:01

doug · Answer 1 · 2022-05-02T21:59:48.513

Here's straight c++ code that runs in .08s with ints and .14s with floats or doubles. My system is 10 years old with relatively slow memory. Good at the time but now is now.

I agree with @VictorEijkhout that the best results would be with tuned code. There has been huge amounts of work optimizing those.

#include <vector>
#include <array>
#include <random>
#include <cassert>
#include <iostream>
#include <iomanip>
#include <thread>
#include <future>
#include <chrono>

struct Timer {
    std::chrono::system_clock::time_point snapTime;
    Timer() { snapTime = std::chrono::system_clock::now(); }
    operator double() { return std::chrono::duration<double>(std::chrono::system_clock::now() - snapTime).count(); }
};

using DataType = int;
using std::array, std::vector;
constexpr int N = 1000, THREADS = 12;
static auto launchType = std::launch::async;
using Matrix = vector<array<DataType, N>>;
Matrix create_matrix() { return Matrix(N); };

Matrix product(Matrix const& v0, Matrix const& v1, double& time)
{
    Matrix ret = create_matrix();
    Matrix v2 = create_matrix();
    Timer timer;
    for (size_t r = 0; r < N; r++)      // transpose first
        for (size_t c = 0; c < N; c++)
            v2[c][r] = v1[r][c];
    // lambda to process sets of rows in separate threads
    auto do_row_set = [&v0, &v2, &ret](size_t start, size_t last) {
        for (size_t row = start; row < last; row++)
            for (size_t col = 0; col < N; col++)
               {
                DataType tmp{}; // separate tmp variable significantly improves optimization
                for (size_t col_t = 0; col_t < N; col_t++)
                    tmp += v0[row][col_t] * v2[col][col_t];
                ret[row][col] = tmp;
                }

    };

    vector<size_t> seq;
    const size_t NN = N / THREADS;
    // make a sequence of NN+1 rows from start to end
    for (size_t thread_n = 0; thread_n < N; thread_n += NN)
        seq.push_back(thread_n);
    seq.push_back(N);
    vector<std::future<void>> results; results.reserve(THREADS);
    for (size_t i = 0; i < THREADS; i++)
        results.emplace_back(std::async(launchType, do_row_set, seq[i], seq[i + 1]));
    for (auto& x : results)
        x.get();
    time = timer;
    return ret;
}

bool operator==(Matrix const& v0, Matrix const& v1)
{
    for (size_t r = 0; r < N; r++)
        for (size_t c = 0; c < N; c++)
            if (v0[r][c] != v1[r][c])
                return false;
    return true;
}

int main()
{
    auto fill = [](Matrix& v) {
        std::mt19937_64 r(1);
        std::normal_distribution dist(1.);
        for (size_t row = 0; row < N; row++)
            for (size_t col = 0; col < N; col++)
                v[row][col] = DataType(dist(r));
    };
    Matrix m1 = create_matrix(), m2 = create_matrix(), m3 = create_matrix();
    fill(m1); fill(m2);

    auto process_test = [&m1, &m2](Matrix& out) {
        const int rpt_count = 4;
        double sum = 0;
        for (int i = 0; i < rpt_count; i++)
        {
            double time;
            out = product(m1, m2, time);
            sum += time / rpt_count;
        }
        return sum;
    };

    std::cout << std::fixed << std::setprecision(4);
    double time{};
    time = process_test(m3);
    std::cout << "Threads w transpose   " << time << " secs.\n";
}

`vector` is clever and gives nice contiguous memory, but it may confuse people wanting to adapt your example to flexible inputs. I always write a class that stores the data as a single `vector` and then use an indexing function or macro or so. — Victor Eijkhout, May 02 '22 at 14:26
@VictorEijkhout It was just a technique to easily compare matrix operations with contiguous ram v non-contiguous `vector` The make array function allowed various dynamic allocations between vector rows as well. I also normally use a single vector for 2D arrays. See [this](https://stackoverflow.com/a/36123944/5282154) It's row/col memory order is the same as `vector` — doug, May 02 '22 at 14:44
@doug Your program might have an infinite-loop logic somewhere. On my 2-core 2Ghz laptop, on Windows 10 64-bit + CLang, if I set THREADS to 2, then this program runs for several Minutes and doesn't show any sign of stopping, while consuming almost 100% of CPU. — Arty, May 02 '22 at 21:11
@doug Right, I found your infinite loop place, a bug, code `for (size_t col_t = 0; col_t < N; col_t++)` is repeated two times, so having 4 nested loops in total, for N=1000 this means `1000 * 1000 * 1000 * 1000` time, which will take around half hour. — Arty, May 02 '22 at 21:16
@Arty. Oops, that's what happens when I paste in partial code. w/o checking the whole thing. Fixed. — doug, May 02 '22 at 21:58
@doug Hi doug! Thank you for your posting your solution! I ran your code and I got ```Threads w transpose 0.2534 secs.``` as the speed which is pretty fast. I would just like to ask you what improvements would you make to my code? I am new to C++ but I'm trying to optimize while keeping it simple. — Qi'ra, May 05 '22 at 07:44
@Qi'ra Hi. You can change the THREADS to a different numbers. The impact it has affects cache size for the localized multiplies with larger THREADS reducing cache memory. OTOH, to the degree it exceeds the number of CPU cores it can slow down due to thread overhead. Try different numbers. This is the simplest code I have without going to the tuned code which is complex and fast but harder to reason about when learning. — doug, May 05 '22 at 14:46

Arty · Accepted Answer · 2022-05-05T10:36:16.997

Following things can be done for speedup:

Using OpenMP for parallelizing external loop, like you did (and like I also did in my following code). Or alternatively using std::async for multi-threading like it was used in another answer.
Transpose B matrix, this will help to increase L1 cache hits, because you will read from sequential memory each B column (or row in transposed variant).
Use vectorized SIMD instructions, this will allow to do several multiplications (and additions) within one CPU cycle. Often compilers do auto-vectorization of your loops well enough through SIMD instructions without your help, but I did it explicitly in my code.
Run several independent SIMD instructions within loop. This will help to occupy whole CPU pipeline of SIMD. I did so in my code by using four SIMD registers r0, r1, r2, r3. In compilers this is usually called loop unrolling.
Align your matrix starting address on 64-bytes boundary. This will help SIMD instructions to do fast aligned read/write.
Align starting address of each matrix row on 64-bytes boundary. I did this in my code by padding each row with zeros till multiple of 64-bytes. This also helps SIMD instructions to do fast aligned read/write.

In my following code I did all 1. - 6. steps above. Memory 64-bytes alignment I did through implementing AlignmentAllocator that was used in std::vector. Also I did time measurements for float/double/int.

On my old 4-core laptop I got following time measurements for the case of 1000x1000 matrix multiplying by 1000x1000:

 float: time 0.1569 sec
double: time 0.3168 sec
   int: time 0.1565 sec

To compare my hardware capabilities I did measurements of another answer of @doug for the case of int:

Threads w transpose   0.2164 secs.

As one can see my solution is 1.4x times faster that the other answer, I guess due to memory 64-bytes alignment and maybe due to using explicit SIMD (instead of relying on compiler auto-vectorization of a loop).

To compile my program, don't forget to add -fopenmp -lgomp options (for OpenMP support) and -march=native -O3 -std=c++20 options (for SIMD support, optimizations and standard) if you're compiling under GCC/CLang, while MSVC I guess adds OpenMP automatically and doesn't need any special options (use /O2 /GL /std:c++latest for optimizations and standard in MSVC).

In my code I only implemented SSE2/SSE4/AVX/AVX2 instructions for SIMD, if you have more powerful machine you may tell me and I implement also FMA/AVX-512, they will give even twice more speed boost.

My Mul() function is quite generic, it is templated, and you just pass pointers to matrices and row/col count, so your matrices may be stored on calling side in any way (through std::vector or std::array or plain 2D array).

At start of Run() function you may change number of rows and columns if you need a bigger test. Notice that all my functions support any rows and columns, you may even multiply matrix of size 1234x2345 by 2345x3456.

Try it online!

#include <cstdint>
#include <cstring>
#include <stdexcept>
#include <iostream>
#include <iomanip>
#include <vector>
#include <memory>
#include <string>

#include <immintrin.h>

#define USE_OPENMP 1
#define ASSERT_MSG(cond, msg) { if (!(cond)) throw std::runtime_error("Assertion (" #cond ") failed at line " + std::to_string(__LINE__) + "! Msg '" + std::string(msg) + "'."); }
#define ASSERT(cond) ASSERT_MSG(cond, "")
#if defined(_MSC_VER)
    #define IS_MSVC 1
#else
    #define IS_MSVC 0
#endif

#if USE_OPENMP
    #include <omp.h>
#endif

template <typename T, std::size_t N>
class AlignmentAllocator {
  public:
    typedef T value_type;
    typedef std::size_t size_type;
    typedef std::ptrdiff_t difference_type;
    typedef T * pointer;
    typedef const T * const_pointer;
    typedef T & reference;
    typedef const T & const_reference;

  public:
    inline AlignmentAllocator() throw() {}
    template <typename T2> inline AlignmentAllocator(const AlignmentAllocator<T2, N> &) throw() {}
    inline ~AlignmentAllocator() throw() {}
    inline pointer adress(reference r) { return &r; }
    inline const_pointer adress(const_reference r) const { return &r; }
    inline pointer allocate(size_type n);
    inline void deallocate(pointer p, size_type);
    inline void construct(pointer p, const value_type & wert);
    inline void destroy(pointer p) { p->~value_type(); }
    inline size_type max_size() const throw() { return size_type(-1) / sizeof(value_type); }
    template <typename T2> struct rebind { typedef AlignmentAllocator<T2, N> other; };
    bool operator!=(const AlignmentAllocator<T, N> & other) const { return !(*this == other); }
    bool operator==(const AlignmentAllocator<T, N> & other) const { return true; }
};

template <typename T, std::size_t N>
inline typename AlignmentAllocator<T, N>::pointer AlignmentAllocator<T, N>::allocate(size_type n) {
    #if IS_MSVC
        auto p = (pointer)_aligned_malloc(n * sizeof(value_type), N);
    #else
        auto p = (pointer)std::aligned_alloc(N, n * sizeof(value_type));
    #endif
    ASSERT(p);
    return p;
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::deallocate(pointer p, size_type) {
    #if IS_MSVC
        _aligned_free(p);
    #else
        std::free(p);
    #endif
}
template <typename T, std::size_t N>
inline void AlignmentAllocator<T, N>::construct(pointer p, const value_type & wert) {
    new (p) value_type(wert);
}

template <typename T>
using AlignedVector = std::vector<T, AlignmentAllocator<T, 64>>;

template <typename T>
struct RegT;

#ifdef __AVX__
    template <> struct RegT<float> { static size_t constexpr bisize = 256; using type = __m256; static type zero() { return _mm256_setzero_ps(); } };
    template <> struct RegT<double> { static size_t constexpr bisize = 256; using type = __m256d; static type zero() { return _mm256_setzero_pd(); } };
    
    inline void MulAddReg(float const * a, float const * b, __m256 & c) {
        c = _mm256_add_ps(c, _mm256_mul_ps(_mm256_load_ps(a), _mm256_load_ps(b)));
    }
    inline void MulAddReg(double const * a, double const * b, __m256d & c) {
        c = _mm256_add_pd(c, _mm256_mul_pd(_mm256_load_pd(a), _mm256_load_pd(b)));
    }
    
    inline void StoreReg(float * dst, __m256 const & src) { _mm256_store_ps(dst, src); }
    inline void StoreReg(double * dst, __m256d const & src) { _mm256_store_pd(dst, src); }
#else // SSE2
    template <> struct RegT<float> { static size_t constexpr bisize = 128; using type = __m128; static type zero() { return _mm_setzero_ps(); } };
    template <> struct RegT<double> { static size_t constexpr bisize = 128; using type = __m128d; static type zero() { return _mm_setzero_pd(); } };

    inline void MulAddReg(float const * a, float const * b, __m128 & c) {
        c = _mm_add_ps(c, _mm_mul_ps(_mm_load_ps(a), _mm_load_ps(b)));
    }
    inline void MulAddReg(double const * a, double const * b, __m128d & c) {
        c = _mm_add_pd(c, _mm_mul_pd(_mm_load_pd(a), _mm_load_pd(b)));
    }
    
    inline void StoreReg(float * dst, __m128 const & src) { _mm_store_ps(dst, src); }
    inline void StoreReg(double * dst, __m128d const & src) { _mm_store_pd(dst, src); }
#endif

#ifdef __AVX2__
    template <> struct RegT<int32_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };
    //template <> struct RegT<int64_t> { static size_t constexpr bisize = 256; using type = __m256i; static type zero() { return _mm256_setzero_si256(); } };

    inline void MulAddReg(int32_t const * a, int32_t const * b, __m256i & c) {
        c = _mm256_add_epi32(c, _mm256_mullo_epi32(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
    }
    //inline void MulAddReg(int64_t const * a, int64_t const * b, __m256i & c) {
    //    c = _mm256_add_epi64(c, _mm256_mullo_epi64(_mm256_load_si256((__m256i*)a), _mm256_load_si256((__m256i*)b)));
    //}
    
    inline void StoreReg(int32_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
    //inline void StoreReg(int64_t * dst, __m256i const & src) { _mm256_store_si256((__m256i*)dst, src); }
#else // SSE2
    template <> struct RegT<int32_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };
    //template <> struct RegT<int64_t> { static size_t constexpr bisize = 128; using type = __m128i; static type zero() { return _mm_setzero_si128(); } };

    inline void MulAddReg(int32_t const * a, int32_t const * b, __m128i & c) {
        c = _mm_add_epi32(c, _mm_mullo_epi32(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
    }
    //inline void MulAddReg(int64_t const * a, int64_t const * b, __m128i & c) {
    //    c = _mm_add_epi64(c, _mm_mullo_epi64(_mm_load_si128((__m128i*)a), _mm_load_si128((__m128i*)b)));
    //}
    
    inline void StoreReg(int32_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
    //inline void StoreReg(int64_t * dst, __m128i const & src) { _mm_store_si128((__m128i*)dst, src); }
#endif    

template <typename T>
void Mul(T const * A0, size_t A_rows, size_t A_cols, T const * B0, size_t B_rows, size_t B_cols, T * C) {
    size_t constexpr reg_cnt = RegT<T>::bisize / 8 / sizeof(T), block = 4 * reg_cnt;
    ASSERT(A_cols == B_rows);
    size_t const A_cols_aligned = (A_cols + block - 1) / block * block, B_rows_aligned = (B_rows + block - 1) / block * block;
    
    // Copy aligned A
    AlignedVector<T> Av(A_rows * A_cols_aligned);
    for (size_t i = 0; i < A_rows; ++i)
        std::memcpy(&Av[i * A_cols_aligned], &A0[i * A_cols], sizeof(Av[0]) * A_cols);
    T const * A = Av.data();
    // Transpose B
    AlignedVector<T> Bv(B_cols * B_rows_aligned);
    for (size_t j = 0; j < B_cols; ++j)
        for (size_t i = 0; i < B_rows; ++i)
            Bv[j * B_rows_aligned + i] = B0[i * B_cols + j];
    T const * Bt = Bv.data();
    ASSERT(uintptr_t(A) % 64 == 0 && uintptr_t(Bt) % 64 == 0);
    ASSERT(uintptr_t(&A[A_cols_aligned]) % 64 == 0 && uintptr_t(&Bt[B_rows_aligned]) % 64 == 0);
    
    // Multiply
    #pragma omp parallel for
    for (size_t i = 0; i < A_rows; ++i) {
        // Aligned Reg storage
        AlignedVector<T> Regs(block);
        
        for (size_t j = 0; j < B_cols; ++j) {
            T const * Arow = &A[i * A_cols_aligned + 0], * Btrow = &Bt[j * B_rows_aligned + 0];
            
            using Reg = typename RegT<T>::type;
            Reg r0 = RegT<T>::zero(), r1 = RegT<T>::zero(), r2 = RegT<T>::zero(), r3 = RegT<T>::zero();
            
            size_t const k_hi = A_cols - A_cols % block;
            
            for (size_t k = 0; k < k_hi; k += block) {
                MulAddReg(&Arow[k + reg_cnt * 0], &Btrow[k + reg_cnt * 0], r0);
                MulAddReg(&Arow[k + reg_cnt * 1], &Btrow[k + reg_cnt * 1], r1);
                MulAddReg(&Arow[k + reg_cnt * 2], &Btrow[k + reg_cnt * 2], r2);
                MulAddReg(&Arow[k + reg_cnt * 3], &Btrow[k + reg_cnt * 3], r3);
            }
            
            StoreReg(&Regs[reg_cnt * 0], r0);
            StoreReg(&Regs[reg_cnt * 1], r1);
            StoreReg(&Regs[reg_cnt * 2], r2);
            StoreReg(&Regs[reg_cnt * 3], r3);
            
            T sum1 = 0, sum2 = 0, sum3 = 0;
            for (size_t k = 0; k < Regs.size(); ++k)
                sum1 += Regs[k];
            
            //for (size_t k = 0; k < A_cols - A_cols % block; ++k) sum3 += Arow[k] * Btrow[k];
            
            for (size_t k = k_hi; k < A_cols; ++k)
                sum2 += Arow[k] * Btrow[k];
            
            C[i * A_rows + j] = sum2 + sum1;
        }
    }
}

#include <random>
#include <thread>
#include <chrono>
#include <type_traits>

template <typename T>
void Test(T const * A, size_t A_rows, size_t A_cols, T const * B, size_t B_rows, size_t B_cols, T const * C, T eps) {
    for (size_t i = 0; i < A_rows / 16; ++i)
        for (size_t j = 0; j < B_cols / 16; ++j) {
            T sum = 0;
            for (size_t k = 0; k < A_cols; ++k)
                sum += A[i * A_cols + k] * B[k * B_cols + j];
            ASSERT_MSG(std::abs(C[i * A_rows + j] - sum) <= eps * A_cols, "i " + std::to_string(i) + " j " + std::to_string(j) +
                " C " + std::to_string(C[i * A_rows + j]) + " ref " + std::to_string(sum));
        }
}

double Time() {
    static auto const gtb = std::chrono::high_resolution_clock::now();
    return std::chrono::duration_cast<std::chrono::duration<double>>(
        std::chrono::high_resolution_clock::now() - gtb).count();
}

template <typename T>
void Run() {
    size_t constexpr A_rows = 1000, A_cols = 1000, B_rows = 1000, B_cols = 1000;
    
    std::string const tname = std::is_same_v<T, float> ? "float" : std::is_same_v<T, double> ?
        "double" : std::is_same_v<T, int32_t> ? "int" : "<unknown>";
    bool const is_int = tname == "int";
    std::mt19937_64 rng{123};
    std::vector<T> A(A_rows * A_cols), B(B_rows * B_cols), C(A_rows * B_cols);
    for (size_t i = 0; i < A.size(); ++i)
        A[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
    for (size_t i = 0; i < B.size(); ++i)
        B[i] = is_int ? (int64_t(rng() % (1 << 11)) - (1 << 10)) : (T(int64_t(rng() % (1 << 28)) - (1 << 27)) / T(1 << 27));
    auto tim = Time();
    Mul(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0]);
    tim = Time() - tim;
    std::cout << std::setw(6) << tname << ": time " << std::fixed << std::setprecision(4) << tim << " sec" << std::endl;
    Test(&A[0], A_rows, A_cols, &B[0], B_rows, B_cols, &C[0], tname == "float" ? T(1e-7) : tname == "double" ? T(1e-15) : T(0));
}

int main() {
    try {
        #if USE_OPENMP
            omp_set_num_threads(std::thread::hardware_concurrency());
        #endif
        Run<float>();
        Run<double>();
        Run<int32_t>();
        return 0;
    } catch (std::exception const & ex) {
        std::cout << "Exception: " << ex.what() << std::endl;
        return -1;
    }
}

Output:

 float: time 0.1569 sec
double: time 0.3168 sec
   int: time 0.1565 sec

Thank you for sharing your code! I am new to C++ and despite the errors in my code what are the ways you would try to optimize the code that I have if it was your code? I tried to implement your solution into mine but it's a little too advanced for me right now. — Qi'ra, May 03 '22 at 18:45
@Qi'ra If you want non-advanced solution that is quite fast too, then just transpose B matrix, for that just make a new matrix BTransposed so that for all elements `BTransposed[j][i] = B[i][j];`. And then multiply by transposed matrix - instead of your current `product[row][col] += aMatrix[row][inner] * bMatrix[inner][col];` do `product[row][col] += aMatrix[row][inner] * bTransposedMatrix[col][inner];`, i.e. use transposed matrix. Transposing a matrix should boost speed of computation greatly first by good hit-rate of L1 cache, second by doing auto-vectorization in compiler. — Arty, May 03 '22 at 19:41
Thank you so much and I will accept your reply as my answer! I just have one last question. Are there any updates you would make to the ```#pramga omp``` sections of my code? I'm am trying to parallelize my code but everywhere I am implementing it, it doesn't seem to improve the speed. I think I'm doing it wrong. — Qi'ra, May 05 '22 at 05:04
@Qi'ra It is very important to set at start of program (e.g. in `main()`) set number of OpenMP threads through `omp_set_num_threads(std::thread::hardware_concurrency());` (for this to work do `#include `), this line of code sets number of all workers threads in OpenMP to total number of cores/threads in CPU.Without this line I noticed that sometimes (or always) OpenMP doesn't set automatically number of threads to maximal possible.Secondly it is enough to write line `#pragma omp parallel for` before outer loop like I did, that's it, nothing else, it always worked in many of my projects — Arty, May 05 '22 at 09:51
@Qi'ra Extendeding on previous comment, it is also very important to specify on compilation command for GCC/CLang compilers `-fopenmp -lgomp` - without these options GCC/CLang simply ignores OpenMP code and runs all on one core/thread. On MSVC compiler probably nothing is needed, as I remember, no options. Also don't forget to look into Task Manager on Windows (Ctrl+Shift+ESC keys) or run `htop` on Linux, these two programs show CPU usage, change matrix size to something like 5000x5000 to take more time and check in these programs if ALL CPU cores are used, CPU usage should be nearly 100%. — Arty, May 05 '22 at 09:55
Don't forget to allow the compiler to optimize with e.g. `-O3`. In the best case this ill take care of loop unrolling. If the compiler is allowed to compile for a certain processor architecture, it will ideally also use SIMD instructions. If you only want to run it one your development machine, you can use `-march=native` in gcc at least. — paleonix, May 05 '22 at 10:27
@paleonix Thanks, added your note to my answer. Thought that it's obvious about need of `-O3` usage, but still added that to answer as it appeared to be non-obvious. — Arty, May 05 '22 at 10:38
To me it is, to @Qi'ra I don't know, he hasn't told us his compile command, I think. — paleonix, May 05 '22 at 10:57

How to optimize my C++ OpenMp Matrix Multiplication code

2 Answers2