What is the issue for my performance difference with openmp between an array of pointers and a pointer to an array?

Question

I wrote two programs in C which are doing a tall-skinny-matrix-matrix multiplication with openmp. The algorithm is memory bounded for my machine. For one of the codes I used and array of pointers (aop) for storing the matrices. For the other code I used just on array where the rows of the matrix are stored one after another, called pta from now on. Now I observed that pta always outperforms the aop version. Especially when using 12 instead of 6 cores, the performance for aop goes slightly down where the performance for pta doubles. I can't really explain this behavior, I just assume that the cores are somehow interfering during computation. Does somebody can explain the behavior?

Pointer to array version:

int main(int argc, char *argv[])
{
// parallel region to verify that pinning works correctly
#pragma omp parallel
  {
    printf("OpenMP thread %d / %d runs on core %d\n", omp_get_thread_num(), omp_get_num_threads(), sched_getcpu());
  }

  //define dimensions
  int dim_n=atoi(*(argv+1));
  int dim_nb=2;
  printf("n = %d, nb = %d\n",dim_n,dim_nb);

  //allocate space for matrix M, V and W
  //each element of **M is a pointer for the first element of an array
  //size of double and double* is depending on compiler and machine

  double *M = malloc((dim_nb*dim_nb) * sizeof(double));

  //Initialize Matrix M
  for(int i=0; i<dim_nb; i++)
  {
    for(int j=0; j<dim_nb; j++)
    {
      M[i*dim_nb+j]=((i+1)-1.0)*dim_nb+(j+1)-1.0;
    }
  }

  double *V = malloc((dim_n*dim_nb) * sizeof(double));
  double *W = malloc((dim_n*dim_nb) * sizeof(double));


// using parallel region to Initialize the matrix V
#pragma omp parallel for schedule(static)
  for (int i=0; i<dim_n; i++)
  {
    for (int j=0; j<dim_nb; j++)
    {
      V[i*dim_nb+j]=j+1;
    }
  }

  int max_iter=100;
  double time = omp_get_wtime();

  // calculate the matrix-matrix product VM product max_iter times
  for(int iter=0; iter<max_iter; iter++)
  {
  // calculate matrix-matrix product in parallel
#pragma omp parallel for schedule(static)
    // i < #rows of V
    for(int i=0; i<dim_n; i++)
    {
      // j < #columns of M
      for(int j=0; j<dim_nb; j++)
      {
        // Initialize W_ij with zero, everytime W_ij is calculated
        W[i*dim_nb+j]=0;
        // k < #colums of V = rows of M
        for(int k=0; k<dim_nb; k++)
        {
          W[i*dim_nb+j] += V[i*dim_nb+k]*M[k*dim_nb+j];
        }
      }
    }
  }
  time=omp_get_wtime()-time;
'''

Array of pointers version:

int main(int argc, char *argv[])
{
// parallel region to verify that pinning works correctly
#pragma omp parallel
  {
    printf("OpenMP thread %d / %d runs on core %d\n", omp_get_thread_num(), omp_get_num_threads(), sched_getcpu());
  }

  //define dimensions
  int dim_n=atoi(*(argv+1));
  int dim_nb=2;
  printf("n = %d, nb = %d\n",dim_n,dim_nb);

  //allocate space for matrix M, V and W
  // each element of **M is a pointer for the first element of an array
  //size of double and double* is depending on compiler and machine
  double **M = malloc(dim_nb * sizeof(double *));
  for(int i = 0; i < dim_nb; i++)
  {
    M[i] = malloc(dim_nb * sizeof(double));
  }


  //Initialize Matrix 
  for(int i=0; i<dim_nb; i++)
  {
    for(int j=0; j<dim_nb; j++)
    {
      M[i][j]=((i+1)-1.0)*dim_nb+(j+1)-1.0;
    }
  }

    double **V = malloc(dim_n * sizeof(double *));
    for(int i=0; i<dim_n; i++)
  {
    V[i] = malloc(dim_nb * sizeof(double));
  }

  double **W = malloc(dim_n * sizeof(double *));
    for(int i=0; i<dim_n; i++)
  {
    W[i] = malloc(dim_nb * sizeof(double));
  }


// using parallel region to Initialize the matrix V
#pragma omp parallel for schedule(static)
  for (int i=0; i<dim_n; i++)
  {
    for (int j=0; j<dim_nb; j++)
    {
      V[i][j]=j+1;
    }
  }

  int max_iter=100;
  double time = omp_get_wtime();

  // calculate the matrix-matrix product VM product max_iter times
  for(int iter=0; iter<max_iter; iter++)
  {
  // calculate matrix-matrix product in parallel
#pragma omp parallel for schedule(static)
    // i < #rows of V
    for(int i=0; i<dim_n; i++)
    {
      // j < #columns of M
      for(int j=0; j<dim_nb; j++)
      {
        // Initialize W_ij with zero, everytime W_ij is calculated
        W[i][j]=0;
        // k < #colums of V = rows of M
        for(int k=0; k<dim_nb; k++)
        {
          W[i][j] += V[i][k]*M[k][j];
        }
      }
    }
  }
  time=omp_get_wtime()-time;

Duplicate: [Correctly allocating multi-dimensional arrays](https://stackoverflow.com/questions/42094465/correctly-allocating-multi-dimensional-arrays). — Lundin, Nov 21 '19 at 12:12

score 3 · Answer 1 · answered Nov 21 '19 at 11:19

3

It is quite easy to explain as pointer version has to access pointer first then dereference this pointer. Those memory location can be very far fro each other and it is much more likely cache to be flushed as well. Data in the array is stored in one memory chunk so less memory accesses are needed and it is more likely CPU to do not miss the cache.

https://godbolt.org/z/c_8c7c

answered Nov 21 '19 at 11:19

0___________

60,014
4
34
74

Why is the cashe more likely to be flushed? – Robin Nov 21 '19 at 11:41
@Robin Due to cache eviction I would assume. – Max Langhof Nov 21 '19 at 11:43

What is the issue for my performance difference with openmp between an array of pointers and a pointer to an array?

1 Answers1