CUB reduction using 2D grid of blocks

Question

I'm trying to make a sum using the CUB reduction method.

The big problem is: I'm not sure how to return the values of each block to the Host when using 2-dimensional grids.

#include <iostream>
#include <math.h>
#include <cub/block/block_reduce.cuh>
#include <cub/block/block_load.cuh>
#include <cub/block/block_store.cuh>
#include <iomanip>

#define nat 1024
#define BLOCK_SIZE 32
#define GRID_SIZE 32

struct frame
{
   int  natm;
   char  title[100];
   float conf[nat][3];
};

using namespace std;
using namespace cub;

__global__
void add(frame* s, float L, float rc, float* blocksum)
{
int i = blockDim.x*blockIdx.x + threadIdx.x;
int j = blockDim.y*blockIdx.y + threadIdx.y;

float E=0.0, rij, dx, dy, dz;

// Your calculations first so that each thread holds its result
  dx = fabs(s->conf[j][0] - s->conf[i][0]);
  dy = fabs(s->conf[j][1] - s->conf[i][1]);
  dz = fabs(s->conf[j][2] - s->conf[i][2]);
  dx = dx - round(dx/L)*L;
  dy = dy - round(dy/L)*L;
  dz = dz - round(dz/L)*L;

   rij = sqrt(dx*dx + dy*dy + dz*dz);

  if ((rij <= rc) && (rij > 0.0))
    {E =  (4*((1/pow(rij,12))-(1/pow(rij,6))));}

//  E = 1.0;
__syncthreads();
// Block wise reduction so that one thread in each block holds sum of thread results

typedef cub::BlockReduce<float, BLOCK_SIZE, BLOCK_REDUCE_RAKING, BLOCK_SIZE> BlockReduce;

__shared__ typename BlockReduce::TempStorage temp_storage;

float aggregate = BlockReduce(temp_storage).Sum(E);

if (threadIdx.x == 0 && threadIdx.y == 0)
    blocksum[blockIdx.x*blockDim.y + blockIdx.y] = aggregate;

}

int main(void)
{
  frame  * state = (frame*)malloc(sizeof(frame));

  float *blocksum = (float*)malloc(GRID_SIZE*GRID_SIZE*sizeof(float));

  state->natm = nat; //inicializando o numero de atomos;

  char name[] = "estado1";
  strcpy(state->title,name);

  for (int i = 0; i < nat; i++) {
    state->conf[i][0] = i;
    state->conf[i][1] = i;
    state->conf[i][2] = i;
  }

  frame * d_state;
  float *d_blocksum;

  cudaMalloc((void**)&d_state, sizeof(frame));

  cudaMalloc((void**)&d_blocksum, ((GRID_SIZE*GRID_SIZE)*sizeof(float)));

  cudaMemcpy(d_state, state, sizeof(frame),cudaMemcpyHostToDevice);


  dim3 dimBlock(BLOCK_SIZE,BLOCK_SIZE);
  dim3 gridBlock(GRID_SIZE,GRID_SIZE);

  add<<<gridBlock,dimBlock>>>(d_state, 3000, 15, d_blocksum);

  cudaError_t status =  cudaMemcpy(blocksum, d_blocksum, ((GRID_SIZE*GRID_SIZE)*sizeof(float)),cudaMemcpyDeviceToHost);

  float Etotal = 0.0;
  for (int k = 0; k < GRID_SIZE*GRID_SIZE; k++){
       Etotal += blocksum[k];
  }
 cout << endl << "energy: " << Etotal << endl;

  if (cudaSuccess != status)
  {
    cout << cudaGetErrorString(status) << endl;
  }

 // Free memory
  cudaFree(d_state);
  cudaFree(d_blocksum);

  return cudaThreadExit();
}

What is happening is that if the value of GRID_SIZE is the same asBLOCK_SIZE, as written above. The calculation is correct. But if I change the value of GRID_SIZE, the result goes wrong. Which leads me to think that the error is in this code:

blocksum[blockIdx.x*blockDim.y + blockIdx.y] = aggregate;

The idea here is to return a 1D array, which contains the sum of each block.

I do not intend to change the BLOCK_SIZE value, but the value of GRID_SIZE depends on the system I'm looking at, I intend to use values greater than 32 (always multiples of that).

I looked for some example that use 2D grid with CUB, but did not find.

I really new in CUDA program, maybe I'm making a mistake.

edit: I put the complete code. For comparison, when I calculate these exact values for a serial program, it gives me energy: -297,121

please provide a [mcve]. When you are asking for help on SO for a code that is not working, you are supposed to provide one. See item 1 [here](https://stackoverflow.com/help/on-topic). Also, any time you are having trouble with a CUDA code, it is good practice to use [proper CUDA error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) and run your code with `cuda-memcheck`. Even if you don't understand the error output, it may be useful for those trying to help you. — Robert Crovella, Jun 01 '18 at 23:38

Robert Crovella · Accepted Answer · 2018-06-02T00:37:31.880

Probably the main issue is that your output indexing is not correct. Here's a reduced version of your code demonstrating correct results for arbitrary GRID_SIZE:

$ cat t1360.cu
#include <stdio.h>
#include <cub/cub.cuh>
#define BLOCK_SIZE 32
#define GRID_SIZE 25
__global__
void add(float* blocksum)
{
   float E = 1.0;
  // Block wise reduction so that one thread in each block holds sum of thread results
    typedef cub::BlockReduce<float, BLOCK_SIZE, cub::BLOCK_REDUCE_RAKING, BLOCK_SIZE> BlockReduce;

    __shared__ typename BlockReduce::TempStorage temp_storage;
    float aggregate = BlockReduce(temp_storage).Sum(E);
    __syncthreads();
    if (threadIdx.x == 0 && threadIdx.y == 0)
        blocksum[blockIdx.y*gridDim.x + blockIdx.x] = aggregate;
}

int main(){

  float *d_result, *h_result;
  h_result = (float *)malloc(GRID_SIZE*GRID_SIZE*sizeof(float));
  cudaMalloc(&d_result, GRID_SIZE*GRID_SIZE*sizeof(float));
  dim3 grid  = dim3(GRID_SIZE,GRID_SIZE);
  dim3 block = dim3(BLOCK_SIZE, BLOCK_SIZE);
  add<<<grid, block>>>(d_result);
  cudaMemcpy(h_result, d_result, GRID_SIZE*GRID_SIZE*sizeof(float), cudaMemcpyDeviceToHost);
  cudaError_t err = cudaGetLastError();
  if (err != cudaSuccess) {printf("cuda error: %s\n", cudaGetErrorString(err)); return -1;}
  float result = 0;
  for (int i = 0; i < GRID_SIZE*GRID_SIZE; i++) result += h_result[i];
  if (result != (float)(GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE)) printf("mismatch, should be: %f, was: %f\n", (float)(GRID_SIZE*GRID_SIZE*BLOCK_SIZE*BLOCK_SIZE), result);
  else printf("Success\n");
  return 0;
}

$ nvcc -o t1360 t1360.cu
$ ./t1360
Success
$

The important change I made to your kernel code was in the output indexing:

blocksum[blockIdx.y*gridDim.x + blockIdx.x] = aggregate;

We want a simulated 2D index into an array that has width and height of GRID_SIZE consisting of one float quantity per point. Therefore the width of this array is given by gridDim.x (not blockDim). The gridDim variable gives the dimensions of the grid in terms of blocks - and this lines up exactly with how our results array is set up.

Your posted code will fail if GRID_SIZE and BLOCK_SIZE are different (for example, if GRID_SIZE were smaller than BLOCK_SIZE, cuda-memcheck will show illegal accesses, and if GRID_SIZE is larger than BLOCK_SIZE then this indexing error will result in blocks overwriting each other's values in the output array) because of this mixup between blockDim and gridDim.

Also note that float operations typically only have around 5 decimal digits of precision. So small differences in the 5th or 6th decimal place may be attributable to order of operations differences when doing floating-point arithmetic. You can prove this to yourself by switching to double arithmetic.

I'm sorry for not asking very clearly. I'll pay more attention to the next posts. Thank you very much for your help, your comments helped me clarify some of the details that were confusing. — Antonio B. Oliveira Junior, Jun 02 '18 at 00:54
You need not apologize. If you are asking for help, and you are responsive to the requests that are made of you, that is pretty much all anyone can ask. If you continue to use SO you will learn the rhythm of things. — Robert Crovella, Jun 02 '18 at 01:06

CUB reduction using 2D grid of blocks

1 Answers1