CUDA: Reduce algorithm

Question

I am new to C++/CUDA. I tried implementing the parallel algorithm "reduce" with ability to handle any type of inputsize, and threadsize without increasing asymptotic parallel runtime by recursing over the output of the kernel (in the kernel wrapper).

e.g. Implementing Max Reduce in Cuda the top answer to this question, his/hers implementation will essentially be sequential when threadsize is small enough.

However, I keep getting a "Segmentation fault" when I compile and run it ..?

>> nvcc -o mycode mycode.cu
>> ./mycode
Segmentail fault.

Compiled on a K40 with cuda 6.5

Here is the kernel, basically same as the SO post I linked the checker for "out of bounds" is different:

#include <stdio.h>

/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
  // position and threadId
  int pos = blockIdx.x * blockDim.x + threadIdx.x;
  int tid = threadIdx.x;

  // do reduction in global memory
  for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
  {
    if (tid < s)
    {
      if (pos+s < size) // Handling out of bounds
      {
        d_in[pos] = d_in[pos] + d_in[pos+s];
      }
    }
  }

  // only thread 0 writes result, as thread
  if (tid==0)
  {
    d_out[blockIdx.x] = d_in[pos];
  }
}

The kernel wrapper I mentioned to handle when 1 block will not contain all of the data.

/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, const int size, int num_threads)
{
  // setting up blocks and intermediate result holder
  int num_blocks = ((size) / num_threads) + 1;
  float * d_intermediate;
  cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);

  // recursively solving, will run approximately log base num_threads times.
  do
  {
    reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);

    // updating input to intermediate
    cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);

    // Updating num_blocks to reflect how many blocks we now want to compute on
      num_blocks = num_blocks / num_threads + 1;

    // updating intermediate
    cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
  }
  while(num_blocks > num_threads); // if it is too small, compute rest.

  // computing rest
  reduce_kernel<<<1, num_blocks>>>(d_out, d_in, size);

}

Main program to initialize in/out and create bogus data for testing.

/* -------- MAIN -------- */
int main(int argc, char **argv)
{
  // Setting num_threads
  int num_threads = 512;
  // Making bogus data and setting it on the GPU
  const int size = 1024;
  const int size_out = 1;
  float * d_in;
  float * d_out;
  cudaMalloc(&d_in, sizeof(float)*size);
  cudaMalloc((void**)&d_out, sizeof(float)*size_out);
  const int value = 5;
  cudaMemset(d_in, value, sizeof(float)*size);

  // Running kernel wrapper
  reduce(d_out, d_in, size, num_threads);

  printf("sum is element is: %.f", d_out[0]);
}

A segmentation fault arises out of host code, not CUDA device code. Good practice when asking about a segmentation fault on SO is to identify the line that is causing the fault (a seg fault can always be localized to a specific line of code that actually generates the fault). This localization could be done trivially with `printf` statements scattered throughout the code, or via a debugger. — Robert Crovella, Jan 04 '16 at 17:31

score 7 · Accepted Answer · edited May 23 '17 at 12:31

There are a few things I would point out with your code.

As a general rule/boilerplate, I always recommend using proper cuda error checking and run your code with cuda-memcheck, any time you are having trouble with a cuda code. However those methods wouldn't help much with the seg fault, although they may help later (see below).
The actual seg fault is occurring on this line:
```
printf("sum is element is: %.f", d_out[0]);
```
you've broken a cardinal CUDA programming rule: host pointers must not be dereferenced in device code, and device pointers must not be dereferenced in host code. This latter condition applies here. d_out is a device pointer (allocated via cudaMalloc). Such pointers have no meaning if you attempt to dereference them in host code, and doing so will lead to a seg fault.

The solution is to copy the data back to the host before printing it out:
```
float result;
cudaMemcpy(&result, d_out, sizeof(float), cudaMemcpyDeviceToHost);
printf("sum is element is: %.f", result);
```
Using cudaMalloc in a loop, on the same variable, without doing any cudaFree operations, is not good practice, and may lead to out-of-memory errors in long-running loops, and may also lead to programs with memory leaks, if such a construct is used in a larger program:
```
do
{
  ...

  cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while...
```
in this case I think a better approach and trivial fix would be to cudaFree d_intermediate right before you re-allocate:
```
do
{
  ...
  cudaFree(d_intermediate);
  cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
}
while...
```
This might not be doing what you think it is:
```
const int value = 5;
cudaMemset(d_in, value, sizeof(float)*size);
```
probably you are aware of this, but cudaMemset, like memset, operates on byte quantities. So you are filling the d_in array with a value corresponding to 0x05050505 (and I have no idea what that bit pattern corresponds to when interpreted as a float quantity). Since you refer to bogus values, you may be cognizant of this already. But it's a common error (e.g. if you were actually trying to initialize the array with the value of 5 in every float location), so I thought I would point it out.

Your code has other issues as well (which you will discover if you make the above fixes then run your code with cuda-memcheck). To learn about how to do good parallel reductions, I would recommend studying the CUDA parallel reduction sample code and presentation. Parallel reductions in global memory are not recommended for performance reasons.

For completeness, here are some of the additional issues I found:

Your kernel code needs an appropriate __syncthreads() statement to ensure that the work of all threads in a block are complete before any threads go onto the next iteration of the for-loop.
Your final write to global memory in the kernel needs to also be conditioned on the read-location being in-bounds. Otherwise, your strategy of always launching an extra block would allow the read from this line to be out-of-bounds (cuda-memcheck will show this).
The reduction logic in your loop in the reduce function is generally messed up and needed to be re-worked in several ways.

I'm not saying this code is defect-free, but it seems to work for the given test case and produce the correct answer (1024):

#include <stdio.h>

/* -------- KERNEL -------- */
__global__ void reduce_kernel(float * d_out, float * d_in, const int size)
{
  // position and threadId
  int pos = blockIdx.x * blockDim.x + threadIdx.x;
  int tid = threadIdx.x;

  // do reduction in global memory
  for (unsigned int s = blockDim.x / 2; s>0; s>>=1)
  {
    if (tid < s)
    {
      if (pos+s < size) // Handling out of bounds
      {
        d_in[pos] = d_in[pos] + d_in[pos+s];
      }
    }
    __syncthreads();
  }

  // only thread 0 writes result, as thread
  if ((tid==0) && (pos < size))
  {
    d_out[blockIdx.x] = d_in[pos];
  }
}

/* -------- KERNEL WRAPPER -------- */
void reduce(float * d_out, float * d_in, int size, int num_threads)
{
  // setting up blocks and intermediate result holder
  int num_blocks = ((size) / num_threads) + 1;
  float * d_intermediate;
  cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
  cudaMemset(d_intermediate, 0, sizeof(float)*num_blocks);
  int prev_num_blocks;
  // recursively solving, will run approximately log base num_threads times.
  do
  {
    reduce_kernel<<<num_blocks, num_threads>>>(d_intermediate, d_in, size);

    // updating input to intermediate
    cudaMemcpy(d_in, d_intermediate, sizeof(float)*num_blocks, cudaMemcpyDeviceToDevice);

    // Updating num_blocks to reflect how many blocks we now want to compute on
      prev_num_blocks = num_blocks;
      num_blocks = num_blocks / num_threads + 1;

    // updating intermediate
    cudaFree(d_intermediate);
    cudaMalloc(&d_intermediate, sizeof(float)*num_blocks);
    size = num_blocks*num_threads;
  }
  while(num_blocks > num_threads); // if it is too small, compute rest.

  // computing rest
  reduce_kernel<<<1, prev_num_blocks>>>(d_out, d_in, prev_num_blocks);

}

/* -------- MAIN -------- */
int main(int argc, char **argv)
{
  // Setting num_threads
  int num_threads = 512;
  // Making non-bogus data and setting it on the GPU
  const int size = 1024;
  const int size_out = 1;
  float * d_in;
  float * d_out;
  cudaMalloc(&d_in, sizeof(float)*size);
  cudaMalloc((void**)&d_out, sizeof(float)*size_out);
  //const int value = 5;
  //cudaMemset(d_in, value, sizeof(float)*size);
  float * h_in = (float *)malloc(size*sizeof(float));
  for (int i = 0; i <  size; i++) h_in[i] = 1.0f;
  cudaMemcpy(d_in, h_in, sizeof(float)*size, cudaMemcpyHostToDevice);

  // Running kernel wrapper
  reduce(d_out, d_in, size, num_threads);
  float result;
  cudaMemcpy(&result, d_out, sizeof(float), cudaMemcpyDeviceToHost);
  printf("sum is element is: %.f\n", result);
}

Wow, you put a lot of effort into this. This helped me a lot in understanding cuda and parallelization. I further swapped "num_blocks" with "prev_num_blocks" in the while condition to obtain another round of parallerization. Thanks. — Alexander R Johansen, Jan 04 '16 at 22:19

CUDA: Reduce algorithm

1 Answers1