CUDA reduction, approach for big arrays

Question

I have the following "Frankenstein" sum reduction code, taken partly from the common CUDA reduction slices, partly from the CUDA samples.

    __global__ void  reduce6(float *g_idata, float *g_odata, unsigned int n)
{
    extern __shared__ float sdata[];

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;
    float mySum = 0;   

    while (i < n) { 
        sdata[tid] += g_idata[i] + g_idata[i+MAXTREADS]; 
        i += gridSize; 
    }
   __syncthreads();


    // do reduction in shared mem
    if (tid < 256)
        sdata[tid] += sdata[tid + 256];
    __syncthreads();

    if (tid < 128)
        sdata[tid] +=  sdata[tid + 128];
     __syncthreads();

    if (tid <  64)
       sdata[tid] += sdata[tid +  64];
    __syncthreads();


#if (__CUDA_ARCH__ >= 300 )
    if ( tid < 32 )
    {
        // Fetch final intermediate sum from 2nd warp
        mySum = sdata[tid]+ sdata[tid + 32];
        // Reduce final warp using shuffle
        for (int offset = warpSize/2; offset > 0; offset /= 2) 
            mySum += __shfl_down(mySum, offset);
    }
    sdata[0]=mySum;
#else

    // fully unroll reduction within a single warp
    if (tid < 32) {
       sdata[tid] += sdata[tid + 32];
       sdata[tid] += sdata[tid + 16];
       sdata[tid] += sdata[tid + 8];
       sdata[tid] += sdata[tid + 4];
       sdata[tid] += sdata[tid + 2];
       sdata[tid] += sdata[tid + 1];
    }
#endif
    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
  }

I will be using this to reduce an unrolled array of big size (e.g. 512^3 = 134217728 = n) on a Tesla k40 GPU.

I have some questions regarding the blockSize variable, and its value.

From here on, I will try to explain my understanding (either right or wrong) on how it works:

The bigger I choose blockSize, the faster this code will execute, as it will spend less time in the whole loop, but it will not finish reducing the whole array, but it will return a smaller array of size dimBlock.x, right? If I use blockSize=1 this code would return in 1 call the reduction value, but it will be really slow because its not exploiting the power of CUDA almost anything. Therefore I need to call the reduction kernel several times, each of the time with a smaller blokSize, and reducing the result of the previous call to reduce, until I get to the smallest point.

something like (pesudocode)

blocks=number; //where do we start? why?
while(not the min){

    dim3 dimBlock( blocks );
    dim3 dimGrid(n/dimBlock.x);
    int smemSize = dimBlock.x * sizeof(float);
    reduce6<<<dimGrid, dimBlock, smemSize>>>(in, out, n);

    in=out;

    n=dimGrid.x; 
    dimGrid.x=n/dimBlock.x; // is this right? Should I also change dimBlock?
}

In which value should I start? I guess this is GPU dependent. Which values shoudl it be for a Tesla k40 (just for me to understand how this values are chosen)?

Is my logic somehow flawed? how?

score 1 · Accepted Answer · edited May 23 '17 at 12:15

There is a CUDA tool to get good grid and block sizes for you : Cuda Occupancy API.

In response to "The bigger I choose blockSize, the faster this code will execute" -- Not necessarily, as you want the sizes which give max occupancy (the ratio of active warps to the total number of possible active warps).

See this answer for additional information How do I choose grid and block dimensions for CUDA kernels?.

Lastly, for Nvidia GPUs supporting Kelper or later, there are shuffle intrinsics to make reductions easier and faster. Here is an article on how to use the shuffle intrinsics : Faster Parallel Reductions on Kepler.

Update for choosing number of threads:

You might not want to use the maximum number of threads if it results in a less efficient use of the registers. From the link on occupancy :

For purposes of calculating occupancy, the number of registers used by each thread is one of the key factors. For example, devices with compute capability 1.1 have 8,192 32-bit registers per multiprocessor and can have a maximum of 768 simultaneous threads resident (24 warps x 32 threads per warp). This means that in one of these devices, for a multiprocessor to have 100% occupancy, each thread can use at most 10 registers. However, this approach of determining how register count affects occupancy does not take into account the register allocation granularity. For example, on a device of compute capability 1.1, a kernel with 128-thread blocks using 12 registers per thread results in an occupancy of 83% with 5 active 128-thread blocks per multi-processor, whereas a kernel with 256-thread blocks using the same 12 registers per thread results in an occupancy of 66% because only two 256-thread blocks can reside on a multiprocessor.

So the way I understand it is that an increased number of threads has the potential to limit performance because of the way the registers can be allocated. However, this is not always the case, and you need to do the calculation (as in the above statement) yourself to determine the optimal number of threads per block.

Thank you , ill have a read. When you mention the Kepler architecture, isn't that what I do with the `#if (__CUDA_ARCH__ >= 300 )` part? — Ander Biguri, Jan 27 '16 at 15:28
Sorry, I didn't see that! Yes, what you do there is correct! The articles should clarify! — RobClucas, Jan 27 '16 at 15:32
Still confused. Why would I choose a BlockSize smaller than the maximum in this case? Thats definitely the one that will give maximum occupancy. — Ander Biguri, Jan 27 '16 at 16:34
@AnderBiguri, I tried to clarify a bit, please let me know if there is any part which I should try to explain better, as I did not have enough time to go into more detail. I can also try to post some examples with the profiler if that would help. — RobClucas, Jan 29 '16 at 09:42

CUDA reduction, approach for big arrays

1 Answers1