I have the following "Frankenstein" sum reduction code, taken partly from the common CUDA reduction slices, partly from the CUDA samples.
__global__ void reduce6(float *g_idata, float *g_odata, unsigned int n)
{
extern __shared__ float sdata[];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize*2 + threadIdx.x;
unsigned int gridSize = blockSize*2*gridDim.x;
sdata[tid] = 0;
float mySum = 0;
while (i < n) {
sdata[tid] += g_idata[i] + g_idata[i+MAXTREADS];
i += gridSize;
}
__syncthreads();
// do reduction in shared mem
if (tid < 256)
sdata[tid] += sdata[tid + 256];
__syncthreads();
if (tid < 128)
sdata[tid] += sdata[tid + 128];
__syncthreads();
if (tid < 64)
sdata[tid] += sdata[tid + 64];
__syncthreads();
#if (__CUDA_ARCH__ >= 300 )
if ( tid < 32 )
{
// Fetch final intermediate sum from 2nd warp
mySum = sdata[tid]+ sdata[tid + 32];
// Reduce final warp using shuffle
for (int offset = warpSize/2; offset > 0; offset /= 2)
mySum += __shfl_down(mySum, offset);
}
sdata[0]=mySum;
#else
// fully unroll reduction within a single warp
if (tid < 32) {
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
#endif
// write result for this block to global mem
if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
I will be using this to reduce an unrolled array of big size (e.g. 512^3 = 134217728 = n) on a Tesla k40 GPU.
I have some questions regarding the blockSize variable, and its value.
From here on, I will try to explain my understanding (either right or wrong) on how it works:
The bigger I choose blockSize, the faster this code will execute, as it will spend less time in the whole loop, but it will not finish reducing the whole array, but it will return a smaller array of size dimBlock.x, right? If I use blockSize=1 this code would return in 1 call the reduction value, but it will be really slow because its not exploiting the power of CUDA almost anything. Therefore I need to call the reduction kernel several times, each of the time with a smaller blokSize, and reducing the result of the previous call to reduce, until I get to the smallest point.
something like (pesudocode)
blocks=number; //where do we start? why?
while(not the min){
dim3 dimBlock( blocks );
dim3 dimGrid(n/dimBlock.x);
int smemSize = dimBlock.x * sizeof(float);
reduce6<<<dimGrid, dimBlock, smemSize>>>(in, out, n);
in=out;
n=dimGrid.x;
dimGrid.x=n/dimBlock.x; // is this right? Should I also change dimBlock?
}
In which value should I start? I guess this is GPU dependent. Which values shoudl it be for a Tesla k40 (just for me to understand how this values are chosen)?
Is my logic somehow flawed? how?