Relying on NVIDIA's samples and on some good advice found here at SO, I have been managing to implement a few array-reduction kernels that I need for my project. However, one particular issue remains causing me trouble. It is, how to properly do sum-reduction for arrays of unsigned chars (uchar).
Because uchar can hold values from 0 to 255, of course the thread blocks can't accumulate a value greater than 255 per thread block. My intuition was that it would be merely a case of collecting the sums inside the sum-reduction function in an int despite the input being uchar. However, it does not work.
Let me show in detail what I have. Below is my kernel to sum-reduce an array of uchar - it is a slighly modified version of the famous reduce6 function in NVIDIA's samples:
template <class T, unsigned int blockSize>
__global__ void reduce6(int n, T *g_idata, int *g_odata)
{
    extern __shared__ T sdata[];
    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockSize * 2 + threadIdx.x;
    unsigned int gridSize = blockSize * 2 * gridDim.x;
    int mySum = 0;
    // we reduce multiple elements per thread.  The number is determined by the
    // number of active thread blocks (via gridDim).  More blocks will result
    // in a larger gridSize and therefore fewer elements per thread
    while (i < n)
    {
        mySum += g_idata[i];
        // ensure we don't read out of bounds
        if (i + blockSize < n) mySum += g_idata[i + blockSize];
        i += gridSize;
    }
    // each thread puts its local sum into shared memory
    sdata[tid] = mySum;
    __syncthreads();
    // do reduction in shared mem
    if ((blockSize >= 512) && (tid < 256))
        sdata[tid] = mySum = mySum + sdata[tid + 256];
    __syncthreads();
    if ((blockSize >= 256) && (tid < 128))
        sdata[tid] = mySum = mySum + sdata[tid + 128];
     __syncthreads();
    if ((blockSize >= 128) && (tid <  64))
        sdata[tid] = mySum = mySum + sdata[tid + 64];
    __syncthreads();
    // fully unroll reduction within a single warp
    if ((blockSize >= 64) && (tid < 32))
        sdata[tid] = mySum = mySum + sdata[tid + 32];
    __syncthreads();
    if ((blockSize >= 32) && (tid < 16))
        sdata[tid] = mySum = mySum + sdata[tid + 16];
    __syncthreads();
    if ((blockSize >= 16) && (tid <  8))
        sdata[tid] = mySum = mySum + sdata[tid + 8];
    __syncthreads();
    if ((blockSize >= 8) && (tid <  4))
        sdata[tid] = mySum = mySum + sdata[tid + 4];
    __syncthreads();
    if ((blockSize >= 4) && (tid <  2))
        sdata[tid] = mySum = mySum + sdata[tid + 2];
    __syncthreads();
    if ((blockSize >= 2) && (tid <  1))
        mySum += sdata[tid + 1];
    __syncthreads();
    // write result for this block to global mem
    if (tid == 0)  atomicAdd(g_odata, mySum);
}
When such kernel is called by using reduce6<uchar, Blocksize> such that Blocksize*num.threads = 256, everything works properly and the sum-reduction gets the right result. Whenever such ratio is not 256, the result of the sum-reduction becomes wrong - which is merely due to what I said in the bebinning, i.e. uchar can't acumulate values greater than 255.
To me, the intuitive solution would be to simply change the line:
extern __shared__ T sdata[];
To:
extern __shared__ int sdata[];
Since sdata is a shared array created within the sum-reduction kernel, I thought that it could be of any type and thus properly accumulate whatever values result from the thread-block summation. Maybe, to make it sure, I even wrote the while loop with an explicit conversion of the income data into int:
    while (i < n)
    {
        mySum += (int)g_idata[i];
        // ensure we don't read out of bounds
        if (i + blockSize < n) mySum += (int)g_idata[i + blockSize];
        i += gridSize;
    }
However, to my surprise, all tha only makes the sum-reduction result to be always zero.
What am I missing? How could I alter such kernel to make it so that the uchar array being passed can be properly sum-reduced with arbitrary number of thread-blocks and threads?
If needed, a full example code can be found at: http://pastebin.com/nq1VRJCs
 
     
     
    