I'm using CUDA 9 on a Pascal architecture, trying to implement a reasonable block reduction using warp shuffle intrinsics plus a shared memory intermediate step.
Examples I've seen on the web:
The first of those links illustrate the shuffle intrinsics with _sync, and how to use __ballot_sync(), but only goes as far as a single warp reduction.
The second of those links is a Kepler-era article that doesn't use the newer _sync but does illustrate a full block level reduction by staging individual warp reductions into shared memory, then reading those values back into warp 0 and doing one more warp reduction to achieve a block reduction.
My problem is different from these and other examples I've seen on the web is that my reduction operator isn't a simple sum, and my "N" usually won't be a nice power of 2.  From my debugging efforts, it seems that when an active thread (included in the mask provided by __ballot_sync() tries to obtain a value from an inactive thread (not included in the mask), it retrieves a "0".  A "0" would work fine regardless for a sum reduction, but not for a min reduction. ).
take the following code excerpt:
__device__ void warpReduceMin(uint32_t &val, uint32_t mask)
{
   for (int offset=16; offset>0; offset /= 2)
   {
       uint32_t tmp;
       tmp = __shfl_down_sync(mask, val, offset);
       val = (tmp<val) ? tmp : val;
    }
}
__global__ void my_kernel(uint32_t *d_data, uint32_t N)
{
    __shared__ uint32_t shmem[32];
    if (threadIdx.x >= N) return;
    uint32_t mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < blockDim.x)
    uint32_t val = d_data[threadIdx.x];
    uint32_t warp_id = threadIdx.x / warpSize;
    uint32_t lane_id = threadIdx.x % warpSize;
    warpReduceMin(val, mask);
    // val is erroneously set to "0" for the active threads in last warp
    if (lane_id == 0)
        shmem[warp_id] = val;
    __syncthreads();
    val = shmem[lane_id];
    mask = __ballot_sync(0xFFFFFFFF, threadIdx.x < (blockDim.x+warpSize-1)/warpSize );
    if (warp_id == 0)
        warpReduceMin( val, mask );
    // do something with result...
}
If I call the kernel with a block size of 1024, and I have 1024 elements in my data (N=1000)...I get the expected answer. But if I call the kernel with a block size of 1024, with N=1000, then I can see through printf debugging that my last warp of incomplete data (warp_id == 31; elements = 992:999), that the initial offset of 16 is pulling a "0" from a thread which isn't even involved in the warp.
So I'm not quite sure where my error is.