CUDA shared memory more than 1 block stops working

Question

I am writing a program for school using shared memory. I got the function to work correctly, yet it only works with a single block per grid. If there is more than one block per grid then the function no longer works.

For example, If I send in an array of size 10, with 1 block and 10 threads this function works correctly. If I send in the same array with 2 blocks and 5 threads per block then it no longer works. When I say no longer works, I am getting the same array I sent in, back out, like nothing is happening.

__global__ void rankSortShared(int* a, int n)
{
    int threadId = threadIdx.x + blockDim.x * blockIdx.x;
    int x = 0;

    // Make dynamic sized shared array
    // size will be determined from the 3rd parameter in the 
    // kernal call 'ranksortShared<<<blocksPerGrid, threadsPerBlock, *(size of shared mem)*>>>()'
    extern __shared__ int b[];

    // copy data from incomming array to shared array
    // then sync the threads so all threads have put their 
    // values into the shared array
    b[threadId] = a[threadId];
    __syncthreads();

    // now use shared array for faster lookups
    for (int j = 0; j < n; j++)
    {
        // handle duplicate values
        if (b[j] < b[threadId] || (b[threadId] == b[j] && j < threadId))
        {
            x++;
        }   
    }
    // put data back into array to be transferred back to CPU
    a[x] = b[threadId];
}

For the life of me I can not figure out why. Is there a special way to use shared memory across multiple blocks?

Just to try and debug this, I replaced all the code with a[threadId] = threadId and the array was seemingly correct (printing the numbers 0 through 9). So then I do not understand why this wouldn't work if the threadIds seem to be correct, and it works with one block.

score 3 · Accepted Answer · edited May 23 '17 at 12:23

There is a problem here:

int threadId = threadIdx.x + blockDim.x * blockIdx.x;
b[threadId] = ...;

Your threadId variable is a globally unique thread index. That means the more threads you include in the grid (e.g. via more blocks), the higher this index will go.

But for shared memory, the index in each block starts at zero. So eventually, as you add more blocks, your threadId variable will be larger than the amount of shared memory in a block.

The usual solution is to do something like this:

b[threadIdx.x] = ...;

since the threadIdx.x variable starts at zero in each block. (You would replace every occurrence of b[threadId] with b[threadIdx.x] using this strategy, not just one occurrence.)

There may well be other problems in your code. Since you haven't shown a complete code, it's not possible to diagnose others.

And based on statements like this:

When I say no longer works, I am getting the same array I sent in, back out, like nothing is happening.

my guess is you are not doing proper cuda error checking. You should really be doing that before asking for help from others. Even if you don't understand the error output, it will help others trying to help you.

CUDA shared memory more than 1 block stops working

1 Answers1