I am writing a program for school using shared memory. I got the function to work correctly, yet it only works with a single block per grid. If there is more than one block per grid then the function no longer works.
For example, If I send in an array of size 10, with 1 block and 10 threads this function works correctly. If I send in the same array with 2 blocks and 5 threads per block then it no longer works. When I say no longer works, I am getting the same array I sent in, back out, like nothing is happening.
__global__ void rankSortShared(int* a, int n)
{
    int threadId = threadIdx.x + blockDim.x * blockIdx.x;
    int x = 0;
    // Make dynamic sized shared array
    // size will be determined from the 3rd parameter in the 
    // kernal call 'ranksortShared<<<blocksPerGrid, threadsPerBlock, *(size of shared mem)*>>>()'
    extern __shared__ int b[];
    // copy data from incomming array to shared array
    // then sync the threads so all threads have put their 
    // values into the shared array
    b[threadId] = a[threadId];
    __syncthreads();
    // now use shared array for faster lookups
    for (int j = 0; j < n; j++)
    {
        // handle duplicate values
        if (b[j] < b[threadId] || (b[threadId] == b[j] && j < threadId))
        {
            x++;
        }   
    }
    // put data back into array to be transferred back to CPU
    a[x] = b[threadId];
}
For the life of me I can not figure out why. Is there a special way to use shared memory across multiple blocks?
Just to try and debug this, I replaced all the code with a[threadId] = threadId and the array was seemingly correct (printing the numbers 0 through 9). So then I do not understand why this wouldn't work if the threadIds seem to be correct, and it works with one block.
 
     
     
    