I have a problem concerning some kind of reduction in CUDA.
distance is a matrix with gridSize*numberOfAngles elements, fftData is a matrix with numberOfAngles*NFFT elements. grid_magnitude is the result-matrix where I want to store my calculation result and it has gridSize elements.
I want to calculate the index in fftData that corresponds to one specific value in distance. After that, the value at this index in fftData is added to grid_magnitude at the corresponding gridPoint. 
This is my code so far:
__global__ void calcBackProjection(cuFloatComplex* fftData, 
                                   cuFloatComplex* grid_magnitude,
                                   float* distance,
                                   int gridSize,
                                   int numberOfAngles,
                                   float rangeBin,
                                   int NFFT)
{
 int gridPointIdx = threadIdx.x + blockIdx.x*blockDim.x;
 while(gridPointIdx < gridSize)
 {
    for(int angleIdx = 0; angleIdx < numberOfAngles; angleIdx++)
    {       
        //find rangeBin in fftData corresponding to distance
        float curDistance = distance[gridPointIdx + angleIdx*gridSize];
        int wantedIdx = floor(curDistance / rangeBin);
        if(wantedIdx < NFFT)
            {                                   
                grid_magnitude[gridPointIdx + angleIdx*gridSize] =  
              addCmplx(fftData[wantedIdx + angleIdx*NFFT], grid_magnitude[gridPointIdx +     
                angleIdx*gridSize]);
            }                   
    }
    gridPointIdx += blockDim.x * gridDim.x;     
 }   
}
gridPointIdx should be unique for every thread and so each thread should be writing at a different location in grid_magnitude. But it seems that this is not working, because no change is made on grid_magnitude. 
What am I missing?
I didn't manage to do this in full parallel 2-dimensional indexing, maybe I should use shared memory, but how do I part grid_magnitude to be used partly by the threads?
 
     
    