Cuda reduction in nested for loops

Question

I have a problem concerning some kind of reduction in CUDA.

distance is a matrix with gridSize*numberOfAngles elements, fftData is a matrix with numberOfAngles*NFFT elements. grid_magnitude is the result-matrix where I want to store my calculation result and it has gridSize elements.

I want to calculate the index in fftData that corresponds to one specific value in distance. After that, the value at this index in fftData is added to grid_magnitude at the corresponding gridPoint.

This is my code so far:

__global__ void calcBackProjection(cuFloatComplex* fftData, 
                                   cuFloatComplex* grid_magnitude,
                                   float* distance,
                                   int gridSize,
                                   int numberOfAngles,
                                   float rangeBin,
                                   int NFFT)
{
 int gridPointIdx = threadIdx.x + blockIdx.x*blockDim.x;

 while(gridPointIdx < gridSize)
 {
    for(int angleIdx = 0; angleIdx < numberOfAngles; angleIdx++)
    {       
        //find rangeBin in fftData corresponding to distance
        float curDistance = distance[gridPointIdx + angleIdx*gridSize];
        int wantedIdx = floor(curDistance / rangeBin);
        if(wantedIdx < NFFT)
            {                                   
                grid_magnitude[gridPointIdx + angleIdx*gridSize] =  
              addCmplx(fftData[wantedIdx + angleIdx*NFFT], grid_magnitude[gridPointIdx +     
                angleIdx*gridSize]);

            }                   
    }
    gridPointIdx += blockDim.x * gridDim.x;     
 }   
}

gridPointIdx should be unique for every thread and so each thread should be writing at a different location in grid_magnitude. But it seems that this is not working, because no change is made on grid_magnitude.

What am I missing?

I didn't manage to do this in full parallel 2-dimensional indexing, maybe I should use shared memory, but how do I part grid_magnitude to be used partly by the threads?

Are you sure the kernel is executed at all? Do you do a proper error checking on all of your CUDA calls? — RoBiK, Jul 12 '13 at 14:15
Yes, you are right, but this seems weird. After the execution of the kernel I did if ( cudaSuccess != cudaGetLastError() ){ cout << "ERROR " << cudaGetLastError() << endl;} But the result of cudaGetLastError() is 0, meaning cudaSuccess, meaning no Error ?! — Johannes, Jul 12 '13 at 17:49
kernel launches are asynchronous, an error during kernel execution will be reported later as a result of some other CUDA function call. — RoBiK, Jul 12 '13 at 21:28
so that's the reason why the last error is not cudaSuccess but is 0. But why is the kernel crashing? — Johannes, Jul 13 '13 at 08:36
0 means cudaSuccess. I am not sure that the kernel is crashing, i am just saying that it is a possibility. Can you provide a complete code example that reproduces the problem? — RoBiK, Jul 13 '13 at 18:07
Put a `cudaDeviceSynchronize();` after your kernel call, before the line that is using `cudaGetLastError()` to [check for errors](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). That should conclusively tell you if the kernel completed successfully. Also, you might want to run your code with `cuda-memcheck` which will also report kernel errors and API errors. — Robert Crovella, Jul 13 '13 at 19:54

score 1 · Answer 1 · edited May 23 '17 at 10:25

1

I changed the code a litte.

__global__ void calcBackProjection(cuFloatComplex* fftData, cuFloatComplex* grid_magnitude,
float* distance, int gridSize, int numberOfAngles, float rangeBin, int NFFT){
int gridPointIdx = threadIdx.x + blockIdx.x*blockDim.x;
while(gridPointIdx < gridSize){
    for(int angleIdx = 0; angleIdx < numberOfAngles; angleIdx++){       
        float curDistance = distance[gridPointIdx + angleIdx*gridSize];
        int wantedIdx = ceil(curDistance / rangeBin) - 1;
        if(wantedIdx){
            int fftIdx = wantedIdx + angleIdx*NFFT;
            int gridIdx=  gridPointIdx + angleIdx*gridSize;
            if((fftIdx < NFFT*numberOfAngles) && (gridIdx < gridSize*numberOfAngles)){                  
                grid_magnitude[gridIdx] = cuCaddf(fftData[fftIdx], grid_magnitude[gridIdx]);
            }
        }
    }
    gridPointIdx += blockDim.x * gridDim.x;     }}

The problem seems to be, that the variables curDistance and wantedIdx are not evaluated by the compiler. When I want to know the values, it says "has no value at the target location". This seems to be the reason why there is an access violation detected at grid_magnitude[gridIdx] = cuCaddf(fftData[fftIdx], grid_magnitude[gridIdx]); I looked at some other answers about this problem, like here and here, but these have not helped me a lot.

edited May 23 '17 at 10:25

Community

1
1

answered Jul 16 '13 at 09:02

Johannes

21
2
6

I don't think your logic is sound. Perhaps what you mean is the variables `curDistance` and `wantedIdx` are *optimized out* by the compiler, which is essentially what the message "has no value at the target location" means. The scope of a variable in your source code may not match it's scope in compiled code, due to optimizations. This can lead to the message you are observing. But inferring that something is wrong based on that probably doesn't make sense. It would not lead to an "access violation". You seem to be skipping a lot of description about your analysis so far. – Robert Crovella Jul 18 '13 at 00:39
But why should the compiler optimize `curDistance` and `wantedIdx` away? Both are used to calculate an index of another array, so they are read to do something. I understand that when I set a variable that is not used, it gets optimized, but that should not be the case here. – Johannes Jul 18 '13 at 14:43
after the kernel-execution, the result of `cudaPeekError()` and `cudaGetLastError()` is both times "unknown error". – Johannes Jul 18 '13 at 14:45
It's not a question of whether they are used or not, it's a question of *scope*. The compiler may only bring the value into being (i.e. into a definite register) when it's actually needed in your code, and at other times use that register for something else. If you attempt to query that value when the compiler has already released the register to do something else, the variable in question is no longer in scope and can't be queried. So the variable availability will depend greatly on when and where you stop the code execution to access it. – Robert Crovella Jul 18 '13 at 21:43

Cuda reduction in nested for loops

1 Answers1