CUDA problems with atomicadd in nested loop on medium sized grids (>760 by 760)

Question

I am having an unknown error within my CUDA program and it seems to be related to the atomicadd function. I am coding on windows on Visual Studio 2015. My calling function is specified as the following

int regionWidth=32;
int regionHeight=32;
dim3 gridSize(765,765);
dim3 blockSize(regionWidth, regionHeight);

cudaMalloc((void **)&dev_count, sizeof(int));
count = 0;
cudaMemcpy(dev_count, &count, sizeof(int), cudaMemcpyHostToDevice);

crashFN << < gridSize, blockSize >> > (regionWidth, regionHeight,  dev_count);

cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost);

printf("total number of threads that executed was: %d vs. %d called -> %s\n", count, gridSize.x*gridSize.y*blockSize.x*blockSize.y, (count==gridSize.x*gridSize.y*blockSize.x*blockSize.y)?"ok":"error");

then my global kernel function is

 __global__ 
 void crashFN(int regionWidth, int regionHeight, int* ct)
 {
     __shared__ int shared_sum;

     shared_sum = 0;

     sumGlobal(regionWidth, regionHeight, &shared_sum);

     atomicAdd(ct, 1);
}

with sumGlobal defined as

 __device__
 void sumGlobal(int regionWidth, int regionHeight, int* global_sum)
 {
     // sum in nested loop
     for (int y = 0; y < regionHeight; y++)
         for (int x = 0; x < regionWidth; x++)
                atomicAdd(global_sum, 1);
 }

The build output from the program is the following

1>  H:\GPU\GPU_PROJECT_HZDR\targeterConsole>"C:\Program Files\NVIDIA GPU 
Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -
 gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-
 version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 
 14.0\VC\bin\x86_amd64"  -I"C:\Program Files\NVIDIA GPU Computing 
 Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing 
 Toolkit\CUDA\v8.0\include"     --keep-dir x64\Release -maxrregcount=0  --
 machine 64 --compile -cudart static     -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE 
 -D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi  /MD " -o 
 x64\Release\targetDetectionGPU.cu.obj 
 "H:\GPU\GPU_PROJECT_HZDR\targetDetectionGPU.cu"

it's a standard Nvidia CUDA console project, only changed the arch to sm_50,compute_50

my program's output is the following (with debug information)

sharedMemBytes=36864
regionWidth=32 regionHeight=32 coDIMX=16 coDIMY=16 coDIMZ=32
gridSize.x=765 gridSize.y=765 blockSize.x=32 blockSize.y=32
There is 1 device supporting CUDA

Device 0: "GeForce GTX 1050 Ti"
  CUDA Driver Version:                           9.0
  CUDA Runtime Version:                          8.0
  CUDA Capability Major revision number:         6
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 0 bytes
  Number of multiprocessors:                     6
  Number of cores:                               288
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Clock rate:                                    1.39 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host             
  threads can use this device simultaneously)
  Concurrent kernel execution:                   Yes
  Device has ECC support enabled:                No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime 
Version = 8.0, NumDevs = 1, Device = GeForce GTX 1050 Ti
Requested resources: gridSize.x=765 gridSize.y=765 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 0 vs. 599270400 called -> error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 558 CUDA Runtime API 
error (30): unknown error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 573 CUDA Runtime API 
error (30): unknown error
finshed cuda algorithm

with smaller grid sizes, it seems to work better

so when I instead choose 764, 764 grid size I get

Requested resources: gridSize.x=764 gridSize.y=764 blockSize.x=32 
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 597704704 vs. 597704704 called -> 
ok
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 574 CUDA Runtime API 
error (30): unknown error

with 750 x 750 the error was gone, with 760x760 the error was back.

The device specifications allows much larger grid sizes than 765, or am I missing something here? Not sure why a simple atomicAdd in a nested loop should cause these errors, is it a bug?

Ok, simplified the kernel call now, removed the function call and combined the loops into 1 but still the error on larger grid sizes, if I comment out the loop it runs ok.

__global__ 
void crashFN(int regionWidth, int regionHeight, int* ct)
{
     __shared__ int shared_sum;

     shared_sum = 0;
     __syncthreads();

    for (int y = 0; y < regionHeight*regionWidth; y++)
           atomicAdd(&shared_sum, 1);

    __syncthreads();

    atomicAdd(ct, 1);
}

if I shorten the loop to

  for (int y = 0; y < regionHeight; y++)
          atomicAdd(&shared_sum, 1);

then it works ok, seems like a timeout issue, strange because I set the WDDM TDR timeout to 10 seconds with the NSight monitor.

1. use [proper cuda error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) 2. run your code with `cuda-memcheck` 3. provide a [mcve]. According to SO expecations [here](https://stackoverflow.com/help/on-topic) (<-click here and read item 1), it is required for questions like this. This certainly does seem like a timeout issue, maybe 10 seconds is not enough. You do need to reboot after making the timeout change. — Robert Crovella, Sep 20 '17 at 14:46
I have debugged with cuda-memcheck the debugger stopped in the atomicadd function. I thought that the example was pretty minimal and verifiable (what's in your opinion is not minimal about it?) Yep, I rebooted, the program doesn't hang for anywhere near 10 seconds before crashing, more like 1 sec. — ejectamenta, Sep 20 '17 at 17:09
After assembling the bits and pieces you provided, and adding a few missing lines of code, I was able to build something out of what you have shown. If I build a debug project, the kernel takes well over 10 seconds to execute on a Pascal Titan X (i.e. a fast GPU). If you are getting an error after 1 second of execution of a debug project, you are hitting a timeout. However, when I run your code, I get no errors. So I'm reasonably confident that you are hitting a timeout, and your TDR is not modified correctly. — Robert Crovella, Sep 20 '17 at 18:56
Thanks Robert, I will remove the TDR setting tomorrow and see if it makes a difference, by the way, are you on Linux or Windows, it could be that it is a Windows specific thing, btw I tried Debug and Release, same issue. The Titan X is much higher spec, could it also be a memory/resource problem? I am sure that if I use global memory it may well run faster and avoid these shared memory issues https://devtalk.nvidia.com/default/topic/514085/atomicadd-in-shared-memory-is-measured-slower-than-in-global-memory-timing-shared-memory-atomic-o/ — ejectamenta, Sep 20 '17 at 20:05
This seems to be a duplicate of the behaviour reported on this webpage, maybe cuda 9 is the way to go for Windows? https://devtalk.nvidia.com/default/topic/1016409/computer-crash-due-atomicadd-why-/?offset=1 — ejectamenta, Sep 20 '17 at 20:07
Maybe it is a timeout issue then as the program does take a while to execute. I am not sure why this should be so, in my understanding all the threads work in parallel and the only work each thread must do is sum a value with itself 1024 times. There are 1024 threads in each block accessing the shared memory which need to be synchronised for the atomic add function each 1024 times the loop is called, maybe this is the hold up!? — ejectamenta, Sep 21 '17 at 11:36

score 1 · Answer 1 · answered Sep 21 '17 at 14:40

If you get a "error (30): unknown error" suspect a TDR timeout, especially on Windows. Basically my test program was taking to long in the loops and causing a timeout. This is particularly the case when you are debugging using printf statements!

The solution is to increase the timeout value by changing the TDR setting to more like 30 seconds or so, increasing this value is not a problem when you are not using the GPU card for the main display. When the TDR value is increased you can better see that it is your program is taking too long and not something else. Try to improve the code by removing loops, especially those containing atomic operations, or restructure it to use techniques like reduction.

http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

CUDA problems with atomicadd in nested loop on medium sized grids (>760 by 760)

1 Answers1