I try to do a simple uni-directional communication between a CPU and a K80 GPU using CUDA. I want to have a bool cancel flag that resides in global memory and is polled by all running GPU/kernel threads. The flag should default to false and can be set by a CPU/host thread to true during ongoing computation. The GPU/kernel threads then should exit.
This is what I tried. I have simplified code. I removed error checking and application logic (including the application logic that prevents concurrent access to cancelRequested).
On the host side, global definition (.cpp):
// Host side thread safety of this pointer is covered by application logic
volatile bool* cancelRequested = nullptr;
On the host side in the compute thread (.cpp):
initialize(&cancelRequested);
compute(cancelRequested);
finalize(&cancelRequested);
On the host side in a main thread (.cpp):
cancel(cancelRequested); // Called after init is finished
Host routines (.cu file):
void initialize(volatile bool** pCancelRequested)
{
cudaMalloc(const_cast<bool**>(pCancelRequested), sizeof(bool));
const bool aFalse = false;
cudaMemcpy(*const_cast<bool**>(pCancelRequested), &aFalse, sizeof(bool), cudaMemcpyHostToDevice);
}
void compute(volatile bool* pCancelRequested)
{
....
computeKernel<<<pBlocksPerGPU, aThreadsPerBlock>>>(pCancelRequested);
cudaDeviceSynchronize(); // Non-busy wait
....
}
void finalize(volatile bool** pCancelRequested)
{
cudaFree(*const_cast<bool**>(pCancelRequested));
*pCancelRequested = nullptr;
}
void cancel(volatile bool* pCancelRequested)
{
const bool aTrue = true;
cudaMemcpy(const_cast<bool*>(pCancelRequested), &aTrue, sizeof(bool), cudaMemcpyHostToDevice);
}
Device routines (.cu file):
__global__ void computeKernel(volatile bool* pCancelRequested)
{
while (someCondition)
{
// Computation step here
if (*pCancelRequested)
{
printf("-> Cancel requested!\n");
return;
}
}
}
The code runs fine. But it does never enter the cancel case. I read back the false and true values in initialize() and cancel() successfully and checked them using gdb. I.e. writing to the global flag works fine, at least from host side view point. However the kernels never see the cancel flag set to true and exit normally from the outer while loop.
Any idea why this doesn't work?