I had an issue with a much larger kernel, but it seems to distil down to the following code, from which the kernel never returns. Can someone please explain why there is an infinite loop?
__global__ void infinite_while_kernel(void)
{
    int index = 0;
    while (index >= threadIdx.x) {
        index--;
    }
    return;
}
int main(void) {
    infinite_while_kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
    return 0;
}
In addition, the below kernel also gets stuck:
__global__ void not_infinite_while_kernel(void)
{
    int index = 0;
    while (index >= (unsigned int) 0u*threadIdx.x) {
        index--;
    }
return;
}
Replacing threadIdx.x with 0 in the original kernel returns, as expected.  I'm using the v5.5 toolkit, and compiling with the -arch=sm_20 -O0 flags.  Running on a Tesla M2090.  I do not currently have access to any other hardware, nor toolkit versions (it's not my system).
 
     
    