I used x & y for calculating cells of a matrix in device.
when I used more than 32 for lenA & lenB, the breakpoint (in int x=  threadIdx.x; in device code) can't work and output isn't correct.
in host code:
int lenA=52;
int lenB=52;
dim3 threadsPerBlock(lenA, lenB);
dim3 numBlocks(lenA / threadsPerBlock.x, lenB / threadsPerBlock.y);
kernel_matrix<<<numBlocks,threadsPerBlock>>>(dev_A, dev_B);
in device code:
int x=  threadIdx.x;
int y=  threadIdx.y;
...
 
    