Starting out with CUDA, about device code

Question

so I was getting started with CUDA programming and I have a question about the kernel coding part. Below is the code I was trying out. I was trying to get it to print the numbers 1-64 using 8 blocks of 8 threads each. To see that the program is using 8 blocks of 8 threads.

The problem is that my output is something impossibly large and different every time and only one value.

#include <stdio.h>

__global__
void start(int *a){
        *a = blockIdx.x*threadIdx.x*blockDim.x;;
}

int main(){
        int a;
        int *d_a;
        int size = 64*sizeof(int);
        cudaMalloc((void**)&d_a,size);
        cudaMemcpy(d_a,&a,size, cudaMemcpyHostToDevice);
        start<<<8,8>>>(d_a);

        cudaMemcpy(&a,d_a,size,cudaMemcpyDeviceToHost);

        cudaFree(d_a);
        printf("%d\n",a);
        return 0;
}

EDIT: Alright, this is going to sound very dumb, but how do I check if the code was actually sent to the GPU card? I suspect the kernel code isn't being processed at all. Maybe because the GPU is off or something. I am using PUTTY so I don't have physical access to the actual machine.

"how do I check if the code was actually sent to the GPU"? A good starting point is to use [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) and run your code with `cuda-memcheck`. — Robert Crovella, Feb 29 '16 at 13:40
If i use lspci -vnn and i'm seeing "Capabilities: " I presume I'll need to contact the administrator? — watisit, Feb 29 '16 at 14:28
I would go with the suggestions I already made before worrying about lspci. If the results of error checking and/or `cuda-memcheck` indicate a misconfigured machine, then that may be the point to see what lspci looks like and/or get the admins involved. Even if your machine is running correctly, `cuda-memcheck` may report API level errors due to the mismatched `cudaMemcpy` sizes indicated below in a comment to the answer. — Robert Crovella, Feb 29 '16 at 14:40

score 1 · Accepted Answer · answered Feb 29 '16 at 07:53

Two problems, all in the same line of code.

*a = blockIdx.x*threadIdx.x*blockDim.x;;

1. All your threads are writing to the same location. Assuming you want an array containing 1-64 this is not what you want to do. You want something like this:

a[id] = id;

Your arithmetic is wrong. If you want your blocks and threads to map into 1-64 you can use this instead

blockIdx.x*blockDim.x+threadIdx.x;

Putting everything together you can do this:

int id= blockIdx.x*blockDim.x+threadIdx.x;
a[id] = id;

Also, the host & device arrays have to have the same size!, if not the cudaMemcpy will be wrong. — Hopobcn, Feb 29 '16 at 09:51

Starting out with CUDA, about device code

1 Answers1