Cuda kernel to compute squares of integers in an array

Question

I am learning some basic CUDA programming. I am trying to initialize an array on the Host with host_a[i] = i. This array consists of N = 128 integers. I am launching a kernel with 1 block and 128 threads per block, in which I want to square the integer at index i.

My questions are:

How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
The expected output for my program is a space-separated list of squares of integers -

1 4 9 16 ... .

What's wrong with my code, since it outputs 1 2 3 4 5 ...

Code:

#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <cuda.h>

const int N = 128;

__global__ void f(int *dev_a) {
    unsigned int tid = threadIdx.x;

    if(tid < N) {
        dev_a[tid] = tid * tid;
    }
}

int main(void) {

    int host_a[N];
    int *dev_a;
    cudaMalloc((void**)&dev_a, N * sizeof(int));
    for(int i = 0 ; i < N ; i++) {
        host_a[i] = i;
    }
    cudaMemcpy(dev_a, host_a, N * sizeof(int), cudaMemcpyHostToDevice);
    f<<<1, N>>>(dev_a);

    cudaMemcpy(host_a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);

    for(int i = 0 ; i < N ; i++) {
        printf("%d ", host_a[i]);
    }
}

There's nothing wrong with your code except that it lacks [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). When I run your code it prints out 0 1 4 9 16 25 ... So I believe you have a problem with your CUDA set up, and proper cuda error checking would help identify the problem. Also you can run your code with `cuda-memcheck` which may give you an idea of what the problem is. You can use `printf` in a kernel if you are using a recent GPU supported by CUDA 7 or CUDA 7.5. — Robert Crovella, Dec 06 '15 at 05:07
Ok, thanks! My Nsight now says no CUDA compatible device found, although I can see CUDA in System Preferences on Mac OS X 10.10.2. Anyway, that's another issue. Thanks for pointing out the error checking part. — xennygrimmato, Dec 06 '15 at 06:49

score 1 · Accepted Answer · edited May 23 '17 at 11:44

How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?

You can use printf in device code (as long as you #include <stdio.h>) on any compute capability 2.0 or higher GPU. Since CUDA 7 and CUDA 7.5 only support those types of GPUs, if you are using CUDA 7 or CUDA 7.5 (successfully) then you can use printf in device code.

What's wrong with my code?

As identified in the comments, there is nothing "wrong" with your code, if run on a properly set up machine. To address your previous question "How do I come to know whether the kernel gets launched or not?", the best approach in my opinion is to use proper cuda error checking, which has numerous benefits besides just telling you whether your kernel launched or not. In this case it would also give a clue as to the failure being an improper CUDA setup on your machine. You can also run CUDA codes with cuda-memcheck as a quick test as to whether any runtime errors are occurring.

Adding error checking in my code helped me identify the error. My CUDA SDK version was not supported by the installed CUDA runtime version. Downgrading to CUDA 6.5 made the program run for me. — xennygrimmato, Dec 07 '15 at 10:41

Cuda kernel to compute squares of integers in an array

1 Answers1