Why AddVector CUDA c++ is not working?

Question

I am trying to add 2 arrays using CUDA , but it didn't work .

I did all that it should be done:

1) I parallelized the VectorAdd function

2) I allocated memory to the GPu and moved the data to the GPU

3) And last thing i modified the function VectorAdd to run on the GPU

This is the code :

#define SIZE 1024

__global__ void VectorAdd(int *a, int *b, int *c, int n)
{
    int i = threadIdx.x ;

    if(i < n)
        c[i] = a[i] + b[i];
}

int main()
{
    int *a , *b , *c;
    int *d_a , *d_b , *d_c;

    a = (int *)malloc(SIZE * sizeof(int));
    b = (int *)malloc(SIZE * sizeof(int));
    c = (int *)malloc(SIZE * sizeof(int));

    cudaMalloc( &d_a , SIZE * sizeof(int) );
    cudaMalloc( &d_b , SIZE * sizeof(int) );
    cudaMalloc( &d_c , SIZE * sizeof(int) );

    for ( int i = 0 ; i < SIZE ; ++i)
    {
        a[i] = i ;
        b[i] = i ;
        c[i] = 0 ;
    }

    cudaMemcpy(d_a, a, SIZE *sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, b, SIZE *sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_c, c, SIZE *sizeof(int), cudaMemcpyHostToDevice);

    VectorAdd<<< 1, SIZE >>>(d_a, d_b, d_c, SIZE);

    cudaMemcpy(c, d_c, SIZE * sizeof(int), cudaMemcpyDeviceToHost);

    for(int i = 0 ; i < 10 ; ++i)
    {
        printf("C[%d] =  %d\n", i, c[i]);
    }

    free(a);
    free(b);
    free(c);

    cudaFree(d_a);
    cudaFree(d_b);
    cudaFree(d_c);

    return 0;
}

The output on the console is this : c[0] = 0 , c[1] = 0 , c[2] = 0 , c[3] = 0 , c[4] = 0 ....

Why is that it should be : c[0] = 0 ; c[1] = 2 ; c[2] = 4 ....

If you add a suitable `cudaGetLastError()` after each CUDA runtime call, what do you get? — Angew is no longer proud of SO, Jan 20 '14 at 12:57
What gpu do you use? As expected your code work fine on my system. On cards of compute capability 1.x the maximum number of threads per block are 512. — hubs, Jan 20 '14 at 13:11
Works fine on my machine too. Check the compute capability just like @hubs says. — Tyler Jandreau, Jan 20 '14 at 13:16
Yep guys you are right i forgot that my GPU card is old . So the problem for my was the SIZE that was 1024 so i just put it at 512 and it works . THANKS !!! — Andrei Tranca, Jan 20 '14 at 13:40
somebody please post an answer so we can get this off the unanswered list. In the future, it's a good idea to do [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) any time you are having trouble with cuda code. I would suggest doing that *before* posting here to ask for help. — Robert Crovella, Jan 20 '14 at 14:11
@AndreiTranca Of course you should check all possible errors (at least when hunting down the problem). Asking on SO should really only come after the "I've tried all I can think of" phase, and the eventual SO question should include all the information you've discovered in your earlier attempts at solving it. — Angew is no longer proud of SO, Jan 20 '14 at 14:27

score 2 · Accepted Answer · edited May 23 '17 at 12:21

In your case the problem depends on your used gpu. Your kernel is launched with 1024 threads per block. Since your gpu is of compute capability 1.x only 512 or 768 threads per block are supported. A detailed list can be found in the official programming guide. Because you didn't use proper cuda error checking, you weren't possible to get the error returned by the cuda runtime api. A good guide for cuda error checking is given by @talonmies in this SO answer/question.

Why AddVector CUDA c++ is not working?

1 Answers1