I'm getting started with CUDA, and I'm having some issues. The code I've posted below is basically the simplest example off the NVIDIA website, with some memory copies and a print statement added to make sure that it's running correctly.
The code compiles and runs without complaint, but when I print the vector c it comes out all zeros, as if the GPU kernel function isn't being called at all.
This is almost exactly the same as this post Basic CUDA - getting kernels to run on the device using C++.
The symptoms are the same, although I don't seem to be making this error. Any ideas?
#include <stdio.h>
static const unsigned short N = 3;
// Kernel definition
__global__ void VecAdd(float* A, float* B, float* C)
{
    int i = threadIdx.x;
    C[i] = A[i] + B[i];
} 
int main()
{
  float *A, *B, *C;
  float a[N] = {1,2,3}, b[N] = {4,5,6}, c[N] = {0,0,0};
  cudaMalloc( (void **)&A, sizeof(float)*N );
  cudaMalloc( (void **)&B, sizeof(float)*N );
  cudaMalloc( (void **)&C, sizeof(float)*N );
  cudaMemcpy( A, a, sizeof(float)*N, cudaMemcpyHostToDevice );
  cudaMemcpy( B, b, sizeof(float)*N, cudaMemcpyHostToDevice );
  VecAdd<<<1, N>>>(A, B, C);
  cudaMemcpy( c, C, sizeof(float)*N, cudaMemcpyHostToDevice );
  printf("%f %f %f\n", c[0],c[1],c[2]);
  cudaFree(A);
  cudaFree(B);
  cudaFree(C);
  return 0;
}
 
     
    