I'd like to send a 3D array src of size size in each dimension, flattened into a 1D array of size length = size * size * size, into a kernel, compute a result and store it in dst. However, at the end, dst improperly contains all 0s. Here is my code:
int size = 256;
int length = size * size * size;
int bytes = length * sizeof(float);
// Allocate source and destination arrays on the host and initialize source array
float *src, *dst;
cudaMallocHost(&src, bytes);
cudaMallocHost(&dst, bytes);
for (int i = 0; i < length; i++) {
src[i] = i;
}
// Allocate source and destination arrays on the device
struct cudaPitchedPtr srcGPU, dstGPU;
struct cudaExtent extent = make_cudaExtent(size*sizeof(float), size, size);
cudaMalloc3D(&srcGPU, extent);
cudaMalloc3D(&dstGPU, extent);
// Copy to the device, execute kernel, and copy back to the host
cudaMemcpy(srcGPU.ptr, src, bytes, cudaMemcpyHostToDevice);
myKernel<<<numBlocks, blockSize>>>((float *)srcGPU.ptr, (float *)dstGPU.ptr);
cudaMemcpy(dst, dstGPU.ptr, bytes, cudaMemcpyDeviceToHost);
I've left out my error checking of cudaMallocHost(), cudaMalloc() and cudaMemcpy() for clarity. No error is triggered by this code in any case.
What is the correct use of cudaMalloc3D() with cudaMemcpy()?
Please let me know if I should post a minimal test case for the kernel as well, or if the problem can be found in the code above.