Proper use of cudaMalloc3D with cudaMemcpy

Question

I'd like to send a 3D array src of size size in each dimension, flattened into a 1D array of size length = size * size * size, into a kernel, compute a result and store it in dst. However, at the end, dst improperly contains all 0s. Here is my code:

int size = 256;
int length = size * size * size;
int bytes = length * sizeof(float);

// Allocate source and destination arrays on the host and initialize source array

float *src, *dst;
cudaMallocHost(&src, bytes);
cudaMallocHost(&dst, bytes);
for (int i = 0; i < length; i++) {
    src[i] = i;
}

// Allocate source and destination arrays on the device

struct cudaPitchedPtr srcGPU, dstGPU;
struct cudaExtent extent = make_cudaExtent(size*sizeof(float), size, size);
cudaMalloc3D(&srcGPU, extent);
cudaMalloc3D(&dstGPU, extent);

// Copy to the device, execute kernel, and copy back to the host

cudaMemcpy(srcGPU.ptr, src, bytes, cudaMemcpyHostToDevice);
myKernel<<<numBlocks, blockSize>>>((float *)srcGPU.ptr, (float *)dstGPU.ptr);
cudaMemcpy(dst, dstGPU.ptr, bytes, cudaMemcpyDeviceToHost);

I've left out my error checking of cudaMallocHost(), cudaMalloc() and cudaMemcpy() for clarity. No error is triggered by this code in any case.

What is the correct use of cudaMalloc3D() with cudaMemcpy()?

Please let me know if I should post a minimal test case for the kernel as well, or if the problem can be found in the code above.

You might be interested in [this question/answer](http://stackoverflow.com/questions/16119943/how-and-when-should-i-use-pitched-pointer-with-the-cuda-api) — Robert Crovella, May 15 '13 at 22:12
Thanks, I already stumbled upon that and it's incredibly helpful. — 1'', May 15 '13 at 23:36
A full working example can be now found in the answer to [Copying from cuda 3D memory to linear memory: copied data is not where I expected](http://stackoverflow.com/questions/16107480/copying-from-cuda-3d-memory-to-linear-memory-copied-data-is-not-where-i-expecte/23052768#23052768). — Vitality, Apr 14 '14 at 05:46

score 3 · Accepted Answer · edited May 23 '17 at 11:49

3

EDIT: the extent takes the number of elements if using a CUDA array, but effectively takes the number of bytes if not using a CUDA array (e.g. memory allocated with some non-array variant of cudaMalloc)

From the Runtime API CUDA documentation:

The extent field defines the dimensions of the transferred area in elements. If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements of unsigned char

Also, cudaMalloc3D returns a pitched pointer, meaning that it'll have at least the dimensions of your supplied extent, but possibly more for alignment reasons. You have to take this pitch into account when accessing and copying to and from your device memory. See here for the documentation on the cudaPitchedPtr struct

As for using cudaMalloc3D with cudaMemcpy, you might want to take a look at using cudaMemcpy3D (documentation here), it might make your life a bit easier in taking the pitch of your host and device memory into account. To use cudaMemcpy3D you have to create a cudaMemcpy3DParms struct with the appropriate information. It's members are:

cudaArray_t dstArray
struct cudaPos dstPos
struct cudaPitchedPtr dstPtr
struct cudaExtent extent
enumcudaMemcpyKind kind
cudaArray_t srcArray
struct cudaPos srcPos
struct cudaPitchedPtr srcPtr

and you must specify one of srcArray or srcPtr and one of dstArray or dstPtr. Also the docs recommend to initialize the struct to 0 before using it, e.g. cudaMemcpy3DParms myParms = {0};

Also, you might be interested in taking a look at this other SO question

edited May 23 '17 at 11:49

Community

1
1

answered May 15 '13 at 18:14

alrikai

4,123
3
24
23

I can use srcGPU as the dstPtr, but what should I use as the srcArray or srcPtr? I'm copying from float *src, which is neither a cuda array nor a cuda pitched pointer. – 1'' May 15 '13 at 19:29
1

@1'' I would try making a `cudaPitchedPtr` to your `src` pointer, with the stride being the same size as your width – alrikai May 15 '13 at 19:42
Good idea, I'll try that. However, I didn't error check the kernel itself, and it gives the error "Invalid argument" with the current code. Why can't I pass srcPtr.ptr and dstPtr.ptr to a kernel expecting a float *? – 1'' May 15 '13 at 20:35
It is certainly valid to pass the raw pointers themselves (although you'll also want to pass their stride in as well). Is it possible that your launch parameters `numBlocks, blockSize` aren't valid? – alrikai May 15 '13 at 20:47
Yes, that was the problem. Good catch! – 1'' May 15 '13 at 23:37
I accepted your answer, but I have a correction regarding `make_cudaExtent()`: in the SO question you linked to, they say: "If a CUDA array is participating in the copy, the extent is defined in terms of that array's elements. If no CUDA array is participating in the copy then the extents are defined in elements of unsigned char." Here, I'm using a primitive array rather than a CUDA array, so the sizeof(float) is necessary. – 1'' May 15 '13 at 23:39
1

Ah good catch on that one, I hadn't read the whole paragraph. I'll edit my answer to include the whole quote for anyone who comes on this question in the future – alrikai May 15 '13 at 23:41

Proper use of cudaMalloc3D with cudaMemcpy

1 Answers1