Why do I need to declare CUDA variables on the Host before allocating them on the Device

Question

I've just started trying to learn CUDA again and came across some code I don't fully understand.

// declare GPU memory pointers
float * d_in;
float * d_out;

// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);

When the GPU memory pointers are declared, they allocate memory on the host. The cudaMalloc calls throw away the information that d_in and d_out are pointers to floats.

I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored. It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.

So, what is the purpose of the original variable declarations on the host?

======================================================================

I would've thought something like this would make more sense:

// declare GPU memory pointers
cudaFloat * d_in;
cudaFloat * d_out;

// allocate GPU memory
cudaMalloc((void**) &d_in, ARRAY_BYTES);
cudaMalloc((void**) &d_out, ARRAY_BYTES);

This way, everything GPU related takes place on the GPU. If d_in or d_out are accidentally used in host code, an error can be thrown at compile time, since those variables wouldn't be defined on the host.

I guess what I also find confusing is that by storing device memory addresses on the host, it feels like the device isn't in fully in charge of managing its own memory. It feels like there's a risk of host code accidentally overwriting the value of either d_in or d_out either through accidentally assigning to them in host code or another more subtle error, which could cause the GPU to lose access to its own memory. Also, it seems strange that the addresses assigned to d_in & d_out are chosen by the host, instead of the device. Why should the host know anything about which addresses are/are not available on the device?

What am I failing to understand here?

score 3 · Answer 1 · edited May 23 '17 at 12:25

I can't think why cudaMalloc would need to know about where in host memory d_in & d_out have originally been stored

That is just the C pass by reference idiom.

It's not even clear why I need to use the host bytes to store whatever host address d_in & d_out point to.

Ok, so let's design the API your way. Here is a typical sequence of operations on the host -- allocate some memory on the device, copy some data to that memory, launch a kernel to do something to that memory. You can think for yourself how it would be possible to do this without having the pointers to the allocated memory stored in a host variable:

cudaMalloc(somebytes);
cudaMemcpy(?????, hostdata, somebytes, cudaMemcpyHOstToDevice);
kernel<<<1,1>>>(?????);

If you can explain what should be done with ????? if we don't have the address of the memory allocation on the device stored in a host variable, then you are really onto something. If you can't, then you have deduced the basic reason why we store the return address of memory allocated on the GPU in host variables.

Further, because of the use of typed host pointers to store the addresses of device allocations, CUDA runtime API can do type checking. So this:

__global__ void kernel(double *data, int N);

// .....
int N = 1 << 20;
float * d_data;
cudaMalloc((void **)&d_data, N * sizeof(float));
kernel<<<1,1>>>(d_data, N);

can report type mismatches at compile time, which is very useful.

Thanks, I now see how type checking works its way in. I extended my original question a bit to better explain the format I was expecting. Does my thought make sense, or is there something else I'm overlooking? — user1245262, Mar 27 '17 at 18:04

score 1 · Accepted Answer · edited Mar 29 '17 at 18:28

1

Your fundamental conceptual failure is mixing up host-side code and device-side code. If you call cudaMalloc() from code execution on the CPU, then, well, it's on the CPU: It's you who want to have the arguments in CPU memory, and the result in CPU memory. You asked for it. cudaMalloc has told the GPU/device how much of its (the device's) memory to allocate, but if the CPU/host wants to access that memory, it needs a way to refer to it that the device will understand. The memory location on the device is a way to do this.

Alternatively, you can call it from device-side code; then everything takes place on the GPU. (Although, frankly, I've never done it myself and it's not such a great idea except in special cases).

edited Mar 29 '17 at 18:28

user1245262

6,968
8
50
77

answered Mar 28 '17 at 22:44

einpoklum

118,144
57
340
684

1

Thanks, I think understanding is slowly dawning. cudaMalloc is the host's way of telling the device to allocate memory for data the host is going to be sending it. The device does this, and then returns the location (in device memory) to the host, so that the host knows where to send the data. The host does not directly allocate any device memory; it just tells the device to do so. The host has no way other than the memory location on the device of telling the device to which variable it wants to read or write. – user1245262 Mar 29 '17 at 15:45
That's right. If you'd like, feel free to edit my answer to reflect what you've written in your comment (the edit will be reviewed.) – einpoklum Mar 29 '17 at 15:49

Why do I need to declare CUDA variables on the Host before allocating them on the Device

2 Answers2