I have a kernel which uses about 2GB local memory. My cudaMalloc which tries to alloc 2.5GB memory fails if I run that kernel_func before.
I found out that 2GB memory is still occupied after finished running kernel_func, which leaves only 1.5GB for my cudaMalloc. Does anyone has a solution or explanation?
I know that using global memory for kernel_func can solve the problem but for some reason I need to use local memory for that huge local static array.
__global__ kernel_func() {
// The huge static array goes here
short my_array[50000];
}
int main() {
kernel_func<<<64, 128>>>();
// my_array is still occupying memory are this point
// This cudaMalloc will fail with insufficient memory
cudaMalloc(/* 2.5GB data */);
}