why the first cuda kernel cannot overlap with previous memcpy?

Question

Here is a demo. The kernel cannot overlap with previous cudaMemcpyAsync, although they are in different streams.

#include <iostream>
#include <cuda_runtime.h>

__global__ void warmUp(){
    int Id = blockIdx.x*blockDim.x+threadIdx.x;
    if(Id == 0){
        printf("warm up!");
    }
}
__global__ void kernel(){
    int Id = blockIdx.x*blockDim.x+threadIdx.x;
    if(Id == 0){
        long long x = 0;
        for(int i=0; i<1000000; i++){
            x += i>>1;
        }
        printf("kernel!%d\n", x);
    }
}

int main(){
    //warmUp<<<1,32>>>();
    int *data, *data_dev;
    int dataSize = pow(10, 7);
    cudaMallocHost(&data, dataSize*sizeof(int));
    cudaMalloc(&data_dev, dataSize*sizeof(int));
    
    cudaStream_t stream1, stream2;
    cudaStreamCreate(&stream1);
    cudaStreamCreate(&stream2);
    cudaMemcpyAsync(data_dev, data, dataSize*sizeof(int), cudaMemcpyHostToDevice, stream1);
    kernel<<<1, 32, 0, stream2>>>();
}

Visual Profiler show

After some attempts, I found out that this is due to it being the first kernel call.

Uncomment warmUp<<<1,32>>>();, Visual Profiler show, overlap!

Why?

score 2 · Accepted Answer · answered Apr 26 '22 at 14:15

CUDA uses lazy initialization. Because of this, the first time you do a particular operation or a particular operation type, it's possible that the behavior will not be as you expect.

The operation will/should work "correctly", but performance measurements may not be as you expect.

Contrary to the linked article, there really is no specified formula to force the lazy initialization to complete, without performing the actual work you intend to do.

If the only thing you ever intend to do with your application is launch a single kernel, then having that kernel overlap with a previous copy operation doesn't seem to make a lot of sense to me. In any event, you should expect that device initialization is necessary before all operations will proceed at expected speeds or in expected ways.

Lazy initialization behavior may vary based on CUDA version, platform (e.g. OS) and GPU type.

Additionally, kernel launches are asynchronous. So this particular coding pattern:

int main(){
    ...
    kernel<<<1, 32, 0, stream2>>>();
}

is generally not recommended in CUDA, and specifically is not recommended when using a profiler. Your code should provide the opportunity for all issued work to complete properly, in order for the profiler to provide useful results. You should provide a cudaDeviceSynchronize() or similar operation at the end of your code, if you want to profile it, for this type of pattern.

I also don't recommend doing performance analysis on kernels that are issuing printf calls. The printf call imposes additional host/device synchronization behavior/needs, and this can be confusing; its not easy to predict the performance impact of that.

...and [check CUDA errors](https://stackoverflow.com/q/14038589/10107454) while you are at it. ;) — paleonix, Apr 26 '22 at 14:46
It can overlap without `printf` in `kernel(){}`, according to my test. The problem is indeed because of initialization about `printf`. Thank you. — 东百月, Apr 27 '22 at 11:37

why the first cuda kernel cannot overlap with previous memcpy?

1 Answers1