when is calling to the cudaDeviceSynchronize function  really needed?. 
As far as I understand from the CUDA documentation, CUDA kernels are asynchronous, so it seems that we should call cudaDeviceSynchronize after each kernel launch.   However, I have tried the same code (training neural networks) with and without any cudaDeviceSynchronize, except one before the time measurement. I have found that I get the same result but with a  speed up between 7-12x (depending on the matrix sizes).  
So, the question is if there are any reasons to use cudaDeviceSynchronize apart of time measurement. 
For example:
- Is it needed before copying data from the GPU back to the host with - cudaMemcpy?
- If I do matrix multiplications like - C = A * B D = C * F
should I put cudaDeviceSynchronize between both? 
From my experiment It seems that I don't.
Why does cudaDeviceSynchronize slow the program so much?
 
     
     
     
     
     
     
    