I recently made a CuPy version of my numpy code and I only get an improvement factor of x5-x15. When I check my GPU usage, it seems low (<1%). I want to optimize the way my code operates to get faster results.
Generally, I want to make multiple successive CuPy operations on a cupy. ndarray.
For example, generating a random vector:
def randomUniformUnitary(N):
theta = cp.random.rand(N) * 2 * cp.pi
phi = cp.random.rand(N) * cp.pi
x = cp.sin(phi) * cp.cos(theta)
y = cp.sin(phi) * cp.sin(theta)
z = cp.cos(phi)
output = cp.stack((x, y, z), axis=-1)
return output
I have multiple questions that the docs didn't seem to answer. (They do say on-the-fly kernel creation, but no explanations)
- Kernel merging?
Does CuPy creates a kernel for rand() then sends back the data and creates a kernel for the multiplication of the 2, then... Or will all these calculations combine into one faster kernel?
- Kernel combinaison criteria?
If it is the case, what are the criteria that lead to such behavior? One-line operations? Same array operation? Function operations? Is it okay performance-wise to def separate function with only one cupy operation on an array, or is it better to double write the code at some places and get all the cupy functions into a single Python function?
- Own kernels?
If each calculation is done separately and there is no "kernel merging", then I feel I should probably make my own kernels to optimize? Is it the only way to achieve GPU optimization?