1

I am trying to understand the significant register usage incurred when using a few of the built-in CUDA math ops like atan2() or division and how the register usage might be reduced/eliminated.

I'm using the following program:

#include <stdint.h>
#include <cuda_runtime.h>

extern "C" {
    __global__ void kernel(float* out) {
        uint32_t n = threadIdx.x + blockIdx.x*blockDim.x;
        out[n] = atan2f(static_cast<float>(n), 2.0f);
    }
}

int main(int argc, char const* argv[]) {
    float* d_ary;
    cudaMalloc(&d_ary, 32);
    kernel<<<1,32>>>(d_ary);
    float ary[32];
    cudaMemcpy(ary, d_ary, 32, cudaMemcpyDeviceToHost);
}

and building it with:

nvcc -arch=sm_80 -Xptxas="-v" kernel.cu

Profiling the kernel produces results in the image attached below.

The massive spike in register usage occurs when atan2() is called (or some function within atan2), increasing the register count by more than 100. As far as I can tell this seems to be due to the fact that atan2() is not inlined. Is there any way to get these more expensive floating point operations to get inlined other than resorting to compiler flags like use_fast_math?

enter image description here

EDIT:

@njuffa pointed out that the function call causing the register spike is associated with a slow path taken within atan2 which calls into an internal CUDA function that is not inlined. After some testing the register spike seems to be associated with any non-inlined function call (CALL.ABS.NOINC). Any device function decorated with __noinline__ results in the same phenomenon. Further, nested __noinline__ calls result in the live register count reported by Nsight increasing even further, up to the cap of 255.

einpoklum
  • 118,144
  • 57
  • 340
  • 684
Chris Uchytil
  • 140
  • 1
  • 11
  • I can't replicate this. The kernel in your question compiles to 17 registers for sm80, and a version which just casts the thread number to a float and stores it uses 8 registers -- https://godbolt.org/z/vxxKjxMd3 . I think you are misinterpreting something or your observations are made on a case other than what you have shown in the question – talonmies Jul 14 '23 at 00:50
  • running the following command to compile the code `nvcc -arch=sm_80 -Xptxas="-v" kernel.cu` it reports that the kernel requires 17 registers but nsight compute still reports the 129 live registers when profiled. Are these numbers unrelated to one another? – Chris Uchytil Jul 14 '23 at 01:40
  • Your kernel performs out-of-bounds accesses. You do not allocate enough memory. – Abator Abetor Jul 14 '23 at 06:41
  • I'm not really sure how to show the live register usage I'm seeing without including an image from Nsight. I understand that the register usage is associated with a non-inlined function. This behavior of large register usage reported by Nsight can also be seen when calling a device function decorated with `noinline` as it also results in a CALL.ABS.NOINC instruction. If atan2 is not directly causing the large register spike my next question would be is there a way to inline the underlying slow path to eliminate the function call and/or is there a reason Nsight is reporting high register usage? – Chris Uchytil Jul 14 '23 at 15:22
  • Will do. Thanks for the help @njuffa ! – Chris Uchytil Jul 14 '23 at 17:12
  • @njuffa: Haven't you answered OP's question? – einpoklum Jul 15 '23 at 08:36
  • @njuffa: I guess that's fair, but IMHO, starting an answer with "Here is my unverified hypothesis" is better than having two very long comments which seem like an answer after which the OP thanks you, strengthening that impression. – einpoklum Jul 15 '23 at 10:06

1 Answers1

0

I posted about this on the Nsight Computer forums and was informed that it is a bug and will be fixed in a future release.

https://forums.developer.nvidia.com/t/contraditory-register-count-report-when-calling-a-non-inlined-function/259908

Chris Uchytil
  • 140
  • 1
  • 11