0

How is the number of registers per thread decided inside the GPU? I want to see if the GPU has 65536 registers per SM that it can allocate among the threads, do these registers all get allocated to the active thread block running on the SM? So, right now, I have a CUDA program where I have 1024 threads per thread block and 65536 available registers per block. My confusion is, the profiler says each thread only gets 40 registers. Another observation is that each thread actually makes use of exactly 64 registers in its assembly code, which means the performance could've been better if it was assigned that number of threads. Why doesn't it get 64? Who makes this decision? Is it decided at compile time per compute capability or runtime, etc?

Edit: Here is the sample code and its assembly. I'm looking at %f64 at the end of the code to conclude the point above. https://godbolt.org/z/eMzW8dY19

Ferrar
  • 65
  • 7
  • “ actually makes use of exactly 64 registers in its assembly code” — how did you reach that conclusion? – talonmies Feb 18 '23 at 00:46
  • In the assembly code, it unrolls a loop and writes the values one by one to consecutive registers and goes up to 64. I'm not talking for sure though, I might be wrong. I'm gonna add my example code and its assembly. @talonmies – Ferrar Feb 18 '23 at 00:49
  • 3
    You are wrong. PTX is naturally emitted in [SSA](https://en.m.wikipedia.org/wiki/Static_single-assignment_form) form. It is only an intermediate representation of the code and not run by the GPU. Also, please don’t post relevant information at the end of links. Post in the question itself – talonmies Feb 18 '23 at 00:56
  • WDYM? You say that it uses 64 registers per thread. Then you say there are 1024 threads. Those multiply to 65536. So all 65536 registers are used. What's the problem? – user253751 Feb 18 '23 at 01:05
  • Compiler explorer shows the PTX intermediate representation per default. Klick the `PTX` option and choose `SASS` to see the actual assembly. – paleonix Feb 18 '23 at 14:25
  • 1
    You can use [`--resource-usage`](https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#resource-usage-res-usage) to get nvcc to show you register usage and other things like (static) shared memory usage etc. It says "Used 32 registers" for your Compiler Explorer snippet. – paleonix Feb 18 '23 at 14:29
  • 1
    But that is specific to the `sm_52` SASS. If you compile for a different architecture (or even with a different CUDA version), results may vary (produce 40 instead). – paleonix Feb 18 '23 at 14:36
  • 1
    You can somewhat influence how stingy the compiler is with registers by using [`__launch_bounds__()`](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#launch-bounds). – paleonix Feb 18 '23 at 14:39

1 Answers1

3

How is the number of registers per thread decided inside the GPU?

Actual (non-PTX-virtual) register assignments are determined at the point of running the ptxas tool on your code (part of the nvcc compiler driver toolchain), or the equivalent tool as part of the driver API loader or the NVRTC mechanism.

ptxas is the tool that converts PTX to SASS (machine code). SASS is the thing that actually runs on a GPU, PTX is not. PTX must first be converted to SASS.

PTX and the virtual register system in PTX are not useful for understanding of these concepts. There is essentially no limit to the number of virtual registers that can be defined in PTX, and the number of virtual registers defined in PTX tells you nothing at all about how actual registers will be used in GPU hardware. PTX is not useful for this sort of study.

The register assignments are entirely statically determined at this point. You can get some evidence of this by passing -Xptxas=-v compile switch to nvcc when your nvcc compile command has specified a valid SASS target. There is no runtime variability (ignoring the "variability" that would come about via the CUDA JIT PTX->SASS conversion mechanism; the item in focus here is SASS not PTX. Once the SASS is defined, there is no runtime variability.)

do these registers all get allocated to the active thread block running on the SM?

The number of registers allocated will be determined by the registers per thread, some granularity/rounding effects, and the number of threads per threadblock (i.e. the product of these). This quantity of registers will be "carved out" of the total available in the SM, at the point at which a threadblock is "deposited" on that SM, by the CUDA Work Distributor (CWD or CUDA block scheduler). The CWD will not deposit a block until a sufficient number of registers are available to be allocated.

The entire complement of registers (e.g. 65536 or whatever the SM capacity is) are not automatically or always allocated for a single threadblock. It will depend on the actual needs of that threadblock. Remaining/unallocated registers can be used in the future if the CWD decides to deposit another threadblock on that SM. CUDA SMs have the ability to support multiple threadblocks simultaneously, with registers allocated for each. Unless unallocated registers are available in sufficient quantity to meet the needs of a prospective threadblock, the CWD will not deposit a new threadblock on that SM.

My confusion is, the profiler says each thread only gets 40 registers. Another observation is that each thread actually makes use of exactly 64 registers in its assembly code,

The profiler reported number is correct (and it includes the granularity/rounding effects, which may or may not be included in the -Xptxas=-v output.) Your confusion is that you are attempting to understand what is happening via the PTX. Do not do that. It is irrelevant for this discussion.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257