NVidia GPU specifies that 1 warp has a fixed number of threads (32), then how are the threads in thread block split to different warps?
For 1 dimension thread block as (128, 1), it looks the threads in x dimension are spit by 32 threads into different warps sequentially, but how does it work for other dimension sizes, like (16, 2), will the 32 threads map to 1 warp in this case?