Shared memory loads not registered when using Tensor Cores

Question

I'm trying to multiply blocks of size 8x8 using Tensor Cores on a GPU designed with the Turing architecture. For that I'm using the WMMA API and fragments of size 16x16. My assumption was that shared memory bandwidth would be wasted since most data loaded into the fragments don't represent useful information. While trying to quantify that I came across the following problem: Shared memory loads using wmma::load_matrix_sync are not even reported on Nsight Compute. To test that, I'm using this kernel:

__global__
void test() {
    extern __shared__  half shmem[];
    wmma::fragment<wmma::matrix_a, 16, 16, 16, half, wmma::row_major> a_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, half, wmma::row_major> b_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, float> c_frag;
    wmma::load_matrix_sync(a_frag, shmem, 16);
    wmma::load_matrix_sync(b_frag, shmem, 16);
    wmma::mma_sync(c_frag, a_frag, b_frag, c_frag);
    wmma::store_matrix_sync((float*)shmem, c_frag, 16, wmma::mem_row_major);
}

Nsight Compute reports shared memory stores, but not loads. What is happening here? I tried several variations but it still shows 0 loads.

The revised question can probably only be answered by Greg Smith. I would guess that the shared load and store statistics come from counting instructions, and the load instructions generated by wmma don't get counted. — talonmies, Feb 18 '21 at 09:22
I migrated to the new Nsight Compute version (previous version was 2019.4.0) and now I can see those loads as "Shared Load Matrix" in the shared memory section. Thanks. — rm95, Feb 18 '21 at 22:19
The new LDSM instruction was not counted in the SM hardware counter used for shared memory accesses. A fix was made in Nsight Compute 2020.3.1. See https://docs.nvidia.com/nsight-compute/ReleaseNotes/index.html#updates-2020-3-1. — Greg Smith, Mar 10 '21 at 17:01

score 0 · Accepted Answer · answered Dec 31 '22 at 03:52

Answer added from information in comments:

The new LDSM instruction was not counted in the SM hardware counter used for shared memory accesses. A fix was made in Nsight Compute 2020.3.1. See the release notes here

As was hypothesised, at the time the question was posted, Nsight compute didn’t include instruction counting for the load instructions generated by wmma. This has since been rectified.

Shared memory loads not registered when using Tensor Cores

1 Answers1