Random Occupancy values returned by the "cudaOccupancyMaxActiveBlocksPerMultiprocessor"

Question

I am trying to understand the usage and benefit of the “cudaOccupancyMaxActiveBlocksPerMultiprocessor” method.

I am using a slightly modified version of the sample program present on NVIDIA developer forum. Basically, I am asking the user to provide the size of the array.

My GPU: NVIDIA GeForce GTX 1070

QUESTIONS:

The occupancy values returned by the program are very random. Many times, the program returns different occupancy values for the same input array size, is there anything wrong in the program?
As shown in the screenshot, if user passed the array size=512 then, the occupancy value is “13” whereas if I set N=512 directly in the program then the occupancy value is “47”. Why?
Why does user provided array size=1024 has occupancy value =0?

SAMPLE CODE:

Source.cpp

#include "kernel_header.cuh"

#include <algorithm>
#include <iostream>

using namespace std;

int main(int argc, char* argv[])
{
    int N;
    int userSize = 0;

    //ask size to user
    cout << "\n\nType the size of 1D Array: " << endl;
    cin >> userSize;

    N = userSize>0? userSize : 1024; //<<<<<<<<<<<<<<<-------PROBLEM

    int* array = (int*)calloc(N, sizeof(int));
    for (int i = 0; i < N; i++)
    {
        array[i] = i + 1;
        //cout << "i = " << i << " is " << array[i]<<endl;
    }

    launchMyKernel(array, N);

    free(array);


    return 0;
}

kernel_header.cuh

#ifndef KERNELHEADER
#define KERNELHEADER

void launchMyKernel(int* array, int arrayCount);

#endif

kernel.cu

#include "stdio.h"
#include "cuda_runtime.h"

__global__ void MyKernel(int* array, int arrayCount)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx < arrayCount)
    {
        array[idx] *= array[idx];
    }
}

void launchMyKernel(int* array, int arrayCount)
{
    int blockSize;   // The launch configurator returned block size 
    int minGridSize; // The minimum grid size needed to achieve the 
                     // maximum occupancy for a full device launch 
    int gridSize;    // The actual grid size needed, based on input size 

    cudaOccupancyMaxPotentialBlockSize(&minGridSize, &blockSize,MyKernel, 0, 0);

    // Round up according to array size 
    gridSize = (arrayCount + blockSize - 1) / blockSize;

    MyKernel << < gridSize, blockSize >> > (array, arrayCount);

    cudaDeviceSynchronize();

    // calculate theoretical occupancy
    int maxActiveBlocks;
    cudaOccupancyMaxActiveBlocksPerMultiprocessor(&maxActiveBlocks,
        MyKernel, blockSize,
        0);

    int device;
    cudaDeviceProp props;
    cudaGetDevice(&device);
    cudaGetDeviceProperties(&props, device);

    float occupancy = (maxActiveBlocks * blockSize / props.warpSize) /
        (float)(props.maxThreadsPerMultiProcessor /
            props.warpSize);


    printf("\n\nMax. Active blocks found: %d\nOur Kernel block size decided: %d\nWarp Size: %d\nNumber of threads per SM: %d\n\n\n\n", maxActiveBlocks
        , blockSize,
        props.warpSize,
        props.maxThreadsPerMultiProcessor);

    printf("Launched blocks of size %d. Theoretical occupancy: %f\n",
        blockSize, occupancy);
}

Robert Crovella · Accepted Answer · 2021-11-12T15:50:11.407

Before asking others for help with a CUDA code that is not working the way you expect, I strongly encourage you to:

Use proper CUDA error checking
run your code with a sanitizer, such as cuda-memcheck or compute-sanitizer

Even if you don't understand the results, the information reported will be useful for those trying to help you.

In your case, you are doing something illegal with your kernel. Specifically, you have passed it host pointers (the one returned by calloc is a host pointer). You pretty much can't use such a pointer in CUDA (i.e. for CUDA device code), and this is a basic CUDA programming principle. To understand one method to structure such a code, so that your kernel can actually do something useful, please refer to the vectorAdd CUDA sample code.

When your kernel attempts to use this host pointer, it makes illegal accesses. At least in my case, when I enter 2048 for the data size, and implement proper CUDA error checking, I observe that the kernel and all subsequent CUDA activity returns an error code, including your call to cudaOccupancyMaxActiveBlocksPerMultiprocessor. That means, that that call is not doing what you expect, and the data it returns is garbage.

So that is at least one reason why you are getting garbage calculation values.

When I fix that issue (e.g. by replacing the calloc with a suitably designed call to cudaMallocManaged), then your code for me reports an occupancy calculation of 1.0, for input data sizes of 512, 1024, and 2048. So there is no variability that I can see, and at best, if you still have questions, I think you would need to restate them (in a new question).

I'm not suggesting that if you fix this, everything will be fine. But this problem is obscuring any ability to make useful analysis.

Thanks Robert. This is certainly a silly mistake at my end. I already know about the concept of device memory but did not realized the mistake while using the sample method. In this case, this question is really not about understanding `cudaOccupancyMaxActiveBlocksPerMultiprocessor`. But I do need to discuss about the values returned/set by the `cudaOccupancyMaxActiveBlocksPerMultiprocessor`. Should I ask a different question for that? — skm, Nov 14 '21 at 13:10
Here is what I said in my answer: "if you still have questions, I think you would need to restate them (in a new question)." — Robert Crovella, Nov 14 '21 at 21:35

Random Occupancy values returned by the "cudaOccupancyMaxActiveBlocksPerMultiprocessor"

1 Answers1