limiting access to GPU threads using CUDA MPS(Multi-Process Service)

Question

I am using CUDA MPS (Multi-Process Service) and trying to understand how the number of active GPU threads affects GPU utilization and execution time. I have written a Python script that multiplies two arrays of length 1,000,000. However, I am encountering some unexpected observations.

Here is the code snippet I am using for GPU multiplication:

import torch
import time

def use_gpu():
    # Check if CUDA is available
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("Using GPU:", torch.cuda.get_device_name(device))
    else:
        print("CUDA is not available. Using CPU instead.")
        device = torch.device("cpu")

    tensor_size = 100_000_000  # Adjust the size to utilize the GPU fully

    # Create a random tensor on GPU
    tensor = torch.randn(tensor_size, device=device)
    inp = input("num of multiplications before each measurement: ")
    
    # Perform some computation on GPU
    i = 0
    execution_time_list = []
    for j in range(100): # repeat the experiment 100 times
        start_time = time.time()
        while i < int(inp):
            i += 1
            result = tensor * tensor
        i = 0
        end_time = time.time()
        tm = end_time - start_time
        execution_time_list.append(tm)
        print("Time took for ", inp, " multiplications to be complete: ", str(tm))

    print("Average time for intervals to get completed: ", str(sum(execution_time_list)/len(execution_time_list)))

use_gpu()

To activate CUDA MPS server and run the script, use the following commands:

Set the GPU's compute mode to exclusive process mode:

nvidia-smi -i 0 -c EXCLUSIVE_PROCESS

Start the CUDA MPS server:

nvidia-cuda-mps-control -d

Run the script with different active thread percentages:

CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=5 python multiply.py

I have observed that the GPU utilization is consistently 100% regardless of the number of active threads allocated to the script. Moreover, the execution time remains the same when I allocate 25% of GPU threads (18 threads) and when I allocate 100% of GPU threads (72 threads). However, when I allocate only 5% of GPU threads, the execution time becomes 5 times longer. I would like to understand the reason behind these observations.

Thank you for your assistance.

gpu utilization reported by `nvidia-smi` tells you nothing about active threads. See [here](https://stackoverflow.com/questions/40937894/nvidia-smi-volatile-gpu-utilization-explanation/40938696#40938696). Your question seems to be a duplicate of that one. When you allocate only 5% of threads, the execution becomes longer because you are only giving a small portion of the GPU to running your workload, so it takes longer. But from `nvidia-smi` perspective, a kernel is always running on the GPU, so it reports 100%. — Robert Crovella, Aug 25 '23 at 19:53
@RobertCrovella Thank you for your comment. So why do you think the execution time is the same for 25% and 100%? — arash asgari, Aug 28 '23 at 18:28
It's difficult to say without testing details. pytorch code is using CUDA under the hood. For me, anyway, I would need to inspect that. It's possible that the code "naturally" only uses 25% of the GPU (for example based on the number of threadblocks launched, or occupancy). In that case, giving either 25% of the GPU or 100% of the GPU would not make a difference. But giving it 5% would result in a 5x increase in duration. — Robert Crovella, Aug 28 '23 at 19:01

limiting access to GPU threads using CUDA MPS(Multi-Process Service)

0 Answers0