I am using CUDA MPS (Multi-Process Service) and trying to understand how the number of active GPU threads affects GPU utilization and execution time. I have written a Python script that multiplies two arrays of length 1,000,000. However, I am encountering some unexpected observations.
Here is the code snippet I am using for GPU multiplication:
import torch
import time
def use_gpu():
    # Check if CUDA is available
    if torch.cuda.is_available():
        device = torch.device("cuda")
        print("Using GPU:", torch.cuda.get_device_name(device))
    else:
        print("CUDA is not available. Using CPU instead.")
        device = torch.device("cpu")
    tensor_size = 100_000_000  # Adjust the size to utilize the GPU fully
    # Create a random tensor on GPU
    tensor = torch.randn(tensor_size, device=device)
    inp = input("num of multiplications before each measurement: ")
    
    # Perform some computation on GPU
    i = 0
    execution_time_list = []
    for j in range(100): # repeat the experiment 100 times
        start_time = time.time()
        while i < int(inp):
            i += 1
            result = tensor * tensor
        i = 0
        end_time = time.time()
        tm = end_time - start_time
        execution_time_list.append(tm)
        print("Time took for ", inp, " multiplications to be complete: ", str(tm))
    print("Average time for intervals to get completed: ", str(sum(execution_time_list)/len(execution_time_list)))
use_gpu()
To activate CUDA MPS server and run the script, use the following commands:
- Set the GPU's compute mode to exclusive process mode:
nvidia-smi -i 0 -c EXCLUSIVE_PROCESS
- Start the CUDA MPS server:
nvidia-cuda-mps-control -d
- Run the script with different active thread percentages:
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE=5 python multiply.py
I have observed that the GPU utilization is consistently 100% regardless of the number of active threads allocated to the script. Moreover, the execution time remains the same when I allocate 25% of GPU threads (18 threads) and when I allocate 100% of GPU threads (72 threads). However, when I allocate only 5% of GPU threads, the execution time becomes 5 times longer. I would like to understand the reason behind these observations.
Thank you for your assistance.
 
    