I am working on a dual-processor windows machine and am trying to run several independent python processes using the multiprocessing library. Of course, I am aiming to maximize the use of both CPU's in order to speed up computation time. The details of my machine are below:
- OS: Windows 10 Pro for Workstations
- RAM: 524 GB
- Hard Drive: Samsung SSD PRO 960 (NVMe)
- CPU: Xeon Gold 6154 (times 2)
I execute a master-script using Python 3.6, which then spawns 72 memory-independent workers using the multiprocessing library. Initially, all 72 cores of my machine are used at 100%. After about 5-10 minutes, however, all 36 of the cores on my second CPU reduce to 0% usage, while the 36 cores on the first CPU remain at 100%. I can't figure out why this is happening.
Is there something I am missing regarding the utilization of both CPU's in a dual-processor Windows machine? How can I ensure that the full potential of my machine is utilized? As a side note, I'm curious if this would be different if I were using a Linux OS? Thank you in advance for anyone who is willing to help with this.
A representation of my python master script is below:
import pandas as pd
import netCDF4 as nc
from multiprocessing import Pool
WEATHERDATAPATH = "C:/Users/..../weatherdata/weatherfile_%s.nc4"
OUTPUTPATH = "C:/Users/....outputs/result_%s.nc4"
def calculationFunction(year):
    dataset = nc.Dataset(WEATHERDATAPATH%year)
    # Read the data
    data1 = dataset["windspeed"][:]
    data2 = dataset["pressure"][:]
    data3 = dataset["temperature"][:]
    timeindex = nc.num2date(dataset["time"][:], dataset["time"].units)
    # Do computations with the data, primarily relying on NumPy
    data1Mean = data1.mean(axis=1)
    data2Mean = data2.mean(axis=1)
    data3Mean = data3.mean(axis=1)
    # Write result to a file
    result = pd.DataFrame( {"windspeed":data1Mean,
                            "pressure":data2Mean,
                            "temperature":data3Mean,}, 
                          index=timeindex )
    result.to_csv(OUTPUTPATH%year)
if __name__ == '__main__':
    pool = Pool(72)
    results = []
    for year in range(1900,2016): 
        results.append( pool.apply_async(calculationFunction, (year, )))
    for r in results: r.get()
 
    