I am running on a machine with two AMD 7302 16 core processors (a total of 32 core). I'm on a Red Hat 8.4 system and using Python 3.10.6.
I've recently started learning the multiprocessing library. Inspired by first example on the documentation page, I wrote my own little code :
from multiprocessing import Pool
import numpy as np
import sys
import datetime
def f(x):
    return x**2
def main(DataType="List", NThr=2, Vectorize=False):
    N = 5*10**7           # number of elements
    n = NThr              # number of threads
    y = np.zeros(N)
    # Use list
    if(DataType == "List"):
        x = []
        for i in range(N):
            x.append(i)
    # Use Numpy
    elif(DataType=="Numpy"):
        x = np.zeros(N)
        for i in range(len(x)):
            x[i] = i
    # Run parallel code
    t0 = datetime.datetime.now()
    if(n==1):
        if(DataType == "Numpy" and Vectorize == True):
            y = np.vectorize(f)(x)
        else:
            for i in range(len(x)):
                y[i] = f(x[i])
    else:
        with Pool(n) as p:
            y = p.map(f, x)
    t1 = datetime.datetime.now()
    dt = (t1 - t0).total_seconds()
    print("{} : Vect = {}, n = {}, time : {}s".format(DataType,Vectorize,n,dt))
    sys.exit(0)
if __name__ == "__main__":
    main()
I noticed that when I try to run p.map() over a numpy array, it performs substantially worse.  Here is the output from several runs (python mycode.py) after twiddling the args to main :
Numpy : Vect = True, n = 1, time : 9.566441s
Numpy : Vect = False, n = 1, time : 16.00333s
Numpy : Vect = False, n = 2, time : 143.331352s
List : Vect = False, n = 1, time : 21.11657s
List : Vect = False, n = 2, time : 11.868897s
List : Vect = False, n = 5, time : 6.162561s
Look at the (Numpy, n=2) run at 143s. It's run time is substantially worse than the (List, n=2) run at 11.9s. It is also much worse than either of the (Numpy, n=1) runs.
Question :
What makes numpy arrays take so long to run with the multiprocessing library, specifically when NThr==2?
EDIT :
Per a comment's suggestion, I ran both versions (Numpy, n=2) and (List, n=2) through the profiler :
>>> import cProfile                                                                                                                                                 
>>> from mycode import main                                                                                                                           
>>> cProfile.run('main()')
and compared them side by side. The most time consuming function calls and the calls with different numbers to them are listed below.
For Numpy version :
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
# Time consuming
1    0.000    0.000  138.997  138.997 pool.py:362(map)
1    0.000    0.000  138.956  138.956 pool.py:764(wait)
1    0.000    0.000  138.956  138.956 pool.py:767(get)
4    0.000    0.000  138.957   34.739 threading.py:288(wait)
4    0.000    0.000  138.957   34.739 threading.py:589(wait)
14/1    0.000    0.000  145.150  145.150 {built-in method builtins.exec}
19  138.957    7.314  138.957    7.314 {method 'acquire' of '_thread.lock' objects}
# Different number of calls
6    0.000    0.000    0.088    0.015 popen_fork.py:24(poll)
1    0.000    0.000    0.088    0.088 popen_fork.py:36(wait)
1    0.000    0.000    0.088    0.088 process.py:142(join)
10    0.000    0.000    0.000    0.000 process.py:99(_check_closed)
18    0.000    0.000    0.000    0.000 util.py:48(debug)
76    0.000    0.000    0.000    0.000 {built-in method builtins.len}
2    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
17    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
6    0.088    0.015    0.088    0.015 {built-in method posix.waitpid}
3    0.000    0.000    0.000    0.000 {method 'append' of 'list' objects}
For List version :
ncalls  tottime  percall  cumtime  percall filename:lineno(function)
# Time consuming
1    0.000    0.000   13.961   13.961 pool.py:362(map)
1    0.000    0.000   13.920   13.920 pool.py:764(wait)
1    0.000    0.000   13.920   13.920 pool.py:767(get)
4    0.000    0.000   13.921    3.480 threading.py:288(wait)
4    0.000    0.000   13.921    3.480 threading.py:589(wait)
14/1    0.000    0.000   24.475   24.475 {built-in method builtins.exec}
19   13.921    0.733   13.921    0.733 {method 'acquire' of '_thread.lock' objects}
# Different number of calls
7    0.000    0.000    0.132    0.019 popen_fork.py:24(poll)
2    0.000    0.000    0.132    0.066 popen_fork.py:36(wait)
2    0.000    0.000    0.132    0.066 process.py:142(join)
12    0.000    0.000    0.000    0.000 process.py:99(_check_closed)
19    0.000    0.000    0.000    0.000 util.py:48(debug)
75    0.000    0.000    0.000    0.000 {built-in method builtins.len}
1    0.000    0.000    0.000    0.000 {built-in method numpy.zeros}
18    0.000    0.000    0.000    0.000 {built-in method posix.getpid}
7    0.132    0.019    0.132    0.019 {built-in method posix.waitpid}
50000003    2.780    0.000    2.780    0.000 {method 'append' of 'list' objects}
Note that for the List version, there are 50000003 calls to append() compared to 3 calls to append() in the Numpy version.  due to the initialization of the x.
 
    