I'm trying to parallelize some calculations that use numpy with the help of Python's multiprocessing module. Consider this simplified example:
import time
import numpy
from multiprocessing import Pool
def test_func(i):
    a = numpy.random.normal(size=1000000)
    b = numpy.random.normal(size=1000000)
    for i in range(2000):
        a = a + b
        b = a - b
        a = a - b
    return 1
t1 = time.time()
test_func(0)
single_time = time.time() - t1
print("Single time:", single_time)
n_par = 4
pool = Pool()
t1 = time.time()
results_async = [
    pool.apply_async(test_func, [i])
    for i in range(n_par)]
results = [r.get() for r in results_async]
multicore_time = time.time() - t1
print("Multicore time:", multicore_time)
print("Efficiency:", single_time / multicore_time)
When I execute it, the multicore_time is roughly equal to single_time * n_par, while I would expect it to be close to single_time. Indeed, if I replace numpy calculations with just time.sleep(10), this is what I get — perfect efficiency. But for some reason it does not work with numpy. Can this be solved, or is it some internal limitation of numpy?
Some additional info which may be useful:
- I'm using OSX 10.9.5, Python 3.4.2 and the CPU is Core i7 with (as reported by the system info) 4 cores (although the above program only takes 50% of CPU time in total, so the system info may not be taking into account hyperthreading). 
- when I run this I see - n_parprocesses in- topworking at 100% CPU
- if I replace - numpyarray operations with a loop and per-index operations, the efficiency rises significantly (to about 75% for- n_par = 4).
 
     
     
     
    