I have a pool of workers which perform the same identical task, and I send each a distinct clone of the same data object. Then, I measure the run time separately for each process inside the worker function.
With one process, run time is 4 seconds. With 3 processes, the run time for each process goes up to 6 seconds.
With more complex tasks, this increase is even more nuanced.
There are no other cpu-hogging processes running on my system, and the workers don't use shared memory (as far as I can tell). The run times are measured inside the worker function, so I assume the forking overhead shouldn't matter.
Why does this happen?
def worker_fn(data):
    t1 = time()    
    data.process()
    print time() - t1
    return data.results
def main( n, num_procs = 3):       
    from multiprocessing import Pool
    from cPickle import dumps, loads 
    pool = Pool(processes = num_procs)
    data = MyClass()
    data_pickle = dumps(data)
    list_data = [loads(data_pickle) for i in range(n)]
    results = pool.map(worker_fn,list_data)
Edit: Although I can't post the entire code for MyClass(), I can tell you that it involves a lot of numpy matrix operations. It seems that numpy's use of OpenBlass may somehow be to blame.
