I want to speed up the computation of  u ** 2 where u is a numpy array, using the multiprocessing module.
Here is my attempt (file name multi.py) : 
# to compile on Windows/Ipython  : import multi  then  run -m multi
from multiprocessing import Pool
import numpy as np
if __name__ == '__main__':
 u=np.arange(6e7)
 def test(N):
    pool = Pool(N)
    v=len(u)//N
    tasks = [ u[k*v:(k+1)*v] for k in range(N)]  
    res = pool.map_async(np.square,tasks).get()
    return res
Here are the benchmarks :
In [25]: %time  r1=test(1)
Wall time: 13.2 s
In [26]: %time  r2=test(2)
Wall time: 7.75 s
In [27]: %time  r4=test(4)
Wall time: 8.29 s
In [31]: %time r=u**2
Wall time: 512 ms
I have 2 physical cores on my PC, so test(2) running faster than test(1) is encouraging.
But for the moment, numpy is faster. The multiprocessing add big overload.
So my question is : How (or is it possible) to speed up u ** 2, with multiprocessing ?
EDIT
I realize that all process work is done in his own memory space, so necessarily a lot of copy arise (See here for example). So no hope to speed simple computation this way.