Have you timed how long it takes just to allocate a? A 10000 x 100000 float64 array takes up 8GB of memory.
a = np.ones((10000, 100000), np.double)
takes over six seconds on my laptop with 16GB of RAM. If you don't have 8GB free then you'll hit the swap and it will take a lot longer. Since func spends almost all of its time just allocating a, parallelizing your outer for loop can therefore only gain you a small fractional improvement on the total runtime.
To demonstrate this, I have modified your function to accept a as an input. In tmp.pyx:
#cython: boundscheck=False, wraparound=False, initializedcheck=False
from cython.parallel cimport prange
def serial(double[:, :] a):
cdef:
double total = 0
int i, j
for i in range(a.shape[0]):
for j in range(a.shape[1]):
total += a[i, j]
return total
def parallel(double[:, :] a):
cdef:
double total = 0
int i, j
for i in prange(a.shape[0], nogil=True, schedule='static', chunksize=1):
for j in range(a.shape[1]):
total += a[i, j]
return total
For example:
In [1]: import tmp
In [2]: r, c = 10000, 100000
In [3]: a = np.random.randn(r, c) # this takes ~6.75 sec
In [4]: %timeit tmp.serial(a)
1 loops, best of 3: 1.25 s per loop
In [5]: %timeit tmp.parallel(a)
1 loops, best of 3: 450 ms per loop
Parallelizing the function gave about a 2.8x speed-up* on my laptop with 4 cores, but this is only a small fraction of the time taken to allocate a.
The lesson here is to always profile your code to understand where it spends its most of its time before you dive into optimizations.
* You could do a little better by passing larger chunks of a to each worker process, e.g. by increasing chunksize or using schedule='guided'