I started working on a SciPy interface to Fortran libraries (BLAS/LAPACK) as one can see here: Calling BLAS / LAPACK directly using the SciPy interface and Cython and came up with a solution but had to resort to using numpy.zeros which in effect killed any speed gains from calling the Fortran code directly.  The issue was the Fortran code needs an 0 valued output matrix (it operates on the matrix in-place in memory) to match the Numpy version (np.outer).  
So I was a bit perplexed since a 1000x1000 zeros matrix in Python only take 8us (using %timeit, or 0.008ms) so why did adding Cython code kill the runtime, noting I'm also creating it on a memoryview? (basically it goes from 3ms to 8ms or so on a 1000 x 1000 matrix multiplication).  Then after searching around on SO I found a solution elsewhere using memset as the fastest array changing mechanism.  So I rewrote the whole code from the referenced post to the newer memoryview format and got to something like this (Cython cyblas.pyx file):
import cython
import numpy as np
cimport numpy as np
from libc.string cimport memset #faster than np.zeros
REAL = np.float64
ctypedef np.float64_t REAL_t
ctypedef np.uint64_t  INT_t
cdef int ONE = 1
cdef REAL_t ONEF = <REAL_t>1.0
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.cdivision(True)
cpdef outer_prod(double[::1] _x, double[::1] _y, double[:, ::1] _output):
    cdef int M = _y.shape[0]
    cdef int N = _x.shape[0]
    memset(&_output[0,0], 0, M*N)
    with nogil:
        dger(&M, &N, &ONEF, &_y[0], &ONE, &_x[0], &ONE, &_output[0,0], &M)
TEST SCRIPT
import numpy as np;
from cyblas import outer_prod;
a=np.random.randint(0,100, 1000);
b=np.random.randint(0,100, 1000);
a=a.astype(np.float64)
b=b.astype(np.float64)
cy_outer=np.zeros((a.shape[0],b.shape[0]));
np_outer=np.zeros((a.shape[0],b.shape[0]));
%timeit outer_prod(a,b, cy_outer)
%timeit np.outer(a,b, np_outer)
So this fixed my issue of resetting the output matrix values BUT ONLY TO ROW 125, BEYOND THAT THIS ISSUE STILL EXISTS (resolved see below).  I thought setting the memset length parameter to M*N would clear 1000*1000 in memory but apparently not.
Does anyone have an idea on how to reset the whole output matrix to 0 using memset?  Much appreciated.
