How to get faster code than numpy.dot for matrix multiplication?

Question

Here Matrix multiplication using hdf5 I use hdf5 (pytables) for big matrix multiplication, but I was suprised because using hdf5 it works even faster then using plain numpy.dot and store matrices in RAM, what is the reason of this behavior?

And maybe there is some faster function for matrix multiplication in python, because I still use numpy.dot for small block matrix multiplication.

here is some code:

Assume matrices can fit in RAM: test on matrix 10*1000 x 1000.

Using default numpy(I think no BLAS lib). Plain numpy arrays are in RAM: time 9.48

If A,B in RAM, C on disk: time 1.48

If A,B,C on disk: time 372.25

If I use numpy with MKL results are: 0.15,0.45,43.5.

Results looks reasonable, but I still don't understand why in 1st case block multiplication is faster(when we store A,B in RAM).

n_row=1000
n_col=1000
n_batch=10

def test_plain_numpy():
    A=np.random.rand(n_row,n_col)# float by default?
    B=np.random.rand(n_col,n_row)
    t0= time.time()
    res= np.dot(A,B)
    print (time.time()-t0)

#A,B in RAM, C on disk
def test_hdf5_ram():
    rows = n_row
    cols = n_col
    batches = n_batch

    #using numpy array
    A=np.random.rand(n_row,n_col)
    B=np.random.rand(n_col,n_row)

    #settings for all hdf5 files
    atom = tables.Float32Atom() #if store uint8 less memory?
    filters = tables.Filters(complevel=9, complib='blosc') # tune parameters
    Nchunk = 128  # ?
    chunkshape = (Nchunk, Nchunk)
    chunk_multiple = 1
    block_size = chunk_multiple * Nchunk

    #using hdf5
    fileName_C = 'CArray_C.h5'
    shape = (A.shape[0], B.shape[1])

    h5f_C = tables.open_file(fileName_C, 'w')
    C = h5f_C.create_carray(h5f_C.root, 'CArray', atom, shape, chunkshape=chunkshape, filters=filters)

    sz= block_size

    t0= time.time()
    for i in range(0, A.shape[0], sz):
        for j in range(0, B.shape[1], sz):
            for k in range(0, A.shape[1], sz):
                C[i:i+sz,j:j+sz] += np.dot(A[i:i+sz,k:k+sz],B[k:k+sz,j:j+sz])
    print (time.time()-t0)

    h5f_C.close()
def test_hdf5_disk():
    rows = n_row
    cols = n_col
    batches = n_batch

    #settings for all hdf5 files
    atom = tables.Float32Atom() #if store uint8 less memory?
    filters = tables.Filters(complevel=9, complib='blosc') # tune parameters
    Nchunk = 128  # ?
    chunkshape = (Nchunk, Nchunk)
    chunk_multiple = 1
    block_size = chunk_multiple * Nchunk


    fileName_A = 'carray_A.h5'
    shape_A = (n_row*n_batch, n_col)  # predefined size

    h5f_A = tables.open_file(fileName_A, 'w')
    A = h5f_A.create_carray(h5f_A.root, 'CArray', atom, shape_A, chunkshape=chunkshape, filters=filters)

    for i in range(batches):
        data = np.random.rand(n_row, n_col)
        A[i*n_row:(i+1)*n_row]= data[:]

    rows = n_col
    cols = n_row
    batches = n_batch

    fileName_B = 'carray_B.h5'
    shape_B = (rows, cols*batches)  # predefined size

    h5f_B = tables.open_file(fileName_B, 'w')
    B = h5f_B.create_carray(h5f_B.root, 'CArray', atom, shape_B, chunkshape=chunkshape, filters=filters)

    sz= rows/batches
    for i in range(batches):
        data = np.random.rand(sz, cols*batches)
        B[i*sz:(i+1)*sz]= data[:]


    fileName_C = 'CArray_C.h5'
    shape = (A.shape[0], B.shape[1])

    h5f_C = tables.open_file(fileName_C, 'w')
    C = h5f_C.create_carray(h5f_C.root, 'CArray', atom, shape, chunkshape=chunkshape, filters=filters)

    sz= block_size

    t0= time.time()
    for i in range(0, A.shape[0], sz):
        for j in range(0, B.shape[1], sz):
            for k in range(0, A.shape[1], sz):
                C[i:i+sz,j:j+sz] += np.dot(A[i:i+sz,k:k+sz],B[k:k+sz,j:j+sz])
    print (time.time()-t0)

    h5f_A.close()
    h5f_B.close()
    h5f_C.close()

First what is your numpy BLAS linked to? In the `np.dot` scenario are you running out of memory and using virtual memory? If you could post a small discrete example so that we can reproduce the difference it will be very beneficial. — Daniel, Nov 07 '13 at 15:24
That linked question compares `np.dot` working on chunks (via `hdf5`) with a single call to `np.dot`. So isn't a test of `np.dot` versus something else, but a test of memory handling for large arrays. — hpaulj, Nov 07 '13 at 17:25
General question was why block matrix multiplication using hdf5 was faster then naive matrix multiplication using numpy, but second questions was there is something faster then numpy.dot. In code there are 3 cases how to store matrices in RAM or on disk. — mrgloom, Nov 08 '13 at 06:18

score 47 · Accepted Answer · edited May 23 '17 at 11:47

np.dot dispatches to BLAS when

NumPy has been compiled to use BLAS,
a BLAS implementation is available at run-time,
your data has one of the dtypes float32, float64, complex32 or complex64, and
the data is suitably aligned in memory.

Otherwise, it defaults to using its own, slow, matrix multiplication routine.

Checking your BLAS linkage is described here. In short, check whether there's a file _dotblas.so or similar in your NumPy installation. When there is, check which BLAS library it's linked against; the reference BLAS is slow, ATLAS is fast, OpenBLAS and vendor-specific versions such as Intel MKL are even faster. Watch out with multithreaded BLAS implementations as they don't play nicely with Python's multiprocessing.

Next, check your data alignment by inspecting the flags of your arrays. In versions of NumPy before 1.7.2, both arguments to np.dot should be C-ordered. In NumPy >= 1.7.2, this doesn't matter as much anymore as special cases for Fortran arrays have been introduced.

>>> X = np.random.randn(10, 4)
>>> Y = np.random.randn(7, 4).T
>>> X.flags
  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False
>>> Y.flags
  C_CONTIGUOUS : False
  F_CONTIGUOUS : True
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  UPDATEIFCOPY : False

If your NumPy is not linked against BLAS, either (easy) re-install it, or (hard) use the BLAS gemm (generalized matrix multiply) function from SciPy:

>>> from scipy.linalg import get_blas_funcs
>>> gemm = get_blas_funcs("gemm", [X, Y])
>>> np.all(gemm(1, X, Y) == np.dot(X, Y))
True

This looks easy, but it does hardly any error checking, so you must really know what you're doing.

Good answer, maybe also you can aswer about big matrix multiplication stored on disk and how to optimize chunk size and other parameters for particular PC? — mrgloom, Nov 08 '13 at 06:39
Because of numerical implementation issue the comparison line should be better ```np.allclose(gemm(1, X, Y),np.dot(X, Y))``` — rudimeier, Sep 26 '15 at 18:46
I just tried the `get_blas_funcs` approach and it turned out to be about 6 times slower than `np.dot`. Is this what you would expect? — satoru, Dec 07 '17 at 01:35
I confirm @satoru 's finding above: 10.3 µs ± 238 VS. 768 ns ± 14.3 ns — bonobo, May 07 '19 at 07:49

How to get faster code than numpy.dot for matrix multiplication?

1 Answers1

Linked