How can I optimize a dot product along a small dimension in numpy?

Question

I have two np.ndarrays

a is an array of shape (13000, 8, 315000) and type uint8
b is an array of shape (8,) and type float32

I want to multiply each slice along the second dimension (8) by the corresponding element in b and sum along that dimension (i.e. a dot product along the second axis). The result will be of shape (13000, 315000)

I have devised two ways of doing this:

np.einsum('ijk,j->ik', a, b): using %timeit it gives 49 s ± 12.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
np.dot(a.transpose(0, 2, 1), b): using %timeit it gives 1min 8s ± 3.54 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Are there faster alternatives?

Complementary information

np.show_config() returns:

blas_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    language = c
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    language = c
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
blis_info:
  NOT AVAILABLE
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    language = c
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    libraries = ['openblas', 'openblas']
    language = c
    library_dirs = ['/usr/local/lib']
    define_macros = [('HAVE_CBLAS', None)]

a.flags:

C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

b.flags:

C_CONTIGUOUS : True
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False

Have you tried `optimize=True` in [`np.einsum`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html)? — Daniel F, Aug 21 '18 at 13:07
Could you rearrange your data to have the dimension of size 8 last? I don't mean just rolling the axis. I mean actually rearranging the data. E.g., `np.rollaxis` followed by `np.array(..., copy=True)` or so. — Mad Physicist, Aug 21 '18 at 13:30
A duplicate question yesterday showed that `dot` is notoriously slow for this. Have you tried `b@a`? Or `tensordot`? — hpaulj, Aug 21 '18 at 14:27
`uint8` saves memory, but slows down computations that use compiled code (that works with floats). — hpaulj, Aug 21 '18 at 14:55
It would require ~122GB of RAM in `float32`. I'll stick with `uint8` for the moment :-) but thanks for the remark — iacolippo, Aug 21 '18 at 15:46

Divakar · Accepted Answer · 2018-08-21T13:39:10.033

We can leverage multi-core with numexpr module for large data and to gain memory efficiency and hence performance -

import numexpr as ne

d = {'a0':a[:,0],'b0':b[0],'a1':a[:,1],'b1':b[1],\
     'a2':a[:,2],'b2':b[2],'a3':a[:,3],'b3':b[3],\
     'a4':a[:,4],'b4':b[4],'a5':a[:,5],'b5':b[5],\
     'a6':a[:,6],'b6':b[6],'a7':a[:,7],'b7':b[7]}
eval_str = 'a0*b0 + a1*b1 + a2*b2 + a3*b3 + a4*b4 + a5*b5 + a6*b6 + a7*b7'
out = ne.evaluate(eval_str,d)

Sample run for timings -

In [474]: # Setup with ~10x smaller than posted one, as my system can't handle those
     ...: np.random.seed(0)
     ...: a = np.random.randint(0,9,(1000,8,30000)).astype(np.uint8)
     ...: b = np.random.rand(8).astype(np.float32)

In [478]: %timeit np.einsum('ijk,j->ik', a, b)
1 loop, best of 3: 247 ms per loop

# einsum with optimize flag set as True
In [479]: %timeit np.einsum('ijk,j->ik', a, b, optimize=True)
1 loop, best of 3: 248 ms per loop

In [480]: d = {'a0':a[:,0],'b0':b[0],'a1':a[:,1],'b1':b[1],\
     ...:      'a2':a[:,2],'b2':b[2],'a3':a[:,3],'b3':b[3],\
     ...:      'a4':a[:,4],'b4':b[4],'a5':a[:,5],'b5':b[5],\
     ...:      'a6':a[:,6],'b6':b[6],'a7':a[:,7],'b7':b[7]}

In [481]: eval_str = 'a0*b0 + a1*b1 + a2*b2 + a3*b3 + a4*b4 + a5*b5 + a6*b6 + a7*b7'

In [482]: %timeit ne.evaluate(eval_str,d)
10 loops, best of 3: 94.3 ms per loop

~2.6x improvement there.

A better (less error-prone) and generic way to create the evaluation parts could be like so -

d = {'a'+str(i):a[:,i] for i in range(8)}
d.update({'b'+str(i):b[i] for i in range(8)})
eval_str = ' + '.join(['a'+str(i)+'*'+'b'+str(i) for i in range(8)])

Wow! This is really impressive. Gained a factor 7x. – iacolippo Aug 21 '18 at 14:34 — iacolippo, Aug 21 '18 at 14:34

How can I optimize a dot product along a small dimension in numpy?

Complementary information

1 Answers1

Linked