Vectorized way of calculating row-wise dot product two matrices with Scipy

Question

I want to calculate the row-wise dot product of two matrices of the same dimension as fast as possible. This is the way I am doing it:

import numpy as np
a = np.array([[1,2,3], [3,4,5]])
b = np.array([[1,2,3], [1,2,3]])
result = np.array([])
for row1, row2 in a, b:
    result = np.append(result, np.dot(row1, row2))
print result

and of course the output is:

[ 26.  14.]

Interestingly, your example computes dot products *within* matrices, and not between corresponding rows from different matrices. — paperskilltrees, Aug 31 '22 at 05:43

score 65 · Answer 1 · answered Jun 24 '15 at 08:44

Straightforward way to do that is:

import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
np.sum(a*b, axis=1)

which avoids the python loop and is faster in cases like:

def npsumdot(x, y):
    return np.sum(x*y, axis=1)

def loopdot(x, y):
    result = np.empty((x.shape[0]))
    for i in range(x.shape[0]):
        result[i] = np.dot(x[i], y[i])
    return result

timeit npsumdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 861 ms per loop
timeit loopdot(np.random.rand(500000,50),np.random.rand(500000,50))
# 1 loops, best of 3: 1.58 s per loop

Warren Weckesser · Accepted Answer · 2021-09-03T15:42:00.880

48

Check out numpy.einsum for another method:

In [52]: a
Out[52]: 
array([[1, 2, 3],
       [3, 4, 5]])

In [53]: b
Out[53]: 
array([[1, 2, 3],
       [1, 2, 3]])

In [54]: einsum('ij,ij->i', a, b)
Out[54]: array([14, 26])

Looks like einsum is a bit faster than inner1d:

In [94]: %timeit inner1d(a,b)
1000000 loops, best of 3: 1.8 us per loop

In [95]: %timeit einsum('ij,ij->i', a, b)
1000000 loops, best of 3: 1.6 us per loop

In [96]: a = random.randn(10, 100)

In [97]: b = random.randn(10, 100)

In [98]: %timeit inner1d(a,b)
100000 loops, best of 3: 2.89 us per loop

In [99]: %timeit einsum('ij,ij->i', a, b)
100000 loops, best of 3: 2.03 us per loop

Note: NumPy is constantly evolving and improving; the relative performance of the functions shown above has probably changed over the years. If performance is important to you, run your own tests with the version of NumPy that you will be using.

edited Sep 03 '21 at 15:42

answered Mar 25 '13 at 19:17

Warren Weckesser

110,654
19
194
214

2

I like einsum very much and it is true that you can avoid loops with it. However, if performance and not code style is your primary concern, you might still be better off with dot and the loop (depending on your specific data and system environment). In contrast to einsum, dot can take advantage of BLAS and will often multithread automatically. http://mail.scipy.org/pipermail/numpy-discussion/2012-October/064277.html – PiQuer Dec 14 '14 at 23:46
@Warren, would like a detailed explanation of the mysterious subscript notation. – CKM Aug 26 '19 at 11:31
1

@PiQuer, The Python-native for-loop is recommended? Even with transparent threading (which I don't get here) the looping is still in Python. I'd honestly expect the inner dot to be finished before the upper loop in Python is ready to execute the next dot. Can you elaborate? – Noein Jul 21 '20 at 17:30
1

I've seen native-for loops perform orders of magnitude slower than numpy operations, I'll need to see some evidence and an explanation why before I'll believe that using a native-for is the performant solution. `einsum` seems like a great tool though, I'm glad I was finally forced to learn about it. – Nathan Chappell Aug 04 '20 at 08:38

Nico Schlömer · Answer 3 · 2021-02-22T09:23:48.483

Played around with this and found inner1d the fastest. That function however is internal, so a more robust approach is to use

numpy.einsum("ij,ij->i", a, b)

Even better is to align your memory such that the summation happens in the first dimension, e.g.,

a = numpy.random.rand(3, n)
b = numpy.random.rand(3, n)

numpy.einsum("ij,ij->j", a, b)

For 10 ** 3 <= n <= 10 ** 6, this is the fastest method, and up to twice as fast as its untransposed equivalent. The maximum occurs when the level-2 cache is maxed out, at about 2 * 10 ** 4.

Note also that the transposed summation is much faster than its untransposed equivalent.

The plot was created with perfplot (a small project of mine)

import numpy
from numpy.core.umath_tests import inner1d
import perfplot


def setup(n):
    a = numpy.random.rand(n, 3)
    b = numpy.random.rand(n, 3)
    aT = numpy.ascontiguousarray(a.T)
    bT = numpy.ascontiguousarray(b.T)
    return (a, b), (aT, bT)


b = perfplot.bench(
    setup=setup,
    n_range=[2 ** k for k in range(1, 25)],
    kernels=[
        lambda data: numpy.sum(data[0][0] * data[0][1], axis=1),
        lambda data: numpy.einsum("ij, ij->i", data[0][0], data[0][1]),
        lambda data: numpy.sum(data[1][0] * data[1][1], axis=0),
        lambda data: numpy.einsum("ij, ij->j", data[1][0], data[1][1]),
        lambda data: inner1d(data[0][0], data[0][1]),
    ],
    labels=["sum", "einsum", "sum.T", "einsum.T", "inner1d"],
    xlabel="len(a), len(b)",
)

b.save("out1.png")
b.save("out2.png", relative_to=3)

Unfortunately `inner1d` is deprecated (or rather removed, not part of the API) - see https://github.com/numpy/numpy/issues/10815 — Mr_and_Mrs_D, Nov 02 '18 at 13:21

score 5 · Answer 4 · edited May 23 '17 at 12:10

You'll do better avoiding the append, but I can't think of a way to avoid the python loop. A custom Ufunc perhaps? I don't think numpy.vectorize will help you here.

import numpy as np
a=np.array([[1,2,3],[3,4,5]])
b=np.array([[1,2,3],[1,2,3]])
result=np.empty((2,))
for i in range(2):
    result[i] = np.dot(a[i],b[i]))
print result

EDIT

Based on this answer, it looks like inner1d might work if the vectors in your real-world problem are 1D.

from numpy.core.umath_tests import inner1d
inner1d(a,b)  # array([14, 26])

Engineero · Answer 5 · 2023-03-18T01:59:48.123

I came across this answer and re-verified the results with Numpy 1.14.3 running in Python 3.5. For the most part the answers above hold true on my system, although I found that for very large matrices (see example below), all but one of the methods are so close to one another that the performance difference is meaningless.

For smaller matrices, I found that einsum was the fastest by a considerable margin, up to a factor of two in some cases.

My large matrix example:

import numpy as np
from numpy.core.umath_tests import inner1d
a = np.random.randn(100, 1000000)  # 800 MB each
b = np.random.randn(100, 1000000)  # pretty big.

def loop_dot(a, b):
    result = np.empty((a.shape[1],))
    for i, (row1, row2) in enumerate(zip(a, b)):
        result[i] = np.dot(row1, row2)

%timeit inner1d(a, b)
# 128 ms ± 523 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.einsum('ij,ij->i', a, b)
# 121 ms ± 402 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.sum(a*b, axis=1)
# 411 ms ± 1.99 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit loop_dot(a, b)  # note the function call took negligible time
# 123 ms ± 342 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

So einsum is still the fastest on very large matrices, but by a tiny amount. It appears to be a statistically significant (tiny) amount though!

Vectorized way of calculating row-wise dot product two matrices with Scipy

5 Answers5

Linked

Related