How to avoid Kronecker product in NumPy computations

Question

Background

After generating a list of random weights:

sizes = [784,30,10]
weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1],sizes[1:])]

utilize Numpy's Kronecker product to create foo (with shape (900, 23520)):

foo = np.kron(np.identity(30),weights[0])

Then, matrix multiply foo with a slice from data, namely,

bar = np.dot(foo,data[0])

whereby the data[0].shape is (23520,) and data[0].dtype is float32.

Question

foo is rather wasteful. How can weights[0], which has shape (30,784), be utilized for the multiplication with data[0] in a more resourceful manner?

More generally, data[0] is a slice from an array with shape (1666,23520), so the multiplication procedure will need to be carried out 1666 times. Also, the data array is close to sparse with less than 20% of entries being non-zero.

Here's the loop I had tried:

for i in range(len(data)):
    foo = np.kron(np.identity(30),weights[0])
    bar = np.dot(foo,data[i])

And you need to do this for all rows - `data[1]`, `data[2]` .. `data[1665]`? By sparse, do you mean the type is sparse matrix? — Divakar, Jan 02 '18 at 08:40
@Divakar yeah. I tried it with a for loop and the computation took far too long. Yes, the data matrix is sparse. Mostly entries 0. — sunspots, Jan 02 '18 at 08:41
@PaulPanzer good question. Less than 20% of entries are non-zero. — sunspots, Jan 02 '18 at 11:37
That's borderline. You'd have to test whether using a sparse matrix for `data` would make things faster. The good news is it's easy. And in any case it will save you some memory. — Paul Panzer, Jan 02 '18 at 12:46

Divakar · Accepted Answer · 2018-01-02T09:36:29.083

The trick is to reshape data into 3D tensor and then use np.tensordot against weights[0] and thus by-pass foo creation, like so -

k = 30 # kernel size
data3D = data.reshape(data.shape[0],k,-1)
out = np.tensordot(data3D, weights[0], axes=(2,1)).reshape(-1,k**2)

Under the hoods, tensordot uses transposing axes, reshaping and then np.dot. So, using all that manual-labor to avoid the function call to tensordot, we would have one, like so -

out = data.reshape(-1,data.shape[1]//k).dot(weights[0].T).reshape(-1,k**2)

Related post to understand tensordot.

Sample run

Let's use a toy example to explain on what's going on to people who might not have understand the problem :

In [68]: # Toy setup and code run with original codes
    ...: k = 3 # kernel size, which is 30 in the original case
    ...: 
    ...: data = np.random.rand(4,6)
    ...: w0 = np.random.rand(3,2) # this is weights[0]
    ...: foo = np.kron(np.identity(k), w0)
    ...: output_first_row = foo.dot(data[0])

So, the question is to get rid of the foo creation step and get to output_first_row and do this for all rows of data.

The proposed solution is :

...: data3D = data.reshape(data.shape[0],k,-1)
...: vectorized_out = np.tensordot(data3D, w0, axes=(2,1)).reshape(-1,k**2)

Let's verify the results :

In [69]: output_first_row
Out[69]: array([ 0.11,  0.13,  0.34,  0.67,  0.53,  1.51,  0.17,  0.16,  0.44])

In [70]: vectorized_out
Out[70]: 
array([[ 0.11,  0.13,  0.34,  0.67,  0.53,  1.51,  0.17,  0.16,  0.44],
       [ 0.43,  0.23,  0.73,  0.43,  0.38,  1.05,  0.64,  0.49,  1.41],
       [ 0.57,  0.45,  1.3 ,  0.68,  0.51,  1.48,  0.45,  0.28,  0.85],
       [ 0.41,  0.35,  0.98,  0.4 ,  0.24,  0.75,  0.22,  0.28,  0.71]])

Runtime test for all proposed approaches -

In [30]: import numpy as np

In [31]: sizes = [784,30,10]

In [32]: weights = [np.random.rand(y, x) for x, y in zip(sizes[:-1],sizes[1:])]

In [33]: data = np.random.rand(1666,23520)

In [37]: k = 30 # kernel size

# @Paul Panzer's soln
In [38]: %timeit (weights[0] @ data.reshape(-1, 30, 784).swapaxes(1, 2)).swapaxes(1, 2)
1 loops, best of 3: 707 ms per loop

In [39]: %timeit np.tensordot(data.reshape(data.shape[0],k,-1), weights[0], axes=(2,1)).reshape(-1,k**2)
10 loops, best of 3: 114 ms per loop

In [40]: %timeit data.reshape(-1,data.shape[1]//k).dot(weights[0].T).reshape(-1,k**2)
10 loops, best of 3: 118 ms per loop

This Q&A and the comments under, might help understand how tensordot works better with tensors.

Paul Panzer · Answer 2 · 2018-01-02T10:39:28.307

You are essentially doing matrix-matrix multiplication where the first factor is weights[0] and the second is data[i] chopped up into 30 equal slices that form the columns.

import numpy as np

sizes = [784,30,10]
weights = [np.random.randn(y, x) for x, y in zip(sizes[:-1],sizes[1:])]

k = 2
# create sparse data
data = np.maximum(np.random.uniform(-100, 1, (k, 23520)), 0)

foo = np.kron(np.identity(30),weights[0])

# This is the original loop as a list comprehension
bar = [np.dot(foo,d) for d in data]

# This is the equivalent using matrix multiplication.
# We can take advantage of the fact that the '@' operator
# can do batch matrix multiplication (it uses the last two
# dimensions as the matrix and all others as batch index).
# The reshape does the chopping up but gives us rows where columns
# are required, hence the first swapaxes.
# The second swapaxes is to make the result directly comparable to
# the `np.kron` based result.
bar2 = (weights[0] @ data.reshape(k, 30, 784).swapaxes(1, 2)).swapaxes(1, 2)

# Instead of letting numpy do the batching we can glue all the
# columns of all the second factors together into one matrix   
bar3 = (weights[0] @ data.reshape(-1, 784).T).T.reshape(k, -1)

# This last formulation works more or less unchanged on sparse data
from scipy import sparse
dsp = sparse.csr_matrix(data.reshape(-1, 784))
bar4 = (weights[0] @ dsp.T).T.reshape(k, -1)


print(np.allclose(bar, bar2.reshape(k, -1)))
print(np.allclose(bar, bar3))
print(np.allclose(bar, bar4))

Prints:

True
True
True

How to avoid Kronecker product in NumPy computations

Background

Question

2 Answers2