Summing values of 2D array on indices

Question

I need to extend this question, which sums values of an array based on indices from a second array. Let A be the result array, B be the index array, and C the array to be summed over. Then A[i] = sum over C such that index(B) == i.

Instead, my setup is

N = 5
M = 2

A = np.zeros((M,N))
B = np.random.randint(M, size=N) # contains indices for A
C = np.random.rand(N,N)

I need A[i,j] = sum_{k in 0...N} C[j,k] such that C[k] == i , i.e. a rowsum conditional on the indices of B matching i. Is there an efficient way to do this? For my application N is around 10,000 and M is around 20. This operation is called for every iteration in a minimization problem... my current looping method is terribly slow.

Thanks!

I'm not entirely sure I follow the role of B here. Could you edit to include the loop equivalent? — DSM, Apr 07 '13 at 16:12

Joe Kington · Accepted Answer · 2013-04-07T22:22:32.490

Following @DSM's comment, I'm assuming your C[k] == i supposed to be B[k] == i. If that's the case, does your loop version look something like this?

Nested Loop Version

import numpy as np

N = 5
M = 2

A = np.zeros((M,N))
B = np.random.randint(M, size=N) # contains indices for A
C = np.random.rand(N,N)

for i in range(M):
    for j in range(N):
        for k in range(N):
            if B[k] == i:
                A[i,j] += C[j,k]

There's more than one way to vectorize this problem. I'm going to show my thought process below, but there are more efficient ways to do it (e.g. @DSM's version that recognizes the matrix multiplication inherent in the problem).

For the sake of explanation, here's a walk-through of one approach.

Vectorizing the Inner Loop

Let's start by re-writing the inner k loop:

for i in range(M):
    for j in range(N):
        A[i,j] = C[j, B == i].sum()

It might be easier to think of this as C[j][B == i].sum(). We're just selecting j^th row of C, selecting only the elements in that row where B is equal to i, and summing them.

Vectorizing the Outer-most Loop

Next let's break down the outer i loop. Now we're going to get to the point where readability will start to suffer, unfortunately...

i = np.arange(M)[:,np.newaxis]
mask = (B == i).astype(int)
for j in range(N):
    A[:,j] = (C[j] * mask).sum(axis=-1)

There are a couple different tricks here. In this case, we're iterating over the columns of A. Each column of A is the sum of a subset of the corresponding row of C. The subset of the row of C is determined by where B is equal to the row index i.

To get around iterating through i, we're making a 2D array where B == i by adding a new axis to i. (Have a look at the documentation for numpy broadcasting if you're confused by this.) In other words:

B:
    array([1, 1, 1, 1, 0])

i: 
    array([[0],
           [1]])

B == i:
    array([[False, False, False, False,  True],
           [ True,  True,  True,  True, False]], dtype=bool)

What we want is to take two (M) filtered sums of C[j], one for each row in B == i. This will give us a two-element vector corresponding to the j^th column in A.

We can't do this by indexing C directly because the result won't maintain it's shape, as each row may have a different number of elements. We'll get around this by multiplying the B == i mask by the current row of C, resulting in zeros where B == i is False, and the value in the current row of C where it's true.

To do this, we need to turn the boolean array B == i into integers:

mask = (B == i).astype(int):
    array([[0, 0, 0, 0, 1],
           [1, 1, 1, 1, 0]])

So when we multiply it by the current row of C:

C[j]:
    array([ 0.19844887,  0.44858679,  0.35370919,  0.84074259,  0.74513377])

C[j] * mask:
    array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.74513377],
           [ 0.19844887,  0.44858679,  0.35370919,  0.84074259,  0.        ]])

Then we can sum over each row to get the current column of A (This will be broadcast to a column when it's assigned to A[:,j]):

(C[j] * mask).sum(axis=-1):
    array([ 0.74513377,  1.84148744])

Fully Vectorized Version

Finally, breaking down the last loop, we can apply the exact same principle to add a third dimension for the loop over j:

i = np.arange(M)[:,np.newaxis,np.newaxis]
mask = (B == i).astype(int)
A = (C * mask).sum(axis=-1)

@DSM's vectorized version

As @DSM suggested, you could also do:

A = (B == np.arange(M)[:,np.newaxis]).dot(C.T)

This is by far the fastest solution for most sizes of M and N, and arguably the most elegant (much more elegant than my solutions, anyway).

Let's break it down a bit.

The B == np.arange(M)[:,np.newaxis] is exactly equivalent to B == i in the "Vectorizing the Outer-most Loop" section above.

The key is in recognizing that all of the j and k loops are equivalent to matrix multiplication. dot will cast the boolean B == i array to the same dtype as C behind-the-scenes, so we don't need to worry about explicitly casting it to a different type.

After that, we're just performing matrix multiplication on the transpose of C (a 5x5 array) and the "mask" 0 and 1 array above, yielding a 2x5 array.

dot will take advantage of any optimized BLAS libraries you have installed (e.g. ATLAS, MKL), so it's very fast.

Timings

For small M's and N's, the differences are less apparent (~6x between looping and DSM's version):

M, N = 2, 5

%timeit loops(B,C,M)
10000 loops, best of 3: 83 us per loop

%timeit k_vectorized(B,C,M)
10000 loops, best of 3: 106 us per loop

%timeit vectorized(B,C,M)
10000 loops, best of 3: 23.7 us per loop

%timeit askewchan(B,C,M)
10000 loops, best of 3: 42.7 us per loop

%timeit einsum(B,C,M)
100000 loops, best of 3: 15.2 us per loop

%timeit dsm(B,C,M)
100000 loops, best of 3: 13.9 us per loop

However, once M and N start to grow, the difference becomes very significant (~600x) (note the units!):

M, N = 50, 20

%timeit loops(B,C,M)
10 loops, best of 3: 50.3 ms per loop

%timeit k_vectorized(B,C,M)
100 loops, best of 3: 10.5 ms per loop

%timeit ik_vectorized(B,C,M)
1000 loops, best of 3: 963 us per loop

%timeit vectorized(B,C,M)
1000 loops, best of 3: 247 us per loop

%timeit askewchan(B,C,M)
1000 loops, best of 3: 493 us per loop

%timeit einsum(B,C,M)
10000 loops, best of 3: 134 us per loop

%timeit dsm(B,C,M)
10000 loops, best of 3: 80.2 us per loop

@DSM - I can verify that it gives the same result, but you're absolutely right. It shouldn't be equivalent. I need to think a bit more about it. — Joe Kington, Apr 07 '13 at 18:53
Well, I'm an idiot.. somehow I managed to remove the `axis=-1` and replace it with `axis=1` in your code, so it's not surprising that I couldn't reproduce anything. Your version *does* match my `(B == np.arange(M)[:,None]).dot(C.T)` and `np.einsum('ij,kj->ik', B == np.arange(M)[:,None], C)`, and I should probably stop thinking today.. — DSM, Apr 07 '13 at 19:04
@DSM - For whatever it's worth, my original post had exactly that mistake (1 vs -1). It I edited it in the first few minutes, so it doesn't show up as an edit, but your `axis=1` may have actually come from copy-pasting my initial answer. :) (And thanks for the `einsum` hint!) — Joe Kington, Apr 07 '13 at 19:20
@DSM - You should post those as an answer. Recognizing the inherent matrix multiplication makes the problem quite elegant (and much faster)! — Joe Kington, Apr 07 '13 at 22:25
@JoeKington: that would be work, and I'm lazy. :^) Plus, I think pedagogical answers are more use both to the OP and to future searchers, and yours covers the matter nicely. — DSM, Apr 07 '13 at 22:31
Very nice, @Joe. Your timings should use `N >> M` though, since OP's case is `N ~ 10,000` and `M ~ 20` — askewchan, Apr 08 '13 at 03:09

score 3 · Answer 2 · answered Apr 07 '13 at 17:57

3

I am assuming @DSM found your typo, and you want:

A[i,j] = sum_{k in 0...N} C[j,k] where B[k] == i

Then you can loop over i in range(M) since M is relatively small.

A = np.array([C[:,B == i].sum(axis=1) for i in range(M)])

answered Apr 07 '13 at 17:57

askewchan

45,161
17
118
134

2

More intuitive and readable than mine, +1. On a side note, the general approach of "vectorize most of it and loop over the shortest axis" is often a good trade-off between speed and memory usage for things like numpy, matlab, etc. – Joe Kington Apr 07 '13 at 19:35