Iterate over all pairwise combinations of numpy array columns

Question

I have an numpy array of size

arr.size = (200, 600, 20).

I want to compute scipy.stats.kendalltau on every pairwise combination of the last two dimensions. For example:

kendalltau(arr[:, 0, 0], arr[:, 1, 0])
kendalltau(arr[:, 0, 0], arr[:, 1, 1])
kendalltau(arr[:, 0, 0], arr[:, 1, 2])
...
kendalltau(arr[:, 0, 0], arr[:, 2, 0])
kendalltau(arr[:, 0, 0], arr[:, 2, 1])
kendalltau(arr[:, 0, 0], arr[:, 2, 2])
...
...
kendalltau(arr[:, 598, 20], arr[:, 599, 20])

such that I cover all combinations of arr[:, i, xi] with arr[:, j, xj] with i < j and xi in [0,20), xj in [0, 20). This is (600 choose 2) * 400 individual calculations, but since each takes about 0.002 s on my machine, it shouldn't take much longer than a day with the multiprocessing module.

What's the best way to go about iterating over these columns (with i<j)? I figure I should avoid something like

for i in range(600):
    for j in range(i+1, 600):
        for xi in range(20):
            for xj in range(20):

What is the most numpythonic way of doing this?

Edit: I changed the title since Kendall Tau isn't really important to the question. I realize I could also do something like

import itertools as it
for i, j in it.combinations(xrange(600), 2):
    for xi, xj in product(xrange(20), xrange(20)):

but there's got to be a better, more vectorized way with numpy.

You want to look into recursion. This has been answered for Java: http://stackoverflow.com/questions/426878/is-there-any-way-to-do-n-level-nested-loops-in-java — Mike Vella, Aug 09 '13 at 20:25
I don't think recursion will use `numpy` the way it's supposed to be used, though. — wflynny, Aug 09 '13 at 20:28
Iteration seems to be [recommended over recursion](http://neopythonic.blogspot.com/2009/04/tail-recursion-elimination.html) in python. Numpy takes that another step with vectorization. Sure, recursion will work with numpy, but I figure there has got to be a more 'pythonic', iterative approach. — wflynny, Aug 09 '13 at 20:37
You could have a look at this discussion: http://stackoverflow.com/questions/16003217/n-d-version-of-itertools-combinations-in-numpy but I think you shouldn't get too hung up on this, itertools.combinations is fine! Worry about it when it causes you problems - Early optimization is the root of all evil. — Mike Vella, Aug 09 '13 at 20:43

score 17 · Accepted Answer · edited Oct 27 '15 at 18:58

The general way of vectorizing something like this is to use broadcasting to create the cartesian product of the set with itself. In your case you have an array arr of shape (200, 600, 20), so you would take two views of it:

arr_x = arr[:, :, np.newaxis, np.newaxis, :] # shape (200, 600, 1, 1, 20)
arr_y = arr[np.newaxis, np.newaxis, :, :, :] # shape (1, 1, 200, 600, 20)

The above two lines have been expanded for clarity, but I would normally write the equivalent:

arr_x = arr[:, :, None, None]
arr_y = arr

If you have a vectorized function, f, that did broadcasting on all but the last dimension, you could then do:

out = f(arr[:, :, None, None], arr)

And then out would be an array of shape (200, 600, 200, 600), with out[i, j, k, l] holding the value of f(arr[i, j], arr[k, l]). For instance, if you wanted to compute all the pairwise inner products, you could do:

from numpy.core.umath_tests import inner1d

out = inner1d(arr[:, :, None, None], arr)

Unfortunately scipy.stats.kendalltau is not vectorized like this. According to the docs

"If arrays are not 1-D, they will be flattened to 1-D."

So you cannot go about it like this, and you are going to wind up doing Python nested loops, be it explicitly writing them out, using itertools or disguising it under np.vectorize. That's going to be slow, because of the iteration on Python variables, and because you have a Python function per iteration step, which are both expensive actions.

Do note that, when you can go the vectorized way, there is an obvious drawback: if your function is commutative, i.e. if f(a, b) == f(b, a), then you are doing twice the computations needed. Depending on how expensive your actual computation is, this is very often offset by the increase in speed from not having any Python loops or function calls.

score 0 · Answer 2 · answered Aug 09 '13 at 20:39

0

If you don't want to use recursion you should generally be using itertools.combinations. There is no specific reason (afaik) why this should cause your code to run slower. The computationally-intensive parts are still being handled by numpy. Itertools also has the advantage of readability.

answered Aug 09 '13 at 20:39

Mike Vella

10,187
14
59
86

Yes, I had edited my post a few minutes ago to include that option. – wflynny Aug 09 '13 at 20:41

Iterate over all pairwise combinations of numpy array columns

2 Answers2

Linked