Code optimization - number of function calls in Python

Question

I'd like to know how I might be able to transform this problem to reduce the overhead of the np.sum() function calls in my code.

I have an input matrix, say of shape=(1000, 36). Each row represents a node in a graph. I have an operation that I am doing, which is iterating over each row and doing an element-wise addition to a variable number of other rows. Those "other" rows are defined in a dictionary nodes_nbrs that records, for each row, a list of rows that must be summed together. An example is as such:

nodes_nbrs = {0: [0, 1], 
              1: [1, 0, 2],
              2: [2, 1],
              ...}

Here, node 0 would be transformed into the sum of nodes 0 and 1. Node 1 would be transformed into the sum of nodes 1, 0, and 2. And so on for the rest of the nodes.

The current (and naive) way I currently have implemented is as such. I first instantiate a zero array of the final shape that I want, and then iterate over each key-value pair in the nodes_nbrs dictionary:

output = np.zeros(shape=input.shape)
for k, v in nodes_nbrs.items():
    output[k] = np.sum(input[v], axis=0)

This code is all cool and fine in small tests (shape=(1000, 36)), but on larger tests (shape=(~1E(5-6), 36)), it takes ~2-3 seconds to complete. I end up having to do this operation thousands of times, so I'm trying to see if there's a more optimized way of doing this.

After doing line profiling, I noticed that the key killer here is calling the np.sum function over and over, which takes about 50% of the total time. Is there a way I can eliminate this overhead? Or is there another way I can optimize this?

Apart from that, here is a list of things I have done, and (very briefly) their results:

A cython version: eliminates the for loop type checking overhead, 30% reduction in time taken. With the cython version, np.sum takes about 80% of the total wall clock time, rather than 50%.
Pre-declare np.sum as a variable npsum, and then call npsum inside the for loop. No difference with original.
Replace np.sum with np.add.reduce, and assign that to the variable npsum, and then call npsum inside the for loop. ~10% reduction in wall clock time, but then incompatible with autograd (explanation below in sparse matrices bullet point).
numba JIT-ing: did not attempt more than adding decorator. No improvement, but didn't try harder.
Convert the nodes_nbrs dictionary into a dense numpy binary array (1s and 0s), and then do a single np.dot operation. Good in theory, bad in practice because it would require a square matrix of shape=(10^n, 10^n), which is quadratic in memory usage.

Things I have not tried, but am hesitant to do so:

scipy sparse matrices: I am using autograd, which does not support automatic differentiation of the dot operation for scipy sparse matrices.

For those who are curious, this is essentially a convolution operation on graph-structured data. Kinda fun developing this for grad school, but also somewhat frustrating being at the cutting edge of knowledge.

One thing that leaps out from your example is the concept that some of the compositions are subsets of others. For instance, you have `0: [0,1]` and also `1: [1,0,2]`. In a straight sum, that would mean you could compute 0, then compute 1 as 0-prime plus 2-original. This wouldn't reduce the number of calls to `np.sum`, but might shorten the call itself. Does that have any "real" value in your circumstance? — aghast, May 24 '16 at 22:44
@AustinHstings: Thank you for replying! Yes, you are right that there are some compositions that are subsets, and others that may overlap by some subsets. I think it's worth a shot. The only concern I have right now is that the overhead of computing which sets are overlapping/subsets might outweigh performance gains, especially when there are hundreds and thousands of rows. What are your thoughts? — ericmjl, May 24 '16 at 22:47
I think that depends on (a) how "computable" the overlaps are; and (b) what process you are using to generate your dict. It might be the case that the overlaps fall out really cheap because you are doing a certain type of traverse or something. — aghast, May 24 '16 at 22:49
Looks like the order in which you do the sums does not matter. The new value for row `0` does not enter into the calculation for row `1`, right. In addition if the number of rows going into each calculation is the same (possibly with a pad), it might be possible to collect all the rows in some higher dimensional array, and do just one `sum`. — hpaulj, May 25 '16 at 00:13

jakevdp · Accepted Answer · 2016-05-25T03:08:03.380

If scipy.sparse is not an option, one way you might approach this would be to massage your data so that you can use vectorized functions to do everything in the compiled layer. If you change your neighbors dictionary into a two-dimensional array with appropriate flags for missing values, you can use np.take to extract the data you want and then do a single sum() call.

Here's an example of what I have in mind:

import numpy as np

def make_data(N=100):
    X = np.random.randint(1, 20, (N, 36))
    connections = np.random.randint(2, 5, N)
    nbrs = {i: list(np.random.choice(N, c))
            for i, c in enumerate(connections)}
    return X, nbrs

def original_solution(X, nbrs):
    output = np.zeros(shape=X.shape)
    for k, v in nbrs.items():
        output[k] = np.sum(X[v], axis=0)
    return output

def vectorized_solution(X, nbrs):
    # Make neighbors all the same length, filling with -1
    new_nbrs = np.full((X.shape[0], max(map(len, nbrs.values()))), -1, dtype=int)
    for i, v in nbrs.items():
        new_nbrs[i, :len(v)] = v

    # add a row of zeros to X
    new_X = np.vstack([X, 0 * X[0]])

    # compute the sums
    return new_X.take(new_nbrs, 0).sum(1)

Now we can confirm that the results match:

>>> X, nbrs = make_data(100)
>>> np.allclose(original_solution(X, nbrs),
                vectorized_solution(X, nbrs))
True

And we can time things to see the speedup:

X, nbrs = make_data(1000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
# 100 loops, best of 3: 13.7 ms per loop
# 100 loops, best of 3: 1.89 ms per loop

Going up to larger sizes:

X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
1 loop, best of 3: 1.42 s per loop
1 loop, best of 3: 249 ms per loop

It's about a factor of 5-10 faster, which may be good enough for your purposes (though this will heavily depend on the exact characteristics of your nbrs dictionary).

Edit: Just for fun, I tried a couple other approaches, one using numpy.add.reduceat, one using pandas.groupby, and one using scipy.sparse. It seems that the vectorized approach I originally proposed above is probably the best bet. Here they are for reference:

from itertools import chain

def reduceat_solution(X, nbrs):
    ind, j = np.transpose([[i, len(v)] for i, v in nbrs.items()])
    i = list(chain(*(nbrs[i] for i in ind)))
    j = np.concatenate([[0], np.cumsum(j)[:-1]])
    return np.add.reduceat(X[i], j)[ind]

np.allclose(original_solution(X, nbrs),
            reduceat_solution(X, nbrs))
# True

-

import pandas as pd

def groupby_solution(X, nbrs):
    i, j = np.transpose([[k, vi] for k, v in nbrs.items() for vi in v])
    return pd.groupby(pd.DataFrame(X[j]), i).sum().values

np.allclose(original_solution(X, nbrs),
            groupby_solution(X, nbrs))
# True

-

from scipy.sparse import csr_matrix
from itertools import chain

def sparse_solution(X, nbrs):
    items = (([i]*len(col), col, [1]*len(col)) for i, col in nbrs.items())
    rows, cols, data = (np.array(list(chain(*a))) for a in zip(*items))
    M = csr_matrix((data, (rows, cols)))
    return M.dot(X)

np.allclose(original_solution(X, nbrs),
            sparse_solution(X, nbrs))
# True

And all the timings together:

X, nbrs = make_data(100000)
%timeit original_solution(X, nbrs)
%timeit vectorized_solution(X, nbrs)
%timeit reduceat_solution(X, nbrs)
%timeit groupby_solution(X, nbrs)
%timeit sparse_solution(X, nbrs)
# 1 loop, best of 3: 1.46 s per loop
# 1 loop, best of 3: 268 ms per loop
# 1 loop, best of 3: 416 ms per loop
# 1 loop, best of 3: 657 ms per loop
# 1 loop, best of 3: 282 ms per loop

Just as you mentioned, on my real data, this is approx. 5X faster than what I was able to do with Cython alone. Thank you, @jakevdp! — ericmjl, May 25 '16 at 00:52
`sparse` matrices perform a `sum` by matrix multiplication - by the appropriate matrix of 1s. http://stackoverflow.com/a/37231877/901925 — hpaulj, May 25 '16 at 01:04
@hpaulj: exactly what I would have done, provided that `scipy.sparse` was something I could use. But `autograd` currently doesn't support this, so I'll have to fall back on something else instead. — ericmjl, May 25 '16 at 01:25

score 1 · Answer 2 · edited May 23 '17 at 11:52

Based on work on recent sparse questions, e.g. Extremely slow sum row operation in Sparse LIL matrix in Python

here's how your sort of problem could be solved with sparse matrices. The method might apply just as well to dense ones. The idea is that sparse sum implemented as matrix product with a row (or column) of 1s. Indexing of sparse matrices is slow, but the matrix product is good C code.

In this case I'm going to build a multiplication matrix that has 1s for the rows that I want to sum - different set of 1s for each entry in the dictionary.

A sample matrix:

In [302]: A=np.arange(8*3).reshape(8,3)    
In [303]: M=sparse.csr_matrix(A)

selection dictionary:

In [304]: dict={0:[0,1],1:[1,0,2],2:[2,1],3:[3,4,7]}

build a sparse matrix from this dictionary. This might not be the most efficient way of constructing such a matrix, but it is enough to demonstrate the idea.

In [305]: r,c,d=[],[],[]
In [306]: for i,col in dict.items():
    c.extend(col)
    r.extend([i]*len(col))
    d.extend([1]*len(col))

In [307]: r,c,d
Out[307]: 
([0, 0, 1, 1, 1, 2, 2, 3, 3, 3],
 [0, 1, 1, 0, 2, 2, 1, 3, 4, 7],
 [1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [308]: idx=sparse.csr_matrix((d,(r,c)),shape=(len(dict),M.shape[0]))

Perform the sum and look at the result (as a dense array):

In [310]: (idx*M).A
Out[310]: 
array([[ 3,  5,  7],
       [ 9, 12, 15],
       [ 9, 11, 13],
       [42, 45, 48]], dtype=int32)

Here's the original for comparison.

In [312]: M.A
Out[312]: 
array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11],
       [12, 13, 14],
       [15, 16, 17],
       [18, 19, 20],
       [21, 22, 23]], dtype=int32)

Code optimization - number of function calls in Python

2 Answers2