I'd like to know how I might be able to transform this problem to reduce the overhead of the np.sum() function calls in my code.
I have an input matrix, say of shape=(1000, 36). Each row represents a node in a graph. I have an operation that I am doing, which is iterating over each row and doing an element-wise addition to a variable number of other rows. Those "other" rows are defined in a dictionary nodes_nbrs that records, for each row, a list of rows that must be summed together. An example is as such:
nodes_nbrs = {0: [0, 1],
1: [1, 0, 2],
2: [2, 1],
...}
Here, node 0 would be transformed into the sum of nodes 0 and 1. Node 1 would be transformed into the sum of nodes 1, 0, and 2. And so on for the rest of the nodes.
The current (and naive) way I currently have implemented is as such. I first instantiate a zero array of the final shape that I want, and then iterate over each key-value pair in the nodes_nbrs dictionary:
output = np.zeros(shape=input.shape)
for k, v in nodes_nbrs.items():
output[k] = np.sum(input[v], axis=0)
This code is all cool and fine in small tests (shape=(1000, 36)), but on larger tests (shape=(~1E(5-6), 36)), it takes ~2-3 seconds to complete. I end up having to do this operation thousands of times, so I'm trying to see if there's a more optimized way of doing this.
After doing line profiling, I noticed that the key killer here is calling the np.sum function over and over, which takes about 50% of the total time. Is there a way I can eliminate this overhead? Or is there another way I can optimize this?
Apart from that, here is a list of things I have done, and (very briefly) their results:
- A
cythonversion: eliminates theforloop type checking overhead, 30% reduction in time taken. With thecythonversion,np.sumtakes about 80% of the total wall clock time, rather than 50%. - Pre-declare
np.sumas a variablenpsum, and then callnpsuminside theforloop. No difference with original. - Replace
np.sumwithnp.add.reduce, and assign that to the variablenpsum, and then callnpsuminside theforloop. ~10% reduction in wall clock time, but then incompatible withautograd(explanation below in sparse matrices bullet point). numbaJIT-ing: did not attempt more than adding decorator. No improvement, but didn't try harder.- Convert the
nodes_nbrsdictionary into a densenumpybinary array (1s and 0s), and then do a singlenp.dotoperation. Good in theory, bad in practice because it would require a square matrix ofshape=(10^n, 10^n), which is quadratic in memory usage.
Things I have not tried, but am hesitant to do so:
scipysparse matrices: I am usingautograd, which does not support automatic differentiation of thedotoperation forscipysparse matrices.
For those who are curious, this is essentially a convolution operation on graph-structured data. Kinda fun developing this for grad school, but also somewhat frustrating being at the cutting edge of knowledge.