I'm using pandas and noticed a HUGE difference in performance between these two statements:
df.sum(level=['My', 'Index', 'Levels']) # uses numpy sum which is vectorized
and
df.groupby(level=['My', 'Index', 'Levels']).sum() # Slow...
First example is using the numpy.sum, which is vectorized, as stated in the documentation.
Unfortunably, using sum(level=...) is deprecated in the API and produces an ugly warning:
FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum()
I don't want to use the non vectorized version and have a poor processing performance. How can I use numpy.sum along with groupby ?
Edit: following the comments, here is a basic test I have done: Pandas 1.4.4 , 10k random lines, 10 levels (index)
import pandas as pd
import numpy as np
print('pandas:', pd.__version__)
nb_lines = int(1e4)
nb_levels = 10
# sequences of random integers [0, 9] x 10k
ix = np.random.randint(0, nb_levels-1, size=(nb_lines, nb_levels))
cols = [chr(65+i) for i in range(nb_levels)] # A, B, C, ...
df = pd.DataFrame(ix, columns=cols)
df = df.set_index(cols)
df['VALUE'] = np.random.rand(nb_lines) # random values to aggregate
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
And the result is:
pandas: 1.4.4
with groupby:
5.51 ms ± 1.06 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
4.93 ms ± 40.1 µs per loop (mean ± std. dev. of 7 runs, 500 loops each)
This is just an example, but the result is always faster without groupby. Changing parameters (levels, step size for the columns to group on, etc) does not change the result.
In the end, for a big data set, you can see the difference between the two methods (numpy.sum vs other).
@mozway, you results indicate a similar performance, however if you increase the number of levels, you should see results worsening with the groupby version -at least that's the results on my computer. See edited code so you can change the number of levels (example with 10 levels and 100k lines):
import numpy as np
from string import ascii_uppercase as UP
np.random.seed(0)
N = 100_000
nb_levels = 10
cols = [chr(65+i) for i in range(nb_levels)]
d = {cols[i]: np.random.choice(list(UP), size=N) for i in range(nb_levels)}
d.update({'num': np.random.random(size=N)})
df = pd.DataFrame(d).set_index(cols)
print(pd.__version__)
print('with groupby:')
%timeit -n 300 df.groupby(level=cols).sum()
print('without groupby:')
%timeit -n 300 df.sum(level=cols)
... and the result:
1.4.4
with groupby:
50.8 ms ± 536 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
without groupby:
<magic-timeit>:1: FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().
42 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 300 loops each)
Thanks