Pandas count values greater than current row in the last n rows

Question

How to get count of values greater than current row in the last n rows?

Imagine we have a dataframe as following:

I am trying to get a table such as following where n=3.

    col_a   col_b
0     8.4       0
1    11.3       0
2     7.2       2
3     6.5       3
4     4.5       3
5     8.9       0

Thanks in advance.

jezrael · Accepted Answer · 2018-06-26T13:07:33.987

In pandas is best dont loop because slow, here is better use rolling with custom function:

n = 3
df['new'] = (df['col_a'].rolling(n+1, min_periods=1)
                        .apply(lambda x: (x[-1] < x[:-1]).sum())
                        .astype(int))
print (df)
   col_a  new
0    8.4    0
1   11.3    0
2    7.2    2
3    6.5    3
4    4.5    3
5    8.9    0

If performance is important, use strides:

n = 3
x = np.concatenate([[np.nan] * (n), df['col_a'].values])

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)
arr = rolling_window(x, n + 1)

df['new'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1)
print (df)
   col_a  new
0    8.4    0
1   11.3    0
2    7.2    2
3    6.5    3
4    4.5    3
5    8.9    0

Performance: Here is used perfplot in small window n = 3:

np.random.seed(1256)
n = 3

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

def roll(df):
    df['new'] = (df['col_a'].rolling(n+1, min_periods=1).apply(lambda x: (x[-1] < x[:-1]).sum(), raw=True).astype(int))
    return df

def list_comp(df):
    df['count'] = [(j < df['col_a'].iloc[max(0, i-3):i]).sum() for i, j in df['col_a'].items()]
    return df

def strides(df):
    x = np.concatenate([[np.nan] * (n), df['col_a'].values])
    arr = rolling_window(x, n + 1)
    df['new1'] = (arr[:, :-1] > arr[:, [-1]]).sum(axis=1)
    return df


def make_df(n):
    df = pd.DataFrame(np.random.randint(20, size=n), columns=['col_a'])
    return df

perfplot.show(
    setup=make_df,
    kernels=[list_comp, roll, strides],
    n_range=[2**k for k in range(2, 15)],
    logx=True,
    logy=True,
    xlabel='len(df)')

Also I was curious about performance in large window, n = 100:

I can confirm this outperforms other answers by far. I am new to pandas, I just couldn't figure out how to filter out rolling series. — koray1396, Jun 26 '18 at 13:11
@koray1396 - Question was really interesting and answers a bit slow, so try added better one. So recommended use this - `rolling` is simplier, `strides` complicated, but faster. — jezrael, Jun 26 '18 at 13:13
@koray1396 - In pandas is simplier rule, avoid `loops` like list comprehensions if exist better faster vectorized solutions. — jezrael, Jun 26 '18 at 13:16
Just to be clear the `roll` method is faster but it's still not vectorised because of the `apply` + `lambda`. The `strides` method is indeed vectorised and an excellent solution if you can understand the logic. — jpp, Jun 26 '18 at 14:45

score 1 · Answer 2 · answered Jun 26 '18 at 10:02

1

n = 3
df['col_b'] = df.apply(lambda row: sum(row.col_a <= df.col_a.loc[row.name - n: row.name-1]), axis=1)

Out[]: 
   col_a  col_b
0    8.4      0
1   11.3      0
2    7.2      2
3    6.5      3
4    4.5      3
5    8.9      0

answered Jun 26 '18 at 10:02

Dillon

997
4
13

Unfortnately your code give wrong output, so cannot add to timings. sorry. – jezrael Jun 26 '18 at 12:46
@jezrael What do you mean by this? – Dillon Jun 26 '18 at 12:49

jpp · Answer 3 · 2018-06-26T10:15:20.823

Using a list comprehension with pd.Series.items:

n = 3
df['count'] = [(j < df['col_a'].iloc[max(0, i-3):i]).sum() \
               for i, j in df['col_a'].items()]

Equivalently, using enumerate:

n = 3
df['count'] = [(j < df['col_a'].iloc[max(0, i-n):i]).sum() \
               for i, j in enumerate(df['col_a'])]

Result:

print(df)

   col_a  count
0    8.4      0
1   11.3      0
2    7.2      2
3    6.5      3
4    4.5      3
5    8.9      0

Pandas count values greater than current row in the last n rows

3 Answers3

Linked