How do I use the Pandas groupby function to calculate the mean for the previous year?

Question

I am trying to look for a method to find a Player's mean score for the "Last Season" (Previous Year) and add it in a new column in the original dataframe df.

I have coded a formula to get a Player's mean score for the current year, excluding the current row, which is as follows:

df['Season Avg'] = df.groupby([df['Player'], df['DateTime'].dt.year])['Score']
                   .apply(lambda x: x.shift(1).expanding().mean())

However, despite my best attempt at using the shift function, I can not quite work out how to calculate the previous years mean ("Last Season Avg") directly into a new column.

The dataframe is set out as follows:

Player	DateTime	Score	Season Avg
PlayerB	2020-MM-DD HH:MM:SS	40	NaN
PlayerA	2020-MM-DD HH:MM:SS	50	NaN
PlayerA	2021-MM-DD HH:MM:SS	100	NaN
PlayerB	2021-MM-DD HH:MM:SS	200	NaN
PlayerA	2021-MM-DD HH:MM:SS	160	100
PlayerB	2021-MM-DD HH:MM:SS	140	200
PlayerB	2021-MM-DD HH:MM:SS	160	170
PlayerA	2021-MM-DD HH:MM:SS	200	130

The new ideal dataframe that I would like:

Player	DateTime	Score	Season Avg	Last Season Avg
PlayerB	2020-MM-DD HH:MM:SS	40	NaN	NaN
PlayerA	2020-MM-DD HH:MM:SS	50	NaN	NaN
PlayerA	2021-MM-DD HH:MM:SS	100	NaN	50
PlayerB	2021-MM-DD HH:MM:SS	200	NaN	40
PlayerA	2021-MM-DD HH:MM:SS	160	100	50
PlayerB	2021-MM-DD HH:MM:SS	140	200	40
PlayerB	2021-MM-DD HH:MM:SS	160	170	40
PlayerA	2021-MM-DD HH:MM:SS	200	130	50

Provide a [reproducible example](https://stackoverflow.com/q/20109391/2641825) so it's easier for others to help you with working code. — Paul Rougieux, Mar 04 '22 at 13:29

score 3 · Answer 1 · 2022-03-05T00:09:01.533

You can groupby once by the "Player" and the year to find the yearly average for each player; then groupby "Player" + shift to get the previous year's previous year's averages.

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].mean().reset_index(name='Season Avg')
out['Last Season Avg'] = out['Season Avg'].groupby('Player').shift()

If you're looking for career averages until a particular season, you could use expanding().mean():

out = df.groupby(['Player', df['DateTime'].dt.year])['Score'].expanding().mean().reset_index(name='Season Avg')
df['Career Avg by Last Season'] = df['Career Avg by Season'].groupby('Player').shift()

Edit:

With sample data provided, let's test it. The main problem here is that "Year"s have duplicate values. @PaulRougieux handles it very elegantly. Here's another option. The idea is to find last season's averages and map it back to the original df (instead of transforming it).

df['Last Season Avg'] = (df.set_index(['Player', df['DateTime'].str[:4]]).index
                             .map(df.groupby(['Player', df['DateTime'].str[:4]])['Score'].mean()
                                  .groupby(level=0).shift()))

Output:

    Player             DateTime  Score  Season Avg  Last Season Avg
0  PlayerB  2020-MM-DD HH:MM:SS     40         NaN              NaN
1  PlayerA  2020-MM-DD HH:MM:SS     50         NaN              NaN
2  PlayerA  2021-MM-DD HH:MM:SS    100         NaN             50.0
3  PlayerB  2021-MM-DD HH:MM:SS    200         NaN             40.0
4  PlayerA  2021-MM-DD HH:MM:SS    160       100.0             50.0
5  PlayerB  2021-MM-DD HH:MM:SS    140       200.0             40.0
6  PlayerB  2021-MM-DD HH:MM:SS    160       170.0             40.0
7  PlayerA  2021-MM-DD HH:MM:SS    200       130.0             50.0

Use [transform](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html) to keep results in the same data frame. — Paul Rougieux, Mar 04 '22 at 13:26

score 1 · Accepted Answer · answered Mar 04 '22 at 14:18

Create a sample data set

import pandas
import numpy as np
df = pandas.DataFrame(
    {'player': ['B', 'A', 'A', 'B', 'A', 'B', 'B', 'A'],
     'datetime': ['2020-01-01', '2020-01-01', '2021-01-01', '2021-01-01',
                  '2021-01-01', '2021-01-01', '2021-01-01', '2021-01-01'],
     'score': [40, 50, 100, 200, 160, 140, 160, 200],
    }
)
df["datetime"] = pandas.to_datetime(df["datetime"])
df["year"] = df["datetime"].dt.year

Use transform to add the current season average to the data frame

df["season_avg"] = df.groupby(["datetime", "player"])["score"].transform("mean")
df

  player   datetime  score  year  season_avg
0      B 2020-01-01     40  2020   40.000000
1      A 2020-01-01     50  2020   50.000000
2      A 2021-01-01    100  2021  153.333333
3      B 2021-01-01    200  2021  166.666667
4      A 2021-01-01    160  2021  153.333333
5      B 2021-01-01    140  2021  166.666667
6      B 2021-01-01    160  2021  166.666667
7      A 2021-01-01    200  2021  153.333333

Shift cannot be applied here because years are repeated

df.sort_values(["year"], ascending=True).groupby(["player"])["season_avg"].transform("shift")

0           NaN
1           NaN
2     50.000000
3     40.000000
4    153.333333
5    166.666667
6    166.666667
7    153.333333
Name: season_avg, dtype: float64

Compute the average from the previous year and join them to the original dataframe

savg = (df.groupby(["year", "player"])
        .agg(last_season_avg = ("score", "mean"))
        .reset_index())
savg["year"] = savg["year"] + 1
savg

   year player  last_season_avg
0  2021      A        50.000000
1  2021      B        40.000000
2  2022      A       153.333333
3  2022      B       166.666667

df.merge(savg, on=["player", "year"], how="left" )

  player   datetime  score  year  season_avg  last_season_avg
0      B 2020-01-01     40  2020   40.000000              NaN
1      A 2020-01-01     50  2020   50.000000              NaN
2      A 2021-01-01    100  2021  153.333333             50.0
3      B 2021-01-01    200  2021  166.666667             40.0
4      A 2021-01-01    160  2021  153.333333             50.0
5      B 2021-01-01    140  2021  166.666667             40.0
6      B 2021-01-01    160  2021  166.666667             40.0
7      A 2021-01-01    200  2021  153.333333             50.0

Another way to compute the average from the previous year, using shift is maybe more elegant than doing year + 1.

savg = (df.groupby(["year", "player"])
        .agg(season_avg = ("score", "mean"))
        .reset_index()
        .sort_values(["year"])
       )
savg["last_season_avg"] = savg.groupby(["player"])["season_avg"].transform("shift")

Thank you, I was able to follow step-by-step and it worked perfectly! — Fonti, Mar 04 '22 at 15:09
Please add [reproducible examples](https://stackoverflow.com/q/20109391/2641825) to your next questions, it really helps those who reply. Have fun leaning pandas! — Paul Rougieux, Mar 04 '22 at 16:15

How do I use the Pandas groupby function to calculate the mean for the previous year?

2 Answers2