Create feature based on consecutive rows in various independent sections of dataframe

Question

I could not think of a better way to word the question, kindly edit if you have something better.

date	country	total_cases
1	denmark	2
2	denmark	5
3	denmark	10
4	denmark	17
1	usa	5
2	usa	13
3	usa	23
4	usa	37
1	india	0
2	india	2
3	india	9
4	india	40

From the above dataframe, if I want to create a new feature new_cases, I can do this-

df['new_cases'] = df['total_cases'] - df['total_cases'].shift(1)

But that would also find values from where one country ends and another begins. (usa date 1 - denmark date 4). The workaround I have been using is running a loop like this-

new_cases = []
for country in df['country'].unique():
    temp_df = df[df['country'] == country]
    temp_array = temp_df['total_cases'] - temp_df['total_cases'].shift(1)
    new_cases.append(temp_array)

df['new_cases'] = pd.concat(temp_array)

As you can imagine, this takes time to run when the dataframe is huge, especially when creating multiple such features. Is there a better way to do this? Maybe a pandas function especially for this

score 1 · Accepted Answer · answered May 07 '21 at 13:03

1

Use DataFrameGroupBy.shift:

df['new_cases'] = df['total_cases'] - df.groupby('country')['total_cases'].shift(1)

Or DataFrameGroupBy.diff:

df['new_cases'] = df.groupby('country')['total_cases'].diff(1)

answered May 07 '21 at 13:03

jezrael

822,522
95
1,334
1,252

Do you think it can work with pivot table as well? Just curious. I tried, but can't seem to get it. – callmeanythingyouwant May 07 '21 at 13:58
1

@callmeanythingyouwant - I think not, 'diff' is not aggregate function – jezrael May 07 '21 at 13:59

Create feature based on consecutive rows in various independent sections of dataframe

1 Answers1