I could not think of a better way to word the question, kindly edit if you have something better.
| date | country | total_cases |
|---|---|---|
| 1 | denmark | 2 |
| 2 | denmark | 5 |
| 3 | denmark | 10 |
| 4 | denmark | 17 |
| 1 | usa | 5 |
| 2 | usa | 13 |
| 3 | usa | 23 |
| 4 | usa | 37 |
| 1 | india | 0 |
| 2 | india | 2 |
| 3 | india | 9 |
| 4 | india | 40 |
From the above dataframe, if I want to create a new feature new_cases, I can do this-
df['new_cases'] = df['total_cases'] - df['total_cases'].shift(1)
But that would also find values from where one country ends and another begins. (usa date 1 - denmark date 4). The workaround I have been using is running a loop like this-
new_cases = []
for country in df['country'].unique():
temp_df = df[df['country'] == country]
temp_array = temp_df['total_cases'] - temp_df['total_cases'].shift(1)
new_cases.append(temp_array)
df['new_cases'] = pd.concat(temp_array)
As you can imagine, this takes time to run when the dataframe is huge, especially when creating multiple such features. Is there a better way to do this? Maybe a pandas function especially for this