Let's assume I have a pandas dataframe like this:
import pandas as pd
d = {'entrance': ['a', 'a', 'a', 'a', 'b', 'b', 'b', 'b', 'c', 'c', 'c', 'c', 'd', 'd', 'd', 'd', 'e', 'e', 'e', 'e'],
'hour': [3, 4, 4, 4, 2, 2, 2, 2, 0, 0, 0, 1, 12, 12, 12, 12, 7, 7, 8, 8],
'in': [3, 2, 0, 4, 3, 2, 1, 0, 2, 1, 5, 0, 1, 1, 2, 3, 10, 10, 1, 5],
'out': [0, 1, 0, 0, 1, 2, 1, 3, 1, 1, 0, 2, 0, 2, 3, 0, 0, 0, 25, 0]}
df = pd.DataFrame(data=d)
df
entrance hour in out
a 3 3 0
a 4 2 1
a 4 0 0
a 4 4 0
b 2 3 1
b 2 2 2
b 2 1 1
b 2 0 3
c 0 2 1
c 0 1 1
c 0 5 0
c 1 0 2
d 12 1 0
d 12 1 2
d 12 2 3
d 12 3 0
e 7 10 0
e 7 10 0
e 8 1 25
e 8 5 0
I would like to sum the values in the columns in and out grouped by hour. I know I can achieve this with:
df_grouped = df.groupby(['hour']).agg({'in':'sum', 'out': 'sum'}).reset_index()
However, the values in the columns in and out should only be grouped by the first hour record of each entrance group. So for entrance group 'a', the values in and out should be summed for the first record of the corresponding hours, which is in this case hour '3'. The desired output should be:
hour in out
3 9 1
2 6 7
0 8 4
12 7 5
7 27 25
What is the best way to do this?