My data is a bit complicated, I separate into 2 sections: (A) Explain data, (B) Desire output
(A) - Explain data:
My data as follow:
comp date adj_date val
0 a 1999-12-31 NaT 50
1 a 2000-01-31 NaT 51
2 a 2000-02-29 NaT 52
3 a 2000-03-31 NaT 53
4 a 2000-04-30 NaT 54
5 a 2000-05-31 NaT 55
6 a 2000-06-30 NaT 56
----------------------------------
7 a 2000-07-31 2000-01-31 57
8 a 2000-08-31 2000-02-29 58
9 a 2000-09-30 2000-03-31 59
10 a 2000-10-31 2000-04-30 60
11 a 2000-11-30 2000-05-31 61
12 a 2000-12-31 2000-06-30 62
13 a 2001-01-31 2000-07-31 63
14 a 2001-02-28 2000-08-31 64
15 a 2001-03-31 2000-09-30 65
16 a 2001-04-30 2000-10-31 66
17 a 2001-05-31 2000-11-30 67
18 a 2001-06-30 2000-12-31 68
----------------------------------
19 a 2001-07-31 2001-01-31 69
20 a 2001-08-31 2001-02-28 70
21 a 2001-09-30 2001-03-31 71
22 a 2001-10-31 2001-04-30 72
23 a 2001-11-30 2001-05-31 73
24 a 2001-12-31 2001-06-30 74
25 a 2002-01-31 2001-07-31 75
26 a 2002-02-28 2001-08-31 76
27 a 2002-03-31 2001-09-30 77
28 a 2002-04-30 2001-10-31 78
29 a 2002-05-31 2001-11-30 79
30 a 2002-06-30 2001-12-31 80
----------------------------------
31 a 2002-07-31 2002-01-31 81
32 a 2002-08-31 2002-02-28 82
33 a 2002-09-30 2002-03-31 83
34 a 2002-10-31 2002-04-30 84
35 a 2002-11-30 2002-05-31 85
36 a 2002-12-31 2002-06-30 86
37 a 2003-01-31 2002-07-31 87
38 a 2003-02-28 2002-08-31 88
39 a 2003-03-31 2002-09-30 89
40 a 2003-04-30 2002-10-31 90
41 a 2003-05-31 2002-11-30 91
42 a 2003-06-30 2002-12-31 92
----------------------------------
date: is the actual date, as end of month.adj_date=date+MonthEnd(-6)val: is given value
I want to create new column val_new where:
- it is referencing to
valof previous year December val_newis then applied todateas fromdate.Julytodate.(year+1).June, Or equivalently inadj_dateit is fromadj_date.Jantoadj_date.Dec
(B) - Desire Output:
comp date adj_date val val_new
0 a 1999-12-31 NaT 50 NaN
1 a 2000-01-31 NaT 51 NaN
2 a 2000-02-29 NaT 52 NaN
3 a 2000-03-31 NaT 53 NaN
4 a 2000-04-30 NaT 54 NaN
5 a 2000-05-31 NaT 55 NaN
6 a 2000-06-30 NaT 56 NaN
-------------------------------------------
7 a 2000-07-31 2000-01-31 57 50.0
8 a 2000-08-31 2000-02-29 58 50.0
9 a 2000-09-30 2000-03-31 59 50.0
10 a 2000-10-31 2000-04-30 60 50.0
11 a 2000-11-30 2000-05-31 61 50.0
12 a 2000-12-31 2000-06-30 62 50.0
13 a 2001-01-31 2000-07-31 63 50.0
14 a 2001-02-28 2000-08-31 64 50.0
15 a 2001-03-31 2000-09-30 65 50.0
16 a 2001-04-30 2000-10-31 66 50.0
17 a 2001-05-31 2000-11-30 67 50.0
18 a 2001-06-30 2000-12-31 68 50.0
-------------------------------------------
19 a 2001-07-31 2001-01-31 69 62.0
20 a 2001-08-31 2001-02-28 70 62.0
21 a 2001-09-30 2001-03-31 71 62.0
22 a 2001-10-31 2001-04-30 72 62.0
23 a 2001-11-30 2001-05-31 73 62.0
24 a 2001-12-31 2001-06-30 74 62.0
25 a 2002-01-31 2001-07-31 75 62.0
26 a 2002-02-28 2001-08-31 76 62.0
27 a 2002-03-31 2001-09-30 77 62.0
28 a 2002-04-30 2001-10-31 78 62.0
29 a 2002-05-31 2001-11-30 79 62.0
30 a 2002-06-30 2001-12-31 80 62.0
-------------------------------------------
31 a 2002-07-31 2002-01-31 81 74.0
32 a 2002-08-31 2002-02-28 82 74.0
33 a 2002-09-30 2002-03-31 83 74.0
34 a 2002-10-31 2002-04-30 84 74.0
35 a 2002-11-30 2002-05-31 85 74.0
36 a 2002-12-31 2002-06-30 86 74.0
37 a 2003-01-31 2002-07-31 87 74.0
38 a 2003-02-28 2002-08-31 88 74.0
39 a 2003-03-31 2002-09-30 89 74.0
40 a 2003-04-30 2002-10-31 90 74.0
41 a 2003-05-31 2002-11-30 91 74.0
42 a 2003-06-30 2002-12-31 92 74.0
-------------------------------------------
I have two solutions, but both comes at a cost:
- Solution 1: to create
sub_decdataframe where we takevalofDeceach year. Then merge back to maindata. This one works fine, but I don't like this solution because our actual data will involve a lot of merge, and it is not easy and convenient to keep track of all those merges. - Solution 2: (1) I create a lag by shift(7), (2) set other
adj_datebut Jan toNone, (3) then use groupby with ffill. This solution works nicely, but if there is any missing rows, or the date is not continuous, then the entireoutputis wrong
create adj_year:
data['adj_year'] = data['adj_date'].dt.year
cross referencing by shift(7):
data['val_new'] = data.groupby('comp')['val'].shift(7)
setting other adj_date except Jan to be None:
data.loc[data['adj_date'].dt.month != 1, 'val_new'] = None
using ffill to fill in None by each group of ['comp', 'adj_year']:
data['val_new'] = data.groupby(['comp', 'adj_year'])['val_new'].ffill()
If you have any suggestion to overcome the drawback of Solution 02, or any other new solution is appreciated.
Thank you