I have a dataset where groups undergo treatments at different times, and I need to log the year in which the groups first become treated, else giving the value of 0 for all non-treated groups.
 df = pd.DataFrame([['CA',2014,0],['CA',2015,0],['CA',2016,1],['CA',2017,1], 
 ['WA',2011,0],['WA',2012,1],['WA',2013,1],['TX',2010,0]],
 columns=['Group_ID','Year','Treated'])
The dataframe should look like this once complete:
| Group_ID | Year | Treated | First_Treated | 
|---|---|---|---|
| CA | 2014 | 0 | 0 | 
| CA | 2015 | 0 | 0 | 
| CA | 2016 | 1 | 2016 | 
| CA | 2017 | 1 | 2016 | 
| WA | 2011 | 0 | 0 | 
| WA | 2012 | 1 | 2012 | 
| WA | 2013 | 1 | 2012 | 
| TX | 2010 | 0 | 0 | 
The Python code below returns every subsequent year value rather than the first year of treatment.
df['first_treated'] = np.where(df['Treated']==1, df['Year'], 0)
I have tried agg() and min() functions but neither work properly.
df['first_treated'] = np.where(df['Treated']==1,df['Year'].min,0)
I have also used the R code in Create a group variable first.treat indicating the first year when each unit becomes treated, but using an empty first_treated column, no data is inserted into the column with the mutate() function. I receive no errors using that R script on the similar pandas dataframe.
 
     
    