Here are the steps I've taken so far. I am trying to get the daily PM averages of my dataframe, with a column of values 'PM'.
import pandas as pd
import numpy as np
df_2018 = pd.read_csv('kath2018.csv')
My 'kath2018.csv' looks like this:
df_2018.head()
    Date    Year    Month   Day Hour    PM
0   1/1/18 1:00 2018    1   1   1   131
1   1/1/18 2:00 2018    1   1   2   85
2   1/1/18 3:00 2018    1   1   3   74
3   1/1/18 4:00 2018    1   1   4   79
4   1/1/18 5:00 2018    1   1   5   85
I cleanup the data by replacing missing null values with np.NaN, and then using pd.interpolate to replace the NaN's.
#data has random -999 and 985 values, replace with NaN
df_2018['PM']=df_2018['PM'].replace(-999, np.NaN)
df_2018['PM']=df_2018['PM'].replace(985, np.NaN)
df_2018['PM'] = df_2018['PM'].interpolate()
Then, in order to get the daily average (my data is given in hourly intervals), I run the following code, which does exactly what it is supposed to, groups the hourly value by day and gives the average.
df_2018['Date'] = pd.to_datetime(df_2018['Date'])
df_2018 = df_2018.groupby(pd.Grouper(freq='D', key='Date')).mean()
However, there are entirely missing days worth of data, for when i look at df_2018 now, the days that were completely missing look like current dataframe after groupby
I cannot figure out how to go back into the dataframe, and replace the empty cell under the PM column with an np.NaN in order to do the interpolation again.
Should I be 'going back', is there a way for me to scope out the missing days first before running the interpolation and groupby function?
