I am learning how to handle missing values in a dataset. I have a table with ~1million entries. I'm trying to deal with a small number of missing values.
My data concerns a bicycle-share system and my missing values are start & end locations.
Data: missing starting stations, only 7 values
Data: missing ending station, 24 values altogether
I want to fill the NaN in both cases with the mode of the "opposite" station. Example, for start_station==21, I want to see what is the most common end_station, and use that to fill in my missing value.
E.g. df.loc[df['start_station'] == 21].end_station.mode()
I tried to achieve this with a function:
def inpute_end_station(df):
    for index, row in df.iterrows():    
        if pd.isnull(df.loc[index, 'end_station']):
            start_st = df.loc[index, 'start_station']
            mode = df.loc[df['start_station'] == start_st].end_station.mode()
            df.loc[index, 'end_station'].fillna(mode, inplace=True)
The last line throws a AttributeError: 'numpy.float64' object has no attribute 'fillna'. If instead I just use df.loc[index, 'end_station'] = mode I get ValueError: Incompatible indexer with Series.
Am I approaching this properly? I understand it's bad practice to modify something you're iterating over in pandas so what's the correct way of changing start_station and end_station columns and replacing the NaNs with the corresponding mode of the complimentary station?


 
     
    