I have an Age category column in my pandas dataframe, df. In the Age category column, there are 32% missing values which I need to do some imputation. I'm thinking to use the distribution of the available data, which is 68% to impute the missing values.
The screenshot below is the distribution of the available data (the 68%) for the age category:
As you can see from the table,
36 - 45, having 29.5%46 - 55, having24.9%- etc..
Hence, I will expect that when I'm doing the imputation for the 32% missing values, age 36 - 45 will have approximately 29.5% as well, age 46 - 55 will have approximately 24.9% and etc...
Once I impute all the NaN in the Age category column, the overall distribution should not vary a lot compare to the one in the screenshot. May I know how should I achieve that?
Any help or advice will be greatly appreciated!
