I have the following dummy dataframe:
col1 = c("aa", NA, NA, NA, NA, NA, NA
, "cc", "cc", "cc", "cc", "cc", "cc", "cc", "cc", "cc"
, "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa")
col2 = c("aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa", "aa"
, NA, NA, NA, NA, NA, NA, NA, NA, NA
, "bb", "bb", "bb", "bb", "bb", "bb", "bb", "bb", "bb")
col3 = c("aa", "bb", "bb"
, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA)
col4 = c(NA, NA, NA, 4:27)
col5 = c(28:51, NA, NA, NA)
# Construct the data frame with NAs in categorical and numeric columns
df = data.frame("col1" = col1, "col2" = col2, "col3" = col3
, "col4" = col4, "col5" = col5, stringsAsFactors = FALSE)
I would like to understand how to write a function to impute only categorical values i.e. col1, col2, col3 by using the simple rules:
- impute categorical
NAcolumn values with the most frequent values in that column - in case of ties choose the alphabetically first value i.e.
aahas preference overbb(in the case forcol2)
Could anyone please assist in writing a function which takes in df as an input and returns the imputed dataframe for categorical values only. col4, col5 should remain unchanged (They have NAs but are numeric so should be ignored).
Clarification For this example:
col1NAs should be imputed to be"aa"col2NAs should be imputed to be"aa"(by alphabetic preference in ties)col3NAs should be imputed to be"bb"
Thanks