This is my first Stack Overflow post. I researched extensively but have not found a similar post.
I am trying to impute the median for NA values based on two conditions.
Here is my code:
#Create sample of original data for reproducibility
Date<-c("2009-05-01","2009-05-02","2009-05-03","2009-06-01","2009-06-02",
        "2009-06-03", "2010-05-01","2010-05-02","2010-05-03","2010-06-01",
        "2010-06-02","2010-06-03","2011-05-01","2011-05-02","2011-05-03",
        "2011-06-01","2011-06-02","2011-06-03")
Month<- c("May","May","May","June","June","June",
          "May","May","May","June","June","June",
          "May","May","May","June","June","June")
DayType<- c("Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
            "Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday",
            "Monday","Tuesday","Wednesday","Monday","Tuesday","Wednesday")
Qty<- c(NA,NA,NA,NA,NA,NA,
        1,2,1,10,15,13,
        3,2,5,20,14,16)
#Combine into dataframe
Example<-data.frame(Date,Month,DayType,Qty)
#Test output
Example
# Make a separate dataframe to calculate the median value based on day of the month
test1 <- ddply(Example,. (DayType,Month),summarize,median=median(Qty,na.rm=TRUE))
This works as expected. Test1 output looks like this:
DayType   Month  Median
Monday    June   15.0
Monday    May    2.0
Tuesday   June   14.5
Tuesday   May    2.0
Wednesday June   14.5
Wednesday May    3.0
My second step replaces "NA" values in the original dataset with the medians calculated in test1. This is where my issue comes in.
Example$Qty[is.na(Example$Qty)] <- test1$median[match(Example$DayType,test1$DayType,Example$Month,test1$Month)][is.na(Example$Qty)]
Example
Match[] only matches on the median value for each day, rather than the median value for each day by month. The output is the same seven repeating values for the entire set. I have not figured out how to match on both columns simultaneously.
Output:
Date         DayType   Month   GSEvtQty
2009-05-01   Monday    May     15.0    *should be 2.0, matching to June
2009-05-02   Tuesday   May     14.5    *should be 2.0, matching to June
2009-05-03   Wednesday May     14.5    *should be 3.0, matching to June
2009-06-01   Monday    June    15.0    *imputes correctly
2009-06-02   Tuesday   June    14.5    *imputes correctly
2009-06-03   Wednesday June    14.5    *imputes correctly
2010-05-01   Monday    May     1.0     
2010-05-02   Tuesday   May     2.0  
2010-05-03   Wednesday May     1.0 
2010-06-01   Monday    June    10.0
2010-06-02   Tuesday   June    15.0  
2010-06-03   Wednesday June    13.0   
I have also tried using %in%:
Example$Qty[is.na(Example$Qty)] <- test1$median[Example$DayType %in% test1$DayType & Example$Month %in% test1$Month][is.na(Example$Qty)]
But that does not match correctly and only outputs a limited number of values rather than over the entire series of NAs.
Using na.aggregate via the Zoo package as cleverly suggested by @Jaap:
setDT(Example)[, Value := na.aggregate("Qty", FUN = median), by = c("DayType","Month")]
For some reason does not transform the NAs:
Output:
 Date         Month   DayType   Qty
 2009-05-01   May     Monday    NA
 2009-05-02   May     Tuesday   NA
 2009-05-03   May     Wednesday NA
 2009-06-01   June    Monday    NA
Any suggestions would be greatly appreciated! Thanks for sticking with this post for so long and look forward to paying the assistance forward in the future.
 
    