Data.frame allows operations on column subsets using [ , dropping single column/row outputs to vectors by default. Dplyr does not allow this, deliberately (and seemingly because coding was an absolute nightmare).
df <- data.frame(a = c(1:5,NA), b = c(1,1,1,2,2,2))
mean(df[,"a"], na.rm = T) # 3
dftbl <- as.tbl(df)
mean(dftbl[,"a"], na.rm = T) # NA
Advice is therefore to subset with [[ as this will deliver uniform outputs for both dfs and tbl_dfs.
But: that's fine for columns or rows only, but not for rows+columns, and concerningly this difference can be missed if you don't check the warnings (which is my own fault admittedly), e.g.:
dfresult <- mean(df[df$b == 2, "a"], na.rm = T) # 4.5
tblresult <- mean(dftbl[dftbl$b == 2, "a"], na.rm = T) # NA_real_
Does anyone have any 'best practice' suggestions for performing column operations on row subsets? Is this where I should improve my dplyr game using filter & select? My attempts thus far keep hitting walls. Grateful for any golden rules. Thanks in advance.
dftbl %>% filter(b == 2) %>% select(a) %>% mean(na.rm = T) #NA
This fails in the same way, with the filtered & selected data STILL being an N*1 tibble which refuses to play with mean.
dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% .$a
# [1]  4  5 NA
But
dftbl %>% filter(b == 2) %>% select(a) %>% as.data.frame() %>% mean(.$a, na.rm = T)
# [1] NA