I have a data.table in which I want to detect the presence of outliers (according to skewness and kurtosis) and, if found, correct them.
To this purpose, when an outlier is detected and var is the highest value, I want to set the highest value in var to be equal to the second highest. Below is a minimal (almost) working example of my code:
`%>%` <- fastpipe::`%>>%`
country <- rep(c("AA", "BB", "CC", "ZZ"), times = 4)
year <- rep(c("2014", "2015", "2016", "2017"), each = 4)
var <- c(NA, rnorm(8, 2, 4), NA, NA, 1, 25, 19, 2, 3)
melted_data <- data.table(country, year, var)
melted_data %>%
.[, skew := e1071::skewness(var, na.rm = TRUE), by=year] %>%
.[, kurt:= moments::kurtosis(var, na.rm = TRUE), by=year] %>%
.[, outliers := kurt>1 || kurt>3.5 & abs(skew)>2, by=year] %>%
# Ranks
.[, rank_high_first := as.integer(frank(-var, na.last="keep", ties.method="min")), by=year] %>%
.[, rank_low_first := as.integer(frank(var, na.last="keep", ties.method="min")), by=year] %>%
# Identify and correct outliers
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
.[rank_low_first==1, lowest1 := var, by=year] %>%
.[rank_low_first==2, lowest2 := var, by=year] %>%
.[outliers==TRUE & skew>0 & var==highest1, var<-highest2, by=year]
What I am trying to achieve is all in the last row. However, this does not work because the values highest1 and highest2 do not span the whole year group (edit: see also screenshot below). I think the solution would be to modify the following lines
.[rank_high_first==1, highest1 := var, by=year] %>%
.[rank_high_first==2, highest2 := var, by=year] %>%
so that highest1 and highest2 are copied to all rows in that year. How can I achieve that? I also tried the following, which did not work:
.[, highest1 := var[rank_high_first==1], by=year]
