I have some data which looks like:
head:
dfID date group groupValues
<chr> <date> <chr> <dbl>
1 df1 2020-03-01 grp1 0.175
2 df1 2020-03-01 grp2 0.150
3 df1 2020-03-01 grp3 0.0509
tail:
dfID date group groupValues
<chr> <date> <chr> <dbl>
1 df3 2020-06-29 grp7 0.705
2 df3 2020-06-29 grp8 0.473
3 df3 2020-06-29 grp9 0.900
Which is a time series data with 3 unique dfID's and 9 unique group's. Filtering the date column to a single day I have (3 df's and 9 groups):
df %>%
filter(date == "2020-03-01")
dfID date group groupValues
<chr> <date> <chr> <dbl>
1 df1 2020-03-01 grp1 0.175
2 df1 2020-03-01 grp2 0.150
3 df1 2020-03-01 grp3 0.0509
4 df2 2020-03-01 grp4 0.133
5 df2 2020-03-01 grp5 0.779
6 df2 2020-03-01 grp6 0.506
7 df3 2020-03-01 grp7 0.868
8 df3 2020-03-01 grp8 0.552
9 df3 2020-03-01 grp9 0.274
I next want to split the data up into a pairwise combination of the dfID's.
combinedSplit <- combn(levels(as.factor(df$dfID)), m = 2, FUN = function(x)
df %>%
filter(dfID %in% x), simplify = FALSE)
names(combinedSplit) <- combn(levels(as.factor(df$dfID)), m = 2, str_c, collapse="_")
So now I have a list of 3 dfs - one for each pairwise combination of the dfID's:
df1_df2
df1_df3
df2_df3
Now, I want to apply the rolling_origin function from the rsample package to split the time series into training and testing data. I can naively apply the function by mapping over the lists.
rolledData <- combinedSplit %>%
map(., ~ rolling_origin(
data = .,
initial = 60,
assess = 1,
cumulative = FALSE,
skip = 0
)
)
I can access the tail of the first split for the combination df1_df2.
map(rolledData$df1_df2$splits, ~analysis(.x))[[1]] %>% tail()
dfID date group groupValues
<chr> <date> <chr> <dbl>
1 df1 2020-03-19 grp1 0.528
2 df1 2020-03-19 grp2 0.394
3 df1 2020-03-19 grp3 0.532
4 df1 2020-03-20 grp1 0.586
5 df1 2020-03-20 grp2 0.369
6 df1 2020-03-20 grp3 0.153
Which is incorrect. In the rolling_origin function I stated the training/assessment period to be 60 perids (days) but this data ends on the 20th March. This is because it is taking the first 60 observations of the data not of the time series (3 group's * 20 days).
So I want to apply the rolling_origin function to each of the grp's - with each grp having 60 day rolling windows.
Here, I thought it would be best to first nest() the group's and then apply the rolling_origin function, since then each of the group's would be separate and 60 days will correspond to grp1, grp2 and grp3. Then, unnest() the group's to put grp1, grp2 and grp3 back into the same data frame.
nestedRolledData <- map(combinedSplit, ~group_by(.x, group) %>%
nest() %>%
mutate(
rolledData = map(data, ~.x %>%
rolling_origin(
data = .,
initial = 60,
assess = 0,
cumulative = FALSE,
skip = 0
)
)
)
)
I am having difficulty with this unnesting.
Inspecting the structure of the lists:
- layer 1 = contains
df1_df2,df1_df3anddf2_df3combinations. - layer 2 = contains the nested data with the "rolledData" of interest.
- layer 3 = (expanding
rolledDatacontains 6 lists) contains a list for each of thegrp's (eachdfdas 3grp's). - layer 4 = (expanding one of the
grp's) contains asplitslist (generated from therolling_originfunction). - layer 5 = (expanding the
splitslist) contains[[1]]...[[63]]. The lists here correspond to each of therolling_originfunction training splits. In therolling_originfunction I setinitial = 60and the model creates 63 splits of the data between 2020-03-01 and 2020-06-30. (If I changedinitial = 90I get 33 splits. So the number of lists here is dependent on the time duration in the data and theinitialpart in therolling_originfunction).
I can access the first splits using map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[1]] which gives me a data frame of 60 observations starting on 2020-03-01 and ending 2020-04-29. For the second split in this list map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[2]] I have another data frame containing 60 observations starting this time on the 2020-03-02 and ending on the 2020-04-30 (so this data has been shifted by a single day). I can do this up until [[63]] where it starts on 2020-05-02 and ends on 2020-06-30 (which is the last day in my data).
This is what I want - i.e. the data made the correct time series splits for each grp. Now I want to unnest these and put them back into the correct data frames. Going back to layer 3 of the list which contains [[1]], ... , [[6]] 6 lists. These correspond to each of the 3 groups for the 2 dfIDs. So I essentially want to merge these lists together.
If I were to contruct these manually it might look like:
# grp 1:6 for rolling_origin split 1
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[1]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[1]]
# grp 1:6 for rolling_origin split 2
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[2]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[2]]
# ...
# grp 1:6 for rolling_origin split 63
map(nestedRolledData$df1_df2$rolledData[[1]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[2]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[3]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[4]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[5]]$splits, ~analysis(.x))[[63]]
map(nestedRolledData$df1_df2$rolledData[[6]]$splits, ~analysis(.x))[[63]]
Data:
library(rsample)
df1 = data.frame(
date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
grp1 = runif(122),
grp2 = runif(122),
grp3 = runif(122)
) %>%
pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>%
add_column(dfID = "df1")
df2 = data.frame(
date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
grp4 = runif(122),
grp5 = runif(122),
grp6 = runif(122)
) %>%
pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>%
add_column(dfID = "df2")
df3 = data.frame(
date = seq.Date(from = as.Date("2020-03-01"), to = as.Date("2020-06-30"), by = "days"),
grp7 = runif(122),
grp8 = runif(122),
grp9 = runif(122)
) %>%
pivot_longer(cols = where(is.numeric), names_to = "group", values_to = "groupValues") %>%
add_column(dfID = "df3")
df <- bind_rows(df1, df2, df3) %>%
relocate(dfID, .before = date)
map(rolledData$splits, ~analysis(.x))[[1]] %>% tail()
combinedSplit <- combn(levels(as.factor(df$dfID)), m = 2, FUN = function(x)
df %>%
filter(dfID %in% x), simplify = FALSE)
names(combinedSplit) <- combn(levels(as.factor(df$dfID)), m = 2, str_c, collapse="_")
rolledData <- combinedSplit %>%
map(., ~ rolling_origin(
data = .,
initial = 60,
assess = 1,
cumulative = FALSE,
skip = 0
)
)
map(rolledData$df1_df2$splits, ~analysis(.x))[[1]] %>% tail()
nestedRolledData <- map(combinedSplit, ~group_by(.x, group) %>%
nest() %>%
mutate(
rolledData = map(data, ~.x %>%
rolling_origin(
data = .,
initial = 60,
assess = 0,
cumulative = FALSE,
skip = 0
)
)
)
)