I have a dataset looks like this:
(Visualising the datasets below may help you to understand the question)
original <- data.frame(
ID = c(rep("John", 4), "Steve"),
A = as.integer(c(rep(3, 4), 1)),
b = c(2, 3, 4, 2, 2),
B = c(rep(4, 4), 2),
detail1 = c("Yes", "Sure", "No", "Yes", "Yes"),
detail2 = c(rep("Unique1", 4), "Unique2")
)
Values in A, B, and b are all integers. Variable b is incomplete in this dataset and it actually has values from 1 to the value of B.
I need to complete this dataset with a new variable a added, the completed dataset will look like this:
completed1 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
A = c(rep(3, 12), rep(1, 2)),
b = c(rep(1:4, 3), 1, 2),
B = c(rep(4, 12), rep(2, 2)),
detail1 = c("Absence", "Yes", "Sure", "No", "Absence", "Yes", rep("Absence", 7), "Yes"),
detail2 = c(rep("Unique1", 12), rep("Unique2", 2))
)
Values in a are integers too and a has values from 1 to the value of A. Note that b is nested in a.
I think the most relevant functions to complete a dataset in this way are tidyr::complete() and tidyr::expand(), but they can only complete combinations of values in existing variables, they cannot add a new column(variable).
I know the challenge is that there are multiple locations to allocate values in detail1 correspondingly to values in the newly added a through the nested relationship, for example, the completed dataset can also be this:
completed2 <- data.frame(
ID = c(rep("John", 12), rep("Steve", 2)),
a = c(rep(1, 4), rep(2, 4), rep(3, 4), rep(1, 2)),
A = c(rep(3, 12), rep(1, 2)),
b = c(rep(1:4, 3), 1, 2),
B = c(rep(4, 12), rep(2, 2)),
detail1 = c("Absence", "Yes", rep("Absence", 4), "Sure", "Absence", "Absence", "Yes", "Absence", "No", "Absence", "Yes"),
detail2 = c(rep("Unique1", 12), rep("Unique2", 2))
)
I would like to complete the dataset following the logic of completed1 above: values in detail1 go to the smallest value in a first, and if repeated values in b present (for example, the Yes in b under John in original dataset), the repeated value goes to the next value in a.
Is it possible to do this?
My actual dataset has more variables than this example and the completed dataset will have more than 700,000 rows, so I prefer fast methods to automate it.
Thanks very much!!!