I have a dataset where each id has multiple samples and can be stratified into group variable. I would like to do random sampling, stratified by group, but not have the id be repeated (i.e. each id only appears once in the output).
I have tried to modify some existing solutions, however, all seem to sample the data and include multiple samples from a single id across the groups:
- random sampling - matrix
- Stratified random sampling from data frame
- Stratified random sampling in R
- Stratified random sampling from data frame
I have tried the following, thinking replace = FALSE may help to ensure that only 1 sample from each id is used, but this still does not do what I want.
set.seed(1)
# Data
data <- data.frame(
id = c("A", "C", "B", "D", "E", "F", "A", "A", "B", "B", "B", "D", "D", "E", "E", "F"),
group = c("1", "1", "2", "2", "3", "3", "2", "1", "1", "2", "3", "2", "3", "2", "1", "3"),
length = c("54", "52", "43", "42", "60", "46", "59", "60", "51", "45", "47", "58", "48", "46", "56", "57"))
# Stratified random sampling by group
sample <- data %>%
distinct %>%
group_by(group) %>%
sample_n(2, replace = FALSE) %>%
left_join(data)
sample output:
id group length
A 1 60
C 1 52
D 2 42
A 2 59
B 3 47
E 3 60
However, as seen above, the id= A is repeated in group 1 and 2. The ideal output I would like should look something like this where each id appears only once and samples are stratified by group:
id group length
A 1 54
C 1 52
B 2 43
D 2 42
E 3 60
F 3 46
Is there a way to customise the existing solutions so that when sampling for each group, if an id has already been used for another group, it will be excluded and not sampled for another group? I know I can add %>% distinct(id) to my code but I believe this would not be random anymore as distinct() just picks up the first row for that id. Thank you for any help!