r: randomly assigning "1" or "2" in a vector based on double-occurrences in another vector

Question

I constructed the following code below. It shall assign the value "1" or "2" to vector v2, if an element in vector v1 occurs twice, e.g. "A" in vector v1 appears twice, hence in the respective rows, v2 should once read "1" and in the other case "2".

The code works sort of fine, except in some cases, a similar number is assigned to v2, when an element in v1 occurs twice, this should obviously not be the case.

Can anybody help me with the issue? Thanks!

v1 <- c(rep(c("A","B","C","D","E","F","G"),rep(2,7)),c("H","I","J","K"))
v2 <- rep(3,length(v1))
df1 <- data.frame(v1,v2)

for (i in 1:length(df1$v1)) {

  if (sum(df1$v1[i]==df1$v1)==2 & df1$v2[i]==3) {

    df1$v2[i] <- sample(c(1,2),1,replace=TRUE)

  } else if (sum(df1$v1[i]==df1$v1)==2 & df1$v2[i]==1) {

    df1$v2[i] <- 2

  } else if (sum(df1$v1[i]==df1$v1)==2 & df1$v2[i]==2) {

    df1$v2[i] <- 1 

  } else { 

    df1$v2[i] <- 2
  }
}

Couple of questions: Why "randomly"? Seems like you have specific conditions. Is 2 the maximum number of re-occurrences? And what's the default for the values which only have one occurrence, 3? — Val, Apr 03 '18 at 10:04
Not very clear what you're looking for. You can show your initial dataset and then your expected output. I posted an answer below based on what I think you want to do... — AntoniosK, Apr 03 '18 at 10:16
Hi Val, v2 shall be used for sampling purposes. v1 is the ID of my study participants, and I want to retain only on observation per participant, hence in a later step, I will only select those observations, where v2 =1. — Jens Stach, Apr 03 '18 at 10:17
the value 3 is arbitrarily selected. the final version of v2 shall only include 1 and 2. 1 for a randomly selected observation, for which v1 holds the similar number and 2 for all other cases. — Jens Stach, Apr 03 '18 at 10:24

Relasta · Accepted Answer · 2018-04-03T10:32:23.327

1

I think that I have understood what you require and hopefully the below should do what you want, using dplyr. It will randomly assign integer values from 1 to n, where n is the number of occurrences of a given letter (note this is generalizable from your requirement of 2 occurrences).

library(dplyr)
df1 <- data.frame(v1 = c(rep(c("A","B","C","D","E","F","G"),rep(2,7)),c("H","I","J","K")))

df1 <- df1 %>% 
         group_by(v1) %>% 
         mutate(v2 = case_when(n() > 1 ~ sample(c(1:n()), n(), replace = FALSE), 
                                  TRUE ~ 1L))

edited Apr 03 '18 at 10:32

answered Apr 03 '18 at 10:26

Relasta

1,066
8
8

thanks Relasta, the code looks very nice. Yet when I execute it, R give me the following: Error: This function should not be called directly. Any ideas, what's wrong? – Jens Stach Apr 03 '18 at 12:36
Are you using the `plyr` package as well? You may have conflicts between `plyr` and `dplyr` which both have `mutate` functions. I loaded `plyr` and got the same error as you so I imagine that is the problem. A quick fix is to change `mutate` to `dplyr::mutate`. See [this](https://stackoverflow.com/questions/22801153/dplyr-error-in-n-function-should-not-be-called-directly) question for more detail – Relasta Apr 03 '18 at 12:45
tried your suggestion, unfortunately I get this now: Error in mutate_impl(.data, dots) : Evaluation error: RHS of case 1 (sample(c(1:2L), 2L, replace = FALSE)) must be length 1 (the first output), not 2. – Jens Stach Apr 03 '18 at 12:56
Make sure you are using the most recent version of `dplyr` – Relasta Apr 03 '18 at 13:12

Anders Ellern Bilgrau · Answer 2 · 2018-04-03T15:53:15.750

0

Using base R, I think you can arrive at what you want somewhat easily by using table and sequence in connection and manipulating the output.

Edit: After your comments. I now think I understand what you what.

res <- data.frame(v1, v2 = sequence(table(v1)), row.names = NULL)
res <- res[sample(1:nrow(res)), ] # Scramble data order
res <- res[order(res$v1), ] # Reorder by v1 column 
#     v1 v2
#1    A  1
#2    A  2
#3    B  1
#4    B  2
#5    C  1
#6    C  2
#7    D  2  # note 2 comes first here
#8    D  1
#9    E  1
#10   E  2
#11   F  1
#12   F  2
#13   G  1
#14   G  2
#15   H  1
#16   I  1
#17   J  1
#18   K  1

Edit2 "randomly" sorting before assigning:

df1 <- data.frame(v1)
df1[order(rank(v1, ties.method = "random")), "v2"] <- sequence(table(v1))
df1

edited Apr 03 '18 at 15:53

answered Apr 03 '18 at 10:06

Anders Ellern Bilgrau

9,928
1
30
37

HI Anders, what you suggest is elegant, but not exactly what I am after. The output should keep the data.frame as it is. In a later step I want to filter out those cases in v2, which equal 2, only keeping one randomly selected observation per double-occuring case in v1. Hope this clarifies what I am after. – Jens Stach Apr 03 '18 at 10:22
HI Anders, in your answer, the way I understand it, any second observation of a letter in v1 would always receive the number 2 and the first observation always the number 1. Yet I want the assignment of number 1 and 2 to the first and second observation of any letter to be random. Any idea how to adjust your code accordingly? – Jens Stach Apr 03 '18 at 12:52
Both yes and no, it is random, but it is the rows that are scrambled afterwards to give the same results, and so it is not *assigned* randomly. So the applicability of that obviously depends on your data. I have given another option now, that assigns it randomly.. – Anders Ellern Bilgrau Apr 03 '18 at 15:47
Great! Thank you Anders :) – Jens Stach Apr 03 '18 at 19:38

AntoniosK · Answer 3 · 2018-04-03T10:39:29.530

0

v1 <- c(rep(c("A","B","C","D","E","F","G"),rep(2,7)),c("H","I","J","K"))
value = 1:length(v1)
v2 <- rep(3,length(v1))
df1 <- data.frame(v1,value,v2)

library(dplyr)

set.seed(9)

df1 %>%
  sample_frac(1) %>%             # shuffle rows
  group_by(v1) %>%               # for each v1 value
  mutate(v2 = row_number()) %>%  # count and flag occurences
  ungroup() %>%                  # forget the grouping
  arrange(v1)                    # order by v1 (only for visualisation purposes)

# # A tibble: 18 x 3
#   v1    value    v2
#   <fct> <int> <int>
# 1 A         1     1
# 2 A         2     2
# 3 B         4     1
# 4 B         3     2
# 5 C         5     1
# 6 C         6     2
# 7 D         7     1
# 8 D         8     2
# 9 E         9     1
#10 E        10     2
#11 F        12     1
#12 F        11     2
#13 G        14     1
#14 G        13     2
#15 H        15     1
#16 I        16     1
#17 J        17     1
#18 K        18     1

edited Apr 03 '18 at 10:39

answered Apr 03 '18 at 10:15

AntoniosK

15,991
2
19
32

1

ah. there is a catch. in your case the second observation of any pair in v1 always has number 2 and the first number 1. How can you randomize this? – Jens Stach Apr 03 '18 at 10:28
I think that the easiest way to do it is to add a randomisation step before you apply that process. The randomisation step will shuffle the observations (i.e. rows) of the dataset. I'll update my answer. – AntoniosK Apr 03 '18 at 10:29
I've added column `value` so you can compare `df1` before and after the process. You'll see that shuffling the rows leads to a random assignment of 1s and 2s. – AntoniosK Apr 03 '18 at 10:36
AntoniosK, when I execute your code r gives me the following: Error in rank(x, ties.method = "first", na.last = "keep") : argument "x" is missing, with no default. Any idea what's wrong? – Jens Stach Apr 03 '18 at 12:50
I think it has to do with loading `dplyr` and then `plyr` packages. Try to use `dplyr::mutate` instead of `mutate` and see if it solves the issue. This will tell the code to use `mutate` from package `dplyr`. If this doesn't work try to work in a new R session (i.e. no variables and packages attached) and run my code and see if you get the same error. – AntoniosK Apr 03 '18 at 12:56
works now. Thank you :) Just to double check: the assignment of 1 and 2 is random, i.e. the first observation of any letter-pair in v1 can randomly be 1 or 2, correct? – Jens Stach Apr 03 '18 at 13:01
Yes it is. But the way it's done is by shuffling the rows (i.e. first observation might become second) and then assigning 1s and 2s (in that order). In other words, applying a specific order to a randomised dataset means that the assignments are random. Try to run the pipped commands step by step to see how it works. – AntoniosK Apr 03 '18 at 13:04
sweet. Thank you AntoniosK. Very nice! :) – Jens Stach Apr 03 '18 at 13:09

r: randomly assigning "1" or "2" in a vector based on double-occurrences in another vector

3 Answers3