I think you can use a translation dictionary to solve this relatively easily.
genRandomIDs <- function(ID, min_length = 2) {
ID <- unique(ID)
len <- max(min_length, length(ID) %/% 26 + 1)
ltrs <- do.call(paste0,
do.call(expand.grid, replicate(len, LETTERS, simplify=FALSE))
)
sample(ltrs, length(ID))
}
set.seed(42)
IDdict <- df %>%
distinct(ID) %>%
mutate(newID = genRandomIDs(ID))
IDdict
# ID newID
# 1 1 OV
# 2 2 IM
# 3 3 WF
df %>%
left_join(IDdict, by = "ID")
# ID Name newID
# 1 1 Joseph OV
# 2 1 Joseph OV
# 3 2 Leo IM
# 4 2 Leo IM
# 5 1 Joseph OV
# 6 3 David WF
Walk-through:
genRandomIDs is just a helper function that internally produces a vector of all n-long letter permutations (combined with paste0) and samples from them;
- the
do.call(expand.grid, ...) gives us a frame that expands on each len grouping of letters; that is, expand.grid(LETTERS[1:3],LETTERS[1:3],LETTERS[1:3]) gives us 3^3 permutations of three letters
- the
do.call(paste0, ...) takes that frame from expand.grid (which is really just a glorified list) and produces one string per "row".
distinct(ID) reduces your df to just one row per ID;
- since we produce one
newID for each unique ID, we now have a 1-to-1 mapping from old-to-new;
- the
left_join assigns the newID for each row (if you aren't familiar with merges/joins, see How to join (merge) data frames (inner, outer, left, right), What's the difference between INNER JOIN, LEFT JOIN, RIGHT JOIN and FULL JOIN?)
Note: this does not really scale well: since we explode the possible combinations with expand.grid, for a min-length of 2 letters, we produce 676 (26^2) permutations, not a problem. 26^3 produces 17576 possible combinations, whether or not we have that many IDs to uniquify. 26^4 (4 letters) produces 456976, and its delay is "palpable". Five letters is over 11 million, which becomes "stupid" to try to scale to that length (assuming you have that many unique IDs or choose a string of that long.
However ... while inefficient, this method is guaranteed to give you unique newIDs. There are other ways that may be guaranteed at the expense of a (however small) increase in complexity).
Okay, the "increased complexity" here for a more efficient process:
num2alpha <- function(num, chr = letters, zero = "", sep = "") {
len <- length(chr)
stopifnot(len > 1)
signs <- ifelse(!is.na(num) & sign(num) < 0, "-", "")
num <- as.integer(abs(num))
is0 <- !is.na(num) & num < 1e-9
# num[num < 1] <- NA
out <- character(length(num))
mult <- 0
while (any(!is.na(num) & num > 0)) {
not0 <- !is.na(num) & num > 0
out[not0] <- paste0(chr[(num[not0] - 1) %% len + 1], sep, out[not0])
num[not0] <- (num[not0] - 1) %/% len
}
if (nzchar(sep)) out <- sub(paste0(sep, "$"), "", out)
out[is0] <- zero
out[is.na(num)] <- NA
out[!is.na(out)] <- paste0(signs[!is.na(out)], out[!is.na(out)])
out
}
IDdict <- df %>%
distinct(ID) %>%
mutate(newID = num2alpha(row_number()))
IDdict
# ID newID
# 1 1 a
# 2 2 b
# 3 3 c
df %>%
left_join(IDdict, by = "ID")
# ID Name newID
# 1 1 Joseph a
# 2 1 Joseph a
# 3 2 Leo b
# 4 2 Leo b
# 5 1 Joseph a
# 6 3 David c
The num2alpha works more efficiently (using lower-case here, easily changed by using num2alpha(.., chr=LETTERS)), though it is deterministic here. If you are at all concerned about that, then
IDdict <- df %>%
distinct(ID) %>%
mutate(newID = sample(num2alpha(row_number())))
will randomize them for you.
Note that this produces single-letter strings up through 26, then cycles through 2-digit and 3-digit. It also recognized negatives, and while the defatul
num2alpha(c(-5, 0, NA, 1, 25:27, 51:53, 999999), zero="0")
# [1] "-e" "0" NA "a" "y" "z" "aa" "ay" "az" "ba" "bdwgm"
(Note that this is not a simple base-converter, since we're ignoring "0"-values. Try num2alpha(14:16, c(1:9, LETTERS[1:6]), zero="0"). Perhaps it can be made to be more general.)