Create an indicator variable in one data frame based on values in another data frame

Question

Say, I have a dataset called iris. I want to create an indicator variable called sepal_length_group in this dataset. The values of this indicator will be p25, p50, p75, and p100. For example, I want sepal_length_group to be equal to "p25" for an observation if the Species is "setosa" and if the Sepal.Length is equal to or less than the 25th percentile for all species classified as "setosa". I wrote the following codes, but it generates all NAs:

library(skimr)

sepal_length_distribution <- iris %>% group_by(Species) %>% skim(Sepal.Length) %>% select(3, 9:12)

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),2], "p25", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),2] &
                                                Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),3], "p50", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),3] &
                                                        Sepal.Length <= sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),4], "p75", NA))

iris_2 <- iris %>% mutate(sepal_length_group = ifelse(Sepal.Length > sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),4] &
                                                        Sepal.Length < sepal_length_distribution[which(sepal_length_distribution$Species == "setosa"),5], "p100", NA))

Any help will be highly appreciated!

A few posts that should help: https://stackoverflow.com/q/60291876/5325862, https://stackoverflow.com/q/42948306/5325862 — camille, May 19 '21 at 23:44
So you specifically want to use the skimr output? When you say an indicator variable do you mean that you basically want an ordered factor? — Elin, Jun 15 '21 at 12:04

score 2 · Accepted Answer · answered May 19 '21 at 23:46

2

This could be done simply by the use of the function cut as commented by @Camille

library(tidyverse)
iris %>%
  group_by(Species) %>%
  mutate(cat = cut(Sepal.Length, 
                   quantile(Sepal.Length, c(0,.25,.5,.75, 1)),
                   paste0('p', c(25,50, 75, 100)), include.lowest = TRUE))

answered May 19 '21 at 23:46

Onyambu

67,392
3
24
53

thanks. This solves this particular problem of mine. But I was thinking of a more general case where I may want to create an indicator variable that will be based on a particular cell in a different dataframe. That's why the reason why I tried to use `which`. – Anup May 20 '21 at 01:24

Create an indicator variable in one data frame based on values in another data frame

1 Answers1