I want to categories a numeric feature of my data (genomic data, but it doesn't matter now). To do this, I took the min and max of this feature, made a sequence with 0.01 steps, than parcel this sequence to 10 equal groups. Made a list with categories (1-10) as names with corresponding slice of sequences as value.
snv_filt$repl_timing <- round(snv_filt$repl_timing, 2)
min = min(snv_filt$repl_timing, na.rm = TRUE)
max = max(snv_filt$repl_timing, na.rm = TRUE)
sequence <- seq(min, max, by = 0.01)
tibble(value = sequence, key = ntile(sequence, 10)) %>%
  group_by_at(vars(-value)) %>%  # group by everything other than the value column. 
  mutate(row_id=1:n()) %>% ungroup() %>%  # build group index
  spread(key, value) %>%    # spread
  dplyr::select(-row_id) -> categories
timing_categories <- list(
  "1" = categories$`1`[!is.na(categories$`1`)],
  "2" = categories$`2`[!is.na(categories$`2`)],
  "3" = categories$`3`[!is.na(categories$`3`)],
  "4" = categories$`4`[!is.na(categories$`4`)],
  "5" = categories$`5`[!is.na(categories$`5`)],
  "6" = categories$`6`[!is.na(categories$`6`)],
  "7" = categories$`7`[!is.na(categories$`7`)],
  "8" = categories$`8`[!is.na(categories$`8`)],
  "9" = categories$`9`[!is.na(categories$`9`)],
  "10" = categories$`10`[!is.na(categories$`10`)] 
  )
Then I tried to categories:
snv_filt$strand_group <- NA
for (i in 1:length(timing_categories)) {
  snv_filt[which(snv_filt$repl_timing %in% timing_categories[[i]]), "strand_group"] <- names(timing_categories)[i]  
  print(names(timing_categories)[i])
}
Suprisingly, there were a lot of NA in the new column... When I checked some, for example -0.42, I got this:
> timing_categories$"3"
[1] -0.58 -0.57 -0.56 -0.55 -0.54 -0.53 -0.52 -0.51 -0.50 -0.49 -0.48 -0.47 -0.46 -0.45 -0.44 -0.43 -0.42 -0.41 -0.40 -0.39
> -0.42 %in% timing_categories$"3"
[1] FALSE
What the heck? Is it some weird numeric data-sorting stuff I don't know or what? I would appreciate if you could help me.
