I'm trying to automatically calculate the mean score per row for multiple groups of columns. E.g. a set of columns could represent items of different scales. The columns are also systematically named (scale_itemnumber).
For example, the dummy data frame below has items from three different scales. (It can happen that not all items of each scale are included, indicated here as the missing VAR_3).
#library(tidyverse)
set.seed(123)
df <- tibble(  G_1 =  sample(1:5, size = 10000, replace = TRUE),
               G_2 =  sample(1:5, size = 10000, replace = TRUE),
               G_3 =  sample(1:5, size = 10000, replace = TRUE),
             MOT_1 =  sample(1:5, size = 10000, replace = TRUE),
             MOT_2 =  sample(1:5, size = 10000, replace = TRUE),
             MOT_3 =  sample(1:5, size = 10000, replace = TRUE),
             VAR_1 =  sample(1:5, size = 10000, replace = TRUE),
             VAR_2 =  sample(1:5, size = 10000, replace = TRUE),
             VAR_4 =  sample(1:5, size = 10000, replace = TRUE))
What I'm trying to do is to create an extra column for each construct (with dynamic names such as mean_G, mean_MOT, mean_VAR) that represents the row mean for their respective set of columns.
# A tibble: 6 x 12
    G_1   G_2   G_3 MOT_1 MOT_2 MOT_3 VAR_1 VAR_2 VAR_4 mean_G mean_MOT mean_VAR
  <int> <int> <int> <int> <int> <int> <int> <int> <int>  <dbl>    <dbl>    <dbl>
1     3     3     1     1     1     1     1     5     4   2.33     1        3.33
2     3     5     3     3     2     1     4     3     5   3.67     2        4   
3     2     5     4     5     3     2     4     1     1   3.67     3.33     2   
4     2     5     4     4     4     1     2     5     4   3.67     3        3.67
5     3     4     2     1     4     5     2     2     3   3        3.33     2.33
6     5     3     4     4     3     4     1     1     4   4        3.67     2   
I actually have a working approach using rowwise() and c_across() in combination with purrr but its execution is just so slow compared to doing it manually (mutate + rowMeans combo). However, the true df has way more scales with many more items, so I would rather not have to hard code every mean column and insert each item (especially as the exact selection included might also vary per data frame).
#functional but slow approach
#get list of variable prefixes
var_names <- str_extract(names(df), "^.*(?=(_))") %>% 
  unique()
#use map and c_across to calculate the means rowwise per variable group
df_functional <-
      df %>% 
      bind_cols(
        map_dfc(.x = var_names, 
                .f = ~ .y %>% 
                  rowwise() %>% 
                  transmute(!!str_c("mean_", .x) := mean(c_across(starts_with(.x)))),
                .y = .))
#manual approach
df_manual <- df %>% mutate(mean_G   = rowMeans(select(., G_1,   G_2,   G_3)),
                             mean_MOT = rowMeans(select(., MOT_1,   MOT_2,   MOT_3)),
                             mean_VAR = rowMeans(select(., VAR_1,   VAR_2,   VAR_4)))
The result is identical but the dynamic/functional approach is significantly slower! Not sure what this would look like for dfs with many more columns/groups. How could I speed this up while still keeping the flexibility of the dynamic approach?
> identical(df_manual, df_functional)
[1] TRUE
#Benchmark (using the microbenchmark package)
benchmark
Unit: milliseconds
       expr        min         lq        mean     median         uq        max neval
 functional 37198.3569 38592.6855 48313.00156 52936.3254 55349.0561 59831.0141   100
     manual    16.0662    18.0139    27.53403    19.9085    22.9384   138.5401   100
 
     
     
     
    