Compute grouped averages across varying numbers of columns

Question

I have a (very large) dataframe with words in utterances of different sizes and corpus frequencies of the words:

df <- structure(list(size = c(2, 2, 3, 3, 4, 4, 3, 3), 
                     w1 = c("come", "why", "er", "well", "i", "no", "that", "cos"), 
                     w2 = c("on","that", "i", "not", "'m", "thanks", "'s", "she"), 
                     w3 = c(NA, NA, "can", "today", "going", "a", "cool", "does"), 
                     w4 = c(NA,NA, NA, NA, "home", "lot", NA, NA), 
                     f1 = c(9699L, 6519L, 21345L, 35793L, 169024L, 39491L, 84682L, 11375L), 
                     f2 = c(33821L, 84682L,169024L, 21362L, 14016L, 738L, 107729L, 33737L), 
                     f3 = c(NA, NA,  15428L, 2419L, 10385L, 77328L, 132L, 7801L), 
                     f4 = c(NA, NA, NA, NA, 2714L, 3996L, NA, NA)), 
                row.names = c(NA, -8L), class = "data.frame")

I need to compute the averages for the different size groups across varying numbers of columns. I can do it size by size, like so, e.g. for size == 2:

# calculate numbers of rows per size group:
RowsPerSize <- table(df$size) 

# make size subset:                   
df_size2 <- df[df$size == 2,] 

# calculate average `f`requencies per `size`:                    
AvFreqSize_2 <- apply(df_size2[,6:7], 2, function(x) sum(x, na.rm = T)/RowsPerSize[1])

# result:
AvFreqSize_2
     f1      f2 
 8109.0 59251.5

But that's cumbersome already for a single size and all the more so for multiple sizes. I'm pretty certain there's a more economical way, probably in dplyr, where you can group_by. A humble beginning is this:

df %>%
  group_by(size) %>%
  summarise(freq = n())
# A tibble: 3 x 2
   size  freq
* <dbl> <int>
1     2     2
2     3     4
3     4     2

You created a `df_size2` but you never used it. This example does not really show what is happening — Onyambu, May 06 '21 at 06:30
It seems you are not doing any grouping. for example, your results is just from adding all the values in the column and divided by the length of size 2. ie `colSums(df[, 6:7])/2` Is this what you want? — Onyambu, May 06 '21 at 06:38
That's correct, thanks for noticing the mishap. Have corrected it. What I want is for every `size` the average `f`requencies in the `f1`, `f2`, etc. cols. — Chris Ruehlemann, May 06 '21 at 06:42
This is essentially group by mean for multiple columns - https://stackoverflow.com/questions/21982987/mean-per-group-in-a-data-frame — Ronak Shah, May 06 '21 at 06:45

Dan Chaltiel · Accepted Answer · 2021-05-06T06:49:57.483

1

I had to guess a lot but I think you are looking for this:

library(tidyverse)

df %>%
  group_by(size) %>%
  summarise(across(matches("f\\d"), ~sum(.x, na.rm = T)/n()))
#> # A tibble: 3 x 5
#>    size      f1     f2     f3    f4
#>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>
#> 1     2   8109  59252.     0      0
#> 2     3  38299. 82963   6445      0
#> 3     4 104258.  7377  43856.  3355


#as @Onyambu suggested, it could make more sense to use `mean()`
df %>%
  group_by(size) %>%
  summarise(across(matches("f\\d"), ~mean(.x, na.rm = T)))
#> # A tibble: 3 x 5
#>    size      f1     f2     f3    f4
#>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>
#> 1     2   8109  59252.   NaN    NaN
#> 2     3  38299. 82963   6445    NaN
#> 3     4 104258.  7377  43856.  3355

^{Created on 2021-05-06 by the reprex package (v2.0.0)}

edited May 06 '21 at 06:49

answered May 06 '21 at 06:43

Dan Chaltiel

7,811
5
47
92

1

sum/n is simply `mean` – Onyambu May 06 '21 at 06:44
@Onyambu You are right but the output would be different: f3 and f4 would have `NaN` instead of `0`. – Dan Chaltiel May 06 '21 at 06:46
but why would you want them to be zero? ie -3 + 3 is 0, but NA +NA is just NA – Onyambu May 06 '21 at 06:47
@ChrisRuehlemann Onyambu is raising an interesting point you might want to consider – Dan Chaltiel May 06 '21 at 06:48
Thanks lot, especially @Onyambu. I noticed in the actual dataset that the previous soution was lesss perfect than I had thought. The current solution with `mean` is perfect! – Chris Ruehlemann May 06 '21 at 06:52

Compute grouped averages across varying numbers of columns

1 Answers1