Sum values in a column for groups in another column in R

Question

I do have a little twist in my head. I think this should be somewhat easy, but I just can't figure it out. I have this data:

                                     tipologia date_info    n
1  Aree soggette a crolli/ribaltamenti diffusi       day  113
2  Aree soggette a crolli/ribaltamenti diffusi     month   59
3  Aree soggette a crolli/ribaltamenti diffusi   no date  506
4  Aree soggette a crolli/ribaltamenti diffusi      year 1880
5   Aree soggette a frane superficiali diffuse       day   24
6   Aree soggette a frane superficiali diffuse     month    7
7   Aree soggette a frane superficiali diffuse   no date  148
8   Aree soggette a frane superficiali diffuse      year  142
9       Aree soggette a sprofondamenti diffusi       day    1
10      Aree soggette a sprofondamenti diffusi   no date    1
11      Aree soggette a sprofondamenti diffusi      year    2
12                             Colamento lento       day   25
13                             Colamento lento     month   12
14                             Colamento lento   no date   27
15                             Colamento lento      year  177
16                            Colamento rapido       day   64
17                            Colamento rapido     month    3
18                            Colamento rapido   no date   12
19                            Colamento rapido      year   92
20                                   Complesso       day  107
21                                   Complesso     month   23
22                                   Complesso   no date  150
23                                   Complesso      year  138

What I want to do now is to sum up all values in the column "n" for each group in tipologia. But I dont want to lose the information in "date_info". So I basically just want to append a column that for the first group "Aree soggette a crolli/ribaltamenti diffusi" would have the value (113+59+506+1880 =2556) in the first four rows.

So I tried something like

df  %>% count(tipologia, date_info) %>% 
  group_by(tipologia) %>% 
  summarise(total = sum(n))

but then I obviously "loose" my "date_info" column.

   tipologia                                   total
   <chr>                                       <int>
 1 Aree soggette a crolli/ribaltamenti diffusi  2558
 2 Aree soggette a frane superficiali diffuse    321
 3 Aree soggette a sprofondamenti diffusi          4
 4 Colamento lento                               241
 5 Colamento rapido                              171
 6 Complesso                                     418
 7 Crollo/Ribaltamento                          2932
 8 DGPV                                           50

When I group by tipologia and date_info and then sum up n, it does not build the sum for some reason

df %>% count(tipologia, date_info) %>% 
  group_by(tipologia, date_info) %>% 
  summarise(total = sum(n))

And the result looks like

   tipologia                                   date_info total
   <chr>                                       <chr>     <int>
 1 Aree soggette a crolli/ribaltamenti diffusi day         113
 2 Aree soggette a crolli/ribaltamenti diffusi month        59
 3 Aree soggette a crolli/ribaltamenti diffusi no date     506
 4 Aree soggette a crolli/ribaltamenti diffusi year       1880
 5 Aree soggette a frane superficiali diffuse  day          24

I think the answer might be somewhere in here too How to sum a variable by group, but I just can't figure it out...:/

Use `mutate` not `summarise` to add a column to the original data. (In your first attempt). — Gregor Thomas, Dec 22 '20 at 17:56
I guess you can still use `summarise` with `date_info` as another column beacuse in the newer version, `summarise` is not limited to return only single row per grou — akrun, Dec 22 '20 at 17:57
Your second does sum up, but it sums by your groups, which include the `date_info`. If you had multiple rows with the same tipologia and date_info, they would be added together and collapsed in your second attempt. — Gregor Thomas, Dec 22 '20 at 17:57
Or another option would be to use `mutate` as GregorThomas mentioned and wrap with `distinct` at the end — akrun, Dec 22 '20 at 17:59
Thank you very much!! That did the trick:) Really appreciate it — Robin Kohrs, Dec 22 '20 at 21:06

score 0 · Answer 1 · answered Jul 25 '23 at 14:54

Here is a base R answer. The key is to forget about elegance and use a "helper" table which stores the aggregated sums. Then merge the two on tipolagia which, by default, will populate all the rows with the sum value. I am sure there are ways in both dplyr and data.table to acheive the same result.

First to recreate your data

DF <- data.frame(tipolagia = c(
  rep("Aree soggette a crolli/ribaltamenti diffusi", 4L),
  rep("Aree soggette a frane superficiali diffuse", 4L),
  rep("Aree soggette a sprofondamenti diffusi", 3L),
  rep("Colamento lento", 4L),
  rep("Colamento rapido", 4L),
  rep("Complesso", 4L)),
  date_info = c(rep(c("day", "month", "no date", "year"), 2L),
                "day", "no date", "year",
                rep(c("day", "month", "no date", "year"), 3L)),
  n = c(113, 59, 506, 1880, 24, 7, 148, 142, 1, 1, 2, 25, 12, 27, 177, 64, 3,
        12, 92, 107, 23, 150, 138))

DF
tipolagia date_info    n
1  Aree soggette a crolli/ribaltamenti diffusi       day  113
2  Aree soggette a crolli/ribaltamenti diffusi     month   59
3  Aree soggette a crolli/ribaltamenti diffusi   no date  506
4  Aree soggette a crolli/ribaltamenti diffusi      year 1880
5   Aree soggette a frane superficiali diffuse       day   24
6   Aree soggette a frane superficiali diffuse     month    7
7   Aree soggette a frane superficiali diffuse   no date  148
8   Aree soggette a frane superficiali diffuse      year  142
9       Aree soggette a sprofondamenti diffusi       day    1
10      Aree soggette a sprofondamenti diffusi   no date    1
11      Aree soggette a sprofondamenti diffusi      year    2
12                             Colamento lento       day   25
13                             Colamento lento     month   12
14                             Colamento lento   no date   27
15                             Colamento lento      year  177
16                            Colamento rapido       day   64
17                            Colamento rapido     month    3
18                            Colamento rapido   no date   12
19                            Colamento rapido      year   92
20                                   Complesso       day  107
21                                   Complesso     month   23
22                                   Complesso   no date  150
23                                   Complesso      year  138

Now to aggregate and merge:

# Create sum of n by tipolagia
aggDF <- aggregate(DF$n, list(DF$tipolagia), sum)
# Name the columns for merge purposes
names(aggDF) <- c("tipolagia", "sum")
# Merge the two
DF <- merge(DF, aggDF)
# Voila
                                     tipolagia date_info    n  sum
1  Aree soggette a crolli/ribaltamenti diffusi       day  113 2558
2  Aree soggette a crolli/ribaltamenti diffusi     month   59 2558
3  Aree soggette a crolli/ribaltamenti diffusi   no date  506 2558
4  Aree soggette a crolli/ribaltamenti diffusi      year 1880 2558
5   Aree soggette a frane superficiali diffuse       day   24  321
6   Aree soggette a frane superficiali diffuse     month    7  321
7   Aree soggette a frane superficiali diffuse   no date  148  321
8   Aree soggette a frane superficiali diffuse      year  142  321
9       Aree soggette a sprofondamenti diffusi       day    1    4
10      Aree soggette a sprofondamenti diffusi   no date    1    4
11      Aree soggette a sprofondamenti diffusi      year    2    4
12                             Colamento lento       day   25  241
13                             Colamento lento     month   12  241
14                             Colamento lento   no date   27  241
15                             Colamento lento      year  177  241
16                            Colamento rapido       day   64  171
17                            Colamento rapido     month    3  171
18                            Colamento rapido   no date   12  171
19                            Colamento rapido      year   92  171
20                                   Complesso       day  107  418
21                                   Complesso     month   23  418
22                                   Complesso   no date  150  418
23                                   Complesso      year  138  418

Sum values in a column for groups in another column in R

1 Answers1