I am trying to aggregate a dataset of 12.000 obs. with 37 variables in which I want to group by 2 variables and sum by 1.
All other variables must remain, as these contain important information for later analysis.
Most remaining variables contain the same value within the group, from others I would want to select the first value.
To get a better feeling of what is happening, I created a random small test dataset (10 obs. 5 variables).
row <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
y <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4)
set1 <- c(1, 1, 1, 1, 1, 1, 2, 2, 2, 3)
set2 <- c(1, 1, 1, 2, 2, 2, 1, 1, 2, 1)
set3 <- c(1, 1, 2, 2, 3, 3, 4, 4, 5, 5)
df <- data.frame(row, y, set1, set2, set3)
df
   row y set1 set2 set3
1    1 1    1    1    1
2    2 1    1    1    1
3    3 1    1    1    2
4    4 2    1    2    2
5    5 2    1    2    3
6    6 2    1    2    3
7    7 3    2    1    4
8    8 3    2    1    4
9    9 3    2    2    5
10  10 4    3    1    5
I want to aggregate the data based on set1 and set2, getting sum(y)-values, whilst keeping the other columns (here row and set3) by selecting the first value within the remaining columns, resulting in the following aggregated dataframe (or tibble):
# row y set1  set2  set3
# 1   3 1     1     1
# 4   6 1     2     2
# 7   6 2     1     4
# 9   3 2     2     5
# 10  4 3     1     5
I have checked other questions for a possible solution, but have not been able to solve mine.
The most important questions and websites I have looked into and tried are:
Combine rows and sum their values
https://community.rstudio.com/t/combine-rows-and-sum-values/41963
https://datascienceplus.com/aggregate-data-frame-r/
R: How to aggregate some columns while keeping other columns
Aggregate by multiple columns, sum one column and keep other columns? Create new column based on aggregated values?
I have figured out that using summarise in dplyr always results in removal of remaining variables.
I thought to have found a solution with R: How to aggregate some columns while keeping other columns as reproducing the example gave satisfying results.
As using
library(dplyr)
df_aggr1 <-
  df %>%
  group_by(set1, set2) %>%
  slice(which.max(y))
Resulted in
# A tibble: 5 x 5
# Groups:   set1, set2 [5]
    row     y  set1  set2  set3
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     1     1     1     1     1
2     4     2     1     2     2
3     7     3     2     1     4
4     9     3     2     2     5
5    10     4     3     1     5
However, using
library(dplyr)
df_aggr2 <-
  df %>%
  group_by(set1, set2) %>%
  slice(sum(y))
resulted in:
# A tibble: 1 x 5
# Groups:   set1, set2 [1]
    row     y  set1  set2  set3
  <dbl> <dbl> <dbl> <dbl> <dbl>
1     3     1     1     1     2
In which y apparently is not even summed, so I do not get what is happening.
What am I missing?
Thanks in advance!
 
    