I didn't see another question quite like this on SO. Let's say I have a dataframe of consisting of 1...m = M variables I want to aggregate over, plus two other grouping variables I plan to aggregate by (m+2 columns total). One of the variables in M needs to be averaged, and the other m-1 should be summed.
I'm familiar with various ways to aggregate data in R, including performing different functions on separate columns. For example, I like this data.table method:
df <- df[, list(w = sum(w), x = mean(x)), by = c("y", "z")]
which sums w and averages y by two groups, y,z.
After some research, it also looks like there's a dplyr function across(-x, sum) that can subject all variables meeting a condition to an aggregation function -- in this case, sum all variables other than x.
I'm having trouble figuring out how to use these together. In my case, I have > 1,000,000 rows and 200+ columns, so it's not feasible to cbind() everything on the LHS of an aggregate()-type function.
- How might I apply multiple functions (
mean,sum) with anacross()technique? (My instinct is it might involve some merging.) - Is there a
data.tablesolution? (Not necessary, but helpful for speed.)
MWE: Consider the iris dataset.
library(datasets)
data(iris)
summary(iris)
which has four numeric columns (Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and one factor (Species).
I want to group by species, and within that group, aggregate each of the other four columns. However, Sepal.Length should be averaged (mean), while Sepal.Width, Petal.Length, Petal.Width should be summed (summed).
I can do this by naming each variable as follows:
library(data.table)
iris <- data.table(iris)
iris[, list(Sepal.Length = mean(Sepal.Length),
Sepal.Width = sum(Sepal.Width),
Petal.Length = sum(Petal.Length),
Petal.Width = sum(Petal.Width)), by = c("Species")]
How can I generalize the procedure so that it's only necessary to name the exceptions (mean(Sepal.Length))?
As an alternative to naming only the exceptions (in the manner of -Sepal.Length), I could also exhaustively name all the indices; eg:
iris[, list(iris[,1] = mean(iris[,1]),
iris[,2:4] = sum(iris[,2:4])), by = c(Species)]
# doesn't work of course