Since I realized that (the very excellent) answers of this post lack of by and aggregate explanations. Here is my contribution.
BY
The by function, as stated in the documentation can be though, as a "wrapper" for tapply. The power of by arises when we want to compute a task that tapply can't handle. One example is this code:
ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )
cb
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
--------------------------------------------------------------
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
--------------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
ct
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, ct and cb, we "essentially" have the same results and the only differences are in how they are shown and the different class attributes, respectively by for cb and array for ct.
As I've said, the power of by arises when we can't use tapply; the following code is one example:
tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say "we want to calculate the summary of all variable in iris along the factor Species": but R just can't do that because it does not know how to handle.
With the by function R dispatch a specific method for data frame class and then let the summary function works even if the length of the first argument (and the type too) are different.
bywork <- by(iris, iris$Species, summary )
bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
--------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
--------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
it works indeed and the result is very surprising. It is an object of class by that along Species (say, for each of them) computes the summary of each variable.
Note that if the first argument is a data frame, the dispatched function must have a method for that class of objects. For example is we use this code with the mean function we will have this code that has no sense at all:
by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
-------------------------------------------
iris$Species: versicolor
[1] NA
-------------------------------------------
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
AGGREGATE
aggregate can be seen as another a different way of use tapply if we use it in such a way.
at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
at
setosa versicolor virginica
5.006 5.936 6.588
ag
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of aggregate must be a list while tapply can (not mandatory) be a list and that the output of aggregate is a data frame while the one of tapply is an array.
The power of aggregate is that it can handle easily subsets of the data with subset argument and that it has methods for ts objects and formula as well.
These elements make aggregate easier to work with that tapply in some situations.
Here are some examples (available in documentation):
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
ag
supp dose len
1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
We can achieve the same with tapply but the syntax is slightly harder and the output (in some circumstances) less readable:
att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
att
OJ VC
0.5 13.23 7.98
1 22.70 16.77
2 26.06 26.14
There are other times when we can't use by or tapply and we have to use aggregate.
ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
ag1
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
We cannot obtain the previous result with tapply in one call but we have to calculate the mean along Month for each elements and then combine them (also note that we have to call the na.rm = TRUE, because the formula methods of the aggregate function has by default the na.action = na.omit):
ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
cbind(ta1, ta2)
ta1 ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000
while with by we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean):
by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.