I have a big dataset, and I need to summarise most of the columns by one single factor (CODE_PLOT). This is the list of columns I need to aggregate:
> names(soil)[4:30]
 [1] "PH"            "CONDUCTIVITY"  "K"             "CA"            "MG"            "N_NO3"        
 [7] "S_SO4"         "ALKALINITY"    "AL"            "DOC"           "WATER_CONTENT" "Na"           
[13] "AL_LABILE"     "FE"            "MN"            "P"             "N_NH4"         "CL"           
[19] "CR"            "NI"            "ZN"            "CU"            "PB"            "CD"           
[25] "SI"            "SAMPLE_VOL"    "N_TOTAL"      
For those columns I need mean, sd and length values. Since the dataset is big, performance is also important. I have tried aggregate, but didn’t work. I am open to other packages that can do it faster. My try:
soil_variables <- names(soil)[4:30]
soil_by <- "CODE_PLOT"
soilM <- aggregate(soil[soil_variables], by=soil[soil_by],data=soil,
                   FUN=function(x) c(mn =mean(x),n=length(x)),na.rm=T)
The required output is a data frame with 3 columns per variable: mean, sd an N (27x3 columns+ 1 “by" column)
 
     
    