data.table aggregations that return vectors, such as scale()

Question

I have recently been work with much larger datasets and have started learning and migrating to data.table to improve performance of aggregation/grouping. I have been unable to get certain expressions or functions to group as expected. Here is an example of a basic group by operation that I am having trouble with.

library(data.table)
category <- rep(1:10, 10)
value <- rnorm(100)
df <- data.frame(category, value)
dt <- data.table(df)

If I want to simply calculate the mean for each group by category. This works easily enough.

dt[,mean(value),by="category"]

    category          V1
 1:        1 -0.67555478
 2:        2 -0.50438413
 3:        3  0.29093723
 4:        4 -0.41684790
 5:        5  0.33921764
 6:        6  0.01970997
 7:        7 -0.23684245
 8:        8 -0.04280998
 9:        9  0.01838804
10:       10  0.44295978

I run into problems if I try and use the scale function or even a simple expression subtracting the value from itself. The grouping is ignored and I get the function/expression applied to each row instead. The following returns all 100 rows instead of 10 group by categories.

dt[,scale(value),by="category"]


dt[,value-mean(value),by="category"]

I thought recreating scale as function that returns a numeric vector instead of a matrix might help.

zScore <- function(x) {
  z=(x-mean(x,na.rm=TRUE))/sd(x,na.rm = TRUE)
  return(z) 
}

dt[,zScore(value),by="category"]

  category          V1
  1:        1 -1.45114132
  2:        1 -0.35304528
  3:        1 -0.94075418
  4:        1  1.44454416
  5:        1  1.39448268
  6:        1  0.55366652
  ....
 97:       10 -0.43190602
 98:       10 -0.25409244
 99:       10  0.35496694
100:       10  0.57323480
     category          V1

This also returns the zScore function applied to all rows (N=100) and ignoring the grouping. What am I missing in order to get scale() or a custom function to use the grouping like it did above when using mean()?

`mean` returns 1 value. `scale` function returns a scaled value for each input. That is, `scale(1:5)` gives 5 values. `mean(1:5)` gives 1 value. Does this help understand your problem? — Arun, Apr 05 '13 at 14:30
the function you apply on `grouped` variables should return 1 value and not vector. — CHP, Apr 05 '13 at 14:45
This is exactly the problem. I was trying to use data.table the way I had used ddply. How would I use data.table to equivalently achieve the following. in ddply? `ddply(df,"category",transform, zscorebycategory=zScore(value))` — Aaron, Apr 05 '13 at 15:38
@Aaron `dt[,zscorebycategory:=zScore(value),by=category]` I don't understand where you see a problem? — Roland, Apr 05 '13 at 15:57
@Roland You are right. I didn't quite realize that := behaved a bit differently and updates the table directly. When I had run it before I was expecting to assign it manually to dt. It is working now. Thanks very much for helping me clear this up. — Aaron, Apr 05 '13 at 16:10
@Roland, when one is first starting to use data.table, the different behavior between `:=` and simply outputting the results is far from obvious. — Ricardo Saporta, Apr 08 '13 at 07:45

score 2 · Answer 1 · answered Apr 05 '13 at 16:03

You've clarified in the comments that you'd like the same behaviour as:

ddply(df,"category",transform, zscorebycategory=zScore(value))

which gives:

    category       value zscorebycategory
1          1  0.28860691       0.31565682
2          1  1.17473759       1.33282374
3          1  0.06395503       0.05778463
4          1  1.37825487       1.56643607
etc

The data table option you gave gives:

     category          V1
  1:        1  0.31565682
  2:        1  1.33282374
  3:        1  0.05778463
  4:        1  1.56643607
  etc

Which is exactly the same data. However you'd like to also repeat the value column in your result, and rename the V1 variable with something more descriptive. data.table gives you the grouping variable in the result, along with the result of the expression you provide. So lets modify that to give the rows you'd like:

Your

dt[,zScore(value),by="category"]

becomes:

dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]

Where the named items in the list become columns in the result.

plyr = data.table(ddply(df,"category",transform, zscorebycategory=zScore(value)))
dt   = dt[,list(value=value, zscorebycategory=zScore(value)),by="category"]
identical(plyr, dt)
> TRUE

(note I converted your ddply data.frame result into a data.table, to allow the identical command to work).

This works great. Thanks Peter. I had other unrelated data issues that moderated my already limited understanding of data.table. Working with my basic generated dataset paired everyone's help has clarified things. — Aaron, Apr 05 '13 at 16:23
Cheers, @Aaron. Although I think @Roland's answer in the question comments: `dt[,zscorebycategory:=zScore(value),by=category]` will probably be more efficient, since `:=` updates the existing table, while my answer will create a new one. — Peter Fine, Apr 05 '13 at 17:09

score 1 · Answer 2 · answered Apr 05 '13 at 15:18

Your claim that data.table does not group is wrong:

library(data.table)
category <- rep(1:2, each=4)
value <- c(rep(c(1:2),each=2),rep(c(4,10),each=2))
dt <- data.table(category, value)

   category value
1:        1     1
2:        1     1
3:        1     2
4:        1     2
5:        2     4
6:        2     4
7:        2    10
8:        2    10

dt[,value-mean(value),by=category]
   category   V1
1:        1 -0.5
2:        1 -0.5
3:        1  0.5
4:        1  0.5
5:        2 -3.0
6:        2 -3.0
7:        2  3.0
8:        2  3.0

If you want to scale/transform this is exactly the behavior you want, because these operations by definition return an object of the same size as the input.

data.table aggregations that return vectors, such as scale()

2 Answers2

Linked