I have a table where the key is repeated a number of times, and one to select just one row for each key, using the largest value of another column.
This example demonstrates the solution I have at the moment:
N = 10
k = 2
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
     X           Y
 1:  1 -1.37925206
 2:  1 -0.53837461
 3:  2  0.26516340
 4:  2 -0.04643483
 5:  3  0.40331424
 6:  3  0.28667275
 7:  4 -0.30342327
 8:  4 -2.13143267
 9:  5  2.11178673
10:  5 -0.98047230
11:  6 -0.27230783
12:  6 -0.79540934
13:  7  1.54264549
14:  7  0.40079650
15:  8 -0.98474297
16:  8  0.73179201
17:  9 -0.34590491
18:  9 -0.55897393
19: 10  0.97523187
20: 10  1.16924293
> DT[, .SD[Y == max(Y)], by = X]
     X          Y
 1:  1 -0.5383746
 2:  2  0.2651634
 3:  3  0.4033142
 4:  4 -0.3034233
 5:  5  2.1117867
 6:  6 -0.2723078
 7:  7  1.5426455
 8:  8  0.7317920
 9:  9 -0.3459049
10: 10  1.1692429
The problem is that for larger data.tables this take a very long time:
N = 10000
k = 25
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
system.time(DT[, .SD[Y == max(Y)], by = X])
   user  system elapsed 
   9.69    0.00    9.69 
My actual table about 100 million rows...
Can anyone suggest a more efficient solution?
Edit - importance of set key
The solution proposed works well, but you must use setkey, or have the DT ordered for it to work:
See Example without "each" in rep:
N = 10
k = 2
DT = data.table(X = rep(1:N, k), Y = rnorm(k*N))
DT[DT[, Y == max(Y), by = X]$V1,]
     X           Y
 1:  1  1.26925708
 2:  4 -0.66625732
 3:  5  0.41498548
 4:  8  0.03531185
 5:  9  0.30608380
 6:  1  0.50308578
 7:  4  0.19848227
 8:  6  0.86458423
 9:  8  0.69825500
10: 10 -0.38160503