Let's say I've got a data.table with 6 columns
library(data.table)
set.seed(123)
dt <- data.table( id = 1:100,
                  p1 = sample(1:10, 100, replace = TRUE ),
                  p2 = sample(1:10, 100, replace = TRUE ),
                  p3 = sample(1:10, 100, replace = TRUE ),
                  p4 = sample(1:10, 100, replace = TRUE ),
                  p5 = sample(1:10, 100, replace = TRUE ) )
Now, I want to subset this data.table, on the p1 - pn columns (here: p1-p5). I want to keep all rows where any of the p-columns contains the value of 10.
For this small sample data.table, this can be done manually with
test1 <- dt[ p1 == 10 | p2 == 10 | p3 == 10 | p4 == 10 | p5 == 10, ]
But my production data contains dozens of p-columns, so typing them all out manually would be a pain...
My current solution is to first create a vector with the column-names i need:
cols <- grep( "^p", names( dt ), value = TRUE )
...and then do the subsetting using apply:
test2 <- dt[ apply( dt[, ..cols ], 1, function(r) any( r == 10 ) ), ]
Check:
identical(test1, test2)
# TRUE
my actual question
The above solution (using apply) is fast enough for me.. But I'm not sure it is the optimal solution. I'm pretty new to data.table (compared to some others here on SO), and this is (probably?) not the most efficient/effective/elegant way to achieve the subset I want.
I'm here to learn, so has anyone got a more elegant/better/faster approach to my subsetting question?
update
The question had been marked duplicate... But I'll still post my answers here:
I found the answer from @Marcus to be the best (=readable) code, and the answer from @akrun to be the fastest.
benchmarking
data.table with 1,000,000 rows and 50 columns of interest (i.e. p-columns)
#create sample data
set.seed( 123 )
n   <- 1000000
k   <- 100
dat <- sample( 1:100, n * k, replace = TRUE )
DT  <- as.data.table( matrix( data = dat, nrow = n, ncol = k ) )
setnames( DT, names( DT ), c( paste0( "p", 1:50 ), paste( "r", 1:50 ) ) )
#vector with columns starting with "p"
cols <- grep( "^p", names( DT ), value = TRUE )
apply_method   <- DT[ apply( DT[, ..cols ], 1, function(x) any( x == 10 ) ), ]
reduce_method  <- DT[ DT[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]]
rowsums_method <- DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1 ]
identical(  apply_method, rowsums_method )
microbenchmark::microbenchmark(
  apply   = DT[ apply( DT[ , ..cols ], 1, function(x) any( x == 10 ) ), ],
  reduce  = DT[ DT[, Reduce( `|`, lapply( .SD, `==`, 10 ) ), .SDcols = cols ] ],
  rowSums = DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1, ],
  times = 10
)
#    expr       min        lq      mean    median        uq       max neval
#   apply 3352.0640 3441.7760 3665.5004 3662.7666 3760.7553 4325.9125    10
#  reduce  408.6349  437.6806  552.8850  572.2012  657.6072  710.7699    10
# rowSums  619.2594  663.7325  784.2389  850.0963  868.2096  892.7469    10
 
     
    