Efficient way to subset data.table based on value in any of selected columns

Question

Let's say I've got a data.table with 6 columns

library(data.table)
set.seed(123)
dt <- data.table( id = 1:100,
                  p1 = sample(1:10, 100, replace = TRUE ),
                  p2 = sample(1:10, 100, replace = TRUE ),
                  p3 = sample(1:10, 100, replace = TRUE ),
                  p4 = sample(1:10, 100, replace = TRUE ),
                  p5 = sample(1:10, 100, replace = TRUE ) )

Now, I want to subset this data.table, on the p1 - pn columns (here: p1-p5). I want to keep all rows where any of the p-columns contains the value of 10.

For this small sample data.table, this can be done manually with

test1 <- dt[ p1 == 10 | p2 == 10 | p3 == 10 | p4 == 10 | p5 == 10, ]

But my production data contains dozens of p-columns, so typing them all out manually would be a pain...

My current solution is to first create a vector with the column-names i need:

cols <- grep( "^p", names( dt ), value = TRUE )

...and then do the subsetting using apply:

test2 <- dt[ apply( dt[, ..cols ], 1, function(r) any( r == 10 ) ), ]

Check:

identical(test1, test2)
# TRUE

my actual question

The above solution (using apply) is fast enough for me.. But I'm not sure it is the optimal solution. I'm pretty new to data.table (compared to some others here on SO), and this is (probably?) not the most efficient/effective/elegant way to achieve the subset I want.

I'm here to learn, so has anyone got a more elegant/better/faster approach to my subsetting question?

update

The question had been marked duplicate... But I'll still post my answers here:

I found the answer from @Marcus to be the best (=readable) code, and the answer from @akrun to be the fastest.

benchmarking

data.table with 1,000,000 rows and 50 columns of interest (i.e. p-columns)

#create sample data
set.seed( 123 )
n   <- 1000000
k   <- 100
dat <- sample( 1:100, n * k, replace = TRUE )
DT  <- as.data.table( matrix( data = dat, nrow = n, ncol = k ) )
setnames( DT, names( DT ), c( paste0( "p", 1:50 ), paste( "r", 1:50 ) ) )

#vector with columns starting with "p"
cols <- grep( "^p", names( DT ), value = TRUE )

apply_method   <- DT[ apply( DT[, ..cols ], 1, function(x) any( x == 10 ) ), ]
reduce_method  <- DT[ DT[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]]
rowsums_method <- DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1 ]

identical(  apply_method, rowsums_method )

microbenchmark::microbenchmark(
  apply   = DT[ apply( DT[ , ..cols ], 1, function(x) any( x == 10 ) ), ],
  reduce  = DT[ DT[, Reduce( `|`, lapply( .SD, `==`, 10 ) ), .SDcols = cols ] ],
  rowSums = DT[ rowSums( DT[ , ..cols ] == 10, na.rm = TRUE ) >= 1, ],
  times = 10
)

#    expr       min        lq      mean    median        uq       max neval
#   apply 3352.0640 3441.7760 3665.5004 3662.7666 3760.7553 4325.9125    10
#  reduce  408.6349  437.6806  552.8850  572.2012  657.6072  710.7699    10
# rowSums  619.2594  663.7325  784.2389  850.0963  868.2096  892.7469    10

How about this `test3 <- dt[rowSums(dt[, ..cols ] == 10) >= 1]; identical(test1, test3)` ? — markus, Feb 28 '19 at 19:12
@markus, that's the fastes so far... also: nice short code.. too bad I can't accept.. — Wimpel, Mar 01 '19 at 07:51
@markus is it also possible to adapt this method when you are not looking for the value `10`, but for a character-string, like `test` in the cols-columns? — Wimpel, Mar 01 '19 at 08:26
As in `dt <- data.table(col1 = c("test", "no_test"), col2 = 1:4); cols = c("col1", "col2"); dt[rowSums(dt[, ..cols ] == "test") >= 1]` ? — markus, Mar 01 '19 at 08:33
Ignore my (deleted) previous comment. adding `na.rm = TRUE` to the solution did the trick! Since my production-data also contains NA's in teh columns of interest! — Wimpel, Mar 01 '19 at 08:42

score 5 · Accepted Answer · answered Feb 28 '19 at 18:04

5

One option is to specify the 'cols' of interest in .SDcols, loop through the Subset of Data.table (.SD), generate a list of logical vectors, Reduce it to single logical vector with (|) and use that to subset the rows

i1 <- dt[, Reduce(`|`, lapply(.SD, `==`, 10)), .SDcols = cols]
test2 <- dt[i1]
identical(test1, test2)
#[1] TRUE

answered Feb 28 '19 at 18:04

akrun

874,273
37
540
662

1

interesting approach... I have never used `Reduce` before... speed is about the same as the original `apply`-solution – Wimpel Feb 28 '19 at 18:12
1

update: reduce + lapply is the clear speed-winner when the data.table becomes larger. see update in question for bechmarks – Wimpel Mar 01 '19 at 08:09
Would love to understand this answer, but what are `|` and `==` ? – W Barker Dec 23 '20 at 01:37

Efficient way to subset data.table based on value in any of selected columns

update

1 Answers1

Linked