R: cross join column on itself excluding symmetric results

Question

I have a data table init that looks like this:

> init
+---+
| id|       
+---+
|  a|
+---+
|  b|
+---+
|  c|
+---+

I want to obtain all pairs for id column, so I need to cross join the init data table on itself. Additionally, I want to exclude equal and symmetric results (in my case a,b == b,a, etc.).

Desired output is:

+---+---+
|id1|id2|
+---+---+
|  a|  b|
|  a|  c|
|  b|  c|
+---+---+

How can this be done with the data.table approach?

Full cross join can be implemented as full_cj <- CJ(init$id, init$id):

+---+---+
| V1| V2|
+---+---+
|  a|  a|
|  a|  b|
|  a|  c|
|  b|  a|
|  b|  b|
|  b|  c|
|  c|  a|
|  c|  b|
|  c|  c|
+---+---+

But how can I remove symmetrical and identical results from the output?
My real data is huge, so I'm looking for an efficient solution.

Some data.table options here: [faster version of combn](https://stackoverflow.com/questions/26828301/faster-version-of-combn) — Henrik, Jun 14 '22 at 23:19

score 2 · Accepted Answer · answered Jun 14 '22 at 23:25

2

You can use a non equi join after converting to factor

dt[, id2:=as.factor(id)][dt, on=.(id2>id2),nomatch=0,.(id, id2)]

answered Jun 14 '22 at 23:25

langtang

22,248
1
12
27

score 1 · Answer 2 · answered Jun 16 '22 at 10:21

For the sake of completeness, this can be achieved as well using base R by

init[, .(id1 = rep(id, rev(seq_along(id) - 1L)),
         id2 = unlist(lapply(-seq_along(id), tail, x = id)))]

Benchmark

As the OP has mentioned

My real data is huge, so I'm looking for an efficient solution.

here is a benchmark comparing

cj: CJ() with subsequent filtering
sj: langtang's non-equi self-join
rp: above base R approach using rep() and tail()

for problem sizes (length of input vector) of 10, 100, 1000, and 10k rows. Note that for the 10k case the result contains nearly 50 million combinations. So available memory is a limiting factor.

Note that bench::mark() verifies that all approaches return the same result.

library(bench)
library(data.table)
bm <- press(
  n_id = 10^(4:1),
  {
    init0 <- data.table(id = sprintf("%09i", seq(n_id)))
    cat(sprintf("Number of combinations in result: %g\n", (n_id-1)*n_id/2))
    mark(
      rp = {
        init <- copy(init0)
        init[, .(id1 = rep(id, rev(seq_along(id) - 1L)),
                 id2 = unlist(lapply(-seq_along(id), tail, x = id)))]
      },
      sj = {
        init <- copy(init0)
        init[, id2 := as.factor(id)][
          init, on = .(id2 > id2), nomatch = 0, .(id1 = as.character(id2), id2 = id)]
      },
      cj = {
        init <- copy(init0)
        init[, CJ(id1 = id, id2 = id)][id1 < id2]
      },
      check = function(x, y) {
        tmp <- all.equal(x, y, check.attributes = FALSE)
          if (!isTRUE(tmp)) {
            cat(tmp, "\n")
            str(x)
            str(y)
          }
        tmp
      },
      min_iterations = 1L
    )
  }
)

bm

   expression  n_id      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result   memory  time 
   <bch:expr> <dbl> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm> <list>   <list>  <lis>
 1 rp         10000     1.8s     1.8s    0.556     2.05GB   0.556      1     1       1.8s <data.t… <Rprof… <ben…
 2 sj         10000    2.87s    2.87s    0.348     2.24GB   0.696      1     2      2.87s <data.t… <Rprof… <ben…
 3 cj         10000   49.29s   49.29s    0.0203    3.17GB   0.0203     1     1     49.29s <data.t… <Rprof… <ben…
 4 rp          1000  22.11ms  22.53ms   44.2      21.15MB   2.10      21     1   475.58ms <data.t… <Rprof… <ben…
 5 sj          1000  18.28ms  18.88ms   52.4      23.19MB   2.02      26     1   496.27ms <data.t… <Rprof… <ben…
 6 cj          1000 542.97ms 542.97ms    1.84     32.53MB   0          1     0   542.97ms <data.t… <Rprof… <ben…
 7 rp           100   1.46ms   1.52ms  651.      285.94KB   6.56     298     3   457.53ms <data.t… <Rprof… <ben…
 8 sj           100   2.14ms   2.21ms  450.      406.48KB   4.24     212     2   471.37ms <data.t… <Rprof… <ben…
 9 cj           100   6.92ms   6.97ms  143.      431.23KB   0         72     0   504.44ms <data.t… <Rprof… <ben…
10 rp            10  679.8µs 738.38µs 1343.       66.34KB   6.60     610     3   454.25ms <data.t… <Rprof… <ben…
11 sj            10   1.71ms   1.81ms  552.      156.17KB   4.22     262     2   474.47ms <data.t… <Rprof… <ben…
12 cj            10 915.48µs 967.99µs 1029.      100.83KB   4.21     489     2   475.13ms <data.t… <Rprof… <ben…

library(ggplot2)
autoplot(bm)

Conclusion

cj is magnitudes slower than the other 2 approaches for larger use cases while rp is on a par with sj but allocates slightly less memory.

R: cross join column on itself excluding symmetric results

2 Answers2

Benchmark

Conclusion