How to do a basic left outer join with data.table in R?

Question

I have a data.table of a and b that I've partitioned into below with b < .5 and above with b > .5:

DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1))
above = DT[DT$b > .5]
below = DT[DT$b < .5, list(a=a)]

I'd like to do a left outer join between above and below: for each a in above, count the number of rows in below. This is equivalent to the following in SQL:

with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select 3, 1),
  above as (select a, b from dt where b > .5),
  below as (select a, b from dt where b < .5)
select above.a, count(below.a) from above left outer join below on (above.a = below.a) group by above.a;
 a | count 
---+-------
 3 |     0
 2 |     1
(2 rows)

How do I accomplish the same thing with data.tables? This is what I tried so far:

> key(below) = 'a'
> below[above, list(count=length(b))]
     a count
[1,] 2     1
[2,] 3     1
[3,] 3     1
> below[above, list(count=length(b)), by=a]
Error in eval(expr, envir, enclos) : object 'b' not found
> below[, list(count=length(a)), by=a][above]
     a count b
[1,] 2     1 1
[2,] 3    NA 1
[3,] 3    NA 1

I should also be more specific in that I already tried merge but that blows through the memory on my system (and the dataset takes only about 20% of my memory).

possible duplicate of [How to join data frames in R (inner, outer, left, right)?](http://stackoverflow.com/questions/1299871/how-to-join-data-frames-in-r-inner-outer-left-right) — Joris Meys, Aug 17 '11 at 11:41
You could have tried looking for the answer first. a data.table is an extension of data.frame, and has a merge function. — Joris Meys, Aug 17 '11 at 11:42
Agree you should explain what you want in natural language, but in addition I get an error in both version 1.6 and (after updating) 1.6.4: Error in abs(j) : Non-numeric argument to mathematical function In addition: Warning message: In is.na(j) : is.na() applied to non-(list or vector) of type 'NULL' — IRTFM, Aug 17 '11 at 16:25
Sorry, it was late - hopefully the question is a gazillion times better now. — Yang, Aug 17 '11 at 19:34

score 4 · Answer 1 · answered Aug 17 '11 at 09:53

4

Since you appear to be using package data.table: check ?merge.data.table. I haven't used it, but it appears this might do what you want:

merge(above, below, by="a", all.x=TRUE, all.y=FALSE)

answered Aug 17 '11 at 09:53

Nick Sabbe

11,684
1
43
57

In fact merge is base function and works also for data.frames. – mbq Aug 17 '11 at 14:26
3

@mbq: `merge.data.table` is really quite different than `merge.data.frame`. – IRTFM Aug 17 '11 at 16:20
@DWin Ok, but nevertheless one does not need data.tables to get merge functionality. – mbq Aug 17 '11 at 17:15
4

@mbq. That is correct... BUT if one is using data.table objects then one does need to know the proper syntax to achieve one's goals, and that can only be gathered from ?merge.data.table and not ?merge.data.frame. – IRTFM Aug 18 '11 at 01:03
works if the by column is uniquely valid, but this is not always be the case in a left join – Hack-R Feb 12 '16 at 19:39

score 4 · Accepted Answer · answered Aug 17 '11 at 20:01

4

See if this is giving you something useful. Your example is too sparse to let me know what you want, but it appears it might be a tabulation of values of above$a that are also in below$a

table(above$a[above$a %in% below$a])

If you also want the converse ... values not in below, then this would do it:

table(above$a[!above$a %in% below$a])

And you can concatenate them:

> c(table(above$a[above$a %in% below$a]),table(above$a[!above$a %in% below$a]) )
2 3 
1 2

Generally table and %in% run in reasonably small footprints and are quick.

answered Aug 17 '11 at 20:01

IRTFM

258,963
21
364
487

This doesn't give me the `a`s with 0 counts, though, which is why I wanted a left outer join. – Yang Aug 17 '11 at 20:21
Yes it does. The "zero counts" are 2 "3"s from the !x %in% y indexing. If you wanted to label them as such you could insert an additional named value between them: `zeros->`=0, – IRTFM Aug 17 '11 at 20:33

score 2 · Answer 3 · answered Apr 27 '14 at 14:22

2

I think this is easier:

setkey(above,a)
setkey(below,a)

Left outer join:

above[below, .N]

regular join:

above[below, .N, nomatch=0]

full outer join with counts:

merge(above,below, all=T)[,.N, by=a]

answered Apr 27 '14 at 14:22

Polgy

36
2

Wouldn't `above[below, .N]` just give you back the total number of rows, due to the `.N`? – gented Oct 08 '15 at 13:01

Yang · Answer 4 · 2011-08-18T01:11:38.993

I eventually found a way to do this with data.table, which I felt is more natural for me to understand than DWin's table, though YMMV:

result = below[, list(count=length(b)), by=a]
key(result) = 'a'
result = result[J(unique(above$a))]
result$count[is.na(result$count)] = 0

I don't know if this could be more compact, though. I especially wanted to be able to do something like result = below[J(unique(above$a)), list(count=length(b))], but that doesn't work.

How to do a basic left outer join with data.table in R?

4 Answers4

Linked