3

I have a largish master data.table and different processes each modifying different subsets of it. Then i need to update the master table with the changes made by each process.

This is an example:

# Master table
dtA <- data.table(month=month.abb[1:5],act=letters[1:5],col1=(1:5)^2, col2=(5:9)^3)

# "subsidiary" table    
dtB <- data.table(month=month.abb[3:4],act=letters[3:4],col1=(6:7)^3)

setkey(dtA, month,act)
setkey(dtB, month,act)

dtA
   month act col1 col2
1:   Apr   d   16  512
2:   Feb   b    4  216
3:   Jan   a    1  125
4:   Mar   c    9  343
5:   May   e   25  729

dtB
   month act col1
1:   Apr   d  343
2:   Mar   c  216

# The result I want is:
   month act col1 col2
1:   Apr   d  343  512
2:   Feb   b    4  216
3:   Jan   a    1  125
4:   Mar   c  216  343
5:   May   e   25  729

Almost by chance and with my great surprise, as I haven't seen this documented elsewhere, I found this incredibly simple way to get there:

dtA[dtB, col1:=i.col1]

My question: is this a legitimate way to solve my problem and what would be the syntax to do this dynamically, e.g. if I have many columns to set at the same time into my master table?

Enzo
  • 2,543
  • 1
  • 25
  • 38
  • 1
    Yes, this is exactly the way to do this. If you want to do add more columns you would do something like ````dtA[dtB, `:=`(col1 = i.col1, col2 = i.col2,..., coln = i.coln)]````. I'm wondering how did you discover this by chance? – David Arenburg Feb 10 '15 at 12:39
  • @david I may have missed it, but I couldn't find any reference to i.columns in data.table docs as a result of a join (but of course it is used elsewhere in R e.g. for merge). It is just one of the many things I tried and (the only one that) worked! – Enzo Feb 10 '15 at 13:01
  • @eddi I didn't see the question you are referring to. Even after hours of searching so and any data.table document I have (FAQ, cheat sheet, intro, data.table main doc in CRAN) I couldn't find an answer. Why the i.col notation is so little documented and used in examples?? – Enzo Feb 10 '15 at 16:35
  • @Enzo that's a good question, and I originally came upon that notation by accident. If you have specific suggestions on how it can be incorporated into any of those docs you should post them on github. – eddi Feb 10 '15 at 17:32

1 Answers1

0

One approach to dynamically update multiple columns is to use assign by reference and select your columns from either data.table that you want updating using indices. This will save you from having to type out assignments for all column names.

The example here includes an additional column in dtA to demonstrate unequal tables:

dtA <- data.table(month=month.abb[1:5],act=letters[1:5],col1=(1:5)^2, col2=(8:12)^2, col3=(13:17)^2)
dtB <- data.table(month=month.abb[3:4],act=letters[3:4],col1=(6:7)^3, col2=(13:14)^3)

setkey(dtA, month,act)
setkey(dtB, month,act)

dtA
   month act col1 col2 col3
1:   Apr   d  343 2744  256
2:   Feb   b    4   81  196
3:   Jan   a    1   64  169
4:   Mar   c  216 2197  225
5:   May   e   25  144  289

dtB
   month act col1 col2
1:   Apr   d  343 2744
2:   Mar   c  216 2197

# columns 3:4 are selected for updating
dtA[dtB[,list(month,act)], 3:4 := (dtB[, 3:4])]


dtA
   month act col1 col2 col3
1:   Apr   d  343 2744  256
2:   Feb   b    4   81  196
3:   Jan   a    1   64  169
4:   Mar   c  216 2197  225
5:   May   e   25  144  289
MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
MattLBeck
  • 5,701
  • 7
  • 40
  • 56
  • that looks nice, but I though the mantra was to always assign by reference? – Enzo Feb 10 '15 at 13:04
  • Actually, your method is "risky" if dtA has more data columns than dtB: they will get NA. I will edit the original question to cater for this use case. – Enzo Feb 10 '15 at 13:18
  • @Enzo see the updated answer for an alternative assignment by reference solution. Yes, I assumed that `dtA` would have the same columns as `dtB`, since `dtB` is derived from `dtA`. – MattLBeck Feb 10 '15 at 13:21
  • dtA as the master table has actually more columns that dtB: my apologies, I simplified my example too much! – Enzo Feb 10 '15 at 13:22
  • @Enzo I think the second solution might solve that second problem anyway, seeing as you are specifying a particular range. It's not that different to what was suggested in the comments, but you at least don't have to assign each column separately. – MattLBeck Feb 10 '15 at 13:23
  • I would love to know why my answer was worth a downvote. It's a valid working solution as far as I can tell. – MattLBeck Feb 10 '15 at 16:55
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/70651/discussion-between-enzo-and-mattrition). – Enzo Feb 10 '15 at 17:25