Julia DataFrames: Replace entries in a dataframe based on a comparison with another dataframe

Question

I have the following dataframes:

df1 = DataFrame(
    col_A = [1, 2, 3, 4, 5, 6, 7],
    col_B = ["A", "B", "C", "D", "E", "F", "G"],
    col_C = missing,
)

7×3 DataFrame
 Row │ col_A  col_B   col_C   
     │ Int64  String  Missing 
─────┼────────────────────────
   1 │     1  "A"     missing 
   2 │     2  "B"     missing 
   3 │     3  "C"     missing 
   4 │     4  "D"     missing 
   5 │     5  "E"     missing 
   6 │     6  "F"     missing 
   7 │     7  "G"     missing

df2 = DataFrame(
    col_X = [1, 2, 3, 4, 5, 5],
    col_Y = ["A", "nope", "C", "nope", "E", "E"],
    col_Z = ["First", "Second", "Third", "Fourth", "Fifth", "Duplicated"]
)

6×3 DataFrame
 Row │ col_X  col_Y   col_Z      
     │ Int64  String  String     
─────┼───────────────────────────
   1 │     1  "A"     "First"
   2 │     2  "nope"  "Second"
   3 │     3  "C"     "Third"
   4 │     4  "nope"  "Fourth"
   5 │     5  "E"     "Fifth"
   6 │     5  "E"     "Duplicated"

I need to efficiently replace the values of df1.col_C with those of df2.col_Z if there's a match between -let's say- composite keys made from the first 2 columns in both dataframes (e.g. (1, "A") occurs in both, but (2, "B") doesn't), and otherwise keep things unchanged. If there were duplicated composite keys, get the last occurrence in df2.

So df1 would become:

7×3 DataFrame
 Row │ col_A  col_B   col_C      
     │ Int64  String  String?    
─────┼───────────────────────────
   1 │     1  "A"     "First"
   2 │     2  "B"     missing    
   3 │     3  "C"     "Third"
   4 │     4  "D"     missing    
   5 │     5  "E"     "Duplicated"
   6 │     6  "F"     missing    
   7 │     7  "G"     missing

score 6 · Answer 1 · answered Dec 18 '22 at 21:21

using InMemoryDatasets package:

df1 = Dataset(
    col_A = [1, 2, 3, 4, 5, 6, 7],
    col_B = ["A", "B", "C", "D", "E", "F", "G"],
    col_C = missings(String, 7),
)
df2 = Dataset(
    col_A = [1, 2, 3, 4, 5, 5],
    col_B = ["A", "nope", "C", "nope", "E", "E"],
    col_C = ["First", "Second", "Third", "Fourth", "Fifth", "Duplicated"]
)
update!(df1, df2, on = [:col_A, :col_B])

score 2 · Answer 2 · answered Dec 18 '22 at 07:18

2

I assume, if you say that row-count does not need to match that you want to lookup in df2 the matching key. This feature is planned to be added in https://github.com/JuliaData/DataFrames.jl/issues/2243 release. Currently we are discussing what the best API would be.

Next, note that it is not clear in your question what to do if there were duplicate keys in the df2 table. I assume there are none. In this case currently you can do:

julia> mapping = Dict(df2.col_X .=> df2.col_Z)
Dict{Int64, Union{Missing, String}} with 5 entries:
  5 => "Fifth"
  4 => "foo"
  2 => "Second"
  3 => missing
  1 => "First"

julia> df1.col_C = [get(mapping, k, v) for (k, v) in zip(df1.col_A, df1.col_C)]
6-element Vector{Union{Missing, String}}:
 "First"
 "Second"
 missing
 "foo"
 "Fifth"
 missing

(note that the result is different than what you expect as in both tables in row 2 there is key value 2 that matches)

answered Dec 18 '22 at 07:18

Bogumił Kamiński

66,844
3
80
107

I think what OP meant was using A, B <-> X,Y as join keys. See edit in my answer. You can edit this answer, as it is basically this solution. – Dan Getz Dec 18 '22 at 10:04
Agreed - I was not sure what OP wanted exactly as the example given did not match what I thought would be a solution. Let us wait for OP to comment what is needed exactly and I will update my answer then. – Bogumił Kamiński Dec 18 '22 at 10:46
Would: `sort!(select(leftjoin(df1,df2, on = [:col_A => :col_X, :col_B => :col_Y]), names(df1), [:col_C,:col_Z] => ByRow((c,z)->coalesce(z,c)) => :col_C), :col_A)` be the DataFrames equivalent of an SQL command to do this? – Dan Getz Dec 18 '22 at 13:49
yes - something like this. – Bogumił Kamiński Dec 18 '22 at 13:56
Yes, it'd be like a composite key made from A, B that needs to match X, Y. I didn't think about duplicates, but let's consider that case. I'm editing the question, so these details are captured. So far, Dan's answer works for the non-duplicate case. – jn_br Dec 18 '22 at 14:35
1

Dan's answer (the update using dict) will give you last value in case of duplicates. – Bogumił Kamiński Dec 18 '22 at 18:12
You're right. I just realized! – jn_br Dec 20 '22 at 01:12

Dan Getz · Accepted Answer · 2022-12-18T10:02:34.967

0

Is this satisfactory?

julia> df1.col_C .= ifelse.(df1.col_A .== df2.col_X .&& 
                            df1.col_B .== df2.col_Y, 
                            df2.col_Z, missing)
5-element Vector{Union{Missing, String}}:
 "First"
 missing
 missing
 missing
 "Fifth"

Using Bogumil's answer, I think:

mapping = Dict(zip(df2.col_X, df2.col_Y) .=> df2.col_Z)
df1.col_C = [get(mapping, k, missing) 
  for k in zip(df1.col_A, df1.col_B)]

will fix the un-aligned dfs problem you commented about.

edited Dec 18 '22 at 10:02

answered Dec 18 '22 at 02:59

Dan Getz

17,002
2
23
41

That's a cool solution, though it doesn't work if dataframes don't have the same length. I'm getting a DimensionMismatch error. – jn_br Dec 18 '22 at 03:14
1

@jn_br See my comments/edits in this and Bogumil's answer. – Dan Getz Dec 18 '22 at 10:05

Julia DataFrames: Replace entries in a dataframe based on a comparison with another dataframe

3 Answers3