I have a reference data frame (df1) with three columns of "characteristics" (gender, year, code), and two columns of "values" (amount, status). It looks like this, but with many rows:
gender    year    code    amount   status
     M    2011       A        15      EMX
     M    2011       A       123      NOX
     F    2015       B         0      MIX
     F    2018       A        12      NOX
     F    2015       B        11      NOX
I have another data frame (df2) that just has the three "characteristics" columns. For example:
gender    year   code
     M    2011      A
     M    2011      A
     F    2018      A
     F    2015      B
For each row in df2, I want to assign "values" based on matches in "characteristics" to df1. Where there are multiple matches, I want to select pairs of "values" at random. So when there are duplicate "characteristics" in df2, they might end up with different pairs of "values", but all of them will have an exact match in df1. Essentially, for each combination of characteristics, I want the distribution of values to match between the two tables.
For example, the last row in 'df2' (gender = F, year = 2015, code = B) matches two rows in 'df1': the third row (amont = 0, status = MIX) and the fifth row (amount = 11, status = NOX). Then one of these matching rows should be selected randomly. For all such cases of multiple matches between 'df2' and 'df1' based on gender, year and code, a random row should be selected.
So far, my approach has been to start by using dplyr to do a left_join between the two data frames. However, this provides all possible "values" for each row in df2, rather than selecting one at random. So I then have to group by characteristics and select one.  This produces a very large intermediate table and doesn't seem very efficient.
I wondered if anyone had suggestions for a more efficient method? I've previously found that joining with the data.table package is quicker, but don't really have a good understanding of the package. I also wonder if I should be doing joins at all, or should just use the sample function?
Any help much appreciated.
 
     
     
     
    