Lets assume I have two databases dfA and dfB. One has individual observations and one has country level data (which is applicable to multiple observations which are from the same year and country) For each of these databases I have created a key called matchcode. This matchcode is a combination of a country code and a year.
dfA <- read.table(
text = "A B C D E F G iso year matchcode
1 0 1 1 1 0 1 0 NLD 2010 NLD2010
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2010 AUS2010
4 1 0 1 0 0 1 0 AUS 2006 AUS2006
5 0 1 0 1 0 1 1 USA 2008 USA2008
6 0 0 1 0 0 0 1 USA 2010 USA2010
7 0 1 0 1 0 0 0 USA 2012 USA2012
8 1 0 1 0 0 1 0 BLG 2008 BLG2008
9 0 1 0 1 1 0 1 BEL 2008 BEL2008
10 1 0 1 0 0 1 0 BEL 2010 BEL2010
11 0 1 1 1 0 1 0 NLD 2010 NLD2010
12 1 0 0 0 1 0 1 NLD 2014 NLD2014
13 0 0 0 1 1 0 0 AUS 2010 AUS2010
14 1 0 1 0 0 1 0 AUS 2006 AUS2006
15 0 1 0 1 0 1 1 USA 2008 USA2008
16 0 0 1 0 0 0 1 USA 2010 USA2010
17 0 1 0 1 0 0 0 USA 2012 USA2012
18 1 0 1 0 0 1 0 BLG 2008 BLG2008
19 0 1 0 1 1 0 1 BEL 2008 BEL2008
20 1 0 1 0 0 1 0 BEL 2010 BEL2010",
header = TRUE
)
dfB <- read.table(
text = "A B C D H I J iso year matchcode
1 0 1 1 1 0 1 0 NLD 2009 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014
3 0 0 0 1 1 0 0 AUS 2011 AUS2011
4 1 0 1 0 0 1 0 AUS 2007 AUS2007
5 0 1 0 1 0 1 1 USA 2007 USA2007
6 0 0 1 0 0 0 1 USA 2011 USA2010
7 0 1 0 1 0 0 0 USA 2013 USA2013
8 1 0 1 0 0 1 0 BLG 2007 BLG2007
9 0 1 0 1 1 0 1 BEL 2009 BEL2009
10 1 0 1 0 0 1 0 BEL 2012 BEL2012",
header = TRUE
)
library(data.table)
setDT(dfA)
setDT(dfB)
Mostly when I merge these datasets I simply do:
dfA<- merge(dfA, dfB, by= "matchcode", all.x = TRUE, allow.cartesian=FALSE)
The problem is that sometimes the years do not completely match. So I tried:
dfA <- dfA[dfB, on = .(iso, year), roll = "nearest", nomatch = 0]
But this reduces the amount of observations to 11.
# A tibble: 11 x 18
A B C D E F G iso year matchcode K L M N O P Q i.matchcode
<int> <int> <int> <int> <int> <int> <int> <fct> <int> <fct> <int> <int> <int> <int> <int> <int> <int> <fct>
1 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 1 1 0 1 0 NLD2009
2 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
3 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 0 0 1 0 1 NLD2014
4 0 0 0 1 1 0 0 AUS 2011 AUS2010 0 0 0 1 1 0 0 AUS2011
5 1 0 1 0 0 1 0 AUS 2007 AUS2006 1 0 1 0 0 1 0 AUS2007
6 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 0 1 0 1 1 USA2007
7 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 0 0 0 1 USA2010
8 0 1 0 1 0 0 0 USA 2013 USA2012 0 1 0 1 0 0 0 USA2013
9 1 0 1 0 0 1 0 BLG 2007 BLG2008 1 0 1 0 0 1 0 BLG2007
10 0 1 0 1 1 0 1 BEL 2009 BEL2008 0 1 0 1 1 0 1 BEL2009
11 1 0 1 0 0 1 0 BEL 2012 BEL2010 1 0 1 0 0 1 0 BEL2012
The preferred output would be as follows:
# A B C D E F G iso year matchcodeA H I J matchcodeB
# 1: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
# 2: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
# 3: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
# 4: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
# 5: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
# 6: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
# 7: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
# 8: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
# 9: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#10: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
#11: 1 0 0 0 1 0 1 NLD 2014 NLD2014 1 0 1 NLD2014
#12: 0 0 0 1 1 0 0 AUS 2011 AUS2010 1 0 0 AUS2011
#13: 1 0 1 0 0 1 0 AUS 2007 AUS2006 0 1 0 AUS2007
#14: 0 0 1 0 0 0 1 USA 2011 USA2010 0 0 1 USA2010
#15: 0 1 0 1 0 0 0 USA 2013 USA2012 0 0 0 USA2013
#16: 0 1 0 1 1 0 1 BEL 2009 BEL2008 1 0 1 BEL2009
#17: 0 1 1 1 0 1 0 NLD 2009 NLD2010 0 1 0 NLD2009
#18: 0 1 0 1 0 1 1 USA 2007 USA2008 0 1 1 USA2007
#19: 0 1 0 1 0 0 0 USA 2011 USA2012 0 0 1 USA2010
#20: 1 0 1 0 0 1 0 BEL 2009 BEL2010 1 0 1 BEL2009
Additional Sources: