I have two data frames: one ("grny") that is mainly a reference but also has some data in the "yield" column I'm after, and another ("txie") that will have "yield" data with a few NA's for missing data. I want to merge them so that all cells from a row with a common value in "site" are complete.
Where most of the year-by-year data is:
txie<-data.frame (site=c(rep("smithfield",2),rep("belleville",3)),
yield=c((rnorm(4, mean=8)),NA),
year=c(1999:2000,1992:1994),
prim=c(rep("nt",2),rep(NA,3)))
Mostly reference with some year-by-year yield data:
grny<-data.frame (site=c("smithfield","belleville",rep("nashua",3)),
yield=c(rep(NA,2),rnorm(3,mean=9)),
year=c(rep(NA,2),1990:1992),
prim=c(NA,"nt",sample(c("nt","ct"),3,rep=TRUE)),
lat=(c(rnorm(2,mean=45,sd=10),rep(49.1,3))))
What I want:
         site    yield year prim  lib      lat
1  smithfield 7.009178 1999   nt 1109     43.61828
2  smithfield 8.472677 2000   nt 1109     43.61828
3  belleville 8.857462 1992   nt 122      74.08792
4  belleville 7.368488 1993   nt 122      74.08792
5  belleville       NA 1994   nt 122      74.08792
6  nashua     7.494519 1990   nt 554      49.10000
8  nashua     8.696066 1991   ct 554      49.10000
9  nashua     8.051670 1992   nt 554      49.10000
What I've tried:
rbind.fill(txie,grny) #this appends rows to the correct columns but leaves NA's everywhere because it doesn't know I want data missing in grny filled in when it is available in txie
Reduce(function(x,y) merge(txie,grny, by="site", all.y=TRUE), list(txie,grny)) #this merges by rows but creates new variables from x and y.
merge(x = txie, y = grny, by = "site", all = TRUE) #this does the same as  the above (new variables from each x and y ending in .x or .y)
merge(x = txie, y = grny, by = "site", all.x = TRUE)#this does similar to above but merges based on the x df  (new variables from each x and y ending in .x or .y)
setkey(setDT(grny),site)[txie]# this gives a similar result to the all.x line
For example, with the outer join merge I end up with:
     site  yield.x year.x prim.x  yield.y year.y prim.y      lat
1 belleville 6.766628   1992   <NA>       NA     NA     nt 34.92136
2 belleville 6.845789   1993   <NA>       NA     NA     nt 34.92136
3 belleville       NA   1994   <NA>       NA     NA     nt 34.92136
4 smithfield 8.841339   1999     nt       NA     NA   <NA> 49.81872
5 smithfield 7.313310   2000     nt       NA     NA   <NA> 49.81872
6     nashua       NA     NA   <NA> 9.173229   1990     ct 49.10000
7     nashua       NA     NA   <NA> 9.196018   1991     nt 49.10000
8     nashua       NA     NA   <NA> 7.336645   1992     ct 49.10000
Stipulations: I would like to keep NA's that were already in the "yield" column (eg. nashua in 1994). Any answers or can someone show me where an example of this kind of merge (with data already in one or more shared columns you're not merging by, each df bringing in new columns except the "by" variable) is?
Thanks!!!
 
    