I would like to join two data sets, A and B. I would like to join A and B exactly on their id variables, but keep only the most recent observation in B that is between three months and three years old.
The data sets are big enough that I need to use the sqldf package (about 500,000 rows in A and 250,000 rows in B). It seems that the logic should be to LEFT OUTER JOIN A AND B with A.id = B.id and (A.date - B.date) BETWEEN 3*30 AND 3*365, then GROUP BY A.row, ORDER BY B.date DESC, then keep first observation. But my code below keeps the first observation OVERALL, not the first observation per A.row group.
I can do this join in two steps (one sqldf, one tidyverse), but can  sqldf do both steps?
library(tidyverse)
library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
library(sqldf)
#> Loading required package: gsubfn
#> Loading required package: proto
#> Loading required package: RSQLite
# Some toy data:
A <- tibble(id = rep(1:10, each = 2),
               subid = rep(1:2, 10),
               date = rep(ymd('2019-01-01'), 20))
A$row <- seq(nrow(A))
set.seed(42)
B <- tibble(id = rep(1:10, each = 10),
            date = ymd('2015-01-01') + months(10*rep(1:10, 10)),
            x = runif(100))
# This code properly matches A and B, but only returns the first observation OVERALL, not per A.row:
C <- sqldf('SELECT *
           FROM A
           LEFT OUTER JOIN B
           ON A.id = B.id
           AND (A.date - B.date) BETWEEN 3*30 and 3*365
           GROUP BY row
           ORDER BY B.date DESC
           LIMIT 1') %>% 
    as_tibble()
C
#> # A tibble: 1 x 7
#>      id subid date         row id..5 date..6     x
#>   <int> <int> <date>     <int> <int>   <dbl> <dbl>
#> 1     1     1 2019-01-01     1     1   17652 0.830
# I could do this in two steps, with the first step in sqldf and the second step in the tidyverse. This two step approach would work my data, because B has annual data, so there should not be more than three matches per row in A. However, it seems like I should be able to do the entire join in sqldf (and maybe one data I will not be able to do the second step in the tidyverse).
D <- sqldf('SELECT *
           FROM A
           LEFT OUTER JOIN B
           ON A.id = B.id
           AND (A.date - B.date) BETWEEN 3*30 and 3*365') %>% 
    as_tibble()
E <- D %>%
    arrange(row, desc(date..6)) %>%
    group_by(row) %>%
    filter(row_number() == 1) %>%
    ungroup()
# Below is the desired output. Can sqldf do both steps?
E
#> # A tibble: 20 x 7
#>       id subid date         row id..5 date..6     x
#>    <int> <int> <date>     <int> <int>   <dbl> <dbl>
#>  1     1     1 2019-01-01     1     1   17652 0.830
#>  2     1     2 2019-01-01     2     1   17652 0.830
#>  3     2     1 2019-01-01     3     2   17652 0.255
#>  4     2     2 2019-01-01     4     2   17652 0.255
#>  5     3     1 2019-01-01     5     3   17652 0.947
#>  6     3     2 2019-01-01     6     3   17652 0.947
#>  7     4     1 2019-01-01     7     4   17652 0.685
#>  8     4     2 2019-01-01     8     4   17652 0.685
#>  9     5     1 2019-01-01     9     5   17652 0.974
#> 10     5     2 2019-01-01    10     5   17652 0.974
#> 11     6     1 2019-01-01    11     6   17652 0.785
#> 12     6     2 2019-01-01    12     6   17652 0.785
#> 13     7     1 2019-01-01    13     7   17652 0.566
#> 14     7     2 2019-01-01    14     7   17652 0.566
#> 15     8     1 2019-01-01    15     8   17652 0.479
#> 16     8     2 2019-01-01    16     8   17652 0.479
#> 17     9     1 2019-01-01    17     9   17652 0.646
#> 18     9     2 2019-01-01    18     9   17652 0.646
#> 19    10     1 2019-01-01    19    10   17652 0.933
#> 20    10     2 2019-01-01    20    10   17652 0.933
Created on 2019-07-12 by the reprex package (v0.3.0)
 
     
    