dplyr: access full data frame (including added column) inside filter()

Question

I'm sure there is a better way to ask this question, and perhaps if I knew it I'd have found an existing answer. The gist is this:

I'm reading in SQL data with R with NEW, UPDATE, and DELETE rows
table uniqueness is determined by values in two columns (call them a and b)
I want (a) find all combinations of a and b in which column c has a DELETE value, and (b) then remove all rows with one of these a and b combinations

Here's a reproducible example:

library(dplyr)

df <- data.frame(
  a = c(1, 1, 1, 2, 2),
  b = c(3, 3, 4, 5, 5),
  c = c("new", "delete", "new", "new", "update")
)

What I tried, not thinking this through:

df %>%
  mutate(filter_key = paste0(a, "_", b)) %>%
  filter(!(filter_key %in% (df %>%
                              filter(c == "delete") %>%
                              pull(filter_key) %>%
                              unique())))

The problem is that filter_key is not in df, so it's not there to pull unique values for. I'm curious if there is a way to refer to it "on the fly"? Some answers led me to think I could perhaps use . or .data, but this didn't work for me (replacing df with . or .data).

The trivial solution is to add it beforehand:

df <- df %>% mutate(filter_key = paste0(a, "_", b))
df %>%
  filter(!(filter_key %in% (df %>%
                              filter(c == "delete") %>%
                              pull(filter_key) %>%
                              unique())))

Alternatively (though this changes the question, so I'll leave it in these "notes"), there's probably a way to just filter directly, but I wasn't sure how to say:

give me all the combos of a and b where any row has c == "delete"
remove all rows where a and b match one of these combos

score 2 · Answer 1 · answered Jul 23 '22 at 02:42

2

Use an anti-join:

df %>% anti_join(filter(., c == "delete"), by = c("a", "b"))

giving

  a b      c
1 1 4    new
2 2 5    new
3 2 5 update

answered Jul 23 '22 at 02:42

G. Grothendieck

254,981
17
203
341

Never heard of this. Awesome, and appreciate learning something new! – Hendy Jul 24 '22 at 19:26

score 1 · Accepted Answer · answered Jul 22 '22 at 23:30

1

You can do:

library(tidyverse)

df %>%
  mutate(both = str_c(a, b)) %>%
  filter(!both %in% (cur_data() %>%
                      filter(c == 'delete') %>%
                      pull(both)))

  a b      c both
1 1 4    new   14
2 2 5    new   25
3 2 5 update   25

answered Jul 22 '22 at 23:30

deschen

10,012
3
27
50

Interesting, I hadn't run across that function. Thanks for teaching me something new! – Hendy Jul 24 '22 at 19:27
My question was a form of xy problem. This answers what I started asking, which was how to access the current df, including on-the-fly columns, from within the pipeline. Thank you for that answer! For others looking for solutions, I think the `anti_join()` question was a cleaner approach to my actual need. – Hendy Aug 02 '22 at 17:26

dplyr: access full data frame (including added column) inside filter()

2 Answers2