I was checking out this answer (Summarize all group values and a conditional subset in the same call) as it gives me the right idea (setting a dupe column), but I need some more help.
So my data is structured something like this (medical data so keeping it as anonymised as possible):
| index | name | category | date | checkup_complete |
|---|---|---|---|---|
| 1 | name1 | cat1 | date1 | Y |
| 2 | name1 | cat1 | date1 | N |
| 3 | name1 | cat2 | date1 | Y |
| 4 | name1 | cat1 | date1 | Y |
| 5 | name1 | cat1 | date2 | N |
| 6 | name2 | cat1 | date1 | Y |
| 7 | name3 | cat1 | date2 | Y |
| 8 | name2 | cat2 | date1 | Y |
| 9 | name3 | cat1 | date2 | N |
| 10 | name2 | cat2 | date1 | Y |
| 11 | name1 | cat1 | date1 | N |
Data is considered a duplicate if it fits the following criteria:
For a given category, if there are multiple names with the same name, on the same date with checkup_complete == "Y". The ones matching this criteria will get collapsed to a single record.
So based on this, the duplicates are indices: 1 and 4; 8 and 10. Indices 2 and 11 are not duplicates.
The actual table is 10s of 000s of rows, so is there a way I can use dplyr (as I'm using it as part of the overall data-frame workflow) to mark these are duplicates or not?
Based on the link I provided above I sense that I would use group_by on category, name, date, and mutate a dupe column. Question is, without creating another column (cond_dupe) to then filter based on checkup_complete AND dupe, is there a way of doing it in one go?