I would like to delete duplicates in a very large dataset (millions of rows) based on a condition. I thought about the following simplifying example to illustrate my prob:
test <- read.table(
text = "
A   1900    1   10  45tz    tztime1 01.06.1900
A   1900    2   9   45tz    tztime1 01.06.1900
A   1900    3   8   45tz    tztime1 01.06.1900
A   1900    4   7   45tz    tztime1 01.06.1900
A   1900    5   6   45tz    tztime1 01.06.1900
A   1900    6   5   45tz    tztime1 01.06.1900
A   1900    7   4   45tz    tztime1 01.06.1900
A   1900    7   10  45tz    tztime1 01.06.1900
A   1900    7   9   45tz    tztime1 01.06.1900
A   1900    8   3   45tz    tztime1 01.06.1900
A   1900    8   10  45tz    tztime1 01.06.1900
A   1900    8   9   45tz    tztime1 01.06.1900
A   2000    1   10  45tz    tztime2 01.06.2000
A   2000    2   9   45tz    tztime2 01.06.2000
A   2000    3   8   45tz    tztime2 01.06.2000
A   2000    3   10  45tz    tztime2 01.06.2000
A   2000    3   9   45tz    tztime2 01.06.2000
B   1800    1   10  52fd    tztime0 01.06.1800
B   1800    2   9   52fd    tztime0 01.06.1800
B   1800    3   8   52fd    tztime0 01.06.1800
B   1800    3   10  52fd    tztime0 01.06.1800
B   1800    3   9   52fd    tztime0 01.06.1800
B   1800    4   7   52fd    tztime0 01.06.1800
B   1900    1   10  52fd    tztime1 01.06.1900
B   1900    2   9   52fd    tztime1 01.06.1900
B   1900    2   10  52fd    tztime1 01.06.1900
B   1900    2   9   52fd    tztime1 01.06.1900
",header=TRUE)
library(data.table)
setDT(test)
names(test) <-  c("ID", "Year", "Count", "value", "A","B","C")
In this simplified dataset, I have two individuals (A and B), for different but possibly overlapping years. A Count is given, as well as a value.
I would like to delete the observations for each ID within each YEAR and Count group, that are duplicates and fullfill a certain condition (see below). For example for the group:
A   1900    7   4
A   1900    7   10
A   1900    7   9
I would like to delete all observations, whose value is larger than the minimum value within each group. In this case I would like to have only
A   1900    7   4
as a remainder.
Note that my real dataset is very large and has many more columns. Therefore if possible, I am looking for a solution which is memory-efficient.
I hope that was clear enough. If not, feel free to ask for any information that is missing.
Edit: my real dataset has a lot more columns than displayed here, so in the end I am looking for a solution which displays the information of all the columns (for example, assume in this case there are also column A, B and C as part of the dataset, which I have added in the latest edit. They are not really needed for the grouping/filtering, but still should be part of the final result). The currently proposed solution does not account for this.
 
     
    