I have a large data.table with >300k rows and 884 columns (see here for data in CSV, 700MB).
I am trying to get labels for identical rows. This is what .GRP does wonderfully in data.table. Unfortunately, it takes forever to run, and in most cases crashes the Rsession. Any ideas on how to splitting up the problem or speeding up the solution would greatly be appreciated.
Here a MWE with the data mentioned above:
troutmat <- fread("troutmat.csv")
troutmat[,grp:=.GRP, names(troutmat)]
-> crashes Rsession (4.1.1. with data.table 1.14.2 on a 16 core Windows server)
Happy to open a bug report, if needed.
EDIT: Essentially, it's the same question as: data.table "key indices" or "group counter" but with a much larger dataset. I am trying to find a fast way to find duplicated rows, but in a way that I know which row is a duplicate of which other row.