I want to compare two really large dataframes and make a consensus dataframe after matching a column ID.
Part of my first dataframe(input1):
ID  BGC_Class   Start   End BGC_Name    Similarity  MIBiG
GCA_000006785.2_ASM678v2    Bacteriocin 593677  606065  Streptolysin_S  100%    BGC0000566
GCA_000169475.1_ASM16947v1  Bacteriocin 633235  645623  Streptolysin_S  100%    BGC0000566
GCA_000433555.1_MGS126  Bacteriocin 524573  536961  Streptolysin_S  100%    BGC0000566
second(input2):
ID  Species_name    Strain_name
GCA_000169475.1_ASM16947v1  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_ATCC_29149_strain=ATCC_29149_
GCA_000433555.1_MGS126  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_CAG:126__
I want to match 'ID' columns in both dataframe and create a new dataframe (results) after matching ID features in both. So in ideal case, output dataframe would be:
ID  Species_name    Strain_name BGC_Class   Start   End BGC_Name    Similarity  MIBiG
GCA_000169475.1_ASM16947v1  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_ATCC_29149_strain=ATCC_29149_ Bacteriocin 633235  645623  Streptolysin_S  100%    BGC0000566
GCA_000433555.1_MGS126  [Ruminococcus]_gnavus   [Ruminococcus]_gnavus_CAG:126__ Bacteriocin 524573  536961  Streptolysin_S  100%    BGC0000566
For that, I have tried in R:
results<-data.frame(merge(input1,input2$ID, by.input1 = "input1$ID", by.input2 = "input2$ID"))
and also:
results <- match(input1$ID, input2$ID)
But I am getting same error in both:
Error: vector memory exhausted (limit reached?)
I am wondering if there any memory efficient way of doing this in R?
If not, can it be done by awk/sed scripts for these large dataset files? All comments are appreciated. Thank you.
NB: The original input files are here: https://sites.google.com/site/iicbbioinformatics/share
 
     
    