I have a SQL Server table with three columns:
Table1
col1 int
col2 int
col3 string
I have a unique constraint defined for all three columns (col1, col2, col3)
Now, I have a .csv file from which I want to add records in this table and the *.csv file can have duplicate records.
I have searched for various options for avoiding duplicates in above scenario. Below are the three options which are working well for me. Please have a look and throw some ideas on pros/cons of each method so I can choose the best one.
Option#1 :
Avoiding duplicates in the first place i.e. while adding objects to the list from csv file. I have used HashSet<T> for this and overridden below methods for type T:  
public override int GetHashCode()
{
    return col1.GetHashCode() + col2.GetHashCode() + col3.GetHashCode();
}
public override bool Equals(object obj)
{
    var other = obj as T;
    if (other == null)
    {
        return false;
    }
    return col1 == other.col1
        && col2 == other.col2
        && col3 == other.col3;
}
option #2
Having List<T> instead of HashSet<T>.
Removing duplicates after all the objects are added to List<T>
    List<T> distinctObjects = allObjects
        .GroupBy(x => new {x.col1, x.col2, x.col3})
        .Select(x => x.First()).ToList();
option #3
Removing duplicates after all the objects are added to DataTable.
public static DataTable RemoveDuplicatesRows(DataTable dataTable)
{
    IEnumerable<DataRow> uniqueRows = dataTable.AsEnumerable().Distinct(DataRowComparer.Default);
    DataTable dataTable2 = uniqueRows.CopyToDataTable();
    return dataTable2;
}
Although I have not compared their running time, but I prefer option#1 as I am removing duplicates as a first step - so moving ahead only with what is required.
Please share your views so I can choose the best one.
Thanks a lot!
 
     
     
     
    