I have one data frame of around 2.5 million records. I also have a dictionary of around 10,000 records. Now, I have a scenario, where I have to loop through every record of the data frame and compare values of one column which is a list of values with another list which is a part of the dictionary and get the first two values out of it(except the values which current list has).
Now, this is below dummy data.
# originally this dictionary has around 10,000 records
temp = {
  0: {
    'AT_1': ['AN_1_2', 'AN_1_0', 'AN_1_1', 'AN_1_3'],
    'AT_2': ['AN_2_1'],
    'AT_3': ['AN_3_0', 'AN_3_1']
  },
  1: {
    'AT_1': ['AN_1_1', 'AN_1_0'],
    'AT_2': ['AN_2_0', 'AN_2_1'],
    'AT_3': ['AN_3_1']
  }
}
# originally this dataframe has around 2.5 million records
Name  Group Points         Subpoints  Additional_Points
   x    0.0   AT_1          [AN_1_0]                NaN
   y    1.0   AT_2  [AN_2_0, AN_2_1]                NaN
   z    1.0   AT_1          [AN_1_1]                NaN
   w    0.0   AT_1               NaN                NaN
And, output should be like this:
# compare each row of value with dictionary on basis of 'Group' and 'Points', then take the first two values except the values present in 'Subpoints'.
Name  Group Points         Subpoints Additional_Points
   x    0.0   AT_1          [AN_1_0]  [AN_1_2, AN_1_1]
   y    1.0   AT_2  [AN_2_0, AN_2_1]                []
   z    1.0   AT_1          [AN_1_1]          [AN_1_0]
   w    0.0   AT_1               NaN               NaN
Now, this is piece of code I am using, which is working fine. But it is very much slow for 2.5 million records.
people['Additional_Points'] = people.apply(lambda x: [i for i in temp[x['Group']][x['Points']] if i not in x['Subpoints']][:2] if x.notna()['Subpoints'] else np.nan, axis=1)
Kindly let me know if there is any way to make this whole operation faster.
