My data contains the names of persons and a list of cities they lived in. I want to group them together following these conditions:
- 
- first_nameand- last_nameare identical
 
- 
- or (if 1. doesn't hold) their last_nameare the same and they have lived in at least one identicalcity.
 
- or (if 1. doesn't hold) their 
The result should be a new column indicating the group id that each person belongs to.
The DataFrame df looks like this:
>>> df
      person_id  last_name first_name                     cities
    0       112     Dorsey      Nancy         [Moscow, New York]
    1       113     Harper        Max  [Munich, Paris, Shanghai]
    2       114    Mueller        Max    [New York, Los Angeles]
    3       115     Dorsey      Nancy          [New York, Miami]
    4       116     Harper    Maxwell            [Munich, Miami]
The new dataframe df_id should look like this. The order of id is irrelevant (i.e., which group gets id=1), but only observations that fulfill either condition 1 or 2 should get the same id.
>>> df_id
      person_id  last_name first_name                     cities  id
    0       112     Dorsey      Nancy         [Moscow, New York]   1
    1       113     Harper        Max  [Munich, Paris, Shanghai]   2
    2       114    Mueller        Max    [New York, Los Angeles]   3
    3       115     Dorsey      Nancy          [New York, Miami]   1
    4       116     Harper    Maxwell            [Munich, Miami]   2
My current code:
df= df.reset_index(drop=True)
#explode lists to rows
df_exploded = df.explode('cities')
# define id_counter and dictionary to person_id to id
id_counter = 1
id_matched = dict()
# define id function
def match_id(df):
  global id_counter
  # check if person_id already matched
  if df['person_id'] not in id_matched.keys():
  # get all persons with similar names (condition 1)
    select = df_expanded[(df_expanded['first_name']==df['first_name']) & df_expanded['last_name']==df['last_name'])]
    # get all persons with same last_name and city (condition 2)
    if select.empty:
      select_2 = df_expanded[(df_expanded['last_name']==df['last_name']) & (df_expanded['cities'] in df['cities'])]
      # create new id for this specific person
      if select_2.empty:
        id_matched[df['person_id']] = id_counter
      # create new id for group of person and record in dictionary
      else:
        select_list = select_2.unique().tolist()
        select_list.append(df['person_id'])
        for i in select_list:
          id_matched[i] = id_counter 
    # create new id for group of person and record in dictionary
    else:
      select_list = select.unique().tolist()
      select_list.append(df['person_id'])
      for i in select_list:
        id_matched[i] = id_counter
    # set next id
    id_counter += 1
# run function
df = df.progress_apply(match_id, axis=1)
# convert dict to DataFrame
df_id_matched = pd.DataFrame.from_dict(id_matched, orient='index', columns['id'])
                            .rename_axis('person_id').reset_index()
# merge back together with df to create df_id
Does anyone have a more efficient way to perform this task? The data set is huge and it would take several days...
Thanks in advance!
 
    