I have a set of people in a dataframe, I need the list of people that do not occur in the main dataset. Currently I am checking against first and last name.
data_to_check_dataset is the input data that needs to be checked , it contains many columns but currently I only need to check against first_name , last_name.
| first_name | last_name | ... | |
|---|---|---|---|
| 0 | James | Apple | ... |
| 1 | Alice | test | ... |
| ... | ... | ... | ... |
| 10000 | Paul | test | ... |
sometimes the data fields can be entirely blank and are read as nan values.
| first_name | last_name | ... | |
|---|---|---|---|
| 0 | James comp | nan | ... |
| 1 | Paul ltd | nan | ... |
| ... | ... | ... | ... |
| 10000 | Paul other | nan | ... |
The dataframe I am checking against current_people_dataset : , it contains many columns I have renamed the name columns to first_name , last_name. Its null values are blank for some reason, I think because
| first_name | last_name | ... | |
|---|---|---|---|
| 0 | f_A | l_A | ... |
| 1 | B | ... | |
| ... | ... | ... | ... |
| 900000 | paul | smith | ... |
The data_to_check_dataset is always smaller then the current_people_dataset.
Column ordering is not fixed and can change depending on here the data is loaded in from.
currently I have been trying to adapt the code from here.
new_people_names = (pd.merge(data_to_check_dataset,current_people_dataset, indicator=True, how='outer')
.query('_merge=="left_only"')
.drop('_merge', axis=1))
This raises ValueError: You are trying to merge on float64 and object columns. If you wish to proceed you should use pd.concat error when comparing colunmns.