Scenario
I have a number of enterprise datasets that I must find missing links between, and one of the ways I use for finding potential matches is joining on first and last name. The complication is that we have a significant number of people who use their legal name in one dataset (employee records), but they use either a nickname or (worse yet) their middle name in others (i.e., EAD, training, PIV card, etc.). I am looking for a way to match up these potentially disparate names across the various datasets.
Simplified Example
Here is an overly simplified example of what I am trying to do, but I think it conveys my thought process. I begin with the employee table:
Employees table
| employee_id | first_name | last_name |
|---|---|---|
| 052451 | Robert | Armsden |
| 442896 | Jacob | Craxford |
| 054149 | Grant | Keeting |
| 025747 | Gabrielle | Renton |
| 071238 | Margaret | Seifenmacher |
and try to find the matching data from the PIV card dataset:
Cards table
| card_id | first_name | last_name |
|---|---|---|
| 1008571527 | Bobbie | Armsden |
| 1009599982 | Jake | Craxford |
| 1004786477 | Gabi | Renton |
| 1000628540 | Maggy | Seifenmacher |
Desired Result
After trying to match these datasets on first name and last name, I would like to end up with the following:
Employees_Cards table
| emp_employee_id | emp_first_name | emp_last_name | crd_card_id | crd_first_name | crd_last_name |
|---|---|---|---|---|---|
| 052451 | Robert | Armsden | 1008571527 | Bobbie | Armsden |
| 442896 | Jacob | Craxford | 1009599982 | Jake | Craxford |
| 054149 | Grant | Keeting | NULL | NULL | NULL |
| 025747 | Gabrielle | Renton | 1004786477 | Gabi | Renton |
| 071238 | Margaret | Seifenmacher | 1000628540 | Maggy | Seifenmacher |
As you can see, I would like to make the following matches:
Gabrielle -> Gabi
Jacob -> Jacob
Margaret -> Maggy
Robert -> Bobbie
My initial thought was to find a common names dataset along the lines of:
Name_Aliases table
| name1 | name2 | name3 | name4 |
|---|---|---|---|
| Gabrielle | Gabi | NULL | NULL |
| Jacob | Jake | NULL | NULL |
| Margaret | Maggy | Maggie | Meg |
| Michael | Mike | Mikey | Mick |
| Robert | Bobbie | Bob | Rob |
and use something like this for the JOIN:
CREATE TABLE employee_cards AS
SELECT
employees.employee_id AS emp_employee_ID,
employees.first_name AS emp_first_name,
employees.last_name AS emp_last_name,
cards.card_id AS crd_card_id,
cards.first_name AS crd_first_name,
cards.last_name AS crd_last_name
FROM employees
LEFT OUTER JOIN name_aliases
LEFT OUTER JOIN cards
ON employees.first_name IN (
nane_aliases.name1,
nane_aliases.name2,
nane_aliases.name3,
nane_aliases.name4
)
AND employees.last_name = cards.last_name;
This is where I got stuck, since I could not figure out how to tie the result of the first ON condition to the first names in the cards table.
Some Questions
As I ponder more deeply into this problem, I know that I am not the first person who has encountered this need for matching on common name variants. My initial search pointed me to things like fuzzysearch and soundex, but those are not quite what I need for the present scenario (although they may come in handy down the road). Given this, I have several questions for the community:
Downloadable Common Names Variants Dataset?
Has someone compiled or crowd-sourced a comprehensive names variants dataset along the lines of my name-aliases table above? My search landed me on a couple of sites that appear to have this kind of data, but neither one of them is downloadable for import into my local database.
I did find this SO discussion over a decade old, but it did not appear to be current: Database of common name aliases / nicknames of people
Also, I am unable to pay any money for this, so I hope there might be one hiding on github.
Better Way to Structure the Name_Aliases table?
Since each record in the name_aliases table can have two or more entries, is there a better way to set up the structure for this table, to make it infinitely flexible?
Match Any Column in Name_Aliases Table?
How do I set up the JOIN query to match up employees.first_name with any column in name_alises, and then finally match that with cards.first_name?
Better Solution?
Am I taking the wrong approach to solve this problem? Has someone come up with a more flexible and elegant approach to this using PostgreSQL?