I have a loop which is comparing street addresses.
It then uses fuzzy matching to tokenise the addresses and compare the addresses. I have tried this both with fuzzywuzzy and rapidfuzz.
It subsequently returns how close the match is.
The aim is to try and take all my street addresses 30k or so and match a variation of the street address to a structured street address in my dataset.
The end result would be a reference table with two columns:
- Column A is the reference column address
- Column B is an address where the match is good enough to be associated with column A
- Column A can have many associated addresses.
I am not a huge python user but do know that for loops are the last resort for most problems (third answer). With that in mind, I have used for loops. However my loops will take approx 235 hours which is sub-optimal to say the least. I have created a reproducible example below. Can anyone see where i can make any tweaks? I have added a progress bar to give you an idea of the speed. You can increase the number of addresses by changing the line for _ in range(20):
import pandas as pd
from tqdm import tqdm
from faker import Faker
from rapidfuzz import process, fuzz
# GENERATE FAKE ADDRESSES FOR THE REPRODUCIBLE EXAMPLE -----------------------------------------------
fake = Faker()
fake_addresses = pd.DataFrame()
for _ in range(20):
    # Generate fake address
    d = {'add':fake.address()}
    df = pd.DataFrame(data = [d])
    # Append it to the addresses dataframe
    fake_addresses = pd.concat([fake_addresses, df])
fake_addresses = fake_addresses.reset_index(drop=True)
# COMPARE ADDRESSES ---------------------------------------------------------------------------------   
# Here we are making a "dictionary" of the addresses where we use left side as a reference address
# We use the right side as all the different variations of the address. The addresses have to be 
# 0% similar. Normally this is 95% similarity    
reference = fake_addresses['add'].drop_duplicates()
ref_addresses = pd.DataFrame()
# This takes a long time. I have added tqdm to show how long when the number of addresses is increased dramatically
for address in tqdm(reference):
    for raw_address in reference:
        result = fuzz.token_sort_ratio(address, raw_address)
        
        d = {'reference_address': address, 
             'matched_address': raw_address,
             'matched_result': result}
        
        df = pd.DataFrame(data = [d])
        
        if len(df.index) > 0:
            filt = df['matched_result'] >= 0
            df = df.loc[filt]
            ref_addresses = pd.concat([ref_addresses, df], ignore_index=True)
        else:
            ref_addresses = ref_addresses
 
     
     
    