I have two tables containing addresses (street, city, zipcode and two fields containing concatenated values of these), I would like to do fuzzy matching on Zipcode, but only for those cases which have exact same StrCity value. I have started with first selecting only addresses that are matching the StrCity from dictionary and then fuzzy matching, but there are two problems:
1) if matched by Zipcode, it doesn't take into account the street and city 2) if matched by Address (containing all of Zipcode, Street and City), it returns also possible values, where on the same zipcode there is another street that is close enough in terms of distance.
Probably I need something like doing two different matches at the same time (one fuzzy and one exact), but I am not sure how to implement it, while not killing my computer performance-wise.
Here is data sample of TableAd:
StrCity              ID      Zipcode Street       City     Address
BiałowiejskaWarszawa 5148676 01-459  Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa   6423687 02-730  Bukowińska   Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa     6425093 03-536  Kanałowa     Warszawa 03-536KanałowaWarszawa
And the dictionary sample:
Zipcode Street   City     Address                StrCity
02-882  Agaty    Warszawa 02-882AgatyWarszawa    AgatyWarszawa
03-663  Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536  Kołowa   Warszawa 03-536KołowaWarszawa   KołowaWarszawa
Here is my current code:
TableMatch <- merge(TableAd, TableDict, by="StrCity")
TableMatch <- TableMatch[, -grep("y", colnames(TableMatch))]
names(TableMatch)[names(TableMatch)=="Zipcode.x"] <- "Zipcode"
names(TableMatch)[names(TableMatch)=="Address.x"] <- "Address"
ResultTable <- TableMatch %>% 
  stringdist_left_join(TableDict, by="Address", distance_col="dist", method="lv", max_dist=5, ignore_case = TRUE) %>%
  select(ID, Zipcode.x, Address.x, Address.y, dist) %>% 
  group_by(Address.x) %>% 
  # select best fit record
  top_n(-1, dist)
The problem I found specifically with an example provided above - the script verifies that strCity KanałowaWarszawa is present in dictionary, but Levenshtein distance of combined Address string is the same when changing the zipcode as when changing the street to Kołowa, which has the same zipcode as the one inspected. Here it returns both changes, but if there would be just 2 or 1 digits difference in zipcode, then it might incorrectly suggest replacing the street while zipcode should be changed.
Note: I am using packages purrr, dplyr and fuzzyjoin.