I have some data which looks like:
> df1
match1 match2 someData
1 AAPL Apple 100
2 GOOG Google Inc 105
3 MSFT Microsoft Inc 103
4 <NA> Nvidia 99
5 <NA> Amazon 101
6 INTC Intel 98
> df2
toMatch1 toMatch2 someOtherData
1 AAPL Apple 1000
2 INTC Intel 500
3 MSFT Microsoft 750
4 GOOG Google 1250
5 MMM 3M 1500
6 ATVI Activision 1000
7 AMZN Amazon 1750
8 NVDA Nvidia 2000
9 TSLA Tesla 500
10 FB Facebook 250
I want to join them together using an if statement. In the df1 data, under the column match1 there are NA values for two observations. I would like to create the if statement such that:
if column exists in
match1joinmatch1withtoMatch1using an "exact match"left_joinelse join using
match2with columntoMatch2using a "fuzzy match"stringdist_join
I can join using:
left_join(df1, df2, by = c("match1" = "toMatch1"))
stringdist_join(df1, df2, by = c("match2" = "toMatch2"))
However when I use the second approach I obtain:
> stringdist_join(df1, df2, by = c("match2" = "toMatch2"))
# A tibble: 4 x 6
match1 match2 someData toMatch1 toMatch2 someOtherData
<fct> <fct> <dbl> <fct> <fct> <dbl>
1 AAPL Apple 100 AAPL Apple 1000
2 NA Nvidia 99 NVDA Nvidia 2000
3 NA Amazon 101 AMZN Amazon 1750
4 INTC Intel 98 INTC Intel 500
But Google Inc in df1 should have been joined up with Google in df2 and it hasn't. The same for Microsoft Inc and Microsoft. It seems that the stringdist_join function is only looking for "exact matches" also...
My question is, I want to create an if statement to first look at the match1 and toMatch1 columns and if there are values in these columns join the data on an exact match. Otherwise look at the match2 and toMatch2 column and try to join the data using fuzzyjoin.
Data:
library(fuzzyjoin)
library(dplyr)
df1 <- data.frame(
match1 = c("AAPL", "GOOG", "MSFT", NA, NA, "INTC"),
match2 = c("Apple", "Google Inc", "Microsoft Inc", "Nvidia", "Amazon", "Intel"),
someData = c(100, 105, 103, 99, 101, 98)
)
df2 <- data.frame(
toMatch1 = c("AAPL", "INTC", "MSFT", "GOOG", "MMM", "ATVI", "AMZN", "NVDA", "TSLA", "FB"),
toMatch2 = c("Apple", "Intel", "Microsoft", "Google", "3M", "Activision", "Amazon", "Nvidia", "Tesla", "Facebook"),
someOtherData = c(1000, 500, 750, 1250, 1500, 1000, 1750, 2000, 500, 250)
)