I would like to ask a question regarding fuzzyjoin package. I am very new to R, and I promise I have read through the readme file and followed through examples on https://cran.r-project.org/web/packages/fuzzyjoin/index.html before I asked this question.
I have a list of vernacular names which I wanted to match with plant species names. A simple version of my list will look like below. Data 1 has a LocalName column with many typos of vernacular name. Data 2 is the table with correct local name and species where the matching should be based on.
data1 <- data.frame(Item=1:5, LocalName=c("BACTERIA F", "BAHIA", "BAIKEA", "BAIKIA", "BAIKIAEA SP")) 
data 1
  Item   LocalName
1    1  BACTERIA F
2    2       BAHIA
3    3      BAIKEA
4    4      BAIKIA
5    5 BAIKIAEA SP
data2 <- data.frame(LocalName=c("ENGOKOM","BAHIA","BAIKIA","BANANIER","BALANITES"), Species=c("Barteria fistulosa","Mitragyna spp","Baikiaea spp", "Musa spp", "Balanites wilsoniana"))
data2
      LocalName              Species
1   ENGOKOM   Barteria fistulosa
2     BAHIA        Mitragyna spp
3    BAIKIA         Baikiaea spp
4  BANANIER             Musa spp
5 BALANITES Balanites wilsoniana
I tried using the stringdist_left_join function, and it managed to match many species correctly. I am being conservative by setting max_dist=1 because in my list, many vernacular names are very similar.
library(fuzzyjoin)
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"), max_dist=1)
table
  Item LocalName.x LocalName.y       Species
1    1  BACTERIA F        <NA>          <NA>
2    2       BAHIA       BAHIA Mitragyna spp
3    3      BAIKEA      BAIKIA  Baikiaea spp
4    4      BAIKIA      BAIKIA  Baikiaea spp
5    5 BAIKIAEA SP        <NA>          <NA>
However, I have one question. As you can see from data1, the Item 5 BAIKIAEA SP actually matches with the Species column of data2 instead of LocalName. I have many entries like this where the LocalName in data 1 were either typos of vernacular names or species name, however, I am not sure how to make stringdist_left_join matches two columns of data 2 with one column of data 1. I tried modifying the codes into something like this:
table <- data1%>%
stringdist_left_join(data2, by=c(LocalName="LocalName"|"Species"), max_dist=1)    
but it did not work, citing "Error in "LocalName" | "Species" : operations are possible only for numeric, logical or complex types". Does anyone know whether such matching is possible? Thanks in advance!
 
    