I'm working with data from many different sources, so I'm creating a name bridge and a function to make it easier to join tables. One of the sources uses an umlaut for a value and (I think) the excel csv isn't UTF-8 encoded, so I'm getting strange results.
Since I can't control how the other source compiles their data, I'd like to make a universal function that fixes all the weird encoding rules. I'll use Dennis Schröder as an example name.
One particular source uses the Umlaut, and when I read it in with read.csv and view the table in RStudio, it shows up as Dennis Schr<f6>der. However, if I index the particular table to his value (table[i,j]), the console reads Dennis Schr\xf6der
So in my name-bridge csv, I made a row to map all Dennis Schr\xf6der to Dennis Schroder. I read this name bridge in (with the condition allowEscapes = TRUE), and he shows up exactly the same in my name-bridge table. Great! I should be able to left_join this to the other source to change the name to just Dennis Schroder.
But unfortunately the names still don't map unless I Don't trim strings (I have to trim strings in general because other sources introduce white spaces). Here's the general function I use to fix names. The dataframe is the other source's table, VarUse is the name-column that I want to fix from dataframe, and correctionTable is my name-bridge.
nameUpdate <- dataframe %>%
mutate(name = str_trim(VarUse, 'both')) %>%
left_join(correctionTable, by = c('name' = 'WrongName'))
When I dig into the results of this mapping, I get the following:
- correctionTable[14,1] is my name-bridge input of "Dennis Schr\xf6der".
- nameUpdate[29,3] is the original name variable from the other source which reads "Dennis Schr\xf6der".
- nameUpdate[29,19] is the mutated
namevariable from the other source after usingstr_trim, which also reads "Dennis Schr\xf6der".
However, for some reason the str_trim version is not equal to the name-bridge, so it won't map:
In writing this (non-reproducible, sorry) example, I've figured out a work-around by using a combo of str_trim and by not using it, but at this point I'm just confused why the name doesn't get fixed after I use str_trim. The values look exactly the same.
