So I have a database which contains the names of multiple people, but the names are encoded in 3 different formats, mainly UTF-8, Latin-1 and ASCII. So I imported the data using data.table's fread function making use of both encoding = "UTF-8" and encoding = "Latin-1" options. However the observations with ASCII are still erroneously displayed. here is an example of what the first few names look like:
                        NAMES         
  1:               NICOL<c0>S
  2:                CAR<d1>PS
  3:                 MU<d1>OZ
  4:                CATA<d1>O
I want to filter these observations, I tried filtering the data using something like:
data[grepl("\\<c0\\>", NAMES)]
however, this does not work as grepl("\\<c0\\>", NAMES) returns all results as FALSE. Indeed the only way I managed to get the desired match was by doing:
data[grepl("À", data$NAMES, useBytes = T)]
However, while I understand what is going on (for the most part) I don't understand why I have to place de ASCII character "À" inside of grepl() instead of the displayed text by R. Mainly the <c0> in "NICOL<c0>S".
This is a problem because doing this doesn't always work. Particularly with "Ñ", so in the sample data above, rows 2 and 3 have "Ñ" but are instead displayed as <d1> and substituting Ñ into grepl() will not work. That is to say the code:
data[grepl("Ñ", data$NAMES, useBytes = T)]
It would also be nice to know if there is a better/more efficient way to do this. Thanks!
Reproducibility
To obtain these results you can download a small sample here please be sure to import the data using fread(stack.csv, encoding = "UTF-8") to obtain the same results. This is necessary because this file only contains NAMES in ASCII format and so, not specifying encoding = "UTF-8" will import the data differently (by actually showing the grave accents, for some reason however this is not replicable at scale with the whole data set). Sorry I couldn't do this through console commands for some reason, this doesn't yield a replicable example.
Edit 1: added Reproducibility section.
