I have been given an export from a MySQL database that seems to have had it's encoding muddled somewhat over time and contains a mix of HTML char codes such as & uuml; and more problematic characters representing the same letters such as ü and Ã. It is my task to to bring some consistency back to the file and get everything into the correct Latin characters, e.g. ú and ó.
An example of the sort of string I am dealing with is
Desinfektionslösungstücher für Flächen
Which should equate to
50 Tattoo Desinfektionsl ö sungst ü cher f ü r Fl ä chen
50 Tattoo Desinfektionsl ö sungst ü cher f ü r Fl ä chen
Is there a method available in C#/.Net 4.5 that would successfully re-encode the likes of ü and à to UTF-8?
Else what approach would be advisable?
Also is the paragraph character ¶ in the above example string an actual paragraph character or part of some other character combination?
I have created a lookup table in the case of needing to do find and replace which is below, however I am unsure as to how complete it is.
É -> É
“ -> "
†-> "
Ç -> Ç
à -> Ã
é, 'é
à -> ú -> ú
• -> -
Ø -> Ø
õ -> õ
à -> í
â -> â
ã -> ã
ê -> ê
á -> á
é -> é
ó -> ó
– -> –
ç -> ç
ª -> ª
º -> º
à -> à