There are some posts regarding encoding questions and HtmlAgilityPack but this issue wasn't addressed:
Because the website I try to parse contains Unicode symbols like € or ä, ü I tried to set the encoding to Unicode:
public class WebpageDeserializer
{
public WebpageDeserializer() {}
/*
* Example address: https://www.dslr-forum.de/showthread.php?t=1930368
*/
public static void Deserialize(string address)
{
var web = new HtmlWeb();
web.OverrideEncoding = Encoding.Unicode;
var htmlDoc = web.Load(address);
//further decoding fails because unicode decoded characters are not proper html (looks more like chinese)
}
}
But now
htmlDoc.DocumentNode.InnerHtml
looks like this:
ℼ佄呃偙⁅瑨汭倠䉕䥌⁃ⴢ⼯㍗⽃䐯䑔堠呈䱍ㄠ〮吠慲獮瑩潩慮⽬䔯≎...
If I try to use UTF-8 or iso-8859-1 the € symbol is converted to � (as well as ä, ö, ü). How can I fix this?