When I load html in a DOMDocument it messes the characters.
In my project, the html source is defined by the user and therefore its content might vary greatly.
I'd like to find a secure way of parsing html content from various sources.
By secure I mean mainly
- keeping strings consistent with the original 
- protected from invalid encoding attack 
unless you think I should have additional concerns.
nodeValue does the same as textContent for this case.
I created this simplified function to clarify the issue:
<?php
function print_content($html)
{
    $dom = new DOMDocument();
    $dom->loadHTML($html);
    $div = $dom->getElementById('cyrillic_bit');
    $content = $div->textContent;
    print(mb_internal_encoding().' '.$html."\n");
    print(mb_detect_encoding($content, 'Windows-1251, UTF-8', true)." ");
    print($content."\n");
}
$html = '<div id="cyrillic_bit">Дядо Коледа<br>Error</div>';
print_content($html);
?>
The output is:
UTF-8 <div id="cyrillic_bit">Дядо Коледа<br>Error</div>
UTF-8 ÐÑдо ÐоледаError
I'd like it to be:
UTF-8 <div id="cyrillic_bit">Дядо Коледа<br>Error</div>
UTF-8 Дядо КоледаError
