I have an xml document that is UTF-8 encoded with no BOM and using named entities in the element text. I use Powershell to get the file and swap named for numeric entities (as I don't always have access to the DTD or XSD files) and post the modified xml to a REST endpoint (it uses xdmp:document-insert).
For those documents with accented characters in the attribute values I get "Invalid UTF-8 escape sequence" reported in the log file. Xml fragment below...
... in Brazil (<xref ref-type="bibr" rid="i0892-1016-51-1-72-BrazilMinistériodoMeioAmbienteMMAInstitutoChicoMendesdeConservaçãodaBiodiversidadeICMBio1">Brazil Ministério do Meio Ambiente, Instituto Chico Mendes de Conservação da Biodiversidade 2014</xref>). This species builds....
Apart from using Powershell to swap these characters to their numeric entity form is there any xquery code to deal with this or a setting in MarkLogic? The characters on this occasion are western European and the attributes are not used in indexes.
MarkLogic 8.0-6.7 Windows 10 Powershell 5.1
Addition Over the weekend I had a look around. On the MarkLogic side I pulled a copy of the xdmp:get-request-body outside a 'try-catch'. and the error confirms your (Mads) suspicion. I looked at the Powershell and it imports text content as UTF8 (Encode a string in UTF-8) but was clearly posting the text out as default character set (1252?).
function getBody ($FilePath)
{
$fileContentBinary = [System.IO.File]::ReadAllBytes($FilePath)
$enc               = [System.Text.Encoding]::GetEncoding("UTF-8")
$encodedContent    =  $enc.GetString($fileContentBinary)
$encodedContent    = elementReplace($encodedContent) 
return $encodedContent 
}
function sendXml ($MLHost, $LocalFilePath, $SUPPLIER_REF, $credentials, $xsltTRANSFORMLABEL)
{
 Add-content $logfile -value ('Posting file '  + $LocalFilePath + ' to ' + $MLHost + ' for supplier ' + $SUPPLIER_REF)
 $filename        =  (Split-Path $LocalFilePath -leaf)
 $EndpointAddress = 'http://{0}:######/nps3/article/upload/?supplier={1}&filename={2}&transform={3}' -f $MLHost, $SUPPLIER_REF, $filename, $xsltTRANSFORMLABEL ;
 $boundary        =  [System.Guid]::NewGuid().ToString()
 $bodyText        =  makeBody $LocalFilePath
 $contentType     = 'multipart/form-data; boundary={0}' -f $boundary;
 try   { 
       Invoke-RestMethod -uri $EndpointAddress -Method PUT -ContentType $contentType -body $bodyText -Credential $credentials
       #all ok so delete file
       if (Test-Path $LocalFilePath) {
       Remove-Item $LocalFilePath
        }
        }
  catch {
        Add-content $logfile -value ('A problem was encountered inserting "' + (Split-Path $LocalFilePath -leaf) + ' --> ' + $_.Exception.Message )
    }}
I added $OutputEncoding = New-Object -typename System.Text.UTF8Encoding to the top of the Powershell script (assuming it sets UTF8 as the default character set for the session??) and also added a charset parameter to the $contentType statement
$contentType = 'multipart/form-data; boundary={0} ; charset=utf-8' -f $boundary;
These changes appear to have corrected the issue. Does '$OutputEncoding' change the entire coding for the session to UTF8 if added at the top of the code?