In Delphi XE2 I can only read and display unicode characters (from a UTF8 encoded file) when the system locale is English using the AssignFile and ReadLn() routines.
Where it fails
If I set the system locale for non-unicode applications to Korean (codepage 949, I think) and repeat the same read, some of my UTF8 multi-byte pairs get replaced with $3F. This only applies to using ReadLn and not when using TFile.ReadAllText(aFilename, TEncoding.UTF8) or TFileStream.Read().
The test
1. I create a text file, UTF8 w/o BOM (Notepad++) with following characters (hex equivalent shown on second line):
테스트
ed 85 8c ec 8a a4 ed 8a b8
Write a Delphi XE 2 Windows form application with TMemo control:
procedure TForm1.ReadFile(aFilename:string); var gFile : TextFile; gLine : RawByteString; gWideLine : string; begin AssignFile(gFile, aFilename); try Reset(gFile); Memo1.Clear; while not EOF(gFile) do begin ReadLn(gFile, gLine); gWideLine := UTF8ToWideString(gLine); Memo1.Lines.Add(gWideLine); end; finally CloseFile(gFile); end; end;I inspect the contents of
gLinebefore performing aUTF8ToWideStringconversation and under English / US locale Windows it is:$ED $85 $8C $EC $8A $A4 $ED $8A $B8
As an aside, if I read the same file with a BOM I get the correct 3 byte preamble and the output when the UTF8 decode is performed is the same. All OK so far!
Switch Windows 7 (x64) to use Korean as the codepage for applications without Unicode support (Region and Language --> Administrative tab --> Change system locale --> Korean (Korea). Restart computer.
Read same file (UTF8 w/o BOM) with above application and
gLinenow has hex value:$3F $8C $EC $8A $A4 $3F $3FOutput in TMemo: ?�스??
Hypothesis that
ReadLn()(andRead()for that matter) are attempting to map UTF8 sequences as Korean multibyte sequences (i.e. Tries to interpret $ED $85, can't and so subs in question mark $3F).Use
TFileStreamto read in exactly the expected number of bytes (9 w/o BOM) and the hex in memory is now exactly:$ED $85 $8C $EC $8A $A4 $ED $8A $B8Output in TMemo: 테스트 (perfect!)
Problem: Laziness - I've a lot of legacy routines that parse potentially large files line by line and I wanted to be sure I didn't need to write a routine to manually read until new lines for each of these files.
Question(s):
Why is
Read()not returning me the exact byte string as found in the file? Is it because I'm using aTextFiletype and so Delphi is doing a degree of interpretation using the non-unicode codepage?Is there a built in way to read a UTF8 encoded file line by line?
Update:
Just came across Rob Kennedy's solution to this post which reintroduces me to TStreamReader, which answers the question about graceful reading of UTF8 files line by line.