Why does ReadLn mis-interpret UTF8 text when non-unicode page is Korean (949)?

Question

In Delphi XE2 I can only read and display unicode characters (from a UTF8 encoded file) when the system locale is English using the AssignFile and ReadLn() routines.

Where it fails
If I set the system locale for non-unicode applications to Korean (codepage 949, I think) and repeat the same read, some of my UTF8 multi-byte pairs get replaced with $3F. This only applies to using ReadLn and not when using TFile.ReadAllText(aFilename, TEncoding.UTF8) or TFileStream.Read().

The test
1. I create a text file, UTF8 w/o BOM (Notepad++) with following characters (hex equivalent shown on second line):

테스트
ed 85 8c ec 8a a4 ed 8a b8

Write a Delphi XE 2 Windows form application with TMemo control:

procedure TForm1.ReadFile(aFilename:string);
var
  gFile     : TextFile;
  gLine     : RawByteString;
  gWideLine : string;
begin
  AssignFile(gFile, aFilename);
  try
    Reset(gFile);
    Memo1.Clear;
    while not EOF(gFile) do
    begin
      ReadLn(gFile, gLine);
      gWideLine := UTF8ToWideString(gLine);
      Memo1.Lines.Add(gWideLine);
    end;
  finally
    CloseFile(gFile);
  end;
end;

I inspect the contents of gLine before performing a UTF8ToWideString conversation and under English / US locale Windows it is:

$ED $85 $8C $EC $8A $A4 $ED $8A $B8

As an aside, if I read the same file with a BOM I get the correct 3 byte preamble and the output when the UTF8 decode is performed is the same. All OK so far!

Switch Windows 7 (x64) to use Korean as the codepage for applications without Unicode support (Region and Language --> Administrative tab --> Change system locale --> Korean (Korea). Restart computer.
Read same file (UTF8 w/o BOM) with above application and gLine now has hex value:

$3F $8C $EC $8A $A4 $3F $3F

Output in TMemo: ?�스??
Hypothesis that ReadLn() (and Read() for that matter) are attempting to map UTF8 sequences as Korean multibyte sequences (i.e. Tries to interpret $ED $85, can't and so subs in question mark $3F).
Use TFileStream to read in exactly the expected number of bytes (9 w/o BOM) and the hex in memory is now exactly:

$ED $85 $8C $EC $8A $A4 $ED $8A $B8

Output in TMemo: 테스트 (perfect!)

Problem: Laziness - I've a lot of legacy routines that parse potentially large files line by line and I wanted to be sure I didn't need to write a routine to manually read until new lines for each of these files.

Question(s):

Why is Read() not returning me the exact byte string as found in the file? Is it because I'm using a TextFile type and so Delphi is doing a degree of interpretation using the non-unicode codepage?
Is there a built in way to read a UTF8 encoded file line by line?

Update:

Just came across Rob Kennedy's solution to this post which reintroduces me to TStreamReader, which answers the question about graceful reading of UTF8 files line by line.

`Readln`, and indeed `Writeln` don't properly support Unicode encodings. This question is related — David Heffernan, Mar 21 '15 at 18:19
http://stackoverflow.com/questions/26255148/is-writeln-capable-of-supporting-unicode — David Heffernan, Mar 21 '15 at 18:19

score 8 · Accepted Answer · answered Mar 21 '15 at 23:45

Is there a built in way to read a UTF8 encoded file line by line?

Use TStreamReader. It has a ReadLine() method.

    procedure TForm1.ReadFile(aFilename:string);
    var
      gFile     : TStreamReader;
      gLine     : string;
    begin
      Memo1.Clear;
      gFile := TStreamReader.Create(aFilename, TEncoding.UTF8, True);
      try
        while not gFile.EndOfStream do
        begin
          gLine := gFile.ReadLine;
          Memo1.Lines.Add(gLine);
        end;
      finally
        gFile.Free;
      end;
    end;

With that said, this particular example can be greatly simplified:

    procedure TForm1.ReadFile(aFilename:string);
    begin
      Memo1.Lines.LoadFromFile(aFilename, TEncoding.UTF8);
    end;

Thanks Remy for going the distance and creating a nice example of the TStreamReader solution. — Duncan, Mar 22 '15 at 21:49

Why does ReadLn mis-interpret UTF8 text when non-unicode page is Korean (949)?

1 Answers1