How do I ignore the UTF-8 Byte Order Marker in String comparisons?

Question

I'm having a problem comparing strings in a Unit Test in C# 4.0 using Visual Studio 2010. This same test case works properly in Visual Studio 2008 (with C# 3.5).

Here's the relevant code snippet:

byte[] rawData = GetData();
string data = Encoding.UTF8.GetString(rawData);

Assert.AreEqual("Constant", data, false, CultureInfo.InvariantCulture);

While debugging this test, the data string appears to the naked eye to contain exactly the same string as the literal. When I called data.ToCharArray(), I noticed that the first byte of the string data is the value 65279 which is the UTF-8 Byte Order Marker. What I don't understand is why Encoding.UTF8.GetString() keeps this byte around.

How do I get Encoding.UTF8.GetString() to not put the Byte Order Marker in the resulting string?

Update: The problem was that GetData(), which reads a file from disk, reads the data from the file using FileStream.readbytes(). I corrected this by using a StreamReader and converting the string to bytes using Encoding.UTF8.GetBytes(), which is what it should've been doing in the first place! Thanks for all the help.

Can you post a small, but complete, program that demonstrates the problem? — Lasse V. Karlsen, May 26 '10 at 17:14

score 18 · Accepted Answer · answered May 26 '10 at 17:15

18

Well, I assume it's because the raw binary data includes the BOM. You could always remove the BOM yourself after decoding, if you don't want it - but you should consider whether the byte array should consider the BOM to start with.

EDIT: Alternatively, you could use a StreamReader to perform the decoding. Here's an example, showing the same byte array being converted into two characters using Encoding.GetString or one character via a StreamReader:

using System;
using System.IO;
using System.Text;

class Test
{
    static void Main()
    {
        byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };
        string viaEncoding = Encoding.UTF8.GetString(withBom);
        Console.WriteLine(viaEncoding.Length);

        string viaStreamReader;
        using (StreamReader reader = new StreamReader
               (new MemoryStream(withBom), Encoding.UTF8))
        {
            viaStreamReader = reader.ReadToEnd();           
        }
        Console.WriteLine(viaStreamReader.Length);
    }
}

answered May 26 '10 at 17:15

Jon Skeet

1,421,763
867
9,128
9,194

You're right that the raw data includes the BOM. It shouldn't, so I'm fixing that part. A philosophical follow-up question: Why does the `String.Equals` method take the BOM into account? Why isn't it simply ignored when doing a string comparison or treated as metadata and not as the "meat" of the string? – Skrud May 26 '10 at 17:39
1

@Skrud: You've got distinct character sequences. The raw String.Equals method compares ordinal sequences, with no further consideration. It's possible that some of the other string comparisons available (culture aware etc) may ignore BOMs - I'm not sure. Given that it's a strange character in some ways, I'm not really convinced it's appropriate to just ignore it arbitrarily. Put it this way: the equality failure showed that you had some bad data, so the behaviour has led to you improving your code. That's a good thing, no? – Jon Skeet May 26 '10 at 17:43
1

Absolutely. Which is the point of testing in the first place. :-) – Skrud May 26 '10 at 17:44

score 13 · Answer 2 · answered May 27 '10 at 02:26

There is a slightly more efficient way to do it than creating StreamReader and MemoryStream:

1) If you know that there is always a BOM

string viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);

2) If you don't know, check:

string viaEncoding;
if (withBom.Length >= 3 && withBom[0] == 0xEF && withBom[1] == 0xBB && withBom[2] == 0xBF)
    viaEncoding = Encoding.UTF8.GetString(withBom, 3, withBom.Length - 3);
else
    viaEncoding = Encoding.UTF8.GetString(withBom);

score 0 · Answer 3 · answered Jul 14 '22 at 07:00

Unfortunately the BOM won't be removed with a simple Trim(). But it can be done as follows:

byte[] withBom = { 0xef, 0xbb, 0xbf, 0x41 };    
byte[] bom = { 0xef, 0xbb, 0xbf };
var text = System.Text.Encoding.UTF8.GetString(withBom);

Console.WriteLine($"Untrimmed: {text.Length}, {text}");
var trimmed = text.Trim(System.Text.Encoding.UTF8.GetString(bom).ToCharArray());
Console.WriteLine($"Trimmed: {trimmed.Length}, {trimmed}");

Output: Untrimmed: 2, A Trimmed: 1, A

score -4 · Answer 4 · answered May 26 '10 at 17:25

-4

I believe the extra character is removed if you Trim() the decoded string

answered May 26 '10 at 17:25

JoeGeeky

3,746
6
36
53

How do I ignore the UTF-8 Byte Order Marker in String comparisons?

4 Answers4

Linked

Related