How to reverse a string that contains surrogate pairs

Question

I have written this method to reverse a string

public string Reverse(string s)
        {
            if(string.IsNullOrEmpty(s)) 
                return s;

            TextElementEnumerator enumerator =
               StringInfo.GetTextElementEnumerator(s);

            var elements = new List<char>();
            while (enumerator.MoveNext())
            {
                var cs = enumerator.GetTextElement().ToCharArray();
                if (cs.Length > 1)
                {
                    elements.AddRange(cs.Reverse());
                }
                else
                {
                    elements.AddRange(cs);
                }
            }

            elements.Reverse();
            return string.Concat(elements);
        }

Now, I don't want to start a discussion about how this code could be made more efficient or how there are one liners that I could use instead. I'm aware that you can perform Xors and all sorts of other things to potentially improve this code. If I want to refactor the code later I could do that easily as I have unit tests.

Currently, this correctly reverses BML strings (including strings with accents like "Les Misérables") and strings that contain combined characters such as "Les Mise\u0301rables".

My test that contains surrogate pairs work if they are expressed like this

Assert.AreEqual("", _stringOperations.Reverse(""));

But if I express surrogate pairs like this

Assert.AreEqual("\u10000", _stringOperations.Reverse("\u10000"));

then the test fails. Is there an air-tight implementation that supports surrogate pairs as well?

If I have made any mistake above then please do point this out as I'm no Unicode expert.

looks like c# supports only \u digit digit digit digit http://msdn.microsoft.com/en-us/library/aa664669(v=vs.71).aspx — wiero, Mar 01 '14 at 13:14
That does seem to work but it goes into the if (cs.Length > 1) branch which suggests to me that this might be a different Unicode character to \u10000. The branch that it goes into is for composite characters, so it's probably a different character. Either way there still is a set of strings that can't be reversed using this method. We need a Unicode and C# expert to clear this all up. — Sachin Kainth, Mar 01 '14 at 14:02
I'm not really doing this for any particular purpose or project. I'm writing this code out of curiosity and so when I try to reverse \u10000 I don't actually care what it prints, I just want to correctly reverse it. — Sachin Kainth, Mar 01 '14 at 14:20

score 6 · Accepted Answer · edited Jun 20 '20 at 09:12

6

\u10000 is a string of two characters: က (Unicode code point 1000) followed by a 0 (which can be detected by inspecting the value of s in your method). If you reverse two characters, they won't match the input anymore.

It seems you're after Unicode Character 'LINEAR B SYLLABLE B008 A' (U+10000) with hexadecimal code point 10000. From Unicode character escape sequences on MSDN:

\u hex-digit hex-digit hex-digit hex-digit

\U hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit hex-digit

So you'll have to use either four or eight digits.

Use \U00010000 (notice the capital U) or \uD800\uDC00 instead of \u10000.

edited Jun 20 '20 at 09:12

Community

1
1

answered Mar 01 '14 at 14:20

CodeCaster

147,647
23
218
272

Okay, you are right (+1), the unit tests for \U00010000 and \uD800\uDC00 both pass. This is great. But, the issue still stands "\u10000" is a valid string, yet it cannot be reversed using the method above? So, 1. Why is this? 2. What's the solution? – Sachin Kainth Mar 01 '14 at 14:28
I think I answered that in my first two sentences. If you reverse two different characters (`AB` -> `BA`), then they don't match anymore. This means your reversion works, but you need to test whether the string has been reversed, not whether it matches the original input. – CodeCaster Mar 01 '14 at 14:29
But they form a single character when combined in this way right? Which means that a user would expect them to be reversed in such a way that the character, though made up of two different characters, would be reversed in it's entirety. Therefore there must be a way of detecting this and reversing appropriately. – Sachin Kainth Mar 01 '14 at 14:31
No, they don't. `\u10000` does **not** represent the character you think it does, it is a string of **two** characters: `{\u1000} and 0`, because you can use either **four or eight digits**. The fifth character, `0`, is not part of the unicode escape sequence (try `\u1000X`). Inspect the value of `s` to confirm. – CodeCaster Mar 01 '14 at 14:32
You are right. Am I alone in finding this a little confusing? I still think a user would be stumped by this. Is \u10000 a surrogate pair? – Sachin Kainth Mar 01 '14 at 14:34
In my opinion it is not confusing. :) It enables you to use unicode characters in mid-sentence, just like your `Les Mise\u0301rables`. Where would the escape sequence stop when it is directly followed by valid hexadecimal digits (`\u1234a5c`)? Is `a5c` part of the unicode character, or part of the rest of the string? – CodeCaster Mar 01 '14 at 14:36
1

I am seeing what you mean now. Like in the cartoons when the lightbulb turns on above the characters head. :) – Sachin Kainth Mar 01 '14 at 14:37

Stefan Steiger · Answer 2 · 2016-07-13T15:33:31.837

Necromancing.
This happens because you use List<char>.Reverse instead of List<string>.Reverse

// using System.Globalization;

TextElementEnumerator enumerator =
    StringInfo.GetTextElementEnumerator("Les Mise\u0301rables");

List<string> elements = new List<string>();
while (enumerator.MoveNext())
    elements.Add(enumerator.GetTextElement());

elements.Reverse();
string reversed = string.Concat(elements);  // selbarésiM seL

See Jon Skeet's pony video for more information: https://vimeo.com/7403673

Here's how you properly reverse a string (a string, not a sequence of chars):

public static class Test
{

    private static System.Collections.Generic.List<string> GraphemeClusters(string s)
    {
        System.Collections.Generic.List<string> ls = new System.Collections.Generic.List<string>();

        System.Globalization.TextElementEnumerator enumerator = System.Globalization.StringInfo.GetTextElementEnumerator(s);
        while (enumerator.MoveNext())
        {
            ls.Add((string)enumerator.Current);
        }

        return ls;
    }


    // this 
    private static string ReverseGraphemeClusters(string s)
    {
         if(string.IsNullOrEmpty(s) || s.Length == 1)
              return s;

        System.Collections.Generic.List<string> ls = GraphemeClusters(s);
        ls.Reverse();

        return string.Join("", ls.ToArray());
    }

    public static void TestMe()
    {
        string s = "Les Mise\u0301rables";
        string r = ReverseGraphemeClusters(s);

        // This would be wrong:
        // char[] a = s.ToCharArray();
        // System.Array.Reverse(a);
        // string r = new string(a);

        System.Console.WriteLine(r);
    }
}

Note that you need to know the difference between
- a character and a glyph
- a byte (8 bit) and a codepoint/rune (32 bit)
- a codepoint and a GraphemeCluster [32+ bit] (aka Grapheme/Glyph)

Reference:

Character is an overloaded term than can mean many things.

A code point is the atomic unit of information. Text is a sequence of code points. Each code point is a number which is given meaning by the Unicode standard.

A grapheme is a sequence of one or more code points that are displayed as a single, graphical unit that a reader recognizes as a single element of the writing system. For example, both a and ä are graphemes, but they may consist of multiple code points (e.g. ä may be two code points, one for the base character a followed by one for the diaresis; but there's also an alternative, legacy, single code point representing this grapheme). Some code points are never part of any grapheme (e.g. the zero-width non-joiner, or directional overrides).

A glyph is an image, usually stored in a font (which is a collection of glyphs), used to represent graphemes or parts thereof. Fonts may compose multiple glyphs into a single representation, for example, if the above ä is a single code point, a font may chose to render that as two separate, spatially overlaid glyphs. For OTF, the font's GSUB and GPOS tables contain substitution and positioning information to make this work. A font may contain multiple alternative glyphs for the same grapheme, too.

score 0 · Answer 3 · answered Mar 31 '14 at 15:06

This is a start. It might not be the fastest, but it does seem to work for what we have thrown at it.

internal static string ReverseItWithSurrogate(string stringToReverse)
{
    string result = string.Empty;

    // We want to get the string into a character array first
    char[] stringArray = stringToReverse.ToCharArray();

    // This is the object that will hold our reversed string.
    var sb = new StringBuilder();
    bool haveSurrogate = false;

    // We are starting at the back and looking at each character.  if it is a
    // low surrogate and the one prior is a high and not < 0, then we have a surrogate pair.
    for (int loopVariable = stringArray.Length - 1; loopVariable >= 0; loopVariable--)
    {
    // we cant' check the high surrogate if the low surrogate is index 0
    if (loopVariable > 0)
    {
        haveSurrogate = false;

        if (char.IsLowSurrogate(stringArray[loopVariable]) &&    char.IsHighSurrogate(stringArray[loopVariable - 1]))
       {
          sb.Append(stringArray[loopVariable - 1]);
          sb.Append(stringArray[loopVariable]);

         // and force the second character to drop from our loop
         loopVariable--;
         haveSurrogate = true;
       }

      if (!haveSurrogate)
      {
         sb.Append(stringArray[loopVariable]);
        }
       }
    else
    {
     // Now we have to handle the first item in the list if it is not a high surrogate.
      if (!haveSurrogate)
      {
        sb.Append(stringArray[loopVariable]);
       }
     }
   }

result = sb.ToString();
return result;
}

Tom · Answer 4 · 2015-01-28T01:06:35.037

best viewed NOT in Chrome!

using System.Linq;
using System.Collections.Generic;
using System;
using System.Globalization;
using System.Diagnostics;
using System.Collections;
namespace OrisNumbers
{
    public static class IEnumeratorExtensions
    {
        public static IEnumerable<T> AsIEnumerable<T>(this IEnumerator iterator)
        {
            while (iterator.MoveNext())
            {
                yield return (T)iterator.Current;
            }
        }
    }
    class Program
    {
        static void Main(string[] args)
        {
            var s = "foo  bar mañana mañana" ;
            Debug.WriteLine(s);
            Debug.WriteLine(string.Join("", StringInfo.GetTextElementEnumerator(s.Normalize()).AsIEnumerable<string>().Reverse()));
            Console.Read();
        }
    }
}

How to reverse a string that contains surrogate pairs

4 Answers4

Linked