How would you get an array of Unicode code points from a .NET String?

Question

I have a list of character range restrictions that I need to check a string against, but the char type in .NET is UTF-16 and therefore some characters become wacky (surrogate) pairs instead. Thus when enumerating all the char's in a string, I don't get the 32-bit Unicode code points and some comparisons with high values fail.

I understand Unicode well enough that I could parse the bytes myself if necessary, but I'm looking for a C#/.NET Framework BCL solution. So ...

How would you convert a string to an array (int[]) of 32-bit Unicode code points?

score 24 · Answer 1 · edited May 23 '17 at 11:54

You are asking about code points. In UTF-16 (C#'s char) there are only two possibilities:

The character is from the Basic Multilingual Plane, and is encoded by a single code unit.
The character is outside the BMP, and encoded using a surrogare high-low pair of code units

Therefore, assuming the string is valid, this returns an array of code points for a given string:

public static int[] ToCodePoints(string str)
{
    if (str == null)
        throw new ArgumentNullException("str");

    var codePoints = new List<int>(str.Length);
    for (int i = 0; i < str.Length; i++)
    {
        codePoints.Add(Char.ConvertToUtf32(str, i));
        if (Char.IsHighSurrogate(str[i]))
            i += 1;
    }

    return codePoints.ToArray();
}

An example with a surrogate pair and a composed character ñ:

ToCodePoints("\U0001F300 El Ni\u006E\u0303o");                        //  El Niño
// { 0x1f300, 0x20, 0x45, 0x6c, 0x20, 0x4e, 0x69, 0x6e, 0x303, 0x6f } //    E l   N i n ̃◌ o

Here's another example. These two code points represents a 32th musical note with a staccato accent, both surrogate pairs:

ToCodePoints("\U0001D162\U0001D181");              // 
// { 0x1d162, 0x1d181 }                            //  ◌

When C-normalized, they are decomposed into a notehead, combining stem, combining flag and combining accent-staccato, all surrogate pairs:

ToCodePoints("\U0001D162\U0001D181".Normalize());  // 
// { 0x1d158, 0x1d165, 0x1d170, 0x1d181 }          //    ◌

Note that leppie's solution is not correct. The question is about code points, not text elements. A text element is a combination of code points that together form a single grapheme. For example, in the example above, the ñ in the string is represented by a Latin lowercase n followed by a combining tilde ̃◌. Leppie's solution discards any combining characters that cannot be normalized into a single code point.

I'd use `var codePoint = Char.ConvertToUtf32(...); if(codePoint > 0xFFFF) i++;` instead of `Char.IsHighSurrogate`. — CodesInChaos, Jan 26 '15 at 18:05
@CodesInChaos: I believe that would be equivalent. If and only if the first char is a high surrogate can you ever get a code point above `0xFFFF`, but please tell me if I'm mistaken. — Daniel A.A. Pelsmaeker, Jan 26 '15 at 18:09
You may want to add your _Devanagari syllable "ni"_ example here as well, i.e. a single text element consisting of two code points that do not unite to a single code point under any normalization form. The tilde n, `ñ`, can turn into one code point through (suitable) normalization. — Jeppe Stig Nielsen, Jan 26 '15 at 19:00
@JeppeStigNielsen I instead added an example of a single text element of two code points that are both surrogate pairs and expand into four code point surrogate pairs under normalization. — Daniel A.A. Pelsmaeker, Jan 26 '15 at 20:25

leppie · Accepted Answer · 2015-04-22T19:02:51.427

7

This answer is not correct. See @Virtlink's answer for the correct one.

static int[] ExtractScalars(string s)
{
  if (!s.IsNormalized())
  {
    s = s.Normalize();
  }

  List<int> chars = new List<int>((s.Length * 3) / 2);

  var ee = StringInfo.GetTextElementEnumerator(s);

  while (ee.MoveNext())
  {
    string e = ee.GetTextElement();
    chars.Add(char.ConvertToUtf32(e, 0));
  }

  return chars.ToArray();
}

Notes: Normalization is required to deal with composite characters.

edited Apr 22 '15 at 19:02

answered Mar 26 '09 at 20:28

leppie

115,091
17
196
297

3

▼: Your solution discards any modifier characters, and you are dealing with _text elements_ and not _code points_. For example, the result of `ExtractScalars("El Ni\u006E\u0303o")` converted back to a string would be `"El Nino"` instead of `"El Niño"`. – Daniel A.A. Pelsmaeker Jan 26 '15 at 17:14
@Virtlink: Interesting. From the docs it must have sounded like `char.ConvertToUtf32(string, int)` should deal with it. Edit: The damn docs claims it should! https://msdn.microsoft.com/en-us/library/z2ys180b(v=vs.110).aspx – leppie Jan 26 '15 at 17:23
@Virtlink: Ok, it does not deal with composite characters, but does for surrogate pairs. – leppie Jan 26 '15 at 17:34
I realize you may be looking at my strange use of ConvertToUtf32 overloads. Yeah, that's fixed now, but that wasn't the issue. It's about the difference between surrogate pairs and composite characters, and text elements and code points. Your code indeed handles surrogate pairs. – Daniel A.A. Pelsmaeker Jan 26 '15 at 17:34
@Virtlink: Fixed. Just Normalize the input ,if needed, to deal with composites. Your codepoints are in fact not normalized, not incorrect, but would be tricky :D Edit: The roundtrip works now. Thanks for pointing it out! – leppie Jan 26 '15 at 17:43
@leppie Only some combinations of base character and composite characters will turn into a single codepoint when normalized to FormC. So this answer is still incorrect. Something TextElement is simply not the right approach when you want a sequence of codepoints. – CodesInChaos Jan 26 '15 at 17:58
2

Yeah, I was just looking into that. For example, the Devanagari syllable "ni" is a composable character `\u0928\u093F` that doesn't turn into one code point when normalized. Also, if you have a latin character with multiple modifiers (e.g. `^` and `~`), that also doesn't get normalized into a single code point. You have to accept that your code deals with _text elements_ (combinations of code points that represent a single grapheme) and you discard all code points except the first by doing `ConvertToUtf32(e, 0)`. There is no way to make your code work with code points using text elements. – Daniel A.A. Pelsmaeker Jan 26 '15 at 18:04
1

An alternative strategy is this: `var bytes = Encoding.UTF32.GetBytes(s); var ints = new int[bytes.Length / 4]; for (var idx = 0; idx < ints.Length; ++idx) { ints[idx] = BitConverter.ToInt32(bytes, 4 * idx); }`. You can still normalize `s` first, of course. You can use `new UTF32Encoding(...)` if you want strange endianness. – Jeppe Stig Nielsen Jan 26 '15 at 18:11
@Virtlink: I see the issue now. Would have been nice if the second parameter was `ref int` to return to number of characters swallowed. – leppie Jan 26 '15 at 18:25
Yes, Virtlink is right, this is broken. If the string contains `"\u0928\u093F"`, the latter of those code points is swallowed. Both code points are in the BMP (plane 0), no surrogate pairs there obviously, but they still constitute one "text element". – Jeppe Stig Nielsen Jan 26 '15 at 18:43

Nicholas Carey · Answer 3 · 2015-01-26T20:57:18.623

4

Doesn't seem like it should be much more complicated than this:

public static IEnumerable<int> Utf32CodePoints( this IEnumerable<char> s )
{
  bool      useBigEndian = !BitConverter.IsLittleEndian;
  Encoding  utf32        = new UTF32Encoding( useBigEndian , false , true ) ;
  byte[]    octets       = utf32.GetBytes( s ) ;

  for ( int i = 0 ; i < octets.Length ; i+=4 )
  {
    int codePoint = BitConverter.ToInt32(octets,i);
    yield return codePoint;
  }

}

edited Jan 26 '15 at 20:57

answered Jan 26 '15 at 18:11

Nicholas Carey

71,308
16
93
135

`BitConverter` uses native endianness, `Encoding.UTF32` uses little endian. So this will break on a big endian system. – CodesInChaos Jan 26 '15 at 18:15
1

I just want to say that I posted the same solution (virtually) as a comment to leppie's answer, _six seconds_ before you submitted your answer. And mentioned endianness trouble as well. – Jeppe Stig Nielsen Jan 26 '15 at 18:30
@JeppeStigNielsen: Clearly, great minds think alike :) – Nicholas Carey Jan 26 '15 at 21:05

score 1 · Answer 4 · edited May 23 '17 at 12:34

1

I came up with the same approach suggested by Nicholas (and Jeppe), just shorter:

    public static IEnumerable<int> GetCodePoints(this string s) {
        var utf32 = new UTF32Encoding(!BitConverter.IsLittleEndian, false, true);
        var bytes = utf32.GetBytes(s);
        return Enumerable.Range(0, bytes.Length / 4).Select(i => BitConverter.ToInt32(bytes, i * 4));
    }

The enumeration was all I needed, but getting an array is trivial:

int[] codePoints = myString.GetCodePoints().ToArray();

edited May 23 '17 at 12:34

Community

1
1

answered Jul 19 '16 at 14:10

Rich Armstrong

143
1
7

This gave the same output as the accepted answer. Thanks! – Arundale Ramanathan Mar 09 '23 at 04:03

score 1 · Answer 5 · answered Jun 12 '20 at 06:44

1

This solution produces the same results as the solution by Daniel A.A. Pelsmaeker but is a little bit shorter:

public static int[] ToCodePoints(string s)
{
    byte[] utf32bytes = Encoding.UTF32.GetBytes(s);
    int[] codepoints = new int[utf32bytes.Length / 4];
    Buffer.BlockCopy(utf32bytes, 0, codepoints, 0, utf32bytes.Length);
    return codepoints;
}

answered Jun 12 '20 at 06:44

eikuh

673
1
9
18

This gives the same output as the accepted answer even for ZWJ sequences. Thanks! – Arundale Ramanathan Mar 09 '23 at 02:58

score 0 · Answer 6 · answered Mar 09 '23 at 04:03

Another solution from here:

    public static int[] GetCodePoints(string input)
    {
        var cp_lst = new ArrayList();
        for (var i = 0; i < input.Length; i += char.IsSurrogatePair(input, i) ? 2 : 1) {
            int codepoint = char.ConvertToUtf32(input, i);
            cp_lst.Add(codepoint);
            //Console.WriteLine(codepoint);
        }
        return (int[]) cp_lst.ToArray(typeof(int));
    }

How would you get an array of Unicode code points from a .NET String?

6 Answers6

Linked