Can a valid Unicode string contain FFFF? Is Java/CharacterIterator broken?

Question

Here's an excerpt from java.text.CharacterIterator documentation:

This interface defines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methods previous() and next() are used for iteration. They return DONE if [...], signaling that the iterator has reached the end of the sequence.

static final char DONE: Constant that is returned when the iterator has reached either the end or the beginning of the text. The value is \uFFFF, the "not a character" value which should not occur in any valid Unicode string.

The italicized part is what I'm having trouble understanding, because from my tests, it looks like a Java String can most certainly contain \uFFFF, and there doesn't seem to be any problem with it, except obviously with the prescribed CharacterIterator traversal idiom that breaks because of a false positive (e.g. next() returns '\uFFFF' == DONE when it's not really "done").

Here's a snippet to illustrate the "problem" (see also on ideone.com):

import java.text.*;
public class CharacterIteratorTest {

    // this is the prescribed traversal idiom from the documentation
    public static void traverseForward(CharacterIterator iter) {
       for(char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) {
          System.out.print(c);
       }
    }

    public static void main(String[] args) {
        String s = "abc\uFFFFdef";

        System.out.println(s);
        // abc?def

        System.out.println(s.indexOf('\uFFFF'));
        // 3
        
        traverseForward(new StringCharacterIterator(s));
        // abc
    }
}

So what is going on here?

Is the prescribed traversal idiom "broken" because it makes the wrong assumption about \uFFFF?
Is the StringCharacterIterator implementation "broken" because it doesn't e.g. throw an IllegalArgumentException if in fact \uFFFF is forbidden in valid Unicode strings?
Is it actually true that valid Unicode strings should not contain \uFFFF?
If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing String to contain \uFFFF anyway?

Link: http://www.mail-archive.com/java-dev@lucene.apache.org/msg15483.html "(LUCENE-1241) `0xffff char` is not a string terminator" — polygenelubricants, Aug 14 '10 at 09:13
Link: http://icu-project.org/apiref/icu4c/utf_8h.html - "ICU APIs designed before ICU 2.4 usually define service-specific "done" values, mostly `0xffff`. Those may need to be distinguished from actual `U+ffff` text contents by calling functions like..." — polygenelubricants, Aug 14 '10 at 09:14
Java is "broken" beyond repair when it comes to dealing with Unicode but not for the nitpicky reason you mention. Java was created when Unicode 3.1 wasn't out and hence there weren't more than 2**16 codepoints. Which is why they very stupidly choosed char to be 16 bits and now we have a fully broken set of methods like *size()* and *getCharAt(...)* that will keep confusing people for decades. This is a **MUCH** more serious deeply, sadly and profoundly broken aspect of the String class than the nitpicky one you're mentioning. — NoozNooz42, Aug 14 '10 at 14:44
but as always... Sending love to you for focusing on the big picture and not on obscure JLS details. **THAT** is time well-spent as always ;) — NoozNooz42, Aug 14 '10 at 14:47
@NoozNooz42: Well, back in the day, UCS-2 *was* a good way of dealing with Unicode. And UTF-16 is a reasonable step from there. Just don't assume that `char` actually holds a character (or codepoint – as a character in the usual [human] sense would be a grapheme, which can be arbitrarily long), but instead holds a single UTF-16 code unit. It's not broken, it's just a certain way of dealing with Unicode. It's the same with Windows, by the way. — Joey, Aug 14 '10 at 17:30

score 29 · Accepted Answer · edited May 23 '17 at 12:00

EDIT (2013-12-17): Peter O. brings up an excellent point below, which renders this answer wrong. Old answer below, for historical accuracy.

Answering your questions:

Is the prescribed traversal idiom "broken" because it makes the wrong assumption about \uFFFF?

No. U+FFFF is a so-called non-character. From Section 16.7 of the Unicode Standard:

Noncharacters are code points that are permanently reserved in the Unicode Standard for internal use. They are forbidden for use in open interchange of Unicode text data.

...

The Unicode Standard sets aside 66 noncharacter code points. The last two code points of each plane are noncharacters: U+FFFE and U+FFFF on the BMP, U+1FFFE and U+1FFFF on Plane 1, and so on, up to U+10FFFE and U+10FFFF on Plane 16, for a total of 34 code points. In addition, there is a contiguous range of another 32 noncharacter code points in the BMP: U+FDD0..U+FDEF.

Is the StringCharacterIterator implementation "broken" because it doesn't e.g. throw an IllegalArgumentException if in fact \uFFFF is forbidden in valid Unicode strings?

Not quite. Applications are allowed to use those code points internally in any way they want. Quoting the standard again:

Applications are free to use any of these noncharacter code points internally but should never attempt to exchange them. If a noncharacter is received in open interchange, an application is not required to interpret it in any way. It is good practice, however, to recognize it as a noncharacter and to take appropriate action, such as replacing it with U+FFFD REPLACEMENT CHARACTER, to indicate the problem in the text. It is not recommended to simply delete noncharacter code points from such text, because of the potential security issues caused by deleting uninterpreted characters.

So while you should never encounter such a string from the user, another application or a file, you may well put it into a Java String if you know what you're doing (this basically means that you cannot use the CharacterIterator on that string, though.

Is it actually true that valid Unicode strings should not contain \uFFFF?

As quoted above, any string used for interchange must not contain them. Within your application you're free to use them in whatever way they want.

Of course, a Java char, being just a 16-bit unsigned integer doesn't really care about the value it holds as well.

If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing String to contain \uFFFF anyway?

No. In fact, the section on noncharacters even suggests the use of U+FFFF as sentinel value:

In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses.

U+FFFF and U+10FFFF. These two noncharacter code points have the attribute of being associated with the largest code unit values for particular Unicode encoding forms. In UTF-16, U+FFFF is associated with the largest 16-bit code unit value, FFFF₁₆. U+10FFFF is associated with the largest legal UTF-32 32-bit code unit value, 10FFFF₁₆. This attribute renders these two noncharacter code points useful for internal purposes as sentinels. For example, they might be used to indicate the end of a list, to represent a value in an index guaranteed to be higher than any valid character value, and so on.

CharacterIterator follows this in that it returns U+FFFF when no more characters are available. Of course, this means that if you have another use for that code point in your application you may consider using a different non-character for that purpose since U+FFFF is already taken – at least if you're using CharacterIterator.

Peter O. · Answer 2 · 2021-11-15T09:54:18.113

Some of these answers have changed in the meantime.

The Unicode Consortium recently issued Corrigendum 9 that clarifies the role of noncharacters, including U+FFFF, in Unicode strings. It states that while noncharacters are intended for internal use, they can occur legally in Unicode strings.

That means the statement "The value is \uFFFF, the 'not a character' value which should not occur in any valid Unicode string." is now incorrect, since U+FFFF can occur in valid Unicode strings.

Accordingly:

Is the StringCharacterIterator implementation "broken" because it doesn't throw an exception if \uFFFF is forbidden in valid Unicode strings? Since U+FFFF is valid, this doesn't apply here. But an implementation has wide flexibility in signaling an error when it encounters text that's illegal for other reasons, such as unpaired surrogate code points, which still remain illegal (see conformance clause C10 in chapter 3 of the Unicode Standard).
Is it true that valid Unicode strings should not contain \uFFFF? U+FFFF is not illegal in a valid Unicode string.

However U+FFFF is reserved as a noncharacter and so will generally not occur in meaningful text. The corrigendum deleted the text that noncharacters "should never be interchanged", which the corrigendum says happens "anytime a Unicode string crosses an API boundary", including the StringCharacterIterator API at issue here.
If that's true, then is Java "broken" for violating the Unicode specification by allowing String to contain \uFFFF anyway? The specification for java.lang.String says "A String represents a string in the UTF-16 format." U+FFFF is legal in a Unicode string, so Java doesn't violate Unicode for allowing U+FFFF in a string containing it.

In general, a higher-level protocol can impose its own rules on top of the Unicode Standard, on the question of which characters are allowed in documents accepted by the protocol. This is the case, for example, in the XML specification. In general, U+FFFF (and other Unicode scalar values) can validly appear in a text string unless a higher-level protocol (such as XML) specifies otherwise. Indeed, there is a current effort (as of November 15, 2021) to restrict the use of Unicode bidirectional override characters in certain programming languages such as Rust, to reduce security attacks due to visual confusion.

Good point. I won't edit my older answer now, though, but the accept should shift, I guess. — Joey, Dec 16 '13 at 23:55
But is String "broken" in that `"\uFFFF".getBytes("UTF-8");` doesn't throw an error, but returns the appropriate bytes: EF BF BF (on Java 7 anyway)? — theory, Jul 18 '14 at 05:08
@theory: No, not in that case: here, the UTF-8 conversion uses the string's length, rather than a noncharacter such as U+FFFF, to determine the end of the string. — Peter O., Jul 18 '14 at 13:14
What about XML containing data referencing unicode. Is it still invalid XML if this internal unicode character is occurring, because in XML context this is not allowed? Or should also XML libraries handle this case? — jan, Nov 15 '21 at 09:15
@jan : XML imposes its own rules on top of the Unicode Standard, on the question of which characters are allowed in XML documents. In general, U+FFFF (and other Unicode scalar values) can validly appear in a text string unless a higher-level protocol (such as XML) specifies otherwise. Indeed, there is a current effort to restrict the use of Unicode bidirectional override characters in certain programming languages such as Rust, to reduce security attacks due to visual confusion. — Peter O., Nov 15 '21 at 09:46

score 3 · Answer 3 · answered Aug 19 '10 at 14:17

Is the StringCharacterIterator implementation "broken" because it doesn't e.g. throw an IllegalArgumentException if in fact \uFFFF is forbidden in valid Unicode strings?

Not strictly according to Unicode, but it's inconsistent with the rest of Java's string handling interfaces, and that inconsistency could have very unpleasant effects. Think of all the security holes we've had from string processing that does vs. doesn't treat \0 as a terminator.

I would strongly avoid the CharacterIterator interface.

score 2 · Answer 4 · answered Aug 14 '10 at 09:45

Yes, CharacterIterator's use of 0xFFFF as the DONE value is a bit of anomaly. But it all makes sense from the perspective of efficient text processing.

The String class does not forbid the 0xFFFF "non-character" and other reserved or unmapped Unicode codepoints. To do so would require the String constructors to check each supplied char value. It would also present problems with handling text containing Unicode code points defined in a future (with respect to the JVM) version of Unicode.

On the other hand, the CharacterIterator interface is designed to allow iteration by calling one just method; i.e. next(). They have decided to use a distinguished char value to indicate "no more", because the other alternatives are:

throwing an exception (which is too expensive), or
using int as the return type which life more complicated for the caller.

If the CharacterIterator is used for "real" Unicode text, then the fact that you cannot include 0xFFFF is not an issue. Valid Unicode text doesn't contain this code point. (In fact, the reason for 0xFFFF being reserved as a non-character is to support applications where Unicode text is represented as strings terminated by a non-character value. The use of 0xFFFF as a character would break that completely.)

The bottom line is:

if you want strict Unicode strings then don't use String, and
if you want to iterate over Java Strings that contain 0xFFFF values, then don't use a CharacterIterator.

Future assigned code points are no problem, as the set of non-character code points which are explicitly forbidden for interchange is unchanging and fixed. — Joey, Aug 14 '10 at 12:01
@Joey - Actually, PeterO's answer directly contradicts your statement about "unchanging and fixed". Non-character codepoints are >>now<< officially permitted for interchange. Officially. — Stephen C, Nov 03 '15 at 02:29

Can a valid Unicode string contain FFFF? Is Java/CharacterIterator broken?

4 Answers4

Is the prescribed traversal idiom "broken" because it makes the wrong assumption about \uFFFF?

Is the StringCharacterIterator implementation "broken" because it doesn't e.g. throw an IllegalArgumentException if in fact \uFFFF is forbidden in valid Unicode strings?

Is it actually true that valid Unicode strings should not contain \uFFFF?

If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing String to contain \uFFFF anyway?

Linked

Related