Here's an excerpt from java.text.CharacterIterator documentation:
This
interfacedefines a protocol for bidirectional iteration over text. The iterator iterates over a bounded sequence of characters. [...] The methodsprevious()andnext()are used for iteration. They returnDONEif [...], signaling that the iterator has reached the end of the sequence.
static final char DONE: Constant that is returned when the iterator has reached either the end or the beginning of the text. The value is\uFFFF, the "not a character" value which should not occur in any valid Unicode string.
The italicized part is what I'm having trouble understanding, because from my tests, it looks like a Java String can most certainly contain \uFFFF, and there doesn't seem to be any problem with it, except obviously with the prescribed CharacterIterator traversal idiom that breaks because of a false positive (e.g. next() returns '\uFFFF' == DONE when it's not really "done").
Here's a snippet to illustrate the "problem" (see also on ideone.com):
import java.text.*;
public class CharacterIteratorTest {
// this is the prescribed traversal idiom from the documentation
public static void traverseForward(CharacterIterator iter) {
for(char c = iter.first(); c != CharacterIterator.DONE; c = iter.next()) {
System.out.print(c);
}
}
public static void main(String[] args) {
String s = "abc\uFFFFdef";
System.out.println(s);
// abc?def
System.out.println(s.indexOf('\uFFFF'));
// 3
traverseForward(new StringCharacterIterator(s));
// abc
}
}
So what is going on here?
- Is the prescribed traversal idiom "broken" because it makes the wrong assumption about
\uFFFF? - Is the
StringCharacterIteratorimplementation "broken" because it doesn't e.g.throwanIllegalArgumentExceptionif in fact\uFFFFis forbidden in valid Unicode strings? - Is it actually true that valid Unicode strings should not contain
\uFFFF? - If that's true, then is Java "broken" for violating the Unicode specification by (for the most parts) allowing
Stringto contain\uFFFFanyway?