This question is an extension of Do C++11 regular expressions work with UTF-8 strings?
#include <regex>
if (std::regex_match ("中", std::regex("中") )) // "\u4e2d" also works
std::cout << "matched\n";
The program is compiled on Mac Mountain Lion with clang++ with the following options:
clang++ -std=c++0x -stdlib=libc++
The code above works. This is a standard range regex "[一-龠々〆ヵヶ]" for matching any Japanese Kanji or Chinese character. It works in Javascript and Ruby, but I can't seem to get ranges working in C++11, even with using a similar version [\u4E00-\u9fa0]. The code below does not match the string.
if (std::regex_match ("中", std::regex("[一-龠々〆ヵヶ]")))
std::cout << "range matched\n";
Changing locale hasn't helped either. Any ideas?
EDIT
So I have found that all ranges work if you add a + to the end. In this case [一-龠々〆ヵヶ]+, but if you add {1} [一-龠々〆ヵヶ]{1} it does not work. Moreover, it seems to overreach it's boundaries. It won't match latin characters, but it will match は which is \u306f and ぁ which is \u3041. They both lie below \u4E00
nhahtdh also suggested regex_search which also works without adding + but it still runs into the same problem as above by pulling values outside of its range. Played with the locales a bit as well. Mark Ransom suggests it treats the UTF-8 string as a dumb set of bytes, I think this is possibly what it is doing.
Further pushing the theory that UTF-8 is getting jumbled some how, [a-z]{1} and [a-z]+ matches a, but only [一-龠々〆ヵヶ]+ matches any of the characters, not [一-龠々〆ヵヶ]{1}.