Does `std::wregex` support utf-16/unicode or only UCS-2?

Question

With c++11 the regex library was introduced into the standard library.

On the Windows/MSVC platform wchar_t has size of 2 (16 bit) and wchar_t* is normally utf-16 when interfacing with the system/platform (eg. CreateFileW).

However it seems that std::regex isn't utf-8 or does not support it, so I'm wondering whether std::wregex supports utf-16 or just ucs2 ?

I do not find any mention of this (Unicode or the like) in the documentation. In other languages normalization takes place.

The question is:

Is std::wregex representing ucs2 when wchar_t has size of 2 ?

AFAIK no `std::regex` implementation *properly* supports Unicode. Also, they are *all* extremely slow. Don’t use them, use something like re2 instead. See https://www.reddit.com/r/cpp/comments/e16s1m/what_is_wrong_with_stdregex/ and https://reddit.com/r/cpp/comments/aetf17/stdregex_replacestdchronohigh_resolution_clocknow/edsfwe1/ — Konrad Rudolph, Nov 27 '19 at 09:50
@KonradRudolph speed may be of a lesser concern sometimes or for some parts of code - but its a good comment though. — darune, Nov 27 '19 at 09:51
It’s a concern when it’s eight hundred times (!!!) slower than other implementations. — Konrad Rudolph, Nov 27 '19 at 09:52
Windows uses UTF16 and Visual C++ itself supports standard UTF16 strings, ie `u16string`. C++ itself doesn't have special handling for UTF8 until C++20 and UTF8 strings are treated as char arrays - regex would work if you first encoded your non-English strings as UTF8. — Panagiotis Kanavos, Nov 27 '19 at 12:28
Now [std::wregex](http://www.cplusplus.com/reference/regex/wregex/) is a `typedef basic_regex wregex;` so it works with `wchar_t`, ie UCS-2. That shouldn't be a concern unless you want to handle Chinese or emojis. You can probably create a `basic_regex` to handle UTF16 strings. For whatever reason though, that's left as an exercise to the reader — Panagiotis Kanavos, Nov 27 '19 at 12:33
@PanagiotisKanavos I do want that, but thats another concern - why not post as an answer though ? — darune, Nov 27 '19 at 12:37
`I do not find any mention of this (unicode or the like) in the documentation.` depends on the documentation you're looking at. C++20 will bring basic UTF8 support anc `char8_t` with conversions going out to C++23. `char8_t` alone will be a headache for any code that assumed UTF8==char. [String and Character Literals](https://learn.microsoft.com/en-us/cpp/cpp/string-and-character-literals-cpp?view=vs-2019) in the Visual C++ docs explains how Unicode is handled today, and how the compiler handles conversions between `wchar_t` and `char16_t` — Panagiotis Kanavos, Nov 27 '19 at 12:40
@darune nope. No way. Every time I try to post an answer about C++ and Unicode i get roasted because I don't know all the intricacies of the committee process. I've worked far too long with native Unicode languages like C#, Java and Javascript. Besides, the actual answer to your question is `Not before 2023 at least`. — Panagiotis Kanavos, Nov 27 '19 at 12:42
@darune and as I said, the committee left `basic_regex` out of the library itself. They must have had a reason, eg too tricky conversions? Edge cases? They plan to replace `std::regex` altogether? — Panagiotis Kanavos, Nov 27 '19 at 12:45

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

C++ standard doesn't enforce any encoding on std::string and std::wstring. They're simply a series of CharT. Only std::u8string, std::u16string and std::u32string have defined encoding

Similarly std::regex and std::wregex also wrap around std::basic_string and CharT. Their constructors accept std::basic_string and the encoding being used for std::basic_string will also be used for std::basic_regex. So what you said "std::regex isn't utf-8 or does not support it" is wrong. If the current locale is UTF-8 then std::regex and std::string will be UTF-8 (yes, modern Windows does support UTF-8 locale)

On Windows std::wstring uses UTF-16 so std::wregex also uses UTF-16. UCS-2 is deprecated and no one uses it anymore. You don't even need to differentiate between then since UCS-2 is just a subset of UTF-16 unless you use some very old tool that cuts in the middle of a surrogate pair. String searches in UTF-16 works exactly the same as in UCS-2 because UTF-16 is self-synchronized and a proper needle string can never match from the middle of a haystack. Same to UTF-8. If the tool doesn't understand UTF-16 then it's highly likely that it doesn't know that UTF-8 is variable length either, and will truncate the UTF-8 in the middle

Self-synchronization: The leading bytes and the continuation bytes do not share values (continuation bytes start with 10 while single bytes start with 0 and longer lead bytes start with 11). This means a search will not accidentally find the sequence for one character starting in the middle of another character. It also means the start of a character can be found from a random position by backing up at most 3 bytes to find the leading byte. An incorrect character will not be decoded if a stream starts mid-sequence, and a shorter sequence will never appear inside a longer one.

https://en.wikipedia.org/wiki/UTF-8#Description

The only things you need to care about are: avoid truncating in the middle of a character, and normalize the string before matching if necessary. The former issue can be avoided in UCS-2-only regex engines if you never use characters outside the BMP in a character class like commented. Replace them with a group instead

In other languages normalization takes place.

This is wrong. Some languages may do normalization before matching a regex, but that definitely doesn't apply to all "other languages"

If you want a little bit more assurance then use std::basic_regex<char8_t> and std::basic_regex<char16_t> for UTF-8 and UTF-16 respectively. You'll still need a UTF-16-aware library though, otherwise that'll still only work for regex strings that only contain words

The better solution may be changing to another library like ICU regex. You can check Comparison of regular expression engines for some suggestions. It even has a column indicating native UTF-16 support for each library

After some more testing, it seems my main problem is matching a group, eg. an emoji with utf-16. It doesn't match no matter if i invert the group. — darune, Nov 28 '19 at 07:37
@darune you mean character class, like `[]`? because a group like `()` should be matched without any problem. A class is more tricky because the engine must be UTF-16 aware — phuclv, Nov 28 '19 at 08:17
It seems I meant character class, not group - but really good point — darune, Apr 14 '20 at 06:48

Does `std::wregex` support utf-16/unicode or only UCS-2?

1 Answers1