I am looking for simple practical C++ examples on how to use ICU.
The ICU home page is not helpful in this regard. 
I am not interested on what and why Unicode.
The few demos are not self contained and not compilable examples ( where are the includes? )
I am looking for something like 'Hello, World' of:
How to open and read a file encoded in UTF-8
How to use STL / Boost string functions to manipulate UTF-8 encoded strings
etc.
 
    
    - 437
- 1
- 4
- 10
- 
                    2Did you see this question: http://stackoverflow.com/questions/402283/stdwstring-vs-stdstring – yasouser May 15 '11 at 20:37
2 Answers
There's no special way to read a UTF-8 file unless you need to process a byte order mark (BOM). Because of the way UTF-8 encoding works, functions that read ANSI strings can also read UTF-8 strings.
The following code will read the contents of a file (ANSI or UTF-8) and do a couple of conversions.
#include <fstream>
#include <string>
#include <unicode/unistr.h>
int main(int argc, char** argv) {
    std::ifstream f("...");
    std::string s;
    while (std::getline(f, s)) {
        // at this point s contains a line of text
        // which may be ANSI or UTF-8 encoded
        // convert std::string to ICU's UnicodeString
        UnicodeString ucs = UnicodeString::fromUTF8(StringPiece(s.c_str()));
        // convert UnicodeString to std::wstring
        std::wstring ws;
        for (int i = 0; i < ucs.length(); ++i)
            ws += static_cast<wchar_t>(ucs[i]);
    }
}
Take a look at the online API reference.
If you want to use ICU through Boost, see Boost.Locale.
 
    
    - 93,841
- 5
- 60
- 108
 
    
    - 98,941
- 38
- 226
- 299
- 
                    8This code is wrong for any platform where wchar_t is not 16-bit, as ucs.getBuffer() always returns a pointer to UTF-16 data. – wjl Jun 07 '11 at 23:08
- 
                    Is `std::getline` sufficient ? I'm assuming it wouldn't recognize a `U+2028` for instance ? – lmat - Reinstate Monica Mar 06 '14 at 21:54
- 
                    @LimitedAtonement - `std::getline()` doesn't know anything about character encoding; it simply reads a string of bytes until it sees a `\n`. `UnicodeString::fromUTF8()` is responsible for recognizing that a series of bytes represents a Unicode code point and convert them accordingly. In this case, the UTF-8 representation of U+2028 is `E2 80 A8`. `std::getline()` will have no problem reading those bytes. – Ferruccio Mar 06 '14 at 22:52
- 
                    @Ferruccio the comment "at this point s contains a line of text...UTF-8 encoded" is incorrect then? Any (multi-byte) character with a `0x0a` (`\n`) byte in it will be slaughtered, right ? (I guess I'll have to TIAS :) ) – lmat - Reinstate Monica Mar 07 '14 at 14:30
- 
                    @Ferruccio It appears that I can stand to be corrected. `\n` in UTF-8 only occurs in the single-byte character `0x0a`. The problem exists if the input text is UTF-16, however, I think. – lmat - Reinstate Monica Mar 07 '14 at 14:38
- 
                    @LimitedAtonement - you could always use `std::wifstream` to process UTF16 data, but that would require that you know its format before opening the file. – Ferruccio Mar 07 '14 at 15:41
- 
                    
- 
                    **0** Removed downvote because the question is fixed. Thanks! – Cheers and hth. - Alf Jul 06 '17 at 17:57
- 
                    ICU has a function u_strToWCS which can convert a UnicodeString to a std::wstring – Superfly Jon Feb 24 '23 at 14:20
- ICU ≠ Boost, so you will find example of how to use ICU functions to manipulate strings, but not Boost. 
- Which samples are you looking at? There are samples within the ICU source tree, under icu/source/samples - I think the converter samples there open and close utf-8, also icu/source/extras/uconv which is an 'iconv' like application. 
- more samples at http://source.icu-project.org/repos/icu/icuapps/trunk/ 
hope this helps
 
    
    - 4,228
- 28
- 39