Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

C++ - STD - Wstring Vs STD - String - Stack Overflow

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

std::wstring VS std::string
Asked 13 years, 8 months ago Modified 10 months ago Viewed 381k times

I am not able to understand the differences between std::string and std::wstring . I know
wstring supports wide characters such as Unicode characters. I have got the following
865 questions:

1. When should I use std::wstring over std::string ?

2. Can std::string hold the entire ASCII character set, including the special characters?
447
3. Is std::wstring supported by all popular C++ compilers?

4. What is exactly a "wide character"?

c++ string unicode c++-faq wstring

Share Follow edited Jan 31, 2013 at 21:33 asked Dec 31, 2008 at 4:08
Rapptz Appu
20.4k 5 71 86

13 The ASCII charachter set doesn't have a lot of "special" characters, the most exotic is probably `
(backquote). std::string can hold about 0.025% of all Unicode characters (usually, 8 bit char) – MSalters
Jan 2, 2009 at 14:24

4 Good information about wide characters and which type to use can be found here:
programmers.stackexchange.com/questions/102205/… – Yariv Mar 14, 2012 at 11:19

15 Well, and since we are in 2012, utf8everywhere.org was written. It pretty much answers all questions
about rights and wrongs with C++/Windows. – Pavel Radzivilovsky Jun 21, 2012 at 4:19

53 @MSalters: std::string can hold 100% of all Unicode characters, even if CHAR_BIT is 8. It depends on the
encoding of std::string, which may be UTF-8 on the system level (like almost everywhere except for
windows) or on your application level. Native narrow encoding doesn't support Unicode? No problem,
just don't use it, use UTF-8 instead. – Yakov Galka Jun 22, 2012 at 10:19

8 Great reading on this topic: utf8everywhere.org – Timothy Shields Aug 5, 2013 at 18:29

Sorted by:
Trending sort available
12 Answers
Highest score (default)

string ? wstring ?

1110 std::string is a basic_string templated on a char , and std::wstring on a wchar_t .

char vs. wchar_t


Join Stack Overflow to find the best answer to your technical question, help others
Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 1/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

char is supposed to hold a character, usually an 8-bit character. wchar_t is supposed to


hold a wide character, and then, things get tricky: On Linux, a wchar_t is 4 bytes, while on
Windows, it's 2 bytes.

What about Unicode, then?


The problem is that neither char nor wchar_t is directly tied to unicode.

On Linux?

Let's take a Linux OS: My Ubuntu system is already unicode aware. When I work with a char
string, it is natively encoded in UTF-8 (i.e. Unicode string of chars). The following code:

#include <cstring>
#include <iostream>

int main()
{
const char text[] = "olé";

std::cout << "sizeof(char) : " << sizeof(char) << "\n";


std::cout << "text : " << text << "\n";
std::cout << "sizeof(text) : " << sizeof(text) << "\n";
std::cout << "strlen(text) : " << strlen(text) << "\n";

std::cout << "text(ordinals) :";

for(size_t i = 0, iMax = strlen(text); i < iMax; ++i)


{
unsigned char c = static_cast<unsigned_char>(text[i]);
std::cout << " " << static_cast<unsigned int>(c);
}

std::cout << "\n\n";

// - - -

const wchar_t wtext[] = L"olé" ;

std::cout << "sizeof(wchar_t) : " << sizeof(wchar_t) << "\n";


//std::cout << "wtext : " << wtext << "\n"; <- error
std::cout << "wtext : UNABLE TO CONVERT NATIVELY." << "\n";
std::wcout << L"wtext : " << wtext << "\n";

std::cout << "sizeof(wtext) : " << sizeof(wtext) << "\n";


std::cout << "wcslen(wtext) : " << wcslen(wtext) << "\n";

std::cout << "wtext(ordinals) :";

for(size_t i = 0, iMax = wcslen(wtext); i < iMax; ++i)


{
unsigned short wc = static_cast<unsigned short>(wtext[i]);
std::cout << " " << static_cast<unsigned int>(wc);
}

std::cout << "\n\n";


}

outputs the following text:


Join Stack Overflow to find the best answer to your technical question, help others
Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 2/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

sizeof(char) : 1
text : olé
sizeof(text) : 5
strlen(text) : 4
text(ordinals) : 111 108 195 169

sizeof(wchar_t) : 4
wtext : UNABLE TO CONVERT NATIVELY.
wtext : ol�
sizeof(wtext) : 16
wcslen(wtext) : 3
wtext(ordinals) : 111 108 233

You'll see the "olé" text in char is really constructed by four chars: 110, 108, 195 and 169 (not
counting the trailing zero). (I'll let you study the wchar_t code as an exercise)

So, when working with a char on Linux, you should usually end up using Unicode without
even knowing it. And as std::string works with char , so std::string is already unicode-
ready.

Note that std::string , like the C string API, will consider the "olé" string to have 4
characters, not three. So you should be cautious when truncating/playing with unicode chars
because some combination of chars is forbidden in UTF-8.

On Windows?
On Windows, this is a bit different. Win32 had to support a lot of application working with
char and on different charsets/codepages produced in all the world, before the advent of
Unicode.

So their solution was an interesting one: If an application works with char , then the char
strings are encoded/printed/shown on GUI labels using the local charset/codepage on the
machine, which could not be UTF-8 for a long time. For example, "olé" would be "olé" in a
French-localized Windows, but would be something different on an cyrillic-localized
Windows ("olй" if you use Windows-1251). Thus, "historical apps" will usually still work the
same old way.

For Unicode based applications, Windows uses wchar_t , which is 2-bytes wide, and is
encoded in UTF-16, which is Unicode encoded on 2-bytes characters (or at the very least,
UCS-2, which just lacks surrogate-pairs and thus characters outside the BMP (>= 64K)).

Applications using char are said "multibyte" (because each glyph is composed of one or
more char s), while applications using wchar_t are said "widechar" (because each glyph is
composed of one or two wchar_t . See MultiByteToWideChar and WideCharToMultiByte
Win32 conversion API for more info.

Thus, if you work on Windows, you badly want to use wchar_t (unless you use a framework
hiding that, like GTK or QT...). The fact is that behind the scenes, Windows works with
wchar_t strings, so even historical applications will have their char strings converted in
wchar_t when using API like SetWindowText() (low level API function to set the label on a
Win32
Join Stack GUI). to find the best answer to your technical question, help others
Overflow
Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 3/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

Memory issues?

UTF-32 is 4 bytes per characters, so there is no much to add, if only that a UTF-8 text and
UTF-16 text will always use less or the same amount of memory than an UTF-32 text (and
usually less).

If there is a memory issue, then you should know than for most western languages, UTF-8
text will use less memory than the same UTF-16 one.

Still, for other languages (chinese, japanese, etc.), the memory used will be either the same, or
slightly larger for UTF-8 than for UTF-16.

All in all, UTF-16 will mostly use 2 and occassionally 4 bytes per characters (unless you're
dealing with some kind of esoteric language glyphs (Klingon? Elvish?), while UTF-8 will spend
from 1 to 4 bytes.

See https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16 for more info.

Conclusion
1. When I should use std::wstring over std::string?

On Linux? Almost never (§). On Windows? Almost always (§). On cross-platform code?
Depends on your toolkit...
(§) : unless you use a toolkit/framework saying otherwise

2. Can std::string hold all the ASCII character set including special characters?
Notice: A std::string is suitable for holding a 'binary' buffer, where a std::wstring is
not!

On Linux? Yes. On Windows? Only special characters available for the current locale of
the Windows user.

Edit (After a comment from Johann Gerell): a std::string will be enough to handle
all char -based strings (each char being a number from 0 to 255). But:

1. ASCII is supposed to go from 0 to 127. Higher char s are NOT ASCII.


2. a char from 0 to 127 will be held correctly
3. a char from 128 to 255 will have a signification depending on your encoding
(unicode, non-unicode, etc.), but it will be able to hold all Unicode glyphs as long as
they are encoded in UTF-8.
3. Is std::wstring supported by almost all popular C++ compilers?

Mostly, with the exception of GCC based compilers that are ported to Windows. It works
on my g++ 4.3.2 (under Linux), and I used Unicode API on Win32 since Visual C++ 6.
4. What is exactly a wide character?

On C/C++, it's a character type written wchar_t which is larger than the simple char
character type. It is supposed to be used to put inside characters whose indices (like
Join Stack Overflow to find the
Unicode glyphs) arebest answer
larger than to your
255 (ortechnical question, help others
127, depending...).
Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 4/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

Share Follow edited Oct 31, 2021 at 18:18 answered Dec 31, 2008 at 12:47
Deduplicator paercebal
43.5k 6 62 110 79.3k 37 128 158

4 @gnud: Perhaps wchar_t was supposed to be enough to handle all UCS-2 chars (most UTF-16 chars)
before the advent of UTF-16... Or perhaps Microsoft did have other priorities than POSIX, like giving
easy access to Unicode without modifying the codepaged use of char on Win32. – paercebal Jan 2,
2009 at 20:49

6 @Sorin Sbarnea: UTF-8 could take 1-6 bytes, but apparently the standard limits it to 1-4. See
en.wikipedia.org/wiki/UTF8#Description for more information. – paercebal Jan 13, 2010 at 13:10

9 While this examples produces different results on Linux and Windows the C++ program contains
implementation-defined behavior as to whether olè is encoded as UTF-8 or not. Further more, the
reason you cannot natively stream wchar_t * to std::cout is because the types are
incompatible resulting in an ill-formed program and it has nothing to do with the use of encodings.
It's worth pointing out that whether you use std::string or std::wstring depends on your
own encoding preference rather than the platform, especially if you want your code to be portable.
– John Leidegren Aug 9, 2012 at 9:37

15 Windows actually uses UTF-16 and have been for quite some time, older versions of Windows did use
UCS-2 but this is not the case any longer. My only issue here is the conclusion that std::wstring
should be used on Windows because it's a better fit for the Unicode Windows API which I think is
fallacious. If your only concern was calling into the Unicode Windows API and not marshalling strings
then sure but I don't buy this as the general case. – John Leidegren Aug 9, 2012 at 18:15

18 @ John Leidegren : If your only concern was calling into the Unicode Windows API
and not marshalling strings then sure : Then, we agree. I'm coding in C++, not JavaScript.
Avoiding useless marshalling or any other potentially costly processing at runtime when it can be
done at compile time is at the heart of that language. Coding against WinAPI and using
std::string is just an unjustified wasting runtime resources. You find it fallacious, and it's Ok, as it
is your viewpoint. My own is that I won't write code with pessimization on Windows just because it
looks better from the Linux side. – paercebal Aug 9, 2012 at 19:48

I recommend avoiding std::wstring on Windows or elsewhere, except when required by the


interface, or anywhere near Windows API calls and respective encoding conversions as a
97 syntactic sugar.

My view is summarized in http://utf8everywhere.org of which I am a co-author.

Unless your application is API-call-centric, e.g. mainly UI application, the suggestion is to store
Unicode strings in std::string and encoded in UTF-8, performing conversion near API calls. The
benefits outlined in the article outweigh the apparent annoyance of conversion, especially in
complex applications. This is doubly so for multi-platform and library development.

And now, answering your questions:

1. A few weak reasons. It exists for historical reasons, where widechars were believed to be
the proper way of supporting Unicode. It is now used to interface APIs that prefer UTF-16
strings. I use them only in the direct vicinity of such API calls.
2. This has nothing to do with std::string. It can hold whatever encoding you put in it. The
Join Stack only question
Overflow is how
to find You answer
the best treat itstocontent. My recommendation
your technical is UTF-8, so it will be able
question, help others
Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 5/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

to hold all Unicode characters correctly. It's a common practice on Linux, but I think
Windows programs should do it also.
3. No.

4. Wide character is a confusing name. In the early days of Unicode, there was a belief that a
character can be encoded in two bytes, hence the name. Today, it stands for "any part of
the character that is two bytes long". UTF-16 is seen as a sequence of such byte pairs (aka
Wide characters). A character in UTF-16 takes either one or two pairs.

Share Follow edited Nov 8, 2018 at 12:39 answered Dec 29, 2009 at 16:14
user1741137 Pavel Radzivilovsky
4,671 1 17 25 18.5k 4 56 67

Here is my explanation of string encodings in the context of JavaScript: github.com/duzun/string-


encode.js/blob/master/… – DUzun Sep 23, 2020 at 21:42

I think your idea of using wstring only on API calls is interesting, but I am a bit confused about getting
data in to the program; right now I am using a stringstream to pipe the data from a fstream into, is it
safe to assume that the C++ standard library is capable of detecting that a text file is UTF-8 and will
construct a string in the right encoding automatically? Or will it interpret the text file as 8 bit chars and
return garbled text? Do the standards say anything about this? – jrh Dec 13, 2020 at 16:03

1 @jrh": The C++ standard library does not check file types or handle encodings. If you stream a UTF8 file
into a std::string , you'll end up with a std::string that contains UTF8, with the pros and cons
that entails. if you stream a UTF8 file into a std::wstring , then you end up with garbage. (Similarly,
streaming a UTF16 file into a std::string produces garbage, but std::wstring would be valid, at
least on Windows) – Mooing Duck Apr 29 at 15:28

1 @MooingDuck yes, I later found that to be the case. On a related note one of the very unfortunate
parts of the standard library is that exception messages are always char* not wchar*, which is
unfortunate in Windows if your exception message has to e.g., include a unicode file name / key / etc.,
or "Failed to parse '견고한 논리' as integer". That does add to the reasoning of "use UTF-8 as much as
possible" because if you used wchars for most of the program instead you'd have to convert to UTF-8
to store an exception message, and that conversion itself can sadly throw an exception. – jrh Apr 29 at
16:08

An important reason not to do this conversion is that WCHAR strings can contain unpaired surrogates.
Filenames with unpaired surrogates exist in the wild (Cygwin uses them, for instance), but are rare
enough that they may be missed in testing. A malicious party could create one to crash your program,
or even do worse if, e.g., a failed conversion doesn't write a terminating NUL. You can work around this
by using a UTF-8 compatible encoding that can roundtrip surrogates, but many Unicode libraries don't
provide that, and of course it isn't UTF-8 so it violates your UTF-8 everywhere advice. – benrg May 16 at
0:23

So, every reader here now should have a clear understanding about the facts, the situation. If
not, then you must read paercebal's outstandingly comprehensive answer [btw: thanks!].
39
My pragmatical conclusion is shockingly simple: all that C++ (and STL) "character encoding"
stuff is substantially broken and useless. Blame it on Microsoft or not, that will not help
anyway.

My solution,
Join Stack Overflowafter in-depth
to find investigation,
the best much
answer to your frustration
technical and the
question, helpconsequential
others experiences
Sign up
answeristheirs.
the following:
https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 6/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

1. accept, that you have to be responsible on your own for the encoding and conversion
stuff (and you will see that much of it is rather trivial)

2. use std::string for any UTF-8 encoded strings (just a typedef std::string UTF8String )

3. accept that such an UTF8String object is just a dumb, but cheap container. Do never ever
access and/or manipulate characters in it directly (no search, replace, and so on). You
could, but you really just really, really do not want to waste your time writing text
manipulation algorithms for multi-byte strings! Even if other people already did such
stupid things, don't do that! Let it be! (Well, there are scenarios where it makes sense...
just use the ICU library for those).

4. use std::wstring for UCS-2 encoded strings ( typedef std::wstring UCS2String ) - this is a
compromise, and a concession to the mess that the WIN32 API introduced). UCS-2 is
sufficient for most of us (more on that later...).

5. use UCS2String instances whenever a character-by-character access is required (read,


manipulate, and so on). Any character-based processing should be done in a NON-
multibyte-representation. It is simple, fast, easy.

6. add two utility functions to convert back & forth between UTF-8 and UCS-2:

UCS2String ConvertToUCS2( const UTF8String &str );


UTF8String ConvertToUTF8( const UCS2String &str );

The conversions are straightforward, google should help here ...

That's it. Use UTF8String wherever memory is precious and for all UTF-8 I/O. Use UCS2String
wherever the string must be parsed and/or manipulated. You can convert between those two
representations any time.

Alternatives & Improvements

conversions from & to single-byte character encodings (e.g. ISO-8859-1) can be realized
with help of plain translation tables, e.g. const wchar_t tt_iso88951[256] =
{0,1,2,...}; and appropriate code for conversion to & from UCS2.

if UCS-2 is not sufficient, than switch to UCS-4 ( typedef std::basic_string<uint32_t>


UCS2String )

ICU or other unicode libraries?

For advanced stuff.

Share Follow edited Nov 7, 2011 at 6:31 answered Nov 7, 2011 at 6:07
Frunsi
7,029 5 35 42

1 Dang, it's not good to know that native Unicode support isn't there. – Mihai Danila Dec 15, 2013 at
16:59

@Frunsi, I'm curious to know if you've tried Glib::ustring and if so, what are your thoughts?
Join Stack–Overflow to find
Caroline Beltran the
Sep 19,best
2014answer to your technical question, help others
at 19:44 Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 7/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

@CarolineBeltran: I know Glib, but I never used it, and I probably will never even use it, because it is
rather limited to a rather unspecific target platform (unixoid systems...). Its windows port is based on
external win2unix-layer, and there IMHO is no OSX-compatibility-layer at all. All this stuff is directing
clearly into a wrong direction, at least for my code (on this arch level...) ;-) So, Glib is not an option
– Frunsi Sep 20, 2014 at 5:01

13 Search, replace, and so on works just fine on UTF-8 strings (a part of the byte sequence representing a
character can never be misinterpreted as another character). In fact, UTF-16 and UTF-32 don't make
this any easier at all: all three encodings are multibyte encodings in practice, because a user-perceived
character (grapheme cluster) can be any number of unicode codepoints long! The pragmatic solution is
to use UTF-8 for everything, and convert to UTF-16 only when dealing with the Windows API. – Daniel
Oct 17, 2014 at 10:49

6 @Frunsi: Search and replace works just as fine with UTF-8 as with UTF-32. It's precisely because proper
Unicode-aware text processing needs to deal with multi-codepoint 'characters' anyways, that using a
variable length encoding like UTF-8 doesn't make string processing any more complicated. So just use
UTF-8 everywhere. Normal C string functions will work fine on UTF-8 (and correspond to ordinal
comparisons on the Unicode string), and if you need anything more language-aware, you'll have to call
into a Unicode library anyways, UTF-16/32 can't save you from that. – Daniel Oct 23, 2014 at 10:16

1. When you want to have wide characters stored in your string. wide depends on the
implementation. Visual C++ defaults to 16 bit if i remember correctly, while GCC defaults
26 depending on the target. It's 32 bits long here. Please note wchar_t (wide character type)
has nothing to do with unicode. It's merely guaranteed that it can store all the members
of the largest character set that the implementation supports by its locales, and at least as
long as char. You can store unicode strings fine into std::string using the utf-8
encoding too. But it won't understand the meaning of unicode code points. So
str.size() won't give you the amount of logical characters in your string, but merely the
amount of char or wchar_t elements stored in that string/wstring. For that reason, the
gtk/glib C++ wrapper folks have developed a Glib::ustring class that can handle utf-8.
If your wchar_t is 32 bits long, then you can use utf-32 as an unicode encoding, and you
can store and handle unicode strings using a fixed (utf-32 is fixed length) encoding. This
means your wstring's s.size() function will then return the right amount of wchar_t
elements and logical characters.

2. Yes, char is always at least 8 bit long, which means it can store all ASCII values.

3. Yes, all major compilers support it.

Share Follow edited Dec 31, 2008 at 12:00 answered Dec 31, 2008 at 11:48
Johannes Schaub - litb
485k 125 874 1197

I'm curious about #2. I thought 7 bits would be technically valid too? Or is it required to be able to
store anything past 7-bit ASCII chars? – jalf Dec 31, 2008 at 12:11

1 yes, jalf. c89 specifies minimal ranges for basic types in its documentation of limits.h (for unsigned char,
that's 0..255 min), and a pure binary system for integer types. it follows char, unsigned char and signed
char have minimum bit lengths of 8. c++ inherits those rules. – Johannes Schaub - litb Dec 31, 2008 at
12:26

Join Stack Overflow


17 "This means to find
your the best
wstring's answer
s.size() to your
function will technical
then returnquestion, help others
the right amount of wchar_t elements
Sign up and
answer theirs.
logical characters." This is not entirely accurate, even for Unicode. It would be more accurate to say

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 8/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

codepoint than "logical character", even in UTF-32 a given character may be composed of multiple
codepoints. – Logan Capaldo May 16, 2010 at 17:26

Are you guys in essence saying that C++ doesn't have native support for the Unicode character set?
– Mihai Danila Dec 15, 2013 at 16:56

1 "But it won't understand the meaning of unicode code points." On windows, neither does
std::wstring . – Deduplicator Jan 8, 2015 at 22:20

I frequently use std::string to hold utf-8 characters without any problems at all. I heartily
recommend doing this when interfacing with API's which use utf-8 as the native string type as
8 well.

For example, I use utf-8 when interfacing my code with the Tcl interpreter.

The major caveat is the length of the std::string, is no longer the number of characters in the
string.

Share Follow answered Dec 31, 2008 at 4:33


Juan

1 Juan : Do you mean that std::string can hold all unicode characters but the length will report
incorrectly? Is there a reason that it is reporting incorrect length? – Appu Dec 31, 2008 at 4:35

4 When using the utf-8 encoding, a single unicode character may be made up of multiple bytes. This is
why utf-8 encoding is smaller when using mostly characters from the standard ascii set. You need to use
special functions (or roll your own) to measure the number of unicode characters. – Juan Dec 31, 2008
at 4:39

2 (Windows specific) Most functions will expect that a string using bytes is ASCII and 2 bytes is Unicode,
older versions MBCS. Which means if you are storing 8 bit unicode that you will have to convert to 16
bit unicode to call a standard windows function (unless you are only using ASCII portion).
– Greg Domjan Dec 31, 2008 at 4:58

2 Not only will a std::string report the length incorrectly, but it will also output the wrong string. If some
Unicode character is represented in UTF-8 as multiple bytes, which std::string thinks of as its own
characters, then your typically std::string manipulation routines will probably output the several strange
characters that result from the misinterpretation of the one correct character. – Mihai Danila Dec 15,
2013 at 17:01

2 I suggest changing the answer to indicate that strings should be thought of as only containers of bytes,
and, if the bytes are some Unicode encoding (UTF-8, UTF-16, ...), then you should use specific libraries
that understand that. The standard string-based APIs (length, substr, etc.) will all fail miserably with
multibyte characters. If this update is made, I will remove my downvote. – Mihai Danila Oct 7, 2014 at
14:19

A good question! I think DATA ENCODING (sometimes a CHARSET also involved) is a


MEMORY EXPRESSION MECHANISM in order to save data to a file or transfer data via a
6 network, so I answer this question as:

Join Stack Overflow


1. When to Ifind
should usethe best answerover
std::wstring to your technical question, help others
std::string? Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 9/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

If the programming platform or API function is a single-byte one, and we want to process or
parse some Unicode data, e.g read from Windows'.REG file or network 2-byte stream, we
should declare std::wstring variable to easily process them. e.g.: wstring ws=L"中国a"(6 octets
memory: 0x4E2D 0x56FD 0x0061), we can use ws[0] to get character '中' and ws[1] to get
character '国' and ws[2] to get character 'a', etc.

2. Can std::string hold the entire ASCII character set, including the special characters?

Yes. But notice: American ASCII, means each 0x00~0xFF octet stands for one character,
including printable text such as "123abc&*_&" and you said special one, mostly print it as a '.'
avoid confusing editors or terminals. And some other countries extend their own "ASCII"
charset, e.g. Chinese, use 2 octets to stand for one character.

3.Is std::wstring supported by all popular C++ compilers?

Maybe, or mostly. I have used: VC++6 and GCC 3.3, YES

4. What is exactly a "wide character"?

a wide character mostly indicates using 2 octets or 4 octets to hold all countries' characters. 2
octet UCS2 is a representative sample, and further e.g. English 'a', its memory is 2 octet of
0x0061(vs in ASCII 'a's memory is 1 octet 0x61)

Share Follow edited Nov 8, 2018 at 13:16 answered Oct 29, 2013 at 9:56
user1741137 Leiyi.China
4,671 1 17 25 147 1 4

1. When you want to store 'wide' (Unicode) characters.

2. Yes: 255 of them (excluding 0).


5
3. Yes.
4. Here's an introductory article: http://www.joelonsoftware.com/articles/Unicode.html

Share Follow answered Dec 31, 2008 at 4:16


ChrisW
53.8k 12 113 214

12 std::string can hold 0 just fine (just be careful if you call the c_str() method) – Mr Fooz Dec 31, 2008 at
4:40

3 And strictly speaking, a char isn't guaranteed to be 8 bits. :) Your link in #4 is a must-read, but I don't
think it answers the question. A wide character is strictly nothing to do with unicode. It is simply a wider
character. (How much wider depends on OS, but typically 16 or 32 bit) – jalf Dec 31, 2008 at 12:08

There are some very good answers here, but I think there are a couple of things I can add
regarding Windows/Visual Studio. Tis is based on my experience with VS2015. On Linux,
4 Stack
Join basically the answer
Overflow to findisthe
to best
use UTF-8
answerencoded
to your technical everywhere.
question,
std::string On Windows/VS it
help others
Sign up
answergets
theirs.
more complex. Here is why. Windows expects strings stored using char s to be encoded
https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 10/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

using the locale codepage. This is almost always the ASCII character set followed by 128 other
special characters depending on your location. Let me just state that this in not just when
using the Windows API, there are three other major places where these strings interact with
standard C++. These are string literals, output to std::cout using << and passing a filename
to std::fstream .

I will be up front here that I am a programmer, not a language specialist. I appreciate that
USC2 and UTF-16 are not the same, but for my purposes they are close enough to be
interchangeable and I use them as such here. I'm not actually sure which Windows uses, but I
generally don't need to know either. I've stated UCS2 in this answer, so sorry in advance if I
upset anyone with my ignorance of this matter and I'm happy to change it if I have things
wrong.

String literals
If you enter string literals that contain only characters that can be represented by your
codepage then VS stores them in your file with 1 byte per character encoding based on your
codepage. Note that if you change your codepage or give your source to another developer
using a different code page then I think (but haven't tested) that the character will end up
different. If you run your code on a computer using a different code page then I'm not sure if
the character will change too.

If you enter any string literals that cannot be represented by your codepage then VS will ask
you to save the file as Unicode. The file will then be encoded as UTF-8. This means that all Non
ASCII characters (including those which are on your codepage) will be represented by 2 or
more bytes. This means if you give your source to someone else the source will look the same.
However, before passing the source to the compiler, VS converts the UTF-8 encoded text to
code page encoded text and any characters missing from the code page are replaced with ? .

The only way to guarantee correctly representing a Unicode string literal in VS is to precede
the string literal with an L making it a wide string literal. In this case VS will convert the UTF-8
encoded text from the file into UCS2. You then need to pass this string literal into a
std::wstring constructor or you need to convert it to utf-8 and put it in a std::string . Or if
you want you can use the Windows API functions to encode it using your code page to put it
in a std::string , but then you may as well have not used a wide string literal.

std::cout
When outputting to the console using << you can only use std::string , not std::wstring
and the text must be encoded using your locale codepage. If you have a std::wstring then
you must convert it using one of the Windows API functions and any characters not on your
codepage get replaced by ? (maybe you can change the character, I can't remember).

std::fstream filenames
Windows OS uses UCS2/UTF-16 for its filenames so whatever your codepage, you can have
files with
Join Stack any Unicode
Overflow character.
to find the But this
best answer means
to your that toquestion,
technical access or create
help files with characters
others
Sign up
answernot
theirs.
on your codepage you must use std::wstring . There is no other way. This is a Microsoft
https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 11/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

specific extension to std::fstream so probably won't compile on other systems. If you use
std::string then you can only utilise filenames that only include characters on your codepage.

Your options
If you are just working on Linux then you probably didn't get this far. Just use UTF-8
std::string everywhere.

If you are just working on Windows just use UCS2 std::wstring everywhere. Some purists
may say use UTF8 then convert when needed, but why bother with the hassle.

If you are cross platform then it's a mess to be frank. If you try to use UTF-8 everywhere on
Windows then you need to be really careful with your string literals and output to the console.
You can easily corrupt your strings there. If you use std::wstring everywhere on Linux then
you may not have access to the wide version of std::fstream , so you have to do the
conversion, but there is no risk of corruption. So personally I think this is a better option. Many
would disagree, but I'm not alone - it's the path taken by wxWidgets for example.

Another option could be to typedef unicodestring as std::string on Linux and


std::wstring on Windows, and have a macro called UNI() which prefixes L on Windows and
nothing on Linux, then the code

#include <fstream>
#include <string>
#include <iostream>
#include <Windows.h>

#ifdef _WIN32
typedef std::wstring unicodestring;
#define UNI(text) L ## text
std::string formatForConsole(const unicodestring &str)
{
std::string result;
//Call WideCharToMultiByte to do the conversion
return result;
}
#else
typedef std::string unicodestring;
#define UNI(text) text
std::string formatForConsole(const unicodestring &str)
{
return str;
}
#endif

int main()
{

unicodestring fileName(UNI("fileName"));
std::ofstream fout;
fout.open(fileName);
std::cout << formatForConsole(fileName) << std::endl;
return 0;
}

would be fine on either platform I think.


Join Stack Overflow to find the best answer to your technical question, help others
Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 12/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

Answers
So To answer your questions

1) If you are programming for Windows, then all the time, if cross platform then maybe all the
time, unless you want to deal with possible corruption issues on Windows or write some code
with platform specific #ifdefs to work around the differences, if just using Linux then never.

2)Yes. In addition on Linux you can use it for all Unicode too. On Windows you can only use it
for all unicode if you choose to manually encode using UTF-8. But the Windows API and
standard C++ classes will expect the std::string to be encoded using the locale codepage.
This includes all ASCII plus another 128 characters which change depending on the codepage
your computer is setup to use.

3)I believe so, but if not then it is just a simple typedef of a 'std::basic_string' using wchar_t
instead of char

4)A wide character is a character type which is bigger than the 1 byte standard char type. On
Windows it is 2 bytes, on Linux it is 4 bytes.

Share Follow edited Aug 17, 2018 at 13:36 answered Aug 17, 2018 at 13:17
Phil Rosenberg
1,387 1 12 22

1 Regarding "However, before passing the source to the compiler, VS converts the UTF-8 encoded text to
code page encoded text and any characters missing from the code page are replaced with ?." -> I don't
think that this is true when the compiler uses UTF-8 encoding (use /utf-8 ). – Roi Danton Jan 14, 2019
at 9:42

I was not aware of this as an option. From this link docs.microsoft.com/en-us/cpp/build/reference/… it


seems there is no tick box to select in in project properties, you must add it as an additional command
line option. Good spot! – Phil Rosenberg Jan 15, 2019 at 10:57

Applications that are not satisfied with only 256 different characters have the options of either
using wide characters (more than 8 bits) or a variable-length encoding (a multibyte encoding
2 in C++ terminology) such as UTF-8. Wide characters generally require more space than a
variable-length encoding, but are faster to process. Multi-language applications that process
large amounts of text usually use wide characters when processing the text, but convert it to
UTF-8 when storing it to disk.

The only difference between a string and a wstring is the data type of the characters they
store. A string stores char s whose size is guaranteed to be at least 8 bits, so you can use
strings for processing e.g. ASCII, ISO-8859-15, or UTF-8 text. The standard says nothing about
the character set or encoding.

Practically every compiler uses a character set whose first 128 characters correspond with
ASCII. This is also the case with compilers that use UTF-8 encoding. The important thing to be
aware of when using strings in UTF-8 or some other variable-length encoding, is that the
Join Stack Overflow to find the best answer to your technical question, help others
indices and lengths are measured in bytes, not characters. Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 13/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

The data type of a wstring is wchar_t , whose size is not defined in the standard, except that it
has to be at least as large as a char, usually 16 bits or 32 bits. wstring can be used for
processing text in the implementation defined wide-character encoding. Because the
encoding is not defined in the standard, it is not straightforward to convert between strings
and wstrings. One cannot assume wstrings to have a fixed-length encoding either.

If you don't need multi-language support, you might be fine with using only regular strings.
On the other hand, if you're writing a graphical application, it is often the case that the API
supports only wide characters. Then you probably want to use the same wide characters when
processing the text. Keep in mind that UTF-16 is a variable-length encoding, meaning that you
cannot assume length() to return the number of characters. If the API uses a fixed-length
encoding, such as UCS-2, processing becomes easy. Converting between wide characters and
UTF-8 is difficult to do in a portable way, but then again, your user interface API probably
supports the conversion.

Share Follow edited Oct 12, 2015 at 22:57 answered Sep 11, 2011 at 9:28
Seppo Enarvi
2,920 3 28 25

So, paraphrasing the first paragraph: Application needing more than 256 characters need to use a
multibyte-encoding or a maybe_multibyte-encoding. – Deduplicator Oct 10, 2015 at 12:44

Generally 16 and 32 bit encodings such as UCS-2 and UCS-4 are not called multibyte encodings,
though. The C++ standard distinguishes between multibyte encodings and wide characters. A wide
character representation uses a fixed number (generally more than 8) bits per character. Encodings that
use a single byte to encode the most common characters, and multiple bytes to encode the rest of the
character set, are called multibyte encodings. – Seppo Enarvi Oct 12, 2015 at 21:16

Sorry, sloppy comment. Should have said variable-length encoding. UTF-16 is a variable-length-
encoding, just like UTF-8. Pretending it isn't is a bad idea. – Deduplicator Oct 12, 2015 at 21:23

That's a good point. There's no reason why wstrings couldn't be used to store UTF-16 (instead of UCS-
2), but then the convenience of a fixed-length encoding is lost. – Seppo Enarvi Oct 12, 2015 at 22:13

1. when you want to use Unicode strings and not just ascii, helpful for internationalisation
2. yes, but it doesn't play well with 0
1
3. not aware of any that don't
4. wide character is the compiler specific way of handling the fixed length representation of
a unicode character, for MSVC it is a 2 byte character, for gcc I understand it is 4 bytes.
and a +1 for http://www.joelonsoftware.com/articles/Unicode.html

Share Follow answered Dec 31, 2008 at 4:16


Greg Domjan
13.7k 6 41 59

1 2. An std::string can hold a NULL character just fine. It can also hold utf-8 and wide characters as well.
– Juan Dec 31, 2008 at 4:29
Join Stack Overflow to find the best answer to your technical question, help others
Sign up
@Juan : That put me into confusion again. If std::string can keep unicode characters, what is special with
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 14/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

std::wstring? – Appu Dec 31, 2008 at 4:33

1 @Appu: std::string can hold UTF-8 unicode characters. There are a number of unicode standards
targeted at different character widths. UTf8 is 8 bits wide. There's also UTF-16 and UTF-32 at 16 and 32
bits wide respectively – Greg D Dec 31, 2008 at 4:40

With a std::wstring. Each unicode character can be one wchar_t when using the fixed length encodings.
For example, if you choose to use the joel on software approach as Greg links to. Then the length of the
wstring is exactly number of unicode characters in the string. But it takes up more space – Juan Dec 31,
2008 at 4:43

I didn't say it could not hold a 0 '\0', and what I meant by doesn't play well is that some methods may
not give you an expected result containing all the data of the wstring. So harsh on the down votes.
– Greg Domjan Dec 31, 2008 at 4:53

1) As mentioned by Greg, wstring is helpful for internationalization, that's when you will be
releasing your product in languages other than english
-3
4) Check this out for wide character http://en.wikipedia.org/wiki/Wide_character

Share Follow answered Dec 31, 2008 at 4:24


Raghu
190 4 10

When should you NOT use wide-characters?

-7 When you're writing code before the year 1990.

Obviously, I'm being flip, but really, it's the 21st century now. 127 characters have long since
ceased to be sufficient. Yes, you can use UTF8, but why bother with the headaches?

Share Follow answered Jun 10, 2009 at 23:26


dave

17 @dave: I don't know what headache does UTF-8 create which is greater than that of Widechars (UTF-
16). in UTF-16, you also have multi-character characters. – Pavel Radzivilovsky Dec 29, 2009 at 16:08

The problem is that if you're anywhere but English speaking country you OUGHT to use wchar_t. Not to
mention that some alphabets have way more characters than you can fit into a byte. We were there, on
DOS. Codepage schizophrenia, no, thanks, no more.. – Swift - Friday Pie Nov 26, 2016 at 23:02

1 @Swift The problem with wchar_t is that its size and meaning are OS-specific. It just swaps the old
problems with new ones. Whereas a char is a char regardless of OS (on similar platforms, at least).
So we might as well just use UTF-8, pack everything into sequences of char s, and lament how C++
leaves us completely on our own without any standard methods for measuring, indexing, finding etc
within such sequences. – underscore_d May 21, 2017 at 14:16

1@Swift You seem to have it completely backwards. wchar_t is a fixed-width data type, so an array of
10 wchar_t will always occupy sizeof(wchar_t) * 10 platform bytes. And UTF-16 is a variable-
width encoding in which characters may be made up of 1 or 2 16-bit codepoints (and s/16/8/g for UTF-
Join Stack8). – underscore_d
Overflow to findMay
the21, 2017
best at 14:42
answer to your technical question, help others
Sign up
answer theirs.
1 @SteveHollasch wchar_t representation of string on windows would encode characters greater than
https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 15/16
31/8/2022 c++ - std::wstring VS std::string - Stack Overflow

FFFF as aspecial surrogate pair, other would take only one wchar_t element. So that representation will
not be compatible with representation created by gnu compiler (where all characters less than FFFF will
have zero word in front of them). What is stored in wchar_t is determined by programmer and
compiler, not by some agreement – Swift - Friday Pie Nov 5, 2017 at 0:33

Join Stack Overflow to find the best answer to your technical question, help others
Sign up
answer theirs.

https://stackoverflow.com/questions/402283/stdwstring-vs-stdstring 16/16

You might also like