如何让 STL std::string 在 Windows 上使用 unicode?
在我的公司,我们有一个跨平台(Linux 和 Windows)库,其中包含我们自己的 STL std::string 扩展,此类提供了字符串之上的所有功能;最近我们被要求让这个字符串unicode“友好”,基本上它需要支持中文、日文、阿拉伯文等字符。经过初步研究,这在Linux端似乎没问题因为一切本质上都是 UTF-8,但是我在 Windows 端遇到了问题;有没有办法让 STL std::string 在 Windows 上作为 UTF-8 工作?有可能吗?有更好的办法吗?理想情况下,我们将保持基于 std::string 的基础,因为 Linux 中的 string 类就是基于它的。
谢谢你,
At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.
Thank you,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
你的问题中有几个误解。
C++ 和 STL 都不处理编码。
std::string
本质上是一个字节字符串,而不是字符。所以你把 UTF-8 编码的 Unicode 塞进去应该没有问题。但是,请记住,所有string
函数也适用于字节,因此myString.length()
将为您提供字节数,而不是字符数。 p>Linux 本质上不是 UTF-8。现在大多数发行版都默认使用 UTF-8,但不应依赖它。
There are several misconceptions in your question.
Neither C++ nor the STL deal with encodings.
std::string
is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that allstring
functions also work on bytes, somyString.length()
will give you the number of bytes, not the number of characters.Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.
是的 - 通过更加了解语言环境和编码。
Windows 对于需要文本的所有内容都有两个函数调用:FoobarA() 和 FoobarW()。 *W() 函数采用 UTF-16 编码字符串,*A() 函数采用当前代码页中的字符串。但是,Windows 不支持 UTF-8 代码页,因此您不能直接将其与 *A() 函数一起使用,也不想依赖于用户设置的代码页。如果您希望在 Windows 中使用“Unicode”,请使用支持 Unicode (*W) 的函数。那里有教程,谷歌搜索“Unicode Windows 教程”应该可以找到一些。
如果您将 UTF-8 数据存储在 std::string 中,那么在将其传递给 Windows 之前,请将其转换为 UTF-16(Windows 提供了执行此操作的函数),然后将其传递给 Windows。
其中许多问题都是由于 C/C++ 通常与编码无关而引起的。
char
并不是真正的字符,它只是一个整数类型。如果您需要访问各个代码单元,即使使用char
数组来存储 UTF-8 数据也会给您带来麻烦,因为char
的有符号性未由标准。类似str[x] < 的语句0x80 检查多字节字符可能会很快引入错误。 (如果
char
有符号,则该语句始终为真。)UTF-8 代码单元是范围为 0-255 的无符号整数类型。尽管unsigned char
也可以工作,但它准确地映射到uint8_t
的 C 类型。理想情况下,我会将 UTF-8 字符串设为 uint8_t 数组,但由于旧的 API,很少这样做。有些人推荐了
wchar_t
,声称它是“Unicode 字符类型”或类似的东西。同样,这里的标准与以前一样不可知,因为 C 应该在任何地方工作,而任何地方可能不使用 Unicode。因此,wchar_t
并不比char
更统一。标准规定:在 Linux 中,
wchat_t
表示 UTF-32 代码单元/代码点。因此它是 4 个字节。然而,在Windows中,它是一个UTF-16代码单元,并且只有2个字节。 (我想说这不符合上述规定,因为 2 字节不能代表所有 Unicode,但这就是它的工作方式。)这种大小差异以及数据编码的差异显然给可移植性带来了压力。如果您需要可移植性,Unicode 标准本身建议不要使用wchar_t
。 (§5.2)最终教训:我发现以某种明确声明的格式存储所有数据是最简单的。 (通常是 UTF-8,通常在 std::string 中,但我真的想要更好的东西。)这里重要的不是 UTF-8 部分,而是,我知道我的字符串是 UTF-8。如果我将它们传递给其他某个 API,我还必须知道该 API 需要 UTF-8 字符串。如果没有,那么我必须转换它们。 (因此,如果我使用 Window 的 API,我必须首先将字符串转换为 UTF-16。)UTF-8 文本字符串是“orange”,“latin1”文本字符串是“apple”。不知道其编码方式的
char
数组会导致灾难。Yes - by being more aware of locales and encodings.
Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.
If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.
Many of these problems arise from C/C++ being generally encoding-agnostic.
char
isn't really a character, it's just an integral type. Even usingchar
arrays to store UTF-8 data can get you into trouble if you need to access individual code units, aschar
's signed-ness is left undefined by the standards. A statement likestr[x] < 0x80
to check for multiple-byte characters can quickly introduce a bug. (That statement is always true ifchar
is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type ofuint8_t
exactly, althoughunsigned char
works as well. Ideally then, I'd make a UTF-8 string an array ofuint8_t
s, but due to old APIs, this is rarely done.Some people have recommended
wchar_t
, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus,wchar_t
is no more Unicode thanchar
. The standard states:In Linux, a
wchat_t
represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends againstwchar_t
if you need portability. (§5.2)The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A
char
array that doesn't know what encoding it is in is a recipe for disaster.无论平台如何,将 UTF-8 代码点放入
std::string
应该没问题。 Windows 上的问题是几乎没有其他东西期望或使用 UTF-8 - 它期望并使用 UTF-16。您可以切换到将存储 UTF-16 的std::wstring
(至少在大多数 Windows 编译器上),或者您可以编写其他接受 UTF-8 的例程(可能通过转换为 UTF-16 ,然后传递到操作系统)。Putting UTF-8 code points into an
std::string
should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to anstd::wstring
which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).你看过
std::wstring
吗?它是wchar_t
的std::basic_string
版本,而不是std::string
使用的char
。Have you looked at
std::wstring
? It's a version ofstd::basic_string
forwchar_t
rather than thechar
thatstd::string
uses.不,没有办法让 Windows 将“窄”字符串视为 UTF-8。
在这种情况下,以下是最适合我的方法(具有 Windows 和 Linux 版本的跨平台应用程序)。
我尝试过但不太喜欢的其他方法:
std::wstring
。没有多大帮助,因为wchar_t
在 Windows 上是 16 位,因此您要么必须将自己限制在 BMP 上,要么要进行大量复杂操作才能使处理 Unicode 的代码跨平台。在后一种情况下,UTF-8 的所有优势都会消失。CString
;在跨平台部分使用std::string
。这实际上是我上面推荐的一个变体。CString
在很多方面都优于std::string
(在我看来)。但它引入了额外的依赖性,因此并不总是可以接受或方便。No, there is no way to make Windows treat "narrow" strings as UTF-8.
Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).
Other approaches that I tried but don't like much:
typedef std::basic_string<TCHAR> tstring;
then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.std::wstring
everywhere. Does not help much sincewchar_t
is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.CString
in the platfrom-specific portion; usestd::string
in cross-platfrom portion. This is actually a variant of what I recommend above.CString
is in many aspects superior tostd::string
(in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.如果您想避免头痛,请根本不要使用 STL 字符串类型。 C++ 对 Unicode 或编码一无所知,因此为了可移植,最好使用专为 Unicode 支持而定制的库,例如 ICU 库。 ICU 默认使用 UTF-16 字符串,因此不需要转换,并且支持转换为许多其他重要的编码,例如 UTF-8。还可以尝试使用 Boost.Filesystem 等跨平台库来进行路径操作 (
boost::wpath
) 等操作。避免使用std::string
和std::fstream
。If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (
boost::wpath
). Avoidstd::string
andstd::fstream
.在 Windows API 和 C 运行时库中,
char*
参数被解释为在“ANSI”代码页中编码。问题是 不支持 UTF-8 作为 ANSI 代码页,我觉得非常烦人 。我也面临着类似的情况,正在将软件从 Windows 移植到 Linux,同时使其支持 Unicode。我们为此采取的方法是:
这也是Poco 采取的方法。
In the Windows API and C runtime library,
char*
parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:
This is also the approach Poco has taken.
它确实依赖于平台,Unicode 是令人头疼的。取决于你使用的编译器。对于 MS 的较旧版本(VS2010 或更早版本),您需要
使用 MSDN for VS2015
根据其文档 中描述的 API。我无法检查那个。
对于 mingw、gcc 等,
输出包含正确的文件名...
It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN
for VS2015
according to their docs. I can't check that one.
for mingw, gcc, etc.
output contains proper file name...
您应该考虑使用 QString 和 QByteArray,它具有良好的 unicode 支持
You should consider using QString and QByteArray, it has good unicode support