如何让 STL std::string 在 Windows 上使用 unicode?

发布于 2024-09-10 05:24:27 字数 317 浏览 5 评论 0原文

在我的公司,我们有一个跨平台(Linux 和 Windows)库,其中包含我们自己的 STL std::string 扩展,此类提供了字符串之上的所有功能;最近我们被要求让这个字符串unicode“友好”,基本上它需要支持中文、日文、阿拉伯文等字符。经过初步研究,这在Linux端似乎没问题因为一切本质上都是 UTF-8,但是我在 Windows 端遇到了问题;有没有办法让 STL std::string 在 Windows 上作为 UTF-8 工作?有可能吗?有更好的办法吗?理想情况下,我们将保持基于 std::string 的基础,因为 Linux 中的 string 类就是基于它的。

谢谢你,

At my company we have a cross platform(Linux & Windows) library that contains our own extension of the STL std::string, this class provides all sort of functionality on top of the string; split, format, to/from base64, etc. Recently we were given the requirement of making this string unicode "friendly" basically it needs to support characters from Chinese, Japanese, Arabic, etc. After initial research this seems fine on the Linux side since every thing is inherently UTF-8, however I am having trouble with the Windows side; is there a trick to getting the STL std::string to work as UTF-8 on windows? Is it even possible? Is there a better way? Ideally we would keep ourselves based on the std::string since that is what the string class is based on in Linux.

Thank you,

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

第几種人 2024-09-17 05:24:27

你的问题中有几个误解。

  • C++ 和 STL 都不处理编码。

  • std::string 本质上是一个字节字符串,而不是字符。所以你把 UTF-8 编码的 Unicode 塞进去应该没有问题。但是,请记住,所有 string 函数也适用于字节,因此 myString.length() 将为您提供字节数,而不是字符数。 p>

  • Linux 本质上不是 UTF-8。现在大多数发行版都默认使用 UTF-8,但不应依赖它。

There are several misconceptions in your question.

  • Neither C++ nor the STL deal with encodings.

  • std::string is essentially a string of bytes, not characters. So you should have no problem stuffing UTF-8 encoded Unicode into it. However, keep in mind that all string functions also work on bytes, so myString.length() will give you the number of bytes, not the number of characters.

  • Linux is not inherently UTF-8. Most distributions nowadays default to UTF-8, but it should not be relied upon.

心房的律动 2024-09-17 05:24:27

是的 - 通过更加了解语言环境和编码。

Windows 对于需要文本的所有内容都有两个函数调用:FoobarA() 和 FoobarW()。 *W() 函数采用 UTF-16 编码字符串,*A() 函数采用当前代码页中的字符串。但是,Windows 不支持 UTF-8 代码页,因此您不能直接将其与 *A() 函数一起使用,也不想依赖于用户设置的代码页。如果您希望在 Windows 中使用“Unicode”,请使用支持 Unicode (*W)​​ 的函数。那里有教程,谷歌搜索“Unicode Windows 教程”应该可以找到一些。

如果您将 UTF-8 数据存储在 std::string 中,那么在将其传递给 Windows 之前,请将其转换为 UTF-16(Windows 提供了执行此操作的函数),然后将其传递给 Windows。

其中许多问题都是由于 C/C++ 通常与编码无关而引起的。 char 并不是真正的字符,它只是一个整数类型。如果您需要访问各个代码单元,即使使用 char 数组来存储 UTF-8 数据也会给您带来麻烦,因为 char 的有符号性未由标准。类似 str[x] < 的语句0x80 检查多字节字符可能会很快引入错误。 (如果 char 有符号,则该语句始终为真。)UTF-8 代码单元是范围为 0-255 的无符号整数类型。尽管 unsigned char 也可以工作,但它准确地映射到 uint8_t 的 C 类型。理想情况下,我会将 UTF-8 字符串设为 uint8_t 数组,但由于旧的 API,很少这样做。

有些人推荐了 wchar_t,声称它是“Unicode 字符类型”或类似的东西。同样,这里的标准与以前一样不可知,因为 C 应该在任何地方工作,而任何地方可能不使用 Unicode。因此,wchar_t 并不比 char 更统一。标准规定:

这是一个整数类型,其值范围可以表示受支持的语言环境中指定的最大扩展字符集的所有成员的不同代码

在 Linux 中,wchat_t 表示 UTF-32 代码单元/代码点。因此它是 4 个字节。然而,在Windows中,它是一个UTF-16代码单元,并且只有2个字节。 (我想说这不符合上述规定,因为 2 字节不能代表所有 Unicode,但这就是它的工作方式。)这种大小差异以及数据编码的差异显然给可移植性带来了压力。如果您需要可移植性,Unicode 标准本身建议不要使用 wchar_t。 (§5.2)

最终教训:我发现以某种明确声明的格式存储所有数据是最简单的。 (通常是 UTF-8,通常在 std::string 中,但我真的想要更好的东西。)这里重要的不是 UTF-8 部分,而是,我知道我的字符串是 UTF-8。如果我将它们传递给其他某个 API,我还必须知道该 API 需要 UTF-8 字符串。如果没有,那么我必须转换它们。 (因此,如果我使用 Window 的 API,我必须首先将字符串转换为 UTF-16。)UTF-8 文本字符串是“orange”,“latin1”文本字符串是“apple”。不知道其编码方式的 char 数组会导致灾难。

Yes - by being more aware of locales and encodings.

Windows has two function calls for everything that requires text, a FoobarA() and a FoobarW(). The *W() functions take UTF-16 encoded strings, the *A() takes strings in the current codepage. However, Windows doesn't support a UTF-8 code page, so you can't directly use it in that sense with the *A() functions, nor would you want to depend on that being set by users. If you want "Unicode" in Windows, use the Unicode-capable (*W) functions. There are tutorials out there, Googling "Unicode Windows tutorial" should get you some.

If you are storing UTF-8 data in a std::string, then before you pass it off to Windows, convert it to UTF-16 (Windows provides functions for doing such), and then pass it to Windows.

Many of these problems arise from C/C++ being generally encoding-agnostic. char isn't really a character, it's just an integral type. Even using char arrays to store UTF-8 data can get you into trouble if you need to access individual code units, as char's signed-ness is left undefined by the standards. A statement like str[x] < 0x80 to check for multiple-byte characters can quickly introduce a bug. (That statement is always true if char is signed.) A UTF-8 code unit is an unsigned integral type with a range of 0-255. That maps to the C type of uint8_t exactly, although unsigned char works as well. Ideally then, I'd make a UTF-8 string an array of uint8_ts, but due to old APIs, this is rarely done.

Some people have recommended wchar_t, claiming it to be "A Unicode character type" or something like that. Again, here the standard is just as agnostic as before, as C is meant to work anywhere, and anywhere might not be using Unicode. Thus, wchar_t is no more Unicode than char. The standard states:

which is an integer type whose range of values can represent distinct codes for all members of the largest extended character set specified among the supported locales

In Linux, a wchat_t represents a UTF-32 code unit / code point. It is thus 4 bytes. However, in Windows, it's a UTF-16 code unit, and is only 2 bytes. (Which, I would have said does not conform to the above, since 2-bytes cannot represent all of Unicode, but that's the way it works.) This size difference, and difference in data encoding, clearly puts a strain on portability. The Unicode standard itself recommends against wchar_t if you need portability. (§5.2)

The end lesson: I find it easiest to store all my data in some well-declared format. (Typically UTF-8, usually in std::string's, but I'd really like something better.) The important thing here is not the UTF-8 part, but rather, I know that my strings are UTF-8. If I'm passing them to some other API, I must also know that that API expects UTF-8 strings. If it doesn't, then I must convert them. (Thus, if I speak to Window's API, I must convert strings to UTF-16 first.) A UTF-8 text string is an "orange", and a "latin1" text string is an "apple". A char array that doesn't know what encoding it is in is a recipe for disaster.

地狱即天堂 2024-09-17 05:24:27

无论平台如何,将 UTF-8 代码点放入 std::string 应该没问题。 Windows 上的问题是几乎没有其他东西期望或使用 UTF-8 - 它期望并使用 UTF-16。您可以切换到将存储 UTF-16 的 std::wstring (至少在大多数 Windows 编译器上),或者您可以编写其他接受 UTF-8 的例程(可能通过转换为 UTF-16 ,然后传递到操作系统)。

Putting UTF-8 code points into an std::string should be fine regardless of platform. The problem on Windows is that almost nothing else expects or works with UTF-8 -- it expects and works with UTF-16 instead. You can switch to an std::wstring which will store UTF-16 (at least on most Windows compilers) or you can write other routines that will accept UTF-8 (probably by converting to UTF-16, and then passing through to the OS).

野稚 2024-09-17 05:24:27

你看过std::wstring吗?它是 wchar_tstd::basic_string 版本,而不是 std::string 使用的 char

Have you looked at std::wstring? It's a version of std::basic_string for wchar_t rather than the char that std::string uses.

鹿港小镇 2024-09-17 05:24:27

不,没有办法让 Windows 将“窄”字符串视为 UTF-8。

在这种情况下,以下是最适合我的方法(具有 Windows 和 Linux 版本的跨平台应用程序)。

  • 在代码的跨平台部分使用 std::string 。假设它始终包含 UTF-8 字符串。
  • 在代码的Windows 部分中,明确使用Windows API 的“宽”版本,即编写例如CreateFileW 而不是CreateFile。这可以避免对构建系统配置的依赖。
  • 在平台抽象层中,根据需要在 UTF-8 和 UTF-16 之间进行转换 (MultiByteToWideChar/WideCharToMultiByte)。

我尝试过但不太喜欢的其他方法:

  • typedef std::basic_string; tstring; 然后在业务代码中使用tstring。可以使用包装器/重载来简化 std::string 和 std::tstring 之间的转换,但它仍然会增加很多痛苦。
  • 到处使用std::wstring。没有多大帮助,因为 wchar_t 在 Windows 上是 16 位,因此您要么必须将自己限制在 BMP 上,要么要进行大量复杂操作才能使处理 Unicode 的代码跨平台。在后一种情况下,UTF-8 的所有优势都会消失。
  • 在平台特定部分使用ATL/WTL/MFC CString;在跨平台部分使用 std::string 。这实际上是我上面推荐的一个变体。 CString 在很多方面都优于 std::string (在我看来)。但它引入了额外的依赖性,因此并不总是可以接受或方便。

No, there is no way to make Windows treat "narrow" strings as UTF-8.

Here is what works best for me in this situation (cross-platform application that has Windows and Linux builds).

  • Use std::string in cross-platform portion of the code. Assume that it always contains UTF-8 strings.
  • In Windows portion of the code, use "wide" versions of Windows API explicitly, i.e. write e.g. CreateFileW instead of CreateFile. This allows to avoid dependency on build system configuration.
  • In the platfrom abstraction layer, convert between UTF-8 and UTF-16 where needed (MultiByteToWideChar/WideCharToMultiByte).

Other approaches that I tried but don't like much:

  • typedef std::basic_string<TCHAR> tstring; then use tstring in the business code. Wrappers/overloads can be made to streamline conversion between std::string and std::tstring, but it still adds a lot of pain.
  • Use std::wstring everywhere. Does not help much since wchar_t is 16 bit on Windows, so you either have to restrict yourself to BMP or go to a lot of complications to make the code dealing with Unicode cross-platform. In the latter case, all benefits over UTF-8 evaporate.
  • Use ATL/WTL/MFC CString in the platfrom-specific portion; use std::string in cross-platfrom portion. This is actually a variant of what I recommend above. CString is in many aspects superior to std::string (in my opinion). But it introduces an additional dependency and thus not always acceptable or convenient.
国产ˉ祖宗 2024-09-17 05:24:27

如果您想避免头痛,请根本不要使用 STL 字符串类型。 C++ 对 Unicode 或编码一无所知,因此为了可移植,最好使用专为 Unicode 支持而定制的库,例如 ICU 库。 ICU 默认使用 UTF-16 字符串,因此不需要转换,并且支持转换为许多其他重要的编码,例如 UTF-8。还可以尝试使用 Boost.Filesystem 等跨平台库来进行路径操作 (boost::wpath) 等操作。避免使用 std::stringstd::fstream

If you want to avoid headache, don't use the STL string types at all. C++ knows nothing about Unicode or encodings, so to be portable, it's better to use a library that is tailored for Unicode support, e.g. the ICU library. ICU uses UTF-16 strings by default, so no conversion is required, and supports conversions to many other important encodings like UTF-8. Also try to use cross-platform libraries like Boost.Filesystem for things like path manipulations (boost::wpath). Avoid std::string and std::fstream.

骑趴 2024-09-17 05:24:27

在 Windows API 和 C 运行时库中,char* 参数被解释为在“ANSI”代码页中编码。问题是 不支持 UTF-8 作为 ANSI 代码页,我觉得非常烦人

我也面临着类似的情况,正在将软件从 Windows 移植到 Linux,同时使其支持 Unicode。我们为此采取的方法是:

  • 使用 UTF-8 作为字符串的默认编码。
  • 在特定于 Windows 的代码中,始终调用函数的“W”版本,根据需要在 UTF-8 和 UTF-16 之间转换字符串参数。

这也是Poco 采取的方法

In the Windows API and C runtime library, char* parameters are interpreted as being encoded in the "ANSI" code page. The problem is that UTF-8 isn't supported as an ANSI code page, which I find incredibly annoying.

I'm in a similar situation, being in the middle of porting software from Windows to Linux while also making it Unicode-aware. The approach we've taken for this is:

  • Use UTF-8 as the default encoding for strings.
  • In Windows-specific code, always call the "W" version of functions, converting string arguments between UTF-8 and UTF-16 as necessary.

This is also the approach Poco has taken.

大海や 2024-09-17 05:24:27

它确实依赖于平台,Unicode 是令人头疼的。取决于你使用的编译器。对于 MS 的较旧版本(VS2010 或更早版本),您需要

使用 MSDN for VS2015

std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;

根据其文档 中描述的 API。我无法检查那个。

对于 mingw、gcc 等,

std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();

输出包含正确的文件名...

It really platform dependant, Unicode is headache. Depends on which compiler you use. For older ones from MS (VS2010 or older), you would need use API described in MSDN

for VS2015

std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt"s;

according to their docs. I can't check that one.

for mingw, gcc, etc.

std::string _old = u8"D:\\Folder\\This \xe2\x80\x93 by ABC.txt";
std::cout << _old.data();

output contains proper file name...

耳钉梦 2024-09-17 05:24:27

您应该考虑使用 QString 和 QByteArray,它具有良好的 unicode 支持

You should consider using QString and QByteArray, it has good unicode support

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文