对 C++'s std::wstring、UTF-16、UTF-8 和在 Windows GUI 中显示字符串感到困惑

发布于 2024-08-26 13:23:08 字数 550 浏览 10 评论 0原文

我正在为 Windows 开发一个仅英语的 C++ 程序,我们被告知“始终使用 std::wstring”,但似乎团队中没有人真正对此有更多的理解。

我已经阅读了标题为 "std::wstring VS std::string 的问题。这非常有帮助,但我仍然不太明白如何将所有这些信息应用于我的问题。

我正在处理的程序在 Windows GUI 中显示数据,我们经常使用 XSLT 将 XML 转换为 HTML。 XSL:FO 用于报告目的。

根据我读到的内容,我的感觉是 HTML 应该编码为 UTF-8。我对 GUI 开发知之甚少,但我读到的一点点表明 GUI 的东西都是基于 UTF-8 的。 假设我们决定所有持久数据都应该是 UTF- 8

编码的 XML,这是否意味着为了在 UI 组件中显示持久数据,我真的应该执行某种显式的 UTF-8 到 UTF-16 转码过程吗?

我怀疑我的解释可能需要澄清,所以如果您有任何问题,我会尽力提供。

I'm working on a english only C++ program for Windows where we were told "always use std::wstring", but it seems like nobody on the team really has much of an understanding beyond that.

I already read the question titled "std::wstring VS std::string. It was very helpful, but I still don't quite understand how to apply all of that information to my problem.

The program I'm working on displays data in a Windows GUI. That data is persisted as XML. We often transform that XML using XSLT into HTML or XSL:FO for reporting purposes.

My feeling based on what I have read is that the HTML should be encoded as UTF-8. I know very little about GUI development, but the little bit I have read indicates that the GUI stuff is all based on UTF-16 encoded strings.

I'm trying to understand where this leaves me. Say we decide that all of our persisted data should be UTF-8 encoded XML. Does this mean that in order to display persisted data in a UI component, I should really be performing some sort of explicit UTF-8 to UTF-16 transcoding process?

I suspect my explanation could use clarification, so I'll try to provide that if you have any questions.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

从来不烧饼 2024-09-02 13:23:08

是的,从 NT4 开始的 Windows 是基于 Unicode 编码的字符串。早期版本基于 UCS-2(UTF-16 的前身),因此不支持 UTF-16 所支持的所有字符。后续版本基于 UTF-16。不过,并非所有操作系统都基于 UTF-16/UCS-2。例如,*nix 系统是基于 UTF-8 的。

UTF-8 是持久存储数据的一个非常好的选择。它是所有 Unicode 环境中普遍支持的编码,并且在数据大小和无损数据兼容性之间取得了良好的平衡。

是的,您必须解析 XML,从中提取必要的信息,然后将其解码并转换为 UI 可以使用的内容。

Windows from NT4 onwards is based on Unicode encoded strings, yes. Early versions were based on UCS-2, which is the predecessor of UTF-16, and thus does not support all of the characters that UTF-16 does. Later versions are based on UTF-16. Not all OSes are based on UTF-16/UCS-2, though. *nix systems, for instance, are based on UTF-8 instead.

UTF-8 is a very good choice for storing data persistently. It is a universally supported encoding in all Unicode environments, and it is a good balance between data size and loss-less data compatibility.

Yes, you would have to parse the XML, extract the necessary information from it, and decode and transform it into something the UI can use.

别想她 2024-09-02 13:23:08

std::wstring 从技术上讲是 UCS-2:每个字符使用两个字节,并且代码表大部分映射到 Unicode 格式。 重要的是要了解 UCS-2 与 UTF-16 不同!UTF-16 允许“代理对”来表示两字节范围之外的字符,但 UCS- 2 每个字符、句点恰好使用两个字节。

适合您情况的最佳规则是在读取和写入磁盘时进行转码。一旦进入内存,就以 UCS-2 格式保存。 Windows API 会像 UTF-16 一样读取它(也就是说,虽然 std::wstring 不理解代理对的概念,但如果您手动创建它们(如果您唯一的语言是,则不会这样做)英文),Windows 会读取它们)。

如今,每当您以序列化格式(例如 XML)读取或读取数据时,您可能都需要进行转码。这是生活中一个令人不愉快且非常不幸的事实,但不可避免,因为 Unicode 是一种可变宽度字符编码,并且 C++ 中大多数基于字符的操作都是作为数组完成的,为此您需要一致的间距。

更高级别的框架(例如 .NET)掩盖了大部分细节,但在幕后,它们以相同的方式处理转码:将可变宽度数据更改为固定宽度字符串,操作它们,然后更改它们当需要输出时返回可变宽度编码。

std::wstring is technically UCS-2: two bytes are used for each character and the code tables mostly map to Unicode format. It's important to understand that UCS-2 is not the same as UTF-16! UTF-16 allows "surrogate pairs" in order to represent characters which are outside of the two-byte range, but UCS-2 uses exactly two bytes for each character, period.

The best rule for your situation is to do your transcoding when you read and write to the disk. Once it's in memory, keep it in UCS-2 format. Windows APIs will read it as if it were UTF-16 (which is to say, while std::wstring doesn't understand the concept of surrogate pairs, if you manually create them (which you won't, if your only language is English), Windows will read them).

Whenever you're reading data in or out of serialization formats (such as XML) in the modern day, you'll probably need to do transcoding. It's an unpleasant and very unfortunate fact of life, but inevitable since Unicode is a variable-width character encoding and most character-based operations in C++ are done as arrays, for which you need consistent spacing.

Higher-level frameworks, such as .NET, obscure most of the details, but behind the scenes, they're handling the transcoding in the same fashion: changing variable-width data to fixed-width strings, manipulating them, and then changing them back into variable-width encodings when required for output.

内心激荡 2024-09-02 13:23:08

AFAIK 当您在 Windows 上用 C++ 使用 std::wstring 并在文件中使用 UTF-8 存储时(这听起来不错且合理),那么您必须在写入文件时将数据转换为 UTF-8,然后再转换回从文件读取时使用 UTF-16。查看此链接:写入 UTF-8 文件在 C++ 中

我会坚持使用 Visual Studio 默认的项目 ->属性->配置属性->一般->字符集->使用 Unicode 字符集,使用 wchar_t 类型(即使用 std::wstring)并且使用 TCHAR 类型。 (例如,我只使用 strlen 的 wcslen 版本,而不使用 _tcslen。)

AFAIK when you work with std::wstring on Windows in C++ and store using UTF-8 in files (which sounds good and reasonable), then you have to convert the data to UTF-8 when writing to a file, and convert back to UTF-16 when reading from a file. Check out this link: Writing UTF-8 Files in C++.

I would stick with the Visual Studio default of project -> Properties -> Configuration Properties -> General -> Character Set -> Use Unicode Character Set, use the wchar_t type (i.e. with std::wstring) and not use the TCHAR type. (E.g. I would just use the wcslen version of strlen and not _tcslen.)

很糊涂小朋友 2024-09-02 13:23:08

在 Windows 上使用 std::wstring 作为 GUI 相关字符串的优点之一是,所有 Windows API 调用在内部都使用 UTF-16 并进行操作。如果您曾经注意到,所有采用字符串参数的 Win32 API 调用都有 2 个版本。例如,“MessageBoxA”和“MessageBoxW”。这两个定义都存在于 中,事实上您可以调用您想要的任何一个,但是如果包含在启用了 Unicode 支持的情况下,那么将会发生以下情况:

#define MessageBox MessageBoxW

然后您会进入 TCHAR 和其他 Microsoft 技巧,尝试并使其更容易处理有 ANSI 和 Unicode 版本。简而言之,您可以调用其中任何一个,但在 Windows 内核的底层是基于 Unicode 的,因此如果您不使用宽字符版本,您将为每个接受 Win32 API 调用的字符串转换为 Unicode 付出代价。

UTF-16 和 Windows 内核使用

One advantage to using std::wstring on Windows for GUI related strings, is that internally all Windows API calls use and operate on UTF-16. If you've ever noticed there are 2 versions of all Win32 API calls that take string arguments. For example, "MessageBoxA" and "MessageBoxW". Both definitions exist in , and in fact you can call either you want, but if is included with Unicode support enabled, then the following will happen:

#define MessageBox MessageBoxW

Then you get into TCHAR's and other Microsoft tricks to try and make it easier to deal with APIs that have both an ANSI and Unicode version. In short, you can call either, but under the hood the Windows kernel in Unicode based, so you'll be paying the cost of converting to Unicode for each string accepting Win32 API call if you don't use the wide char version.

UTF-16 and Windows kernel use

狂之美人 2024-09-02 13:23:08

即使您说数据中只有英文,您也可能是错的。由于我们现在处于全球化世界,名称/地址等都包含外来字符。好的,我不知道您拥有什么类型的数据,但通常我会说构建您的应用程序以支持 UNICODE 来存储数据并向用户显示数据。这建议您在执行 GUI 时使用带有 UTF-8 的 XML 来存储 Windows 调用的 UNICODE 版本。由于 Windows GUI 使用 UTF-16,其中每个令牌都是 16 位,因此我建议将应用程序中的数据存储在 16 位宽的字符串中。我猜你的 Windows 编译器会将 std::wstring 设置为 16 位,就是为了这个目的。

那么你就必须在UTF-16和UTF-8之间进行大量的转换。使用一些现有的库来做到这一点,例如 ICU

Even if you say you only have English in your data, you're probably wrong. Since we're in a global world now, names/addresses/etc have foreign characters. OK, I do not know what type of data you have, but generally I would say build your application to support UNICODE for both storing data and displaying data to user. That would suggest using XML with UTF-8 for storing and UNICODE versions of Windows calls when you do GUI. And since Windows GUI uses UTF-16, where each token is 16-bit, I would suggest storing the data in the application in an 16-bit wide string. And I would guess your compiler for windows would have std::wstring as 16-bit for just this purpose.

So then you have to do a lot of conversion between UTF-16 and UTF-8. Do that with some existing library, like for instance ICU.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文