C++ 字符串:UTF-8 还是 16 位编码?

发布于 2024-07-04 18:52:34 字数 885 浏览 19 评论 0原文

我仍在尝试决定我的(家庭)项目是否应该使用 UTF-8字符串(根据 std::string 实现,必要时带有附加的 UTF-8 特定函数)或一些 16 位字符串(作为 std::wstring 实现)。 该项目是一种编程语言和环境(如VB,它是两者的组合)。

有一些愿望/限制:

  • 如果它可以在有限的硬件(例如内存有限的计算机)上运行,那就太酷了。
  • 我希望代码能够在 Windows、Mac 和(如果资源允许)Linux 上运行。
  • 我将使用 wxWidgets 作为我的 GUI 层,但我希望与该工具包交互的代码限制在代码库的一角(我将有非 GUI 可执行文件)。
  • 在处理用户可见文本和应用程序数据时,我想避免使用两种不同类型的字符串。

目前,我正在使用 std::string,目的是仅在必要时使用 UTF-8 操作函数。 它需要更少的内存,并且似乎是许多应用程序正在发展的方向。

如果您推荐 16 位编码,请选择哪一种:UTF-16UCS-2? 另一个?

I'm still trying to decide whether my (home) project should use UTF-8 strings (implemented in terms of std::string with additional UTF-8-specific functions when necessary) or some 16-bit string (implemented as std::wstring). The project is a programming language and environment (like VB, it's a combination of both).

There are a few wishes/constraints:

  • It would be cool if it could run on limited hardware, such as computers with limited memory.
  • I want the code to run on Windows, Mac and (if resources allow) Linux.
  • I'll be using wxWidgets as my GUI layer, but I want the code that interacts with that toolkit confined in a corner of the codebase (I will have non-GUI executables).
  • I would like to avoid working with two different kinds of strings when working with user-visible text and with the application's data.

Currently, I'm working with std::string, with the intent of using UTF-8 manipulation functions only when necessary. It requires less memory, and seems to be the direction many applications are going anyway.

If you recommend a 16-bit encoding, which one: UTF-16? UCS-2? Another one?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

蓝礼 2024-07-11 18:52:34

UTF-16 仍然是一种变长字符编码(有超过 2^16 个 unicode 代码点),因此无法进行 O(1) 字符串索引操作。 如果你做了很多这样的事情,你不会比 UTF-8 节省任何速度。 另一方面,如果您的文本包含大量 256-65535 范围内的代码点,则 UTF-16 可以显着改善大小。 UCS-2 是 UTF-16 的变体,它是固定长度的,但代价是禁止任何大于 2^16 的代码点。

在不了解您的要求的更多情况下,我个人会选择 UTF-8。 由于其他人已经列出的所有原因,这是最容易处理的。

UTF-16 is still a variable length character encoding (there are more than 2^16 unicode codepoints), so you can't do O(1) string indexing operations. If you're doing lots of that sort of thing, you're not saving anything in speed over UTF-8. On the other hand, if your text includes a lot of codepoints in the 256-65535 range, UTF-16 can be a substantial improvement in size. UCS-2 is a variation on UTF-16 that is fixed length, at the cost of prohibiting any codepoints greater than 2^16.

Without knowing more about your requirements, I would personally go for UTF-8. It's the easiest to deal with for all the reasons others have already listed.

牵你的手,一向走下去 2024-07-11 18:52:34

老实说,我从来没有找到任何理由使用 UTF-8 之外的任何东西。

I have never found any reasons to use anything else than UTF-8 to be honest.

み青杉依旧 2024-07-11 18:52:34

如果您决定使用 UTF-8 编码,请查看此库:http://utfcpp.sourceforge.net/

它可能会让您的生活变得更加轻松。

If you decide to go with UTF-8 encoding, check out this library: http://utfcpp.sourceforge.net/

It may make your life much easier.

夏至、离别 2024-07-11 18:52:34

我实际上已经编写了一个广泛使用的应用程序(超过 500 万用户),因此从字面上看,使用的每千字节都会增加。 尽管如此,我还是坚持使用 wxString。 我已将其配置为从 std::wstring 派生,因此我可以将它们传递给需要 wstring const& 的函数。

请注意,std::wstring 是 Mac 上的本机 Unicode(U+10000 以上的字符不需要 UTF-16),因此它使用 4 个字节/wchar_t。 这样做的一大优点是 i++ 总是为您提供下一个字符。 在 Win32 上,只有 99.9% 的情况是这样。 作为一名程序员,你会明白 99.9% 是多么渺小。

但如果您不相信,请将函数编写为大写的 std::string[UTF-8] 和 std::wstring。 这两个函数会告诉你哪种方式是疯狂的。

您的磁盘格式是另一回事。 为了可移植性,应该是 UTF-8。 UTF-8 中没有字节序问题,也没有关于宽度 (2/4) 的讨论。 这可能就是许多程序似乎使用 UTF-8 的原因。

稍微不相关的一点是,请阅读 Unicode 字符串比较和规范化。 或者您最终会遇到与 .NET 相同的错误,其中您可以有两个变量 föö 和 föö,仅在(不可见的)标准化方面有所不同。

I've actually written a widely used application (5million+ users) so every kilobyte used adds up, literally. Despite that, I just stuck to wxString. I've configured it to be derived from std::wstring, so I can pass them to functions expecting a wstring const&.

Please note that std::wstring is native Unicode on the Mac (no UTF-16 needed for characters above U+10000), and therefore it uses 4 bytes/wchar_t. The big advantage of this is that i++ gets you the next character, always. On Win32 that is true in only 99.9% of the cases. As a fellow programmer, you'll understand how little 99.9% is.

But if you're not convinced, write the function to uppercase a std::string[UTF-8] and a std::wstring. Those 2 functions will tell you which way is insanity.

Your on-disk format is another matter. For portability, that should be UTF-8. There's no endianness concern in UTF-8, nor a discussion over the width (2/4). This may be why many programs appear to use UTF-8.

On a slightly unrelated note, please read up on Unicode string comparisions and normalization. Or you'll end up with the same bug as .NET, where you can have two variables föö and föö differing only in (invisible) normalization.

习ぎ惯性依靠 2024-07-11 18:52:34

我建议对任何类型的数据操作和 UI 使用 UTF-16。
Mac OS X 和 Win32 API 使用 UTF-16,wxWidgets、Qt、ICU、Xerces 等也是如此。
UTF-8 可能更适合数据交换和存储。
请参阅http://unicode.org/notes/tn12/

但无论你选择什么,我绝对建议“仅在必要时”不要使用 UTF-8 的 std::string 。

一路使用UTF-16或UTF-8,但不要混合搭配,那是自找麻烦。

I would recommend UTF-16 for any kind of data manipulation and UI.
The Mac OS X and Win32 API uses UTF-16, same for wxWidgets, Qt, ICU, Xerces, and others.
UTF-8 might be better for data interchange and storage.
See http://unicode.org/notes/tn12/.

But whatever you choose, I would definitely recommend against std::string with UTF-8 "only when necessary".

Go all the way with UTF-16 or UTF-8, but do not mix and match, that is asking for trouble.

韬韬不绝 2024-07-11 18:52:34

MicroATX 几乎是一种标准 PC 主板格式,最多支持 4-8 GB RAM。 如果您使用的是 picoATX,那么您的 RAM 可能会受到限制。 即便如此,这对于开发环境来说已经足够了。 由于上述原因,我仍然坚持使用 UTF-8,但内存不应该是您关心的问题。

MicroATX is pretty much a standard PC motherboard format, most capable of 4-8 GB of RAM. If you're talking picoATX maybe you're limited to 1-2 GB RAM. Even then that's plenty for a development environment. I'd still stick with UTF-8 for reasons mentioned above, but memory shouldn't be your concern.

花心好男孩 2024-07-11 18:52:34

根据我的阅读,除非内存不足,否则最好在内部使用 16 位编码。 它几乎适合所有现存语言的一个字符,

我也会看看 ICU。 如果您不打算使用字符串的某些 STL 功能,那么使用 ICU 字符串类型可能更适合您。

From what I've read, it's better to use a 16-bit encoding internally unless you're short on memory. It fits almost all living languages in one character

I'd also look at ICU. If you're not going to be using certain STL features of strings, using the ICU string types might be better for you.

柠北森屋 2024-07-11 18:52:34

你考虑过使用wxStrings吗? 如果我没记错的话,他们可以做 utf-8 <-> Unicode 转换,当您必须将字符串传入和传出 UI 时,它会让事情变得更容易。

Have you considered using wxStrings? If I remember correctly, they can do utf-8 <-> Unicode conversions and it will make it a bit easier when you have to pass strings to and from the UI.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文