C++ 中的 UTF 使用 代码

发布于 2024-07-06 20:21:13 字数 266 浏览 16 评论 0原文

UTF 和 UCS 有什么区别。

在 C++ 字符串中表示非欧洲字符集(使用 UTF)的最佳方法是什么? 我想知道您对以下方面的建议:

  • 代码内的内部表示
    • 用于运行时的字符串操作
    • 用于将字符串用于显示目的。
  • 最佳存储表示(在文件中)
  • 最佳有线传输格式(可能位于不同体系结构并具有不同标准区域设置的应用程序之间传输)

What is the difference between UTF and UCS.

What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for:

  • Internal representation inside the code
    • For string manipulation at run-time
    • For using the string for display purposes.
  • Best storage representation (i.e. In file)
  • Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

岛歌少女 2024-07-13 20:21:13

UTF 和 UCS 有什么区别。

UCS 编码是固定宽度的,并以每个字符使用多少字节来标记。 例如,UCS-2 每个字符需要 2 个字节。 代码点超出可用范围的字符无法使用 UCS 编码进行编码。

UTF 编码是可变宽度的,并以存储字符的最小位数来标记。 例如,UTF-16 要求每个字符至少 16 位(2 个字节)。 具有较大代码点的字符使用较大数量的字节进行编码 - UTF-16 中的星体字符为 4 个字节。

  • 代码内的内部表示
  • 最佳存储表示(即在文件中)
  • 最佳有线传输格式(在可能的应用程序之间传输
    位于不同的架构上并且具有
    不同的标准区域设置)

对于现代系统,最合理的存储和传输编码是 UTF-8。 在某些特殊情况下,其他可能也适用 - UTF-7 适用于旧的邮件服务器,UTF-16 适用于写得不好的文本编辑器 - 但 UTF-8 是最常见的。

首选内部代表将取决于您的平台。 在 Windows 中,它是 UTF-16。 在 UNIX 中,它是 UCS-4。 每个字符串都有其优点:

  • UTF-16 字符串永远不会比 UCS-4 字符串使用更多的内存。 如果您存储许多主要在基本多语言平面 (BMP) 中包含字符的大型字符串,则 UTF-16 所需的空间将比 UCS-4 少得多。 在 BMP 之外,它将使用相同的数量。
  • UCS-4 更容易推理。 由于 UTF-16 字符可能会拆分为多个“代理对”,因此正确拆分或呈现字符串可能具有挑战性。 UCS-4文本没有这个问题。 UCS-4 的行为也很像“char”数组中的 ASCII 文本,因此可以轻松移植现有的文本算法。

最后,一些系统使用 UTF-8 作为内部格式。 如果您需要与现有的基于 ASCII 或 ISO-8859 的系统进行互操作,这非常有用,因为 UTF-8 文本中间不存在 NULL 字节——它们位于 UTF-16 或 UCS-4 中。

What is the difference between UTF and UCS.

UCS encodings are fixed width, and are marked by how many bytes are used for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range can't be encoded in a UCS encoding.

UTF encodings are variable width, and marked by the minimum number of bits to store a character. For example, UTF-16 requires at least 16 bits (2 bytes) per character. Characters with large code points are encoded using a larger number of bytes -- 4 bytes for astral characters in UTF-16.

  • Internal representation inside the code
  • Best storage representation (i.e. In file)
  • Best on wire transport format (Transfer between application that may
    be on different architectures and have
    a different standard locale)

For modern systems, the most reasonable storage and transport encoding is UTF-8. There are special cases where others might be appropriate -- UTF-7 for old mail servers, UTF-16 for poorly-written text editors -- but UTF-8 is most common.

Preferred internal representation will depend on your platform. In Windows, it is UTF-16. In UNIX, it is UCS-4. Each has its good points:

  • UTF-16 strings never use more memory than a UCS-4 string. If you store many large strings with characters primarily in the basic multi-lingual plane (BMP), UTF-16 will require much less space than UCS-4. Outside the BMP, it will use the same amount.
  • UCS-4 is easier to reason about. Because UTF-16 characters might be split over multiple "surrogate pairs", it can be challenging to correctly split or render a string. UCS-4 text does not have this issue. UCS-4 also acts much like ASCII text in "char" arrays, so existing text algorithms can be ported easily.

Finally, some systems use UTF-8 as an internal format. This is good if you need to inter-operate with existing ASCII- or ISO-8859-based systems because NULL bytes are not present in the middle of UTF-8 text -- they are in UTF-16 or UCS-4.

许久 2024-07-13 20:21:13

我建议:

  • 对于代码中的表示,wchar_t 或等效的。
  • 对于存储表示,UTF-8。
  • 对于有线表示,UTF-8。

UTF-8 在存储和有线情况下的优点是机器字节序不是一个因素。 在代码中使用固定大小字符(例如 wchar_t)的优点是,您可以轻松找出字符串的长度,而无需扫描它。

I would suggest:

  • For representation in code, wchar_t or equivalent.
  • For storage representation, UTF-8.
  • For wire representation, UTF-8.

The advantage of UTF-8 in storage and wire situations is that machine endianness is not a factor. The advantage of using a fixed size character such as wchar_t in code is that you can easily find out the length of a string without having to scan it.

淡写薰衣草的香 2024-07-13 20:21:13

UTC 是协调世界时,而不是字符集(我没有找到任何称为 UTC 的字符集)。

对于内部表示,您可能需要对每个字符使用 wchar_t,对字符串使用 std::wstring。 它们每个字符恰好使用 2 个字节,因此查找和随机访问会很快。

对于存储,如果大部分数据不是 ASCII(即代码 >= 128),您可能需要使用 UTF-16,它与序列化的 wstringwchar_t.

由于 UTF-16 可以是小端或大端,对于有线传输,请尝试将其转换为独立于体系结构的 UTF-8。

UTC is Coordinated Universal Time, not a character set (I didn't find any charset called UTC).

For internal representation, you may want to use wchar_t for each character, and std::wstring for strings. They use exactly 2 bytes for each character, so seeking and random access will be fast.

For storage, if most of the data are not ASCII (i.e. code >= 128), you may want to use UTF-16 which is almost the same as serialized wstring and wchar_t.

Since UTF-16 can be little endian or big endian, for wire transport, try to convert it to UTF-8, which is architecture-independent.

把昨日还给我 2024-07-13 20:21:13

在代码内部的表示中,您最好对欧洲和非欧洲字符都这样做:

\uNNNN

\u0020 到 \u007E 范围内的字符,以及一点空白(例如行尾)可以写为普通人物。 \u0080 以上的任何内容,如果您将其写为普通字符,那么它将仅在您的代码页中编译(例如,在法国可以,但在俄罗斯会中断,在俄罗斯可以,但在日本会中断,在中国可以,但在美国会中断,等等.)。

In internal representation inside the code, you'd better do this for both European and non-European characters:

\uNNNN

Characters in the range \u0020 to \u007E, and a little bit of whitespace (e.g. end of line) can be written as ordinary characters. Anything above \u0080, if you write it as an ordinary character then it will compile only in your code page (e.g. OK in France but breaking in Russia, OK in Russia but breaking in Japan, OK in China but breaking in the US, etc.).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文