C++ 中的 UTF 使用 代码
UTF 和 UCS 有什么区别。
在 C++ 字符串中表示非欧洲字符集(使用 UTF)的最佳方法是什么? 我想知道您对以下方面的建议:
- 代码内的内部表示
- 用于运行时的字符串操作
- 用于将字符串用于显示目的。
- 最佳存储表示(即在文件中)
- 最佳有线传输格式(可能位于不同体系结构并具有不同标准区域设置的应用程序之间传输)
What is the difference between UTF and UCS.
What are the best ways to represent not European character sets (using UTF) in C++ strings. I would like to know your recommendations for:
- Internal representation inside the code
- For string manipulation at run-time
- For using the string for display purposes.
- Best storage representation (i.e. In file)
- Best on wire transport format (Transfer between application that may be on different architectures and have a different standard locale)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
UCS 编码是固定宽度的,并以每个字符使用多少字节来标记。 例如,UCS-2 每个字符需要 2 个字节。 代码点超出可用范围的字符无法使用 UCS 编码进行编码。
UTF 编码是可变宽度的,并以存储字符的最小位数来标记。 例如,UTF-16 要求每个字符至少 16 位(2 个字节)。 具有较大代码点的字符使用较大数量的字节进行编码 - UTF-16 中的星体字符为 4 个字节。
对于现代系统,最合理的存储和传输编码是 UTF-8。 在某些特殊情况下,其他可能也适用 - UTF-7 适用于旧的邮件服务器,UTF-16 适用于写得不好的文本编辑器 - 但 UTF-8 是最常见的。
首选内部代表将取决于您的平台。 在 Windows 中,它是 UTF-16。 在 UNIX 中,它是 UCS-4。 每个字符串都有其优点:
最后,一些系统使用 UTF-8 作为内部格式。 如果您需要与现有的基于 ASCII 或 ISO-8859 的系统进行互操作,这非常有用,因为 UTF-8 文本中间不存在 NULL 字节——它们位于 UTF-16 或 UCS-4 中。
UCS encodings are fixed width, and are marked by how many bytes are used for each character. For example, UCS-2 requires 2 bytes per character. Characters with code points outside the available range can't be encoded in a UCS encoding.
UTF encodings are variable width, and marked by the minimum number of bits to store a character. For example, UTF-16 requires at least 16 bits (2 bytes) per character. Characters with large code points are encoded using a larger number of bytes -- 4 bytes for astral characters in UTF-16.
For modern systems, the most reasonable storage and transport encoding is UTF-8. There are special cases where others might be appropriate -- UTF-7 for old mail servers, UTF-16 for poorly-written text editors -- but UTF-8 is most common.
Preferred internal representation will depend on your platform. In Windows, it is UTF-16. In UNIX, it is UCS-4. Each has its good points:
Finally, some systems use UTF-8 as an internal format. This is good if you need to inter-operate with existing ASCII- or ISO-8859-based systems because NULL bytes are not present in the middle of UTF-8 text -- they are in UTF-16 or UCS-4.
您是否读过 Joel Spolsky 的文章 每个软件开发人员绝对、肯定必须了解 Unicode 和字符的绝对最低限度设置(没有任何借口!)?
Have you read Joel Spolsky's article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)?
我建议:
wchar_t
或等效的。UTF-8 在存储和有线情况下的优点是机器字节序不是一个因素。 在代码中使用固定大小字符(例如
wchar_t
)的优点是,您可以轻松找出字符串的长度,而无需扫描它。I would suggest:
wchar_t
or equivalent.The advantage of UTF-8 in storage and wire situations is that machine endianness is not a factor. The advantage of using a fixed size character such as
wchar_t
in code is that you can easily find out the length of a string without having to scan it.UTC 是协调世界时,而不是字符集(我没有找到任何称为 UTC 的字符集)。
对于内部表示,您可能需要对每个字符使用 wchar_t,对字符串使用 std::wstring。 它们每个字符恰好使用 2 个字节,因此查找和随机访问会很快。
对于存储,如果大部分数据不是 ASCII(即代码 >= 128),您可能需要使用 UTF-16,它与序列化的
wstring
和wchar_t.
由于 UTF-16 可以是小端或大端,对于有线传输,请尝试将其转换为独立于体系结构的 UTF-8。
UTC is Coordinated Universal Time, not a character set (I didn't find any charset called UTC).
For internal representation, you may want to use
wchar_t
for each character, and std::wstring for strings. They use exactly 2 bytes for each character, so seeking and random access will be fast.For storage, if most of the data are not ASCII (i.e. code >= 128), you may want to use UTF-16 which is almost the same as serialized
wstring
andwchar_t
.Since UTF-16 can be little endian or big endian, for wire transport, try to convert it to UTF-8, which is architecture-independent.
在代码内部的表示中,您最好对欧洲和非欧洲字符都这样做:
\uNNNN
\u0020 到 \u007E 范围内的字符,以及一点空白(例如行尾)可以写为普通人物。 \u0080 以上的任何内容,如果您将其写为普通字符,那么它将仅在您的代码页中编译(例如,在法国可以,但在俄罗斯会中断,在俄罗斯可以,但在日本会中断,在中国可以,但在美国会中断,等等.)。
In internal representation inside the code, you'd better do this for both European and non-European characters:
\uNNNN
Characters in the range \u0020 to \u007E, and a little bit of whitespace (e.g. end of line) can be written as ordinary characters. Anything above \u0080, if you write it as an ordinary character then it will compile only in your code page (e.g. OK in France but breaking in Russia, OK in Russia but breaking in Japan, OK in China but breaking in the US, etc.).