有什么理由不对所有内容都使用 UTF-8、16 等?
我知道网络最近主要针对 UTF-8 进行标准化,我只是想知道是否有任何地方使用 UTF-8 会是一件坏事。我听说过 UTF-8、16 等可能会使用更多空间的论点,但最终它可以忽略不计。
另外,在 Windows 程序、Linux shell 和类似的东西中,你能安全地使用 UTF-8 吗?
I know the web is mostly standardizing towards UTF-8 lately and I was just wondering if there was any place where using UTF-8 would be a bad thing. I've heard the argument that UTF-8, 16, etc may use more space but in the end it has been negligible.
Also, what about in Windows programs, Linux shell and things of that nature -- can you safely use UTF-8 there?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
如果 UTF-32 可用,则优先使用它而不是其他版本进行处理。
如果您的平台本身支持 UTF-32/UCS-4 Unicode - 那么“压缩”版本 UTF-8 和 UTF-16 可能会更慢,因为它们对每个字符(字符序列)使用不同数量的字节,这使得不可能通过索引在字符串中直接查找,而 UTF-32 对每个字符使用 32 位“平面”,从而大大加快了一些字符串操作的速度。
当然,如果您在非常受限的环境(例如嵌入式系统)中进行编程,并且可以确定周围只有 ASCII 或 ISO 8859-x 字符,永远,那么您可以选择这些字符集为了效率和速度。但总的来说,坚持使用Unicode 转换格式。
If UTF-32 is available, prefer that over the other versions for processing.
If your platform supports UTF-32/UCS-4 Unicode natively - then the "compressed" versions UTF-8 and UTF-16 may be slower, because they use varying numbers of bytes for each character (character sequences), which makes impossible to do a direct lookup in a string by index, while UTF-32 uses 32 bit "flat" for each character, speeding up some string operations a lot.
Of course, if you are programming in a very restricted environment like, say, embedded systems and can be certain there will be only ASCII or ISO 8859-x characters around, ever, then you can chose those charsets for efficiency and speed. But in general, stick with the Unicode Transformation Formats.
当您需要编写一个需要非常快的程序(执行字符串操作)并且您确定不需要外来字符时,UTF-8 可能不是最好的主意。在所有其他情况下,UTF-8 都应该成为标准。
UTF-8 在几乎所有最新的软件上都能很好地工作,甚至在 Windows 上也是如此。
When you need to write a program (performing string manipulations) that needs to be very very fast and that you're sure that you won't need exotic characters, may be UTF-8 is not the best idea. In every other situations, UTF-8 should be a standard.
UTF-8 works well on almost every recent software, even on Windows.
有一种观点认为,添加不必要的转换会增加复杂性,但收效甚微。因此,如果您的输入和输出使用相同的格式,那么也存在以该格式工作的争论。
UTF-8和UTF-16都是设计相对良好的多单元编码。较小的代码单元序列永远不会作为较长序列的子序列出现,并且检测到错误的解码器可以在下一个有效代码单元处恢复解码。
有些人认为 UTF-32“更好”,因为它为每个 Unicode 代码点使用一个代码单元。但更值得怀疑的是,unicode 代码点和大多数用户所认为的“字符”之间并不存在 1:1 的映射。因此,能够从序列中快速获取第 n 个代码点并不像最初出现的那样有用。
Windows 和类 Unix 系统采用了不同的方法来引入 Unicode。两种方法各有利弊。
Windows 通过引入一组并行的 API 引入了 16 位 Unicode(最初是 UCS-2,后来是 UTF-16)。需要 Unicode 支持的应用程序或框架必须切换到新的 API。尽管 Windows NT 在所有 API 中都提供了 Unicode 支持,但 Windows 9x 仅在一个子集中提供了这种支持,这一事实使情况变得更加复杂。
在文件系统方面,Windows NT 的本机 NTFS 文件系统从一开始就使用 16 位 unicode 文件名。对于早于 Windows NT 的 FAT 文件系统,引入了 Unicode 作为长文件名支持的一部分。与 CD 类似,Joliet 扩展添加了 Unicode 长文件名。
另一方面,类 Unix 系统通过使用 UTF-8 引入了 Unicode,并将其视为任何其他扩展 ascii 字符集。 Unix 文件系统上的文件名始终是字节序列,这些字节的含义取决于用户的环境。
两种方法各有利弊。 Unix 方法甚至允许非 Unicode 识别程序在某种程度上处理 Unicode 文本。另一方面,这意味着用户必须在“Unicode”环境和“旧版”环境之间进行选择,“Unicode”环境中所有内容均为 UTF-8,并且任何 pre-unicode 文件都需要转换,而“旧版”环境则不支持 Unicode。
某些编程语言或框架将尝试确定一种编码并将所有内容转换为该编码。然而,由于在 Windows 和类 Unix 系统上,程序可能会遇到来自操作系统的字符串,这些字符串未通过其标称编码的验证,因此情况变得复杂。发生这种情况的原因有很多,包括来自预转换软件的遗留数据、未考虑多单元编码的截断以及使用名义上的文本字符串来传递非文本数据以及简单的旧错误。
There is an argument to be made that adding unnecessary conversions is adding complexity for little benefit. So if your inputs and your outputs use the same format then there is an argument for working in that format too.
Both UTF-8 and UTF-16 are relatively well-designed multi-unit encodings. A smaller sequence of code units never appears as a sub-sequence of a longer sequence and a decoder that detects an error can resume decoding at the next valid code unit.
Some argue that UTF-32 is "better" because it uses one code unit for every Unicode code point. What makes this more questionable though is that there is not a 1:1 mapping between unicode code points and what most users would regard as "characters". So being able to rapidly get the nth code point from a sequence is less useful than it would first appear.
Windows and Unix-like systems took different approaches to the introduction of Unicode. Both approaches had their pros and cons.
Windows introduced 16 bit Unicode (initially UCS-2, later UTF-16) by introducing a parallel set of APIs. Applications or frameworks that wanted Unicode support had to switch to the new APIs. This was further complicated by the fact that while windows NT offered Unicode support in all APIs, windows 9x only offered it in a subset.
On the filesystem side, windows NT's native NTFS filesystem used 16 bit unicode filenames from the start. For the FAT filesystem which pre-dated windows NT, Unicode was introduced as part of Long filename support. Similarly for CDs, the Joliet extension added Unicode long filenames.
Unix-like systems on the other hand introduced Unicode by using UTF-8 and treating it like any other extended-ascii character set. Filenames on Unix filesystems have always been sequences of bytes, where the meaning assigned to those bytes is down to the user's environment.
There are pros and cons to both approaches. The Unix approach allowed even non unicode aware programs to handle Unicode text to some extent. On the other hand it meant users had to essentially choose between a "Unicode" environment where everything was UTF-8 and where any pre-unicode files would need conversion and a "legacy" environment where Unicode was not supported.
Some programming languages or frameworks will attempt to settle on an encoding and convert everything to that encoding. This is however complicated by the fact that on both Windows and Unix-like systems a program may encounter strings from the operating system that do not pass validation for their nominal encoding. This can happen for a number of reasons, including legacy data from pre-transition software, truncation that does not take account of the multi-unit encodings and use of what are nominally text strings to pass non-text data and just plain old errors.
众所周知,utf-8 最适合文件存储和网络传输。但人们争论 utf-16/32 是否更适合处理。一个主要论点是 utf-16 仍然是可变长度,甚至 utf-32 仍然不是每个字符一个代码点,那么它们比 utf-8 更好吗?我的观点是 utf-16 是一个非常好的折衷方案。
首先,在 utf-16 中需要双代码点的 BMP 之外的字符极少使用。该范围内的汉字(还有其他一些亚洲文字)基本上都是死字。一般人根本不会用它们,除非专家用它们来数字化古籍。所以,utf-32大多数时候都会是一种浪费。不要太担心这些字符,因为如果你没有正确处理它们,它们并不会让你的软件看起来很糟糕,只要你的软件不是针对那些特殊用户的。
其次,我们通常需要字符串内存分配与字符计数相关。例如,一个包含 10 个字符的数据库字符串列(假设我们以标准化形式存储 unicode 字符串),对于 utf-16 来说,这将是 20 个字节。在大多数情况下,它会像这样工作,但在极端情况下,它只会容纳 5-8 个字符。但对于utf-8来说,西方语言的一个字符的常见字节长度是1-3,亚洲语言的一个字符的常见字节长度是3-5。这意味着即使对于常见情况我们也需要 10-50 个字节。更多数据,更多处理。
It is well-known that utf-8 works best for file storage and network transport. But people debate whether utf-16/32 are better for processing. One major argument is that utf-16 is still variable length and even utf-32 is still not one code-point per character, so how are they better than utf-8? My opinion is that utf-16 is a very good compromise.
First, characters out side of BMP which need double code-points in utf-16 are extremely rarely used ones. The Chinese characters (also some other Asia characters) in that range are basically dead ones. Ordinary people won't use them at all, except experts use them to digitalize ancient books. So, utf-32 will be a waste most of the time. Don't worry too much about those characters, as they won't make your software look bad if you didn't handle them properly, as long as your software is not for those special users.
Second, often we need the string memory allocation to be related to character count. e.g. a database string column for 10 characters (assuming we store unicode string in normalized form), which will be 20 bytes for utf-16. In most cases it will work just like that, except in extreme cases it will hold only 5-8 characters. But for utf-8, the common byte length of one character is 1-3 for western languages and 3-5 for Asia languages. Which means we need 10-50 bytes even for the common cases. More data, more processing.