如果我们有代理对,为什么使用 UTF-32 而不是 UTF-16?
如果我理解正确的话,UTF-32 可以处理宇宙中的每个字符。 UTF-16 也可以通过使用代理对来实现。 那么有什么充分的理由使用 UTF-32 而不是 UTF-16 呢?
If I understand correctly, UTF-32 can handle every character in the universe. So can UTF-16, through the use of surrogate pairs. So is there any good reason to use UTF-32 instead of UTF-16?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
在 UTF-32 中,unicode 字符始终由 4 个字节表示,因此解析代码比 UTF-16 字符串更容易编写,因为在 UTF-16 中,字符由不同数量的字节表示。 缺点是 UTF-32 字符总是需要 4 个字节,如果您主要使用英语字符,这可能会造成浪费。 因此,设计选择取决于您的要求是使用 UTF-16 还是 UTF-32。
In UTF-32 a unicode character would always be represented by 4 bytes so parsing code would be easier to write than that of a UTF-16 string because in UTF-16 a character is represented by varying number of bytes. On the downside a UTF-32 chatacter would always require 4 bytes which can be wasteful if you are working mostly with say english characters. So its a design choice depending upon your requirements whether to use UTF-16 or UTF-32.
有人可能更喜欢处理 UTF-32 而不是 UTF-16,因为处理代理对几乎总是处理“特殊情况”,并且必须处理这些特殊情况意味着您可能会遇到错误,因为您处理错误地处理它们(或者更有可能根本忘记处理它们)。
如果 UTF-32 增加的内存使用不是问题,那么降低的复杂性可能足以成为选择它的优势。
Someone might prefer to deal with UTF-32 instead of UTF-16 because dealing with surrogate pairs is pretty much always handling 'special-cases', and having to deal with those special cases means you have areas where bugs may creep in because you deal with them incorrectly (or more likely just forget to deal with them at all).
If the increased memory usage of UTF-32 is not an issue, the reduced complexity might be enough of an advantage to choose it.
这里还有 Unicode Consortium 提供的一份很好的文档。
UTF-32、UTF-16 和 UTF-3 的优点比较8
Here is a good documentation from The Unicode Consortium too.
Comparison of the Advantages of UTF-32, UTF-16, and UTF-8
简短的回答:不。
更长的答案:是的,为了与其他没有得到备忘录的东西兼容。
不那么讽刺的答案:当您更关心索引速度而不是空间使用情况,或者作为某种中间格式,或者在对齐问题比缓存问题更重要的机器上,或者......
Short answer: no.
Longer answer: yes, for compatibility with other things that didn't get the memo.
Less sarcastic answer: When you care more about speed of indexing than about space usage, or as an intermediate format of some sort, or on machines where alignment issues were more important than cache issues, or...
UTF-8还可以表示任何unicode字符!
如果您的文本大部分是英文,那么使用utf-8可以节省大量空间,但索引字符不是O(1),因为有些字符占用的不仅仅是一个字节。
如果空间对您的情况不像速度那么重要,那么 utf-32 会更适合您,因为索引是 O(1)
对于非英语文本,UTF-16 可能比 utf-8 更好,因为在 utf-8 中您有某些字符占用 3 个字节的情况,而在 utf16 中它们只占用两个字节。
UTF-8 can also represent any unicode character!
If your text is mostly english, you can save a lot of space by using utf-8, but indexing characters is not O(1), because some characters take up more than just one byte.
If space is not as important to your situation as speed is, utf-32 would suit you better, because indexing is O(1)
UTF-16 can be better than utf-8 for non-english text because in utf-8 you have a situation where some characters take up 3 bytes, where as in utf16 they'd only take up two bytes.
可能有一些很好的理由,但其中之一是加快索引/搜索速度,即在数据库等中。
对于 UTF-32,您知道每个字符都是 4 个字节。 对于 UTF-16,您不知道任何特定字符的长度。
例如,您有一个返回字符串的第 n 个字符的函数:
如果您使用可直接访问内存的语言(例如 C)进行编码,则在 UTF-32 中,该函数可能与某些指针算术一样简单 (
s+(4*index)
),这将是一些 O(1) 的量。但如果您使用 UTF-16,则必须遍历字符串,边走边解码,这将是 O(n)。
There are probably a few good reasons, but one would be to speed up indexing / searching, i.e. in databases and the like.
With UTF-32 you know that each character is 4 bytes. With UTF-16 you don't know what length any particular character will be.
For example, you have a function that returns the nth char of a string:
If you are coding in a language that has direct memory access, say C, then in UTF-32 this function may be as simple as some pointer arithmatic (
s+(4*index)
), which would be some amounts O(1).If you are using UTF-16 though, you would have to walk the string, decoding as you went, which would be O(n).
一般来说,您只需使用底层平台的字符串数据类型/编码,通常是(Windows、Java、Cocoa...)UTF-16,有时是 UTF-8 或 UTF-32。 这主要是出于历史原因; 三种 Unicode 编码之间几乎没有什么区别:所有三种编码都定义明确、快速且稳健,并且都可以对每个 Unicode 代码点序列进行编码。 UTF-32 的独特之处在于它是一种固定宽度编码(意味着每个代码点仅由一个代码单元表示)在实践中几乎没有什么用处:您的内存管理层需要了解代码的数量和宽度单位,用户对抽象字符和字素感兴趣。 正如 Unicode 标准所提到的,无论如何,Unicode 应用程序都必须处理组合字符、连字等,而代理项对的处理尽管在概念上有所不同,但可以在相同的技术框架内完成。
如果我要重新发明世界,我可能会选择 UTF-32,因为它是最不复杂的编码,但就目前情况而言,差异太小,不足以引起实际关注。
In general, you just use the string datatype/encoding of the underlying platform, which is often (Windows, Java, Cocoa...) UTF-16 and sometimes UTF-8 or UTF-32. This is mostly for historical reasons; there is little difference between the three Unicode encodings: all three are well-defined, fast and robust, and all of them can encode every Unicode code point sequence. The unique feature of UTF-32 that it is a fixed-width encoding (meaning that each code point is represented by exactly one code unit) is of little use in practice: Your memory management layer needs to know about the number and width of code units, and users are interested in abstract characters and graphemes. As mentioned by the Unicode standard, Unicode applications have to deal with combined characters, ligatures and so on anyway and the handling of surrogate pairs, despite being conceptually different, can be done within the same technical framework.
If I were to reinvent the world, I'd probably go for UTF-32 because it is simply the least complex encoding, but as it stands the differences are too small to be of practical concern.