如果我们有代理对,为什么使用 UTF-32 而不是 UTF-16?

发布于 2024-07-14 12:53:48 字数 95 浏览 8 评论 0原文

如果我理解正确的话,UTF-32 可以处理宇宙中的每个字符。 UTF-16 也可以通过使用代理对来实现。 那么有什么充分的理由使用 UTF-32 而不是 UTF-16 呢?

If I understand correctly, UTF-32 can handle every character in the universe. So can UTF-16, through the use of surrogate pairs. So is there any good reason to use UTF-32 instead of UTF-16?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(7

我爱人 2024-07-21 12:53:48

在 UTF-32 中,unicode 字符始终由 4 个字节表示,因此解析代码比 UTF-16 字符串更容易编写,因为在 UTF-16 中,字符由不同数量的字节表示。 缺点是 UTF-32 字符总是需要 4 个字节,如果您主要使用英语字符,这可能会造成浪费。 因此,设计选择取决于您的要求是使用 UTF-16 还是 UTF-32。

In UTF-32 a unicode character would always be represented by 4 bytes so parsing code would be easier to write than that of a UTF-16 string because in UTF-16 a character is represented by varying number of bytes. On the downside a UTF-32 chatacter would always require 4 bytes which can be wasteful if you are working mostly with say english characters. So its a design choice depending upon your requirements whether to use UTF-16 or UTF-32.

萌吟 2024-07-21 12:53:48

有人可能更喜欢处理 UTF-32 而不是 UTF-16,因为处理代理对几乎总是处理“特殊情况”,并且必须处理这些特殊情况意味着您可能会遇到错误,因为您处理错误地处理它们(或者更有可能根本忘记处理它们)。

如果 UTF-32 增加的内存使用不是问题,那么降低的复杂性可能足以成为选择它的优势。

Someone might prefer to deal with UTF-32 instead of UTF-16 because dealing with surrogate pairs is pretty much always handling 'special-cases', and having to deal with those special cases means you have areas where bugs may creep in because you deal with them incorrectly (or more likely just forget to deal with them at all).

If the increased memory usage of UTF-32 is not an issue, the reduced complexity might be enough of an advantage to choose it.

人│生佛魔见 2024-07-21 12:53:48

这里还有 Unicode Consortium 提供的一份很好的文档。

UTF-32、UTF-16 和 UTF-3 的优点比较8

版权所有 © 1991–2009 Unicode, Inc. Unicode 标准,版本 5.2

从表面上看,UTF-32 似乎是内部处理代码的 Unicode 编码形式的明显选择,因为它是一种固定宽度的编码形式。 它可以一致地绑定到 C 和 C++ wchar_t,这意味着此类编程语言可以提供程序员可以利用的内置支持和现成的字符串 API。 然而,UTF-16 有许多相反的优势,可能会导致实施者选择它作为内部处理代码。
虽然所有三种编码形式每个字符最多需要 4 个字节(或 32 位)的数据,但实际上,在几乎所有情况下,真实数据集的 UTF-32 占用的存储空间是 UTF-16 所需存储空间的两倍。 因此,常见的策略是让内部字符串存储使用 UTF-16 或 UTF-8,但在操作单个字符时使用 UTF-32。

UTF-32 与 UTF-16。平均而言,超过 99% 的 UTF-16 数据是使用单个代码单元表示的。 这包括软件需要通过对文本进行特殊操作来处理的几乎所有典型字符,例如格式控制字符。 因此,大多数文本扫描操作根本不需要解压 UTF-16 代理项对,而是可以安全地将它们视为字符串的不透明部分。
对于很多操作来说,UTF-16 和 UTF-32 一样容易处理,而且 UTF-16 作为处理代码的性能往往相当不错。 UTF-16 是大多数支持 Unicode 的实现选择的内部处理代码。 除了 Unix 平台之外,UTF-16 提供了紧凑尺寸和处理 BMP 之外的偶尔字符的能力的正确组合。
UTF-32 在软件编码设计和维护的简单性方面具有一定的优势。 由于字符处理是固定宽度的,UTF-32 处理不需要在软件中维护分支来测试和处理 UTF-16 补充字符所需的双代码单元元素。 相反,大型表的 32 位索引的内存效率并不是特别高。 为了避免此类索引的大量内存损失,Unicode 表通常被处理为多级表(请参阅第 5.1 节“转码为其他标准”中的“多级表”)。 在这种情况下,32 位代码点值被分割成更小的范围,以允许对表进行分段访问。 即使在典型的 UTF-32 实现中也是如此。
对于相同的数据,UTF-32 作为处理代码的性能实际上可能比 UTF-16 的性能更差,因为额外的内存开销意味着将更频繁地超出缓存限制,并且更频繁地发生内存分页。 对于处理器设计对 16 位对齐访问施加惩罚但内存非常大的系统,这种影响可能不太明显。
无论如何,Unicode 代码点不一定符合用户对“字符”的期望。 例如,以下内容不由单个代码点表示:组合字符序列,例如 ; 韩语的连接 Jamo 序列; 或天成文合词“ksha”。 由于某些 Unicode 文本处理必须识别并处理作为文本元素的字符序列,因此 UTF-32 的固定宽度编码形式优势在某种程度上被处理文本元素固有的可变宽度性质所抵消。 请参阅 Unicode 技术标准 #18“Unicode 正则表达式”,了解一个示例,其中由于用户对“字符”身份的期望,通常实现的流程处理固有的可变宽度文本元素。
UTF-8。 UTF-8 就使用的字节数而言相当紧凑。 实际上,只有在用于东亚实现(例如中文、日文和韩文)时,它才会在大小上处于显着劣势,这些实现使用汉字表意文字或韩文音节,需要 UTF-8 中的三字节代码单元序列。 UTF-8 在处理方面的效率也明显低于其他编码形式。
二进制排序。 UTF-8 字符串的二进制排序与 Unicode 代码点的二进制排序具有相同的排序。 这显然与 UTF-32 字符串的二进制排序顺序相同。

总体结构

当仅处理 BMP 字符(在 U+0000..U+FFFF 范围内)时,所有三种编码形式对于二进制字符串比较或字符串排序都给出相同的结果。 但是,在处理增补字符(在 U+10000..U+10FFFF 范围内)时,UTF-16 二进制顺序与 Unicode 代码点顺序不匹配。 当尝试与二进制排序列表进行互操作时(例如,在 UTF-16 系统与 UTF-8 或 UTF-32 系统之间),这可能会导致复杂化。 然而,对于根据特定语言或区域设置的约定而不是使用二进制顺序排序的数据,无论编码形式如何,数据的排序都是相同的。

Here is a good documentation from The Unicode Consortium too.

Comparison of the Advantages of UTF-32, UTF-16, and UTF-8

Copyright © 1991–2009 Unicode, Inc. The Unicode Standard, Version 5.2

On the face of it, UTF-32 would seem to be the obvious choice of Unicode encoding forms for an internal processing code because it is a fixed-width encoding form. It can be conformantly bound to the C and C++ wchar_t, which means that such programming languages may offer built-in support and ready-made string APIs that programmers can take advan- tage of. However, UTF-16 has many countervailing advantages that may lead implementers to choose it instead as an internal processing code.
While all three encoding forms need at most 4 bytes (or 32 bits) of data for each character, in practice UTF-32 in almost all cases for real data sets occupies twice the storage that UTF-16 requires. Therefore, a common strategy is to have internal string storage use UTF-16 or UTF-8 but to use UTF-32 when manipulating individual characters.

UTF-32 Versus UTF-16. On average, more than 99 percent of all UTF-16 data is expressed using single code units. This includes nearly all of the typical characters that software needs to handle with special operations on text—for example, format control characters. As a consequence, most text scanning operations do not need to unpack UTF-16 surrogate pairs at all, but rather can safely treat them as an opaque part of a character string.
For many operations, UTF-16 is as easy to handle as UTF-32, and the performance of UTF- 16 as a processing code tends to be quite good. UTF-16 is the internal processing code of choice for a majority of implementations supporting Unicode. Other than for Unix plat- forms, UTF-16 provides the right mix of compact size with the ability to handle the occa- sional character outside the BMP.
UTF-32 has somewhat of an advantage when it comes to simplicity of software coding design and maintenance. Because the character handling is fixed width, UTF-32 processing does not require maintaining branches in the software to test and process the double code unit elements required for supplementary characters by UTF-16. Conversely, 32-bit indices into large tables are not particularly memory efficient. To avoid the large memory penalties of such indices, Unicode tables are often handled as multistage tables (see “Multistage Tables” in Section 5.1, Transcoding to Other Standards). In such cases, the 32-bit code point values are sliced into smaller ranges to permit segmented access to the tables. This is true even in typical UTF-32 implementations.
The performance of UTF-32 as a processing code may actually be worse than the perfor- mance of UTF-16 for the same data, because the additional memory overhead means that cache limits will be exceeded more often and memory paging will occur more frequently. For systems with processor designs that impose penalties for 16-bit aligned access but have very large memories, this effect may be less noticeable.
In any event, Unicode code points do not necessarily match user expectations for “characters.” For example, the following are not represented by a single code point: a combining character sequence such as ; a conjoining jamo sequence for Korean; or the Devanagari conjunct “ksha.” Because some Unicode text pro- cessing must be aware of and handle such sequences of characters as text elements, the fixed-width encoding form advantage of UTF-32 is somewhat offset by the inherently vari- able-width nature of processing text elements. See Unicode Technical Standard #18, “Uni- code Regular Expressions,” for an example where commonly implemented processes deal with inherently variable-width text elements owing to user expectations of the identity of a “character.”
UTF-8. UTF-8 is reasonably compact in terms of the number of bytes used. It is really only at a significant size disadvantage when used for East Asian implementations such as Chi- nese, Japanese, and Korean, which use Han ideographs or Hangul syllables requiring three- byte code unit sequences in UTF-8. UTF-8 is also significantly less efficient in terms of pro- cessing than the other encoding forms.
Binary Sorting. A binary sort of UTF-8 strings gives the same ordering as a binary sort of Unicode code points. This is obviously the same order as for a binary sort of UTF-32 strings.

General Structure

All three encoding forms give the same results for binary string comparisons or string sort- ing when dealing only with BMP characters (in the range U+0000..U+FFFF). However, when dealing with supplementary characters (in the range U+10000..U+10FFFF), UTF-16 binary order does not match Unicode code point order. This can lead to complications when trying to interoperate with binary sorted lists—for example, between UTF-16 sys- tems and UTF-8 or UTF-32 systems. However, for data that is sorted according to the con- ventions of a specific language or locale rather than using binary order, data will be ordered the same, regardless of the encoding form.

醉南桥 2024-07-21 12:53:48

简短的回答:不。

更长的答案:是的,为了与其他没有得到备忘录的东西兼容。

不那么讽刺的答案:当您更关心索引速度而不是空间使用情况,或者作为某种中间格式,或者在对齐问题比缓存问题更重要的机器上,或者......

Short answer: no.

Longer answer: yes, for compatibility with other things that didn't get the memo.

Less sarcastic answer: When you care more about speed of indexing than about space usage, or as an intermediate format of some sort, or on machines where alignment issues were more important than cache issues, or...

时光暖心i 2024-07-21 12:53:48

UTF-8还可以表示任何unicode字符!

如果您的文本大部分是英文,那么使用utf-8可以节省大量空间,但索引字符不是O(1),因为有些字符占用的不仅仅是一个字节。

如果空间对您的情况不像速度那么重要,那么 utf-32 会更适合您,因为索引是 O(1)

对于非英语文本,UTF-16 可能比 utf-8 更好,因为在 utf-8 中您有某些字符占用 3 个字节的情况,而在 utf16 中它们只占用两个字节。

UTF-8 can also represent any unicode character!

If your text is mostly english, you can save a lot of space by using utf-8, but indexing characters is not O(1), because some characters take up more than just one byte.

If space is not as important to your situation as speed is, utf-32 would suit you better, because indexing is O(1)

UTF-16 can be better than utf-8 for non-english text because in utf-8 you have a situation where some characters take up 3 bytes, where as in utf16 they'd only take up two bytes.

会发光的星星闪亮亮i 2024-07-21 12:53:48

可能有一些很好的理由,但其中之一是加快索引/搜索速度,即在数据库等中。

对于 UTF-32,您知道每个字符都是 4 个字节。 对于 UTF-16,您不知道任何特定字符的长度。

例如,您有一个返回字符串的第 n 个字符的函数:

char getChar(int index, String s );

如果您使用可直接访问内存的语言(例如 C)进行编码,则在 UTF-32 中,该函数可能与某些指针算术一样简单 (s+(4*index)),这将是一些 O(1) 的量。

但如果您使用 UTF-16,则必须遍历字符串,边走边解码,这将是 O(n)。

There are probably a few good reasons, but one would be to speed up indexing / searching, i.e. in databases and the like.

With UTF-32 you know that each character is 4 bytes. With UTF-16 you don't know what length any particular character will be.

For example, you have a function that returns the nth char of a string:

char getChar(int index, String s );

If you are coding in a language that has direct memory access, say C, then in UTF-32 this function may be as simple as some pointer arithmatic (s+(4*index)), which would be some amounts O(1).

If you are using UTF-16 though, you would have to walk the string, decoding as you went, which would be O(n).

自此以后,行同陌路 2024-07-21 12:53:48

一般来说,您只需使用底层平台的字符串数据类型/编码,通常是(Windows、Java、Cocoa...)UTF-16,有时是 UTF-8 或 UTF-32。 这主要是出于历史原因; 三种 Unicode 编码之间几乎没有什么区别:所有三种编码都定义明确、快速且稳健,并且都可以对每个 Unicode 代码点序列进行编码。 UTF-32 的独特之处在于它是一种固定宽度编码(意味着每个代码点仅由一个代码单元表示)在实践中几乎没有什么用处:您的内存管理层需要了解代码的数量和宽度单位,用户对抽象字符和字素感兴趣。 正如 Unicode 标准所提到的,无论如何,Unicode 应用程序都必须处理组合字符、连字等,而代理项对的处理尽管在概念上有所不同,但可以在相同的技术框架内完成。

如果我要重新发明世界,我可能会选择 UTF-32,因为它是最不复杂的编码,但就目前情况而言,差异太小,不足以引起实际关注。

In general, you just use the string datatype/encoding of the underlying platform, which is often (Windows, Java, Cocoa...) UTF-16 and sometimes UTF-8 or UTF-32. This is mostly for historical reasons; there is little difference between the three Unicode encodings: all three are well-defined, fast and robust, and all of them can encode every Unicode code point sequence. The unique feature of UTF-32 that it is a fixed-width encoding (meaning that each code point is represented by exactly one code unit) is of little use in practice: Your memory management layer needs to know about the number and width of code units, and users are interested in abstract characters and graphemes. As mentioned by the Unicode standard, Unicode applications have to deal with combined characters, ligatures and so on anyway and the handling of surrogate pairs, despite being conceptually different, can be done within the same technical framework.

If I were to reinvent the world, I'd probably go for UTF-32 because it is simply the least complex encoding, but as it stands the differences are too small to be of practical concern.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文