走过海棠暮 2024-12-04 00:02:32

什么 UTF 支持所有 Unicode 块？

所有 UTF 编码都支持所有 Unicode 块 - 没有不能表示任何 Unicode 代码点的 UTF 编码。但是，某些非 UTF、较旧的编码，例如 UCS-2（类似于 UTF-16，但缺少代理对，因此缺乏对 65535/U+FFFF 以上的代码点进行编码的能力）可能不会。

什么是最好的 UTF（性能、大小等），为什么？

对于主要是英语和/或只是 ASCII 的文本数据，UTF-8 是迄今为止最节省空间的。然而，UTF-8 有时比 UTF-16 和 UTF-32 的空间效率低，因为 UTF-16 和 UTF-32 所使用的大多数代码点都很高（例如大型 CJK 文本）。

这三种UTF有什么不同？

UTF-8 对每个 Unicode 代码点进行 1 到 4 个字节的编码。 Unicode 值 0 到 127 与 ASCII 中的值相同，其编码方式与 ASCII 中的值相同。值为 128 到 255 的字节用于多字节代码点。

UTF-16 以两个字节（一个 UTF-16 值）或四个字节（两个 UTF-16 值）对每个 Unicode 代码点进行编码。基本多语言平面中的任何内容（Unicode 代码点 0 到 65535，或 U+0000 到 U+FFFF）均使用一个 UTF-16 值进行编码。来自更高平原的代码点通过称为“代理对”的技术使用两个 UTF-16 值。

UTF-32 不是 Unicode 的可变长度编码；所有 Unicode 代码点值均按原样编码。这意味着 U+10FFFF 被编码为 0x0010FFFF。

什么是字节顺序和字节顺序标记 (BOM)？

字节序是一段数据、特定 CPU 架构或协议对多字节数据类型的值进行排序的方式。 Little-endian 系统（例如 x86-32 和 x86-64 CPU）将最低有效字节放在前面，而 big-endian 系统（例如 ARM、PowerPC 和许多网络协议）将最高有效字节放在前面。

在小端编码或系统中，32 位值0x12345678 被存储或传输为0x78 0x56 0x34 0x12。在大端编码或系统中，它被存储或传输为0x12 0x34 0x56 0x78。

UTF-16 和 UTF-32 中使用字节顺序标记来表示文本将被解释为哪种字节序。 Unicode 以一种巧妙的方式做到了这一点——U+FEFF 是一个有效的代码点，用于字节顺序标记，而 U+FFFE 则不是。因此，如果文件以0xFF 0xFE开头，则可以假设文件的其余部分以小端字节顺序存储。

UTF-8 中的字节顺序标记在技术上是可行的，但由于明显的原因在字节序上下文中毫无意义。然而，以 UTF-8 编码的 BOM 开头的流几乎肯定意味着它是 UTF-8，因此可以用于识别。

UTF-8 的优点

ASCII 是 UTF-8 编码的子集，因此是将 ASCII 文本引入“Unicode 世界”的好方法，而无需进行数据转换
UTF-8 文本是ASCII 文本的最紧凑格式
有效的 UTF-8 可以按字节值排序并生成排序的代码点

UTF-16 的优点

UTF-16 比 UTF-8 更容易解码，尽管它是一种可变长度编码
对于 BMP 中的字符，UTF-16 比 UTF-8 更节省空间，但在 ASCII 之外

UTF-32

UTF-32 不是可变长度的，因此它不需要特殊的逻辑来解码

what UTF that are support all Unicode blocks ?

All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.

What is the best UTF(performance, size, etc), and why ?

For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).

What is different between these three UTF ?

UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.

UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.

UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that U+10FFFF is encoded as 0x0010FFFF.

what is endianness and byte order marks (BOM) ?

Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.

In a little-endian encoding or system, the 32-bit value 0x12345678 is stored or transmitted as 0x78 0x56 0x34 0x12. In a big-endian encoding or system, it is stored or transmitted as 0x12 0x34 0x56 0x78.

A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with 0xFF 0xFE, it can be assumed that the rest of the file is stored in a little-endian byte ordering.

A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.

Benefits of UTF-8

ASCII is a subset of the UTF-8 encoding and therefore is a great way to introduce ASCII text into a 'Unicode world' without having to do data conversion
UTF-8 text is the most compact format for ASCII text
Valid UTF-8 can be sorted on byte values and result in sorted codepoints

Benefits of UTF-16

UTF-16 is easier than UTF-8 to decode, even though it is a variable-length encoding
UTF-16 is more space-efficient than UTF-8 for characters in the BMP, but outside ASCII

Benefits of UTF-32

UTF-32 is not variable-length, so it requires no special logic to decode

回复收藏 0 原文

叹倦 2024-12-04 00:02:32

“回答我这四个问题，因为很久以前就已经回答了所有问题。”

你真的应该问一个问题，而不是四个问题。但这里有答案。

所有 UTF 转换根据定义都支持所有 Unicode 代码点。这是你不必担心的事情。唯一的问题是，有些系统实际上是 UCS-2，但声称它们是 UTF-16，而 UCS-2 在几个基本方面受到严重破坏：
- UCS-2 不是有效的 Unicode 编码。
- UCS-2 仅支持 1⁄₁₇ᵗʰ Unicode。也就是说，仅限平面 0，而不是平面 1-16。
- UCS-2 允许 Unicode 标准保证永远不会出现在有效 Unicode 流中的代码点。这些包括
  - 所有 2,048 个 UTF-16 代理，代码点 U+D800 到 U+DFFF
  - U+FDD0 和 U+FDEF 之间的 32 个非字符代码点
  - 两个哨兵均位于 U+FFEF 和 U+FFFF
有关七种不同编程语言内部使用的编码，请参阅我上周题为 “Unicode 支持大战”。变化很大。
UTF-8 是逻辑 Unicode 代码点流的最佳序列化转换，因为（无特定顺序）：
- UTF-8 是网络上事实上的标准 Unicode 编码。
- UTF-8 可以存储在以 null 结尾的字符串中。
- UTF-8 不存在令人烦恼的 BOM 问题。
- UTF-8 不会造成 UCS-2 与 UTF-16 混淆的风险。
- UTF-8 可以非常有效地压缩主要为 ASCII 的文本，因此，即使是 XML 或 HTML 格式的亚洲文本，其字节数通常也比 UTF-16 小。这是一件需要了解的重要事情，因为这是一个违反直觉且令人惊讶的结果。 ASCII 标记通常会弥补额外的字节。如果您确实担心存储问题，则应该使用适当的文本压缩，例如 LZW 和相关算法。只需 bzip 即可。
- 如果需要，它可以被用于任意大的跨Unicodian点。例如，64 位机器上的 MAXINT 使用原始 UTF-8 算法变为 13 个字节。不过，此属性很少有用，必须非常小心地使用，以免被误认为是合法的 UTF-8 流。
每当我能摆脱困境时，我都会使用 UTF-8。
我已经给出了 UTF-8 的属性，所以这里是其他两个的一些属性：
- UTF-32 对于内部存储具有独特的优势：对代码点 N 的访问时间为 O(1)。也就是说，当您需要随机访问时，可以进行恒定时间访问。请记住，我们在 C 的 strlen 函数中永远使用 O(N) 访问，所以我不确定这有多重要。我的印象是，我们几乎总是按顺序而不是随机顺序处理字符串，在这种情况下，这不再是一个问题。是的，它需要更多的内存，但从长远来看，这只需要一点点。
- UTF-16 是一种糟糕的格式，具有 UTF-8 和 UTF-32 的所有缺点，但没有任何一种的优点。 确实，如果处理得当，UTF-16当然可以使其发挥作用，但这样做需要真正的努力，而且您的语言可能无法帮助您。事实上，你的语言可能会对你不利。我对 UTF-16 的使用经验足够深，知道它是多么痛苦。如果你在这件事上有任何选择的话，我会避开这两个，特别是 UTF-16。语言支持几乎不存在，因为有大量歇斯底里的海豚在争夺注意力。即使存在正确的代码点而不是代码单元访问机制，这些机制通常使用起来很困难并且键入很长，而且它们不是默认的。这很容易导致错误，而您在部署之前可能无法发现这些错误；在这一点上请相信我，因为我曾经经历过。
这就是我来谈论UTF-16 诅咒的原因。唯一比UTF-16 诅咒更糟糕的是UCS-2 诅咒。
Endianness 和整个 BOM 问题都是诅咒 UTF-16 和 UTF-32 的问题。如果您使用 UTF-8，您将永远不必担心这些。

我确实希望您在所有 API 内部使用逻辑（即抽象）代码点，并且仅担心外部交换的序列化。任何让你得到代码单元而不是代码点的东西都比它的价值要麻烦得多，无论这些代码单元是 8 位宽还是 16 位宽。 您需要一个代码点接口，而不是代码单元接口。现在您的 API 使用代码点而不是代码单元，实际的底层表示不再重要。隐藏这一点很重要。

类别错误

让我补充一下，每个谈论 ASCII 与 Unicode 的人都犯了类别错误。 Unicode 非常不“类似于 ASCII，但字符更多”。这可能描述了 ISO 10646，但它没有描述 Unicode。 Unicode 不仅仅是一个特定的指令集，而且是处理它们的规则。不仅仅是更多的字符，而是更多带有特定规则的字符。没有 Unicode 规则的 Unicode 字符不再是 Unicode 字符。

如果您使用 ASCII 思维方式来处理 Unicode 文本，您将一次又一次地遇到各种损坏。这不起作用。仅举一个例子，正是由于这种误解，Python 模式匹配库 re 在不区分大小写匹配时完全错误地执行了操作。如果两个代码点具有相同的小写字母，它会盲目地假设两个代码点计数相同。这是 ASCII 思维方式，这就是它失败的原因。你不能那样对待 Unicode，因为如果你这样做，你就违反了规则，它就不再是 Unicode。这只是一团糟。

例如，Unicode 将 U+03C3 GREEK SMALL LETTER SIGMA 和 U+03C2 GREEK SMALL LETTER FINAL SIGMA 定义为彼此不区分大小写的版本。（这称为 Unicode 大小写折叠。）但是，由于它们在盲目映射到小写并进行比较时不会改变，因此比较失败。你就是不能那样做。在一般情况下，您也无法通过将小写比较切换为大写比较来修复它。当您需要使用案例折叠时使用案例映射掩盖了对整个作品的不稳定理解。

（这没什么：Python 2 的问题更严重。我建议不要使用 Python 2 来处理 Unicode；如果你想在 Python 中处理 Unicode，请使用 Python 3。对于 Pythonistas，我推荐的解决 Python 无数 Unicode 正则表达式问题的解决方案是 Matthew Barnett 用于 Python 2 和 Python 3 的出色的 regex 库。非常简洁，而且它实际上正确地实现了 Unicode 大小写折叠 — 在标准 re 的许多其他 Unicode 事物中，标准 re 都犯了严重错误。）

记住：Unicode 不是只是更多的字符：Unicode 是处理更多字符的规则。人们要么学习使用 Unicode，要么反对它，并且如果有人反对它，那么它对你不利。

“Answer me these questions four, as all were answered long before.”

You really should have asked one question, not four. But here are the answers.

All UTF transforms by definition support all Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet claim they are UTF-16, and UCS-2 is severely broken in several fundamental ways:
- UCS-2 is not a valid Unicode encoding.
- UCS-2 supports only ¹⁄₁₇ᵗʰ of Unicode. That is, Plane 0 only, not Planes 1–16.
- UCS-2 permits code points that The Unicode Standard guarantees will never be in a valid Unicode stream. These include
  - all 2,048 UTF-16 surrogates, code points U+D800 through U+DFFF
  - the 32 non-character code points between U+FDD0 and U+FDEF
  - both sentinels at U+FFEF and U+FFFF
For what encoding is used internally by seven different programming languages, see slide 7 on Feature Support Summary in my OSCON talk from last week entitled “Unicode Support Shootout”. It varies a great deal.
UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order:
- UTF-8 is the de facto standard Unicode encoding on the web.
- UTF-8 can be stored in a null-terminated string.
- UTF-8 is free of the vexing BOM issue.
- UTF-8 risks no confusion of UCS-2 vs UTF-16.
- UTF-8 compacts mainly-ASCII text quite efficiently, so that even Asian texts that are in XML or HTML often wind up being smaller in bytes than UTF-16. This is an important thing to know, because it is a counterintuitive and surprising result. The ASCII markup tags often make up for the extra byte. If you are really worried about storage, you should be using proper text compression, like LZW and related algorithms. Just bzip it.
- If need be, it can be roped into use for trans-Unicodian points of arbitrarily large magnitude. For example, MAXINT on a 64-bit machine becomes 13 bytes using the original UTF-8 algorithm. This property is of rare usefulness, though, and must be used with great caution lest it be mistaken for a legitimate UTF-8 stream.
I use UTF-8 whenever I can get away with it.
I have already given properties of UTF-8, so here are some for the other two:
- UTF-32 enjoys a singular advantage for internal storage: O(1) access to code point N. That is, constant time access when you need random access. Remember we lived forever with O(N) access in C’s strlen function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run.
- UTF-16 is a terrible format, having all the disadvantages of UTF-8 and UTF-32 but none of the advantages of either. It is grudgingly true that when properly handled, UTF-16 can certainly be made to work, but doing so takes real effort, and your language may not be there to help you. Indeed, your language is probably going to work against you instead. I’ve worked with UTF-16 enough to know what a royal pain it is. I would stay clear of both these, especially UTF-16, if you possibly have any choice in the matter. The language support is almost never there, because there are massive pods of hysterical porpoises all contending for attention. Even when proper code-point instead of code-unit access mechanisms exist, these are usually awkward to use and lengthy to type, and they are not the default. This leads too easily to bugs that you may not catch until deployment; trust me on this one, because I’ve been there.
That’s why I’ve come to talk about there being a UTF-16 Curse. The only thing worse than The UTF-16 Curse is The UCS-2 Curse.
Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.

I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. You want a code-point interface, not a code-unit interface. Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden.

Category Errors

Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much NOT “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more characters that have particular rules accompanying them. Unicode characters without Unicode rules are no longer Unicode characters.

If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library, re, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess.

For example, Unicode defines U+03C3 GREEK SMALL LETTER SIGMA and U+03C2 GREEK SMALL LETTER FINAL SIGMA as case-insensitive versions of each other. (This is called Unicode casefolding.) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works.

(And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is Matthew Barnett’s marvelous regex library for Python 2 and Python 3. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standard re gets miserably wrong.)

REMEMBER: Unicode is not just more characters: Unicode is rules for handling more characters. One either learns to work with Unicode, or else one works against it, and if one works against it, then it works against you.

回复收藏 0 原文

浅黛梨妆こ 2024-12-04 00:02:32

它们都支持所有 Unicode 代码点。
它们具有不同的性能特征 - 例如，UTF-8 对于 ASCII 字符来说更加紧凑，而 UTF-32 可以更轻松地处理整个 Unicode，包括基本多语言平面之外的值（即 U+FFFF 之上）。由于每个字符的宽度可变，UTF-8 字符串很难用于获取二进制编码中的特定字符索引 - 您必须扫描。 UTF-16 也是如此，除非您知道不存在非 BMP 字符。
查看 UTF-8、UTF-16 和 UTF-32 a>
字节顺序决定（对于 UTF -16 和 UTF-32) 最高有效字节是否在前最低有效字节排在最后，反之亦然。例如，如果您想用 UTF-16 表示 U+1234，则可以是 { 0x12, 0x34 } 或 { 0x34, 0x12 }。 字节顺序标记指示您正在处理的字节顺序。 UTF-8 没有不同的字节序，但在文件开头看到 UTF-8 BOM 可以很好地表明该文件是 UTF-8。

回复收藏 0 原文

胡大本事 2024-12-04 00:02:32

这里有一些很好的问题，并且已经有了一些很好的答案。我也许可以添加一些有用的东西。

如前所述，所有三个都涵盖了全套可能的代码点，U+0000 到 U+10FFFF。
取决于文本，但这里有一些可能感兴趣的细节。 UTF-8 每个字符使用 1 到 4 个字节； UTF-16使用2或4； UTF-32 始终使用 4。需要注意的一个有用的事情是这一点。如果使用UTF-8，那么英文文本将被编码，其中绝大多数字符每个字节1个，但中文每个字符需要3个字节。使用UTF-16，英文和中文都需要2。所以基本上UTF-8对英文来说是胜利； UTF-16 对中文来说是胜利。
主要区别在上面#2 的答案中提到，或者正如 Jon Skeet 所说，请参阅维基百科文章。
字节顺序：对于 UTF-16 和 UTF-32，这是指字节出现的顺序；例如，在 UTF-16 中，字符 U+1234 可以编码为 12 34（大端）或 34 12（小端）。 BOM（或字节顺序标记）很有趣。假设您有一个以 UTF-16 编码的文件，但您不知道它是大端还是小端，但您注意到文件的前两个字节是 FE FF。如果这是大端字节序，则字符将为 U+FEFF；如果是小端，则表示 U+FFFE。但事情是这样的：在 Unicode 中，代码点 FFFE 永久未分配：那里没有字符！因此我们可以知道编码必须是大尾数。 FEFF 字符在这里是无害的；它是零宽度无中断空间（基本上是不可见的）。同样，如果文件以 FF FE 开头，我们就知道它是小端字节序。

不确定我是否在其他答案中添加了任何内容，但我发现英语与汉语的具体分析过去对于向其他人解释这一点很有用。

回复收藏 0 原文

青丝拂面 2024-12-04 00:02:32

看待它的一种方式是规模大于复杂性。通常，它们会增加对文本进行编码所需的字节数，但会降低对用于表示字符的方案进行解码的复杂性。因此，UTF-8 通常很小，但解码起来可能很复杂，而 UTF-32 占用更多字节，但很容易解码（但很少使用，UTF-16 更常见）。

考虑到这一点，网络传输通常选择 UTF-8，因为它的尺寸较小。而在更容易解码比存储大小更重要的情况下选择 UTF-16。

BOM 旨在作为文件开头的信息，描述所使用的编码。但这些信息经常缺失。

回复收藏 0 原文

好听的两个字的网名 2024-12-04 00:02:32

Joel Spolsky 写了一篇关于 Unicode 的精彩介绍性文章：

每个软件开发人员绝对、肯定必须了解 Unicode 的绝对最低限度和字符集（没有任何借口！）

回复收藏 0 原文

什么是最好的 UTF

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

“回答我这四个问题，因为很久以前就已经回答了所有问题。”

类别错误

“Answer me these questions four, as all were answered long before.”

Category Errors

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

什么是最好的 UTF

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（6）

“回答我这四个问题，因为很久以前就已经回答了所有问题。”

类别错误

“Answer me these questions four, as all were answered long before.”

Category Errors

关于作者

相关话题

热门标签

推荐作者

十二

飞烟轻若梦

OPleyuhuo

wxb0109

旧城空念

-小熊_

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。