什么是最好的 UTF
我对 Unicode 中的 UTF 感到非常困惑。
有 UTF-8、UTF-16 和 UTF-32。
我的问题是:
-
什么 UTF 支持所有 Unicode 块?
-
什么是最好的 UTF(性能、大小等),为什么?
-
这三种UTF有什么不同?
-
什么是字节序和字节顺序标记 (BOM)?
谢谢
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
所有 UTF 编码都支持所有 Unicode 块 - 没有不能表示任何 Unicode 代码点的 UTF 编码。但是,某些非 UTF、较旧的编码,例如 UCS-2(类似于 UTF-16,但缺少代理对,因此缺乏对 65535/U+FFFF 以上的代码点进行编码的能力)可能不会。
对于主要是英语和/或只是 ASCII 的文本数据,UTF-8 是迄今为止最节省空间的。然而,UTF-8 有时比 UTF-16 和 UTF-32 的空间效率低,因为 UTF-16 和 UTF-32 所使用的大多数代码点都很高(例如大型 CJK 文本)。
UTF-8 对每个 Unicode 代码点进行 1 到 4 个字节的编码。 Unicode 值 0 到 127 与 ASCII 中的值相同,其编码方式与 ASCII 中的值相同。值为 128 到 255 的字节用于多字节代码点。
UTF-16 以两个字节(一个 UTF-16 值)或四个字节(两个 UTF-16 值)对每个 Unicode 代码点进行编码。基本多语言平面中的任何内容(Unicode 代码点 0 到 65535,或 U+0000 到 U+FFFF)均使用一个 UTF-16 值进行编码。来自更高平原的代码点通过称为“代理对”的技术使用两个 UTF-16 值。
UTF-32 不是 Unicode 的可变长度编码;所有 Unicode 代码点值均按原样编码。这意味着
U+10FFFF
被编码为0x0010FFFF
。字节序是一段数据、特定 CPU 架构或协议对多字节数据类型的值进行排序的方式。 Little-endian 系统(例如 x86-32 和 x86-64 CPU)将最低有效字节放在前面,而 big-endian 系统(例如 ARM、PowerPC 和许多网络协议)将最高有效字节放在前面。
在小端编码或系统中,32 位值
0x12345678
被存储或传输为0x78 0x56 0x34 0x12
。在大端编码或系统中,它被存储或传输为0x12 0x34 0x56 0x78
。UTF-16 和 UTF-32 中使用字节顺序标记来表示文本将被解释为哪种字节序。 Unicode 以一种巧妙的方式做到了这一点——U+FEFF 是一个有效的代码点,用于字节顺序标记,而 U+FFFE 则不是。因此,如果文件以
0xFF 0xFE
开头,则可以假设文件的其余部分以小端字节顺序存储。UTF-8 中的字节顺序标记在技术上是可行的,但由于明显的原因在字节序上下文中毫无意义。然而,以 UTF-8 编码的 BOM 开头的流几乎肯定意味着它是 UTF-8,因此可以用于识别。
UTF-8 的优点
UTF-16 的优点
UTF-32
All UTF encodings support all Unicode blocks - there is no UTF encoding that can't represent any Unicode codepoint. However, some non-UTF, older encodings, such as UCS-2 (which is like UTF-16, but lacks surrogate pairs, and thus lacks the ability to encode codepoints above 65535/U+FFFF), may not.
For textual data that is mostly English and/or just ASCII, UTF-8 is by far the most space-efficient. However, UTF-8 is sometimes less space-efficient than UTF-16 and UTF-32 where most of the codepoints used are high (such as large bodies of CJK text).
UTF-8 encodes each Unicode codepoint from one to four bytes. The Unicode values 0 to 127, which are the same as they are in ASCII, are encoded like they are in ASCII. Bytes with values 128 to 255 are used for multi-byte codepoints.
UTF-16 encodes each Unicode codepoint in either two bytes (one UTF-16 value) or four bytes (two UTF-16 values). Anything in the Basic Multilingual Plane (Unicode codepoints 0 to 65535, or U+0000 to U+FFFF) are encoded with one UTF-16 value. Codepoints from higher plains use two UTF-16 values, through a technique called 'surrogate pairs'.
UTF-32 is not a variable-length encoding for Unicode; all Unicode codepoint values are encoded as-is. This means that
U+10FFFF
is encoded as0x0010FFFF
.Endianness is how a piece of data, particular CPU architecture or protocol orders values of multi-byte data types. Little-endian systems (such as x86-32 and x86-64 CPUs) put the least-significant byte first, and big-endian systems (such as ARM, PowerPC and many networking protocols) put the most-significant byte first.
In a little-endian encoding or system, the 32-bit value
0x12345678
is stored or transmitted as0x78 0x56 0x34 0x12
. In a big-endian encoding or system, it is stored or transmitted as0x12 0x34 0x56 0x78
.A byte order mark is used in UTF-16 and UTF-32 to signal which endianness the text is to be interpreted as. Unicode does this in a clever way -- U+FEFF is a valid codepoint, used for the byte order mark, while U+FFFE is not. Therefore, if a file starts with
0xFF 0xFE
, it can be assumed that the rest of the file is stored in a little-endian byte ordering.A byte order mark in UTF-8 is technically possible, but is meaningless in the context of endianness for obvious reasons. However, a stream that begins with the UTF-8 encoded BOM almost certainly implies that it is UTF-8, and thus can be used for identification because of this.
Benefits of UTF-8
Benefits of UTF-16
Benefits of UTF-32
“回答我这四个问题,因为很久以前就已经回答了所有问题。”
你真的应该问一个问题,而不是四个问题。但这里有答案。
所有 UTF 转换根据定义都支持所有 Unicode 代码点。这是你不必担心的事情。唯一的问题是,有些系统实际上是 UCS-2,但声称它们是 UTF-16,而 UCS-2 在几个基本方面受到严重破坏:
有关七种不同编程语言内部使用的编码,请参阅我上周题为 “Unicode 支持大战”。变化很大。
UTF-8 是逻辑 Unicode 代码点流的最佳序列化转换,因为(无特定顺序):
每当我能摆脱困境时,我都会使用 UTF-8。
我已经给出了 UTF-8 的属性,所以这里是其他两个的一些属性:
strlen
函数中永远使用 O(N) 访问,所以我不确定这有多重要。我的印象是,我们几乎总是按顺序而不是随机顺序处理字符串,在这种情况下,这不再是一个问题。是的,它需要更多的内存,但从长远来看,这只需要一点点。这就是我来谈论UTF-16 诅咒的原因。唯一比UTF-16 诅咒更糟糕的是UCS-2 诅咒。
Endianness 和整个 BOM 问题都是诅咒 UTF-16 和 UTF-32 的问题。如果您使用 UTF-8,您将永远不必担心这些。
我确实希望您在所有 API 内部使用逻辑(即抽象)代码点,并且仅担心外部交换的序列化。任何让你得到代码单元而不是代码点的东西都比它的价值要麻烦得多,无论这些代码单元是 8 位宽还是 16 位宽。 您需要一个代码点接口,而不是代码单元接口。现在您的 API 使用代码点而不是代码单元,实际的底层表示不再重要。隐藏这一点很重要。
类别错误
让我补充一下,每个谈论 ASCII 与 Unicode 的人都犯了类别错误。 Unicode 非常不“类似于 ASCII,但字符更多”。这可能描述了 ISO 10646,但它没有描述 Unicode。 Unicode 不仅仅是一个特定的指令集,而且是处理它们的规则。不仅仅是更多的字符,而是更多带有特定规则的字符。没有 Unicode 规则的 Unicode 字符不再是 Unicode 字符。
如果您使用 ASCII 思维方式来处理 Unicode 文本,您将一次又一次地遇到各种损坏。这不起作用。仅举一个例子,正是由于这种误解,Python 模式匹配库
re
在不区分大小写匹配时完全错误地执行了操作。如果两个代码点具有相同的小写字母,它会盲目地假设两个代码点计数相同。这是 ASCII 思维方式,这就是它失败的原因。你不能那样对待 Unicode,因为如果你这样做,你就违反了规则,它就不再是 Unicode。这只是一团糟。例如,Unicode 将 U+03C3
GREEK SMALL LETTER SIGMA
和 U+03C2GREEK SMALL LETTER FINAL SIGMA
定义为彼此不区分大小写的版本。 (这称为 Unicode 大小写折叠。)但是,由于它们在盲目映射到小写并进行比较时不会改变,因此比较失败。你就是不能那样做。在一般情况下,您也无法通过将小写比较切换为大写比较来修复它。当您需要使用案例折叠时使用案例映射掩盖了对整个作品的不稳定理解。(这没什么:Python 2 的问题更严重。我建议不要使用 Python 2 来处理 Unicode;如果你想在 Python 中处理 Unicode,请使用 Python 3。对于 Pythonistas,我推荐的解决 Python 无数 Unicode 正则表达式问题的解决方案是 Matthew Barnett 用于 Python 2 和 Python 3 的出色的
regex
库。非常简洁,而且它实际上正确地实现了 Unicode 大小写折叠 — 在标准re
的许多其他 Unicode 事物中,标准re
都犯了严重错误。)记住:Unicode 不是 只是更多的字符:Unicode 是处理更多字符的规则。人们要么学习使用 Unicode,要么反对它,并且如果有人反对它,那么它对你不利。
“Answer me these questions four, as all were answered long before.”
You really should have asked one question, not four. But here are the answers.
All UTF transforms by definition support all Unicode code points. That is something you needn’t worry about. The only problem is that some systems are really UCS-2 yet claim they are UTF-16, and UCS-2 is severely broken in several fundamental ways:
For what encoding is used internally by seven different programming languages, see slide 7 on Feature Support Summary in my OSCON talk from last week entitled “Unicode Support Shootout”. It varies a great deal.
UTF-8 is the best serialization transform of a stream of logical Unicode code points because, in no particular order:
I use UTF-8 whenever I can get away with it.
I have already given properties of UTF-8, so here are some for the other two:
strlen
function, so I am not sure how important this is. My impression is that we almost always process our strings in sequential not random order, in which case this ceases to be a concern. Yes, it takes more memory, but only marginally so in the long run.That’s why I’ve come to talk about there being a UTF-16 Curse. The only thing worse than The UTF-16 Curse is The UCS-2 Curse.
Endianness and the whole BOM thing are problems that curse both UTF-16 and UTF-32 alike. If you use UTF-8, you will not ever have to worry about these.
I sure do hope that you are using logical (that is, abstract) code points internally with all your APIs, and worrying about serialization only for external interchange alone. Anything that makes you get at code units instead of code points is far far more hassle than it’s worth, no matter whether those code units are 8 bits wide or 16 bits wide. You want a code-point interface, not a code-unit interface. Now that your API uses code points instead of code units, the actual underlying representation no longer matters. It is important that this be hidden.
Category Errors
Let me add that everyone talking about ASCII versus Unicode is making a category error. Unicode is very much NOT “like ASCII but with more characters.” That might describe ISO 10646, but it does not describe Unicode. Unicode is not merely a particular repertoire but rules for handling them. Not just more characters, but rather more characters that have particular rules accompanying them. Unicode characters without Unicode rules are no longer Unicode characters.
If you use an ASCII mindset to handle Unicode text, you will get all kinds of brokenness, again and again. It doesn’t work. As just one example of this, it is because of this misunderstanding that the Python pattern-matching library,
re
, does the wrong thing completely when matching case-insensitively. It blindly assumes two code points count as the same if both have the same lowercase. That is an ASCII mindset, which is why it fails. You just cannot treat Unicode that way, because if you do you break the rules and it is no longer Unicode. It’s just a mess.For example, Unicode defines U+03C3
GREEK SMALL LETTER SIGMA
and U+03C2GREEK SMALL LETTER FINAL SIGMA
as case-insensitive versions of each other. (This is called Unicode casefolding.) But since they don’t change when blindly mapped to lowercase and compared, that comparison fails. You just can’t do it that way. You can’t fix it in the general case by switching the lowercase comparison to an uppercase one, either. Using casemapping when you need to use casefolding belies a shakey understanding of the whole works.(And that’s nothing: Python 2 is broken even worse. I recommend against using Python 2 for Unicode; use Python 3 if you want to do Unicode in Python. For Pythonistas, the solution I recommend for Python’s innumerably many Unicode regex issues is Matthew Barnett’s marvelous
regex
library for Python 2 and Python 3. It is really quite neat, and it actually gets Unicode casefolding right — amongst many other Unicode things that the standardre
gets miserably wrong.)REMEMBER: Unicode is not just more characters: Unicode is rules for handling more characters. One either learns to work with Unicode, or else one works against it, and if one works against it, then it works against you.
它们都支持所有 Unicode 代码点。
它们具有不同的性能特征 - 例如,UTF-8 对于 ASCII 字符来说更加紧凑,而 UTF-32 可以更轻松地处理整个 Unicode,包括基本多语言平面之外的值(即 U+FFFF 之上) 。由于每个字符的宽度可变,UTF-8 字符串很难用于获取二进制编码中的特定字符索引 - 您必须扫描。 UTF-16 也是如此,除非您知道不存在非 BMP 字符。
查看 UTF-8、UTF-16 和 UTF-32 a>
字节顺序决定(对于 UTF -16 和 UTF-32) 最高有效字节是否在前最低有效字节排在最后,反之亦然。例如,如果您想用 UTF-16 表示 U+1234,则可以是 { 0x12, 0x34 } 或 { 0x34, 0x12 }。 字节顺序标记指示您正在处理的字节顺序。 UTF-8 没有不同的字节序,但在文件开头看到 UTF-8 BOM 可以很好地表明该文件是 UTF-8。
All of them support all Unicode code points.
They have different performance characteristics - for example, UTF-8 is more compact for ASCII characters, whereas UTF-32 makes it easier to deal with the whole of Unicode including values outside the Basic Multilingual Plane (i.e. above U+FFFF). Due to its variable width per character, UTF-8 strings are hard to use to get to a particular character index in the binary encoding - you have scan through. The same is true for UTF-16 unless you know that there are no non-BMP characters.
It's probably easiest to look at the wikipedia articles for UTF-8, UTF-16 and UTF-32
Endianness determines (for UTF-16 and UTF-32) whether the most significant byte comes first and the least significant byte comes last, or vice versa. For example, if you want to represent U+1234 in UTF-16, that can either be { 0x12, 0x34 } or { 0x34, 0x12 }. A byte order mark indicates which endianess you're dealing with. UTF-8 doesn't have different endiannesses, but seeing a UTF-8 BOM at the start of a file is a good indicator that it is UTF-8.
这里有一些很好的问题,并且已经有了一些很好的答案。我也许可以添加一些有用的东西。
如前所述,所有三个都涵盖了全套可能的代码点,U+0000 到 U+10FFFF。
取决于文本,但这里有一些可能感兴趣的细节。 UTF-8 每个字符使用 1 到 4 个字节; UTF-16使用2或4; UTF-32 始终使用 4。需要注意的一个有用的事情是这一点。如果使用UTF-8,那么英文文本将被编码,其中绝大多数字符每个字节1个,但中文每个字符需要3个字节。使用UTF-16,英文和中文都需要2。所以基本上UTF-8对英文来说是胜利; UTF-16 对中文来说是胜利。
主要区别在上面#2 的答案中提到,或者正如 Jon Skeet 所说,请参阅维基百科文章。
字节顺序:对于 UTF-16 和 UTF-32,这是指字节出现的顺序;例如,在 UTF-16 中,字符 U+1234 可以编码为 12 34(大端)或 34 12(小端)。 BOM(或字节顺序标记)很有趣。假设您有一个以 UTF-16 编码的文件,但您不知道它是大端还是小端,但您注意到文件的前两个字节是 FE FF。如果这是大端字节序,则字符将为 U+FEFF;如果是小端,则表示 U+FFFE。但事情是这样的:在 Unicode 中,代码点 FFFE 永久未分配:那里没有字符!因此我们可以知道编码必须是大尾数。 FEFF 字符在这里是无害的;它是零宽度无中断空间(基本上是不可见的)。同样,如果文件以 FF FE 开头,我们就知道它是小端字节序。
不确定我是否在其他答案中添加了任何内容,但我发现英语与汉语的具体分析过去对于向其他人解释这一点很有用。
Some good questions here and already a couple good answers. I might be able to add something useful.
As said before, all three cover the full set of possible codepoints, U+0000 to U+10FFFF.
Depends on the text, but here are some details that might be of interest. UTF-8 uses 1 to 4 bytes per char; UTF-16 uses 2 or 4; UTF-32 always uses 4. A useful thing to note is this. If you use UTF-8 then then English text will be encoded with the vast majority of characters in one byte each, but Chinese needs 3 bytes each. Using UTF-16, English and Chinese will both require 2. So basically UTF-8 is a win for English; UTF-16 is a win for Chinese.
The main difference is mentioned in the answer to #2 above, or as Jon Skeet says, see the Wikipedia articles.
Endianness: For UTF-16 and UTF-32 this refers to the order in which the bytes appear; for example in UTF-16, the character U+1234 can be encoded either as 12 34 (big endian), or 34 12 (little endian). The BOM, or byte order mark is interesting. Let's say you have a file encoded in UTF-16, but you don't know whether it is big or little endian, but you notice the first two bytes of the file are FE FF. If this were big-endian the character would be U+FEFF; if little endian, it would signify U+FFFE. But here's the thing: In Unicode the codepoint FFFE is permanently unassigned: there is no character there! Therefore we can tell the encoding must be big-endian. The FEFF character is harmless here; it is the ZERO-WIDTH NO BREAK SPACE (invisible, basically). Similarly if the file began with FF FE we know it is little endian.
Not sure if I added anything to the other answers, but I have found the English vs. Chinese concrete analysis useful in explaining this to others in the past.
看待它的一种方式是规模大于复杂性。通常,它们会增加对文本进行编码所需的字节数,但会降低对用于表示字符的方案进行解码的复杂性。因此,UTF-8 通常很小,但解码起来可能很复杂,而 UTF-32 占用更多字节,但很容易解码(但很少使用,UTF-16 更常见)。
考虑到这一点,网络传输通常选择 UTF-8,因为它的尺寸较小。而在更容易解码比存储大小更重要的情况下选择 UTF-16。
BOM 旨在作为文件开头的信息,描述所使用的编码。但这些信息经常缺失。
One way of looking at it is as size over complexity. Generally they increase in the number of bytes they need to encode text, but decrease in the complexity of decoding the scheme they use to represent characters. Therefore, UTF-8 is usually small but can be complex to decode, whereas UTF-32 takes up more bytes but is easy to decode (but is rarely used, UTF-16 being more common).
With this in mind UTF-8 is often chosen for network transmission, as it has smaller size. Whereas UTF-16 is chosen where easier decoding is more important than storage size.
BOMs are intended as information at the beginning of files which describes which encoding has been used. This information is often missing though.
Joel Spolsky 写了一篇关于 Unicode 的精彩介绍性文章:
每个软件开发人员绝对、肯定必须了解 Unicode 的绝对最低限度和字符集(没有任何借口!)
Joel Spolsky wrote a nice introductory article about Unicode:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)