有什么理由比 UTF-8 更喜欢 UTF-16 吗?
检查UTF-16和UTF-8的属性,我找不到任何更喜欢UTF-16的理由。
然而,查看 Java 和 C#,看起来字符串和字符默认为 UTF-16。我想这可能是出于历史原因,或者可能是出于性能原因,但找不到任何信息。
有人知道为什么这些语言选择 UTF-16 吗?我也有这样做的正当理由吗?
编辑:同时我还发现这个答案,这似乎相关并且有一些有趣的链接。
Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16.
However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information.
Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well?
EDIT: Meanwhile I've also found this answer, which seems relevant and has some interesting links.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(8)
东亚语言通常需要比 UTF-8(通常需要 3 个字节)更少的 UTF-16 存储(2 个字节足以存储 99% 的东亚语言字符)。
当然,对于西方语言,UTF-8 通常更小(1 个字节而不是 2 个字节)。对于像 HTML(有很多标记)这样的混合文件来说,这是一个很大的问题。
用户模式应用程序的 UTF-16 处理比处理 UTF-8 稍微容易一些,因为代理对的行为几乎与组合字符的行为相同。所以UTF-16通常可以作为固定大小的编码来处理。
East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).
Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.
Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.
@Oak:对于评论来说太长了...
我不了解 C#(并且会非常惊讶:这意味着他们只是复制了 Java 太多)但对于 Java 来说很简单:Java是在 Unicode 3.1 发布之前构思的。
因此,代码点少于 65537 个,因此每个 Unicode 代码点仍然适合 16 位,因此 Java char 诞生了。
当然,这导致了一些疯狂的问题,这些问题仍然影响着今天的 Java 程序员(比如我),其中有一个方法 charAt 在某些情况下既不返回 Unicode 字符,也不返回 Unicode 代码点和一个方法 (在 Java 5 中添加)codePointAt 它接受的参数不是您想要跳过的代码点数量! (您必须向 codePointAt 提供要跳过的 Java char 数量,这使其成为 String 类中最难理解的方法之一)。
所以,是的,这绝对让大多数 Java 程序员感到疯狂和困惑(大多数甚至没有意识到这些问题),是的,这是出于历史原因。至少,这是当人们在这个问题之后生气时想到的借口:但这是因为 Unicode 3.1 还没有出来。
:)
@Oak: this too long for a comment...
I don't know about C# (and would be really surprised: it would mean they just copied Java too much) but for Java it's simple: Java was conceived before Unicode 3.1 came out.
Hence there were less than 65537 codepoints, hence every Unicode codepoint was still fitting on 16-bit and so the Java char was born.
Of course this led to crazy issues that are still affecting Java programmers (like me) today, where you have a method charAt which in some case does return neither a Unicode character nor a Unicode codepoint and a method (added in Java 5) codePointAt which takes an argument which is not the number of codepoints you want you want to skip! (you have to supply to codePointAt the number of Java char you want to skip, which makes it one of the least understood method in the String class).
So, yup, this is definitely wild and confusing most Java programmers (most aren't even aware of these issues) and, yup, it is for historical reason. At least, that was the excuse that came up with when people got mad after this issue: but it's because Unicode 3.1 wasn't out yet.
:)
我认为使用 UTF-16 的 C# 源自内部使用 UTF-16 的 Windows NT 系列操作系统。
我想 Windows NT 在内部使用 UTF-16 有两个主要原因:
很多的编码空间。
解码比UTF-16。在 UTF-16 中,字符是
基本多语言平面字符(2 个字节)或代理项
对(4 字节)。 UTF-8 字符
可以是 1 到 4 之间的任意值
字节。
与其他人的回答相反 - 你不能将 UTF-16 视为 UCS-2< /a>.如果要正确迭代字符串中的实际字符,则必须使用 unicode 友好的迭代函数。例如,在 C# 中,您需要使用
StringInfo.GetTextElementEnumerator()
。有关更多信息,请阅读 wiki 上的此页面: http://en.wikipedia.org/wiki/ Unicode 编码比较
I imagine C# using UTF-16 derives from the Windows NT family of operating systems using UTF-16 internally.
I imagine there are two main reasons why Windows NT uses UTF-16 internally:
lot of space to encode.
decode than UTF-16. In UTF-16 characters are either
a Basic Multilingual Plane character (2 bytes) or a Surrogate
Pair (4 bytes). UTF-8 characters
can be anywhere between 1 and 4
bytes.
Contrary to what other people have answered - you cannot treat UTF-16 as UCS-2. If you want to correctly iterate over actual characters in a string, you have to use unicode-friendly iteration functions. For example in C# you need to use
StringInfo.GetTextElementEnumerator()
.For further information, this page on the wiki is worth reading: http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings
这取决于预期的字符集。如果您期望大量使用 7 位 ASCII 范围之外的 Unicode 代码点,那么您可能会发现 UTF-16 将比 UTF-8 更紧凑,因为某些 UTF-8 序列的长度超过两个字节。
此外,出于效率原因,Java 和 C# 在索引字符串时不考虑代理对。当使用由占用奇数字节的 UTF-8 序列表示的代码点时,这将完全崩溃。
It depends on the expected character sets. If you expect heavy use of Unicode code points outside of the 7-bit ASCII range then you might find that UTF-16 will be more compact than UTF-8, since some UTF-8 sequences are more than two bytes long.
Also, for efficiency reasons, Java and C# does not take surrogate pairs into account when indexing strings. This would break down completely when using code points that are represented with UTF-8 sequences that take up an odd number of bytes.
UTF-16 可以更有效地表示某些语言中的字符,例如中文、日文和韩文,其中大多数字符可以用一个 16 位字表示。一些很少使用的字符可能需要两个 16 位字。 UTF-8 通常对于表示西欧字符集的字符更为有效 - UTF-8 和 ASCII 在 ASCII 范围 (0-127) 上是等效的 - 但对于亚洲语言来说效率较低,需要三个或四个字节来表示以下字符在UTF-16中可以用两个字节来表示。
UTF-16 作为 Java/C# 的内存格式具有优势,因为基本多语言平面中的每个字符都可以用 16 位表示(请参阅 Joe 的回答),但也有 UTF-16 的一些缺点(例如,依赖于令人困惑的代码)在 \0 终止符上)不太相关。
UTF-16 can be more efficient for representing characters in some languages such as Chinese, Japanese and Korean where most characters can be represented in one 16 bit word. Some rarely used characters may require two 16 bit words. UTF-8 is generally much more efficient for representing characters from Western European character sets - UTF-8 and ASCII are equivalent over the ASCII range (0-127) - but less efficient with Asian languages, requiring three or four bytes to represent characters that can be represented with two bytes in UTF-16.
UTF-16 has an advantage as an in-memory format for Java/C# in that every character in the Basic Multilingual Plane can be represented in 16 bits (see Joe's answer) and some of the disadvantages of UTF-16 (e.g. confusing code relying on \0 terminators) are less relevant.
如果我们单独讨论纯文本,则 UTF-16 在某些语言中可能会更加紧凑,日语(约 20%)和中文(约 40%)就是最好的例子。一旦您比较 HTML 文档,优势就完全相反了,因为 UTF-16 会为每个 ASCII 字符浪费一个字节。
至于简单性或效率:如果您在编辑器应用程序中正确实现 Unicode,复杂性将相似,因为 UTF-16 并不总是将代码点编码为单个数字,并且单个代码点通常不是分段文本的正确方法。
鉴于在最常见的应用程序中,UTF-16 不太紧凑,并且实现起来同样复杂,因此选择 UTF-16 而不是 UTF-8 的唯一原因是,如果您拥有一个完全封闭的生态系统,并且定期存储或传输纯文本完全在复杂的书写系统中,没有压缩。
用zstd或LZMA2压缩后,即使是100%中文纯文本,优势也被彻底抹去;对于具有大约 3000 个独特字素的中文文本,使用 gzip 时,UTF-16 的优势约为 4%。
If we're talking about plain text alone, UTF-16 can be more compact in some languages, Japanese (about 20%) and Chinese (about 40%) being prime examples. As soon as you're comparing HTML documents, the advantage goes completely the other way, since UTF-16 is going to waste a byte for every ASCII character.
As for simplicity or efficiency: if you implement Unicode correctly in an editor application, complexity will be similar because UTF-16 does not always encode codepoints as a single number anyway, and single codepoints are generally not the right way to segment text.
Given that in the most common applications, UTF-16 is less compact, and equally complex to implement, the singular reason to prefer UTF-16 over UTF-8 is if you have a completely closed ecosystem where you are regularly storing or transporting plain text entirely in complex writing systems, without compression.
After compression with zstd or LZMA2, even for 100% Chinese plain text, the advantage is completely wiped out; with gzip the UTF-16 advantage is about 4% on Chinese text with around 3000 unique graphemes.
对于许多(大多数?)应用程序,您将仅处理基本多语言平面,因此可以将UTF-16视为定长编码。
因此,您可以避免 UTF-8 等可变长度编码的所有复杂性。
For many (most?) applications, you will be dealing only with characters in the Basic Multilingual Plane, so can treat UTF-16 as a fixed-length encoding.
So you avoid all the complexity of variable-length encodings like UTF-8.
简短的回答:
因为 Sun 和 Microsoft 是 Unicode 的早期采用者。
长答案:
在 20 世纪 80 年代末的某个时候,人们开始意识到通用字符集是可取的,但使用什么位宽度是一个有争议的问题。 1989年ISO提出了ISO 10646草案,提供了多种编码模式,其中一种模式每个字符使用32位,并且能够对所有内容进行编码而无需任何模式切换。另一种方法是每个字符使用 16 位,但有一个用于切换的转义系统。还有一种方法会使用基于字节的编码,但这种编码存在许多设计缺陷。
许多主要软件供应商不喜欢 ISO 10646 草案,认为它太复杂。他们支持称为 Unicode 的替代方案。 Unicode 1.0,一种固定宽度的 16 位编码,于 1991 年 10 月发布。软件供应商能够说服国家标准机构投票否决 ISO 10646 草案,ISO 被推入与 Unicode 的统一。
这就是 20 世纪 90 年代初的情况,许多主要软件供应商合作设计了一种新的固定宽度编码,并在他们的旗舰产品中采用它,包括 Windows NT 和 Java。
与此同时,X/Open 正在寻找一种更好的方法在扩展的 Ascii 上下文中对 Unicode 进行编码。 UTF-1 的性能不佳有几个原因:它的处理速度很慢,因为它需要对不是 2 的幂的数字进行计算,它不是自同步的,而且较短的序列可能会显示为较长序列的子序列。这些努力产生了我们现在所知的 UTF-8,但它们偏离了 Unicode 开发的主线。 UTF-8 于 1992 年开发,并于 1993 年在 Usenix 会议上提出,但直到 1996 年它似乎才被认为是正确的标准。
这是 Windows NT(1993 年 7 月发布)和 Java(1996 年 1 月发布)的环境被设计并发布。 Unicode 是一种简单的固定宽度编码,因此是内部处理格式的明显选择。 Java 确实采用了 UTF-8 的修改形式,但仅作为一种存储格式。
由于面临对更多字符进行编码的压力,1996 年 7 月引入了 Unicode 2.0。代码空间扩展到 20 位多,Unicode 不再是固定宽度的 16 位编码。相反,可以选择固定宽度 32 位编码或具有 8 和 16 位单元的可变宽度编码。
没有人愿意承受 32 位代码单元的空间占用或第二次更改编码单元大小的兼容性影响,因此围绕原始 16 位 Unicode 设计的系统通常最终使用 UTF-16。当然,文本处理可能存在一些风险,可能会误算或破坏新字符,但嗯,这仍然是较小的危害。
.net 框架以及随之而来的 C# 在 2002 年晚些时候推出,但此时 Microsoft 已经坚定地致力于 16 位代码单元。他们的操作系统 API 使用了它们。他们的文件系统使用了它们。他们的可执行格式使用了它们。
另一方面,Unix 和互联网自始至终基本上都是基于字节的。 UTF-8 被视为另一种编码,逐渐取代了以前的遗留编码。
Short answer:
Because Sun and Microsoft were early adopters of Unicode.
Long answer:
Sometime in the late 1980s, people started to realise that a Universal Character set was desirable but what bit width to use was a matter of controversy. In 1989 ISO proposed a draft of ISO 10646 which offered multiple encoding modes, one mode would use 32-bits per character and be able to encode everything without any mode switching. Another would use 16 bits per character but have an escape system for switching. Yet another would use a byte based encoding that had a number of design flaws.
A number of major software vendors did not like the ISO 10646 draft, seeing it as too complicated. They backed an alternative scheme called Unicode. Unicode 1.0, a fixed width 16 bit encoding was published in October 1991. The software vendors were able to convince the national standards bodies to vote down the ISO 10646 draft and ISO were pushed into Unification with Unicode.
So that was where we were in the early 1990s, a number of major software vendors had collaborated to design a new fixed width encoding and were adopting it in their Flagship products, including Windows NT and Java.
Meanwhile, X/Open were looking for a better way to encode Unicode in extended Ascii contexts. UTF-1 sucked for several reasons, it was slow to process because it required calculations modulo a number that was not a power of 2, it was not self-synchronizing and shorter sequences could appear as sub-sequences of longer ones. These efforts resulted in what we now know as UTF-8 but they happened away from the main line of Unicode development. UTF-8 was developed in 1992 and presented in 1993 at a Usenix conference but it does not seem to have been considered a proper standard until 1996.
This is the Environment in which Windows NT (released July 1993) and Java (released January 1996) were designed and released. Unicode was a simple fixed width encoding and hence the obvious choice for an internal processing format. Java did adopt a modified form of UTF-8 but only as a storage format.
There was pressure to encode more characters and in July 1996 Unicode 2.0 was introduced. The code space was expanded to just over 20 bits and Unicode was no longer a fixed-width 16 bit encoding. Instead there was a choice of a fixed-width 32-bit encoding or variable width encodings with 8 and 16 bit units.
No-one wanted to take the space hit of 32-bit code units or the compatibility hit of changing their encoding unit size for the second time, so the systems that had been designed around the original 16 bit Unicode generally ended up using UTF-16. Sure there was some risk that text processing could miscount or mangle the new characters but meh, it was still the lesser evil.
The .net framework, and with it C# were introduced somewhat later in 2002, but by this time Microsoft was already deeply committed to 16 bit code units. Their operating system APIs used them. Their file systems used them. Their executable formats used them.
Unix and the Internet on the other hand, stayed largely byte-based all the way through. UTF-8 was treated as just another encoding, gradually replacing the previous legacy encodings.