UTF-8 和 UTF-16 之间的区别?
UTF-8 和 UTF-16 之间的区别? 为什么我们需要这些?
MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";
md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();
Difference between UTF-8 and UTF-16?
Why do we need these?
MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";
md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
我相信网络上有很多关于这方面的好文章,但这里是一个简短的总结。
UTF-8和UTF-16都是可变长度编码。然而,在 UTF-8 中,一个字符可能至少占用 8 位,而在 UTF-16 中,字符长度从 16 位开始。
UTF-8 的主要优点:
UTF-8 的主要缺点:
UTF-16的主要优点:
char
作为原始组件字符串的。UTF-16 的主要缺点:
一般来说,UTF-16 通常更适合内存中表示,因为 BE/LE 在那里不相关(只需使用本机顺序)并且索引速度更快(只是不要忘记正确处理代理对)。另一方面,UTF-8 非常适合文本文件和网络协议,因为不存在 BE/LE 问题,并且空终止通常会派上用场,而且还具有 ASCII 兼容性。
I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Main UTF-8 cons:
Main UTF-16 pros:
char
as the primitive component of the string.Main UTF-16 cons:
In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.
它们只是表示 Unicode 字符的不同方案。
两者都是可变长度 - UTF-16 对基本多语言平面 (BMP) 中的所有字符使用 2 个字节,其中包含大多数常用字符。
UTF-8 在 BMP 中使用 1 到 3 个字节,对于当前 Unicode 范围 U+0000 到 U+1FFFFF 中的字符最多使用 4 个字节,并且如果有必要的话,可以扩展到 U+7FFFFFFF...但值得注意的是,所有 ASCII 字符均以单个字节表示。
出于消息摘要的目的,您选择其中哪一个并不重要,只要尝试重新创建摘要的每个人都使用相同的选项即可。
有关 UTF-8 和 Unicode 的详细信息,请参阅此页面。
(请注意,所有 Java 字符都是 BMP 中的 UTF-16 代码点;要表示 U+FFFF 以上的字符,您需要在 Java 中使用代理对。)
They're simply different schemes for representing Unicode characters.
Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.
UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.
For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.
See this page for more about UTF-8 and Unicode.
(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)
安全性:仅使用 UTF-8
UTF-16 的实现中至少存在一些安全漏洞。请参阅维基百科了解详细信息。
WHATWG 和 W3C< /a>现在声明仅UTF-8 将在网络上使用。
其他团体也这么说。
因此,虽然 UTF-16 可能会继续在 Java 和 Windows 等某些系统内部使用,但您过去在数据文件、数据交换等方面看到的 UTF-16 的很少使用可能会完全消失。
Security: Use only UTF-8
There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.
WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.
Other groups are saying the same.
So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.
这与 UTF-8/16 无关(一般来说,尽管它确实转换为 UTF16 并且 BE/LE 部分可以用单行设置),但下面是将 String 转换为 byte[] 的最快方法。例如:完全适合所提供的情况(哈希码)。 String.getBytes(enc) 相对较慢。
This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.
UTF-8是8位字节的序列,而UTF-16是16位单元(以下简称字)的序列。
在 UTF-8 中,值 0 到 0x7F 的代码点直接编码为单字节,值 0x100 到 0x7FF 的代码点编码为两个字节,值 0x800 到 0xFFFF 的代码点编码为三个字节,值 0x100000 到 0x10FFFF 的代码点编码为四个字节字节。
在 UTF-16 代码点 0x0000 到 0xFFFF(注意:值 0xD800 到 0xDFFF 不是有效的 Unicode 代码点)中,直接编码为单个字。值为 0x100000 到 0x10FFFF 的代码点被编码为两个字。这两个单词序列称为代理对。
因为历史是混乱的。不同的公司和组织有不同的优先事项和想法,一旦做出格式决定,它往往会保留下来。
早在 1989 年,ISO 就提出了通用字符集作为 ISO 10646 草案,但主要软件供应商并不喜欢它,认为它过于复杂。他们设计了自己的系统,称为 Unicode,一种固定宽度的 16 位编码。软件公司说服了足够多的国家标准机构投票否决了 ISO 10646 草案,并推动 ISO 与 Unicode 统一。
这种原始的 16 位 Unicode 被许多主要软件产品采用作为本机内部格式。其中最著名的两个是 Java(1996 年发布)和 Windows NT(1993 年发布)。 Java 或 NT 中的字符串本质上是一个 16 位值的序列。
需要在面向字节的“扩展 ASCII”环境中对 Unicode 进行编码。 ISO为此提出了一个标准“UTF-1”,但人们不喜欢它,它实施起来很慢,因为它涉及模运算符,并且编码数据具有一些不良特性。
x-open 发布了一项关于在扩展 ASCII 环境中编码 Unicode/UCS 值的新标准的提案。 Plan 9 开发人员对此稍作修改,成为我们现在所知的 UTF-8。
最终,软件供应商不得不承认 16 位是不够的。特别是,中国极力要求支持历史悠久的汉字,这些汉字有两个以 16 位编码。
最终结果是 Unicode 2.0,它将代码空间扩展到 20 位多一点,并引入了 UTF-16。同时,Unicode 2.0也将UTF-8提升为标准的正式组成部分。最后它引入了UTF-32,一种新的固定宽度编码。
实际中出于兼容性和效率的考虑,采用UTF-32的系统相对较少。那些采用原始16位Unicode的系统(例如Windows、Java)转向UTF-16,而那些仍然以字节为导向的系统(例如Unix、互联网)则继续从传统的8位编码逐渐转向UTF-8。
UTF-8 is a sequence of 8 bit bytes, while UTF-16 is a sequence of 16 bit units (hereafter referred to as words).
In UTF-8 code points with values 0 to 0x7F are encoded directly as single bytes, code points with values 0x100 to 0x7FF as two bytes, code points with values 0x800 to 0xFFFF as three bytes and code points with values 0x100000 to 0x10FFFF encoded as four bytes.
In UTF-16 code points 0x0000 to 0xFFFF (note: values 0xD800 to 0xDFFF are not valid Unicode code points) are encoded directly as single words. Code points with values 0x100000 to 0x10FFFF are encoded as two words. These two word sequences are known as surrogate pairs.
Because history is messy. Different companies and organisations have different priorities and ideas, and once a format decision is made, it tends to stick around.
Back in 1989 the ISO had proposed a Universal character set as a draft of ISO 10646, but the major software vendors did not like it, seeing it as over-complicated. They devised their own system called Unicode, a fixed-width 16-bit encoding. The software companies convinced a sufficient number of national standards bodies to vote down the draft of ISO 10646 and ISO was pushed into unification with Unicode.
This original 16-bit Unicode was adopted as the native internal format by a number of major software products. Two of the most notable were Java (released in 1996) and Windows NT (released in 1993). A string in Java or NT is, at its most fundamental, a sequence of 16-bit values.
There was a need to encode Unicode in byte-orientated "extended ASCII" environments. The ISO had proposed a standard "UTF-1" for this, but people didn't like it, it was slow to implement because it involved modulo operators and the encoded data had some undesirable properties.
x-open circulated a proposal for a new standard for encoding Unicode/UCS values in extended ASCII environments. This was altered slightly by the Plan 9 developers to become what we now know as UTF-8.
Eventually, the software vendors had to concede that 16 bits was not enough. In particular, China was pressing heavily for support for historic Chinese characters that were two numerous to encode in 16 bits.
The end result was Unicode 2.0, which expanded the code space to just over 20 bits and introduced UTF-16. At the same time, Unicode 2.0 also elevated UTF-8 to be a formal part of the standard. Finally it introduced UTF-32, a new fixed width encoding.
In practice due to compatibility and efficiency considerations, relatively few systems adopted UTF-32. Those systems that had adopted the original 16 bit Unicode (e.g. Windows, Java) moved to UTF-16, while those that had remained byte orientated (e.g. Unix, the Internet) continued their gradual move from legacy 8 bit encodings to UTF-8.
区分 UTF-8 和 UTF-16 的简单方法是确定它们之间的共性。
除了给定字符共享相同的 unicode 编号之外,每个字符都有自己的格式。
UTF-8 尝试用一个字节来表示赋予字符的每个 unicode 数字(如果是 ASCII),否则 2 两个字节,否则 4 个字节等等...
UTF-16 尝试表示赋予字符的每个 unicode 数字以两个字节开头。如果两个字节不够,则使用 4 个字节。如果这还不够,则使用 6 个字节。
理论上,UTF-16 更节省空间,但实际上 UTF-8 更节省空间,因为要处理的大多数字符(98% 的数据)都是 ASCII,UTF-8 尝试用单字节表示它们,而 UTF-16尝试用 2 个字节来表示它们。
此外,UTF-8 是 ASCII 编码的超集。因此,每个需要 ASCII 数据的应用程序也会被 UTF-8 处理器接受。对于 UTF-16 来说情况并非如此。 UTF-16 无法理解 ASCII,这是 UTF-16 采用的一大障碍。
另一点需要注意的是,目前所有的 UNICODE 都可以容纳在 UTF-8 最大 4 个字节中(考虑到世界上所有语言)。这与 UTF-16 相同,与 UTF-8 相比并没有真正节省空间(https://stackoverflow.com/a/8505038 /3343801 )
因此,人们尽可能使用 UTF-8。
Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.
Other than sharing same unicode number for given character, each one is their own format.
UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...
UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.
Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.
Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.
Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )
So, people use UTF-8 where ever possible.