UTF-8 和 UTF-16 之间的区别?

发布于 2024-10-11 08:44:53 字数 259 浏览 4 评论 0原文

UTF-8 和 UTF-16 之间的区别? 为什么我们需要这些?

MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";

md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();

Difference between UTF-8 and UTF-16?
Why do we need these?

MessageDigest md = MessageDigest.getInstance("SHA-256");
String text = "This is some text";

md.update(text.getBytes("UTF-8")); // Change this to "UTF-16" if needed
byte[] digest = md.digest();

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(6

拒绝两难 2024-10-18 08:44:53

我相信网络上有很多关于这方面的好文章,但这里是一个简短的总结。

UTF-8和UTF-16都是可变长度编码。然而,在 UTF-8 中,一个字符可能至少占用 8 位,而在 UTF-16 中,字符长度从 16 位开始。

UTF-8 的主要优点:

  • 基本 ASCII 字符(如数字、无重音的拉丁字符等)占用一个字节,与 US-ASCII 表示相同。这样,所有 US-ASCII 字符串都变成有效的 UTF-8,这在许多情况下提供了良好的向后兼容性。
  • 没有空字节,允许使用以空结尾的字符串,这也引入了大量的向后兼容性。
  • UTF-8 与字节顺序无关,因此您不必担心 Big Endian / Little Endian 问题。

UTF-8 的主要缺点:

  • 许多常见字符具有不同的长度,这会减慢按代码点进行索引的速度并严重计算代码点计数。
  • 尽管字节顺序并不重要,但有时 UTF-8 仍然有 BOM(字节顺序标记),用于通知文本是用 UTF-8 编码的,并且即使文本仅包含 ASCII 字符,也会破坏与 ASCII 软件的兼容性。微软的软件(比如记事本)特别喜欢在UTF-8中添加BOM。

UTF-16的主要优点:

  • BMP(基本多语言平面)字符,包括拉丁文、西里尔文、大多数中文(中华人民共和国强制支持BMP之外的一些代码点)、大多数日文可以用2个字节表示。如果文本包含增补字符,这可以加快索引和计算代码点计数的速度。
  • 即使文本有补充字符,它们仍然由 16 位值对表示,这意味着总长度仍然可以被 2 整除,并且允许使用 16 位 char 作为原始组件字符串的。

UTF-16 的主要缺点:

  • US-ASCII 字符串中存在大量空字节,这意味着没有空终止字符串并且会浪费大量内存。
  • 在许多常见场景中(特别是在美国/欧盟/使用西里尔字母的国家/以色列/阿拉伯国家/伊朗等)将其用作固定长度编码“大部分有效”,但通常会导致在不支持的情况下破坏支持。 这意味着程序员必须了解代理对,并在重要的情况下正确处理它们!
  • 它的长度可变,因此计算或索引代码点的成本很高,尽管低于 UTF-8。

一般来说,UTF-16 通常更适合内存中表示,因为 BE/LE 在那里不相关(只需使用本机顺序)并且索引速度更快(只是不要忘记正确处理代理对)。另一方面,UTF-8 非常适合文本文件和网络协议,因为不存在 BE/LE 问题,并且空终止通常会派上用场,而且还具有 ASCII 兼容性。

I believe there are a lot of good articles about this around the Web, but here is a short summary.

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

  • Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
  • No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
  • UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

Main UTF-8 cons:

  • Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
  • Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

Main UTF-16 pros:

  • BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
  • Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.

Main UTF-16 cons:

  • Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
  • Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
  • It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.

In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.

音盲 2024-10-18 08:44:53

它们只是表示 Unicode 字符的不同方案。

两者都是可变长度 - UTF-16 对基本多语言平面 (BMP) 中的所有字符使用 2 个字节,其中包含大多数常用字符。

UTF-8 在 BMP 中使用 1 到 3 个字节,对于当前 Unicode 范围 U+0000 到 U+1FFFFF 中的字符最多使用 4 个字节,并且如果有必要的话,可以扩展到 U+7FFFFFFF...但值得注意的是,所有 ASCII 字符均以单个字节表示。

出于消息摘要的目的,您选择其中哪一个并不重要,只要尝试重新创建摘要的每个人都使用相同的选项即可。

有关 UTF-8 和 Unicode 的详细信息,请参阅此页面

(请注意,所有 Java 字符都是 BMP 中的 UTF-16 代码点;要表示 U+FFFF 以上的字符,您需要在 Java 中使用代理对。)

They're simply different schemes for representing Unicode characters.

Both are variable-length - UTF-16 uses 2 bytes for all characters in the basic multilingual plane (BMP) which contains most characters in common use.

UTF-8 uses between 1 and 3 bytes for characters in the BMP, up to 4 for characters in the current Unicode range of U+0000 to U+1FFFFF, and is extensible up to U+7FFFFFFF if that ever becomes necessary... but notably all ASCII characters are represented in a single byte each.

For the purposes of a message digest it won't matter which of these you pick, so long as everyone who tries to recreate the digest uses the same option.

See this page for more about UTF-8 and Unicode.

(Note that all Java characters are UTF-16 code points within the BMP; to represent characters above U+FFFF you need to use surrogate pairs in Java.)

不必你懂 2024-10-18 08:44:53

安全性:仅使用 UTF-8

UTF-8 和 UTF-16 之间的区别?为什么我们需要这些?

UTF-16 的实现中至少存在一些安全漏洞。请参阅维基百科了解详细信息

WHATWGW3C< /a>现在声明UTF-8 将在网络上使用。

当专门使用 UTF-8 时,此处概述的[安全]问题就会消失,这也是现在所有事物强制编码的众多原因之一。

其他团体也这么说。

因此,虽然 UTF-16 可能会继续在 Java 和 Windows 等某些系统内部使用,但您过去在数据文件、数据交换等方面看到的 UTF-16 的很少使用可能会完全消失。

Security: Use only UTF-8

Difference between UTF-8 and UTF-16? Why do we need these?

There have been at least a couple of security vulnerabilities in implementations of UTF-16. See Wikipedia for details.

WHATWG and W3C have now declared that only UTF-8 is to be used on the Web.

The [security] problems outlined here go away when exclusively using UTF-8, which is one of the many reasons that is now the mandatory encoding for all things.

Other groups are saying the same.

So while UTF-16 may continue being used internally by some systems such as Java and Windows, what little use of UTF-16 you may have seen in the past for data files, data exchange, and such, will likely fade away entirely.

梦毁影碎の 2024-10-18 08:44:53

这与 UTF-8/16 无关(一般来说,尽管它确实转换为 UTF16 并且 BE/LE 部分可以用单行设置),但下面是将 String 转换为 byte[] 的最快方法。例如:完全适合所提供的情况(哈希码)。 String.getBytes(enc) 相对较慢。

static byte[] toBytes(String s){
        byte[] b=new byte[s.length()*2];
        ByteBuffer.wrap(b).asCharBuffer().put(s);
        return b;
    }

This is unrelated to UTF-8/16 (in general, although it does convert to UTF16 and the BE/LE part can be set w/ a single line), yet below is the fastest way to convert String to byte[]. For instance: good exactly for the case provided (hash code). String.getBytes(enc) is relatively slow.

static byte[] toBytes(String s){
        byte[] b=new byte[s.length()*2];
        ByteBuffer.wrap(b).asCharBuffer().put(s);
        return b;
    }
一抹淡然 2024-10-18 08:44:53

UTF-8 和 UTF-16 之间的区别?

UTF-8是8位字节的序列,而UTF-16是16位单元(以下简称字)的序列。

在 UTF-8 中,值 0 到 0x7F 的代码点直接编码为单字节,值 0x100 到 0x7FF 的代码点编码为两个字节,值 0x800 到 0xFFFF 的代码点编码为三个字节,值 0x100000 到 0x10FFFF 的代码点编码为四个字节字节。

在 UTF-16 代码点 0x0000 到 0xFFFF(注意:值 0xD800 到 0xDFFF 不是有效的 Unicode 代码点)中,直接编码为单个字。值为 0x100000 到 0x10FFFF 的代码点被编码为两个字。这两个单词序列称为代理对。

为什么我们需要这些?

因为历史是混乱的。不同的公司和组织有不同的优先事项和想法,一旦做出格式决定,它往往会保留下来。

早在 1989 年,ISO 就提出了通用字符集作为 ISO 10646 草案,但主要软件供应商并不喜欢它,认为它过于复杂。他们设计了自己的系统,称为 Unicode,一种固定宽度的 16 位编码。软件公司说服了足够多的国家标准机构投票否决了 ISO 10646 草案,并推动 ISO 与 Unicode 统一。

这种原始的 16 位 Unicode 被许多主要软件产品采用作为本机内部格式。其中最著名的两个是 Java(1996 年发布)和 Windows NT(1993 年发布)。 Java 或 NT 中的字符串本质上是一个 16 位值的序列。

需要在面向字节的“扩展 ASCII”环境中对 Unicode 进行编码。 ISO为此提出了一个标准“UTF-1”,但人们不喜欢它,它实施起来很慢,因为它涉及模运算符,并且编码数据具有一些不良特性。

x-open 发布了一项关于在扩展 ASCII 环境中编码 Unicode/UCS 值的新标准的提案。 Plan 9 开发人员对此稍作修改,成为我们现在所知的 UTF-8。

最终,软件供应商不得不承认 16 位是不够的。特别是,中国极力要求支持历史悠久的汉字,这些汉字有两个以 16 位编码。

最终结果是 Unicode 2.0,它将代码空间扩展到 20 位多一点,并引入了 UTF-16。同时,Unicode 2.0也将UTF-8提升为标准的正式组成部分。最后它引入了UTF-32,一种新的固定宽度编码。

实际中出于兼容性和效率的考虑,采用UTF-32的系统相对较少。那些采用原始16位Unicode的系统(例如Windows、Java)转向UTF-16,而那些仍然以字节为导向的系统(例如Unix、互联网)则继续从传统的8位编码逐渐转向UTF-8。

Difference between UTF-8 and UTF-16?

UTF-8 is a sequence of 8 bit bytes, while UTF-16 is a sequence of 16 bit units (hereafter referred to as words).

In UTF-8 code points with values 0 to 0x7F are encoded directly as single bytes, code points with values 0x100 to 0x7FF as two bytes, code points with values 0x800 to 0xFFFF as three bytes and code points with values 0x100000 to 0x10FFFF encoded as four bytes.

In UTF-16 code points 0x0000 to 0xFFFF (note: values 0xD800 to 0xDFFF are not valid Unicode code points) are encoded directly as single words. Code points with values 0x100000 to 0x10FFFF are encoded as two words. These two word sequences are known as surrogate pairs.

Why do we need these?

Because history is messy. Different companies and organisations have different priorities and ideas, and once a format decision is made, it tends to stick around.

Back in 1989 the ISO had proposed a Universal character set as a draft of ISO 10646, but the major software vendors did not like it, seeing it as over-complicated. They devised their own system called Unicode, a fixed-width 16-bit encoding. The software companies convinced a sufficient number of national standards bodies to vote down the draft of ISO 10646 and ISO was pushed into unification with Unicode.

This original 16-bit Unicode was adopted as the native internal format by a number of major software products. Two of the most notable were Java (released in 1996) and Windows NT (released in 1993). A string in Java or NT is, at its most fundamental, a sequence of 16-bit values.

There was a need to encode Unicode in byte-orientated "extended ASCII" environments. The ISO had proposed a standard "UTF-1" for this, but people didn't like it, it was slow to implement because it involved modulo operators and the encoded data had some undesirable properties.

x-open circulated a proposal for a new standard for encoding Unicode/UCS values in extended ASCII environments. This was altered slightly by the Plan 9 developers to become what we now know as UTF-8.

Eventually, the software vendors had to concede that 16 bits was not enough. In particular, China was pressing heavily for support for historic Chinese characters that were two numerous to encode in 16 bits.

The end result was Unicode 2.0, which expanded the code space to just over 20 bits and introduced UTF-16. At the same time, Unicode 2.0 also elevated UTF-8 to be a formal part of the standard. Finally it introduced UTF-32, a new fixed width encoding.

In practice due to compatibility and efficiency considerations, relatively few systems adopted UTF-32. Those systems that had adopted the original 16 bit Unicode (e.g. Windows, Java) moved to UTF-16, while those that had remained byte orientated (e.g. Unix, the Internet) continued their gradual move from legacy 8 bit encodings to UTF-8.

乖乖哒 2024-10-18 08:44:53

区分 UTF-8 和 UTF-16 的简单方法是确定它们之间的共性。

除了给定字符共享相同的 unicode 编号之外,每个字符都有自己的格式。

UTF-8 尝试用一个字节来表示赋予字符的每个 unicode 数字(如果是 ASCII),否则 2 两个字节,否则 4 个字节等等...

UTF-16 尝试表示赋予字符的每个 unicode 数字以两个字节开头。如果两个字节不够,则使用 4 个字节。如果这还不够,则使用 6 个字节。

理论上,UTF-16 更节省空间,但实际上 UTF-8 更节省空间,因为要处理的大多数字符(98% 的数据)都是 ASCII,UTF-8 尝试用单字节表示它们,而 UTF-16尝试用 2 个字节来表示它们。

此外,UTF-8 是 ASCII 编码的超集。因此,每个需要 ASCII 数据的应用程序也会被 UTF-8 处理器接受。对于 UTF-16 来说情况并非如此。 UTF-16 无法理解 ASCII,这是 UTF-16 采用的一大障碍。

另一点需要注意的是,目前所有的 UNICODE 都可以容纳在 UTF-8 最大 4 个字节中(考虑到世界上所有语言)。这与 UTF-16 相同,与 UTF-8 相比并没有真正节省空间(https://stackoverflow.com/a/8505038 /3343801

因此,人们尽可能使用 UTF-8。

Simple way to differentiate UTF-8 and UTF-16 is to identify commonalities between them.

Other than sharing same unicode number for given character, each one is their own format.

UTF-8 try to represent, every unicode number given to character with one byte(If it is ASCII), else 2 two bytes, else 4 bytes and so on...

UTF-16 try to represent, every unicode number given to character with two byte to start with. If two bytes are not sufficient, then uses 4 bytes. IF that is also not sufficient, then uses 6 bytes.

Theoretically, UTF-16 is more space efficient, but in practical UTF-8 is more space efficient as most of the characters(98% of data) for processing are ASCII and UTF-8 try to represent them with single byte and UTF-16 try to represent them with 2 bytes.

Also, UTF-8 is superset of ASCII encoding. So every app that expects ASCII data would also accepted by UTF-8 processor. This is not true for UTF-16. UTF-16 could not understand ASCII, and this is big hurdle for UTF-16 adoption.

Another point to note is, all UNICODE as of now could be fit in 4 bytes of UTF-8 maximum(Considering all languages of world). This is same as UTF-16 and no real saving in space compared to UTF-8 ( https://stackoverflow.com/a/8505038/3343801 )

So, people use UTF-8 where ever possible.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文