一个 Unicode 字符占用多少字节?

发布于 2024-10-21 09:35:40 字数 289 浏览 3 评论 0原文

我对编码有点困惑。据我所知,旧的 ASCII 字符每个字符占用一个字节。一个 Unicode 字符需要多少字节?

我假设一个 Unicode 字符可以包含任何语言中的所有可能的字符 - 我是对的吗?那么每个字符需要多少字节呢?

UTF-7、UTF-6、UTF-16 等是什么意思?它们是不同版本的 Unicode 吗?

我读了关于 Unicode 的维基百科文章,但这对我来说相当困难。我期待看到一个简单的答案。

I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?

I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?

And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?

I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(12

提赋 2024-10-28 09:35:40

奇怪的是,没有人指出如何计算一个 Unicode 字符占用了多少字节。以下是 UTF-8 编码字符串的规则:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation byte: one of 1-3 bytes following the first
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF7   First byte of a 4-byte character encoding

所以简单的回答是:它需要 1 到 4 个字节,具体取决于第一个字节,它将指示它将占用多少字节。

Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:

Binary    Hex          Comments
0xxxxxxx  0x00..0x7F   Only byte of a 1-byte character encoding
10xxxxxx  0x80..0xBF   Continuation byte: one of 1-3 bytes following the first
110xxxxx  0xC0..0xDF   First byte of a 2-byte character encoding
1110xxxx  0xE0..0xEF   First byte of a 3-byte character encoding
11110xxx  0xF0..0xF7   First byte of a 4-byte character encoding

So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.

洋洋洒洒 2024-10-28 09:35:40

你不会看到一个简单的答案,因为根本不存在。

首先,Unicode 并不包含“每种语言的每个字符”,尽管它确实在尝试。

Unicode 本身是一种映射,它定义代码点,而代码点是一个数字,通常与一个字符相关联。我说通常是因为存在诸如组合字符之类的概念。您可能熟悉重音或元音变音等内容。它们可以与其他字符(例如 au)一起使用来创建新的逻辑字符。因此,一个字符可以由 1 个或多个代码点组成。

为了在计算系统中有用,我们需要选择此信息的表示形式。这些是各种 unicode 编码,例如 utf-8、utf-16le、utf-32 等。它们主要通过代码单元的大小来区分。 UTF-32 是最简单的编码,它具有 32 位的代码单元,这意味着单个代码点可以轻松地融入到代码单元中。其他编码会遇到这样的情况:一个代码点需要多个代码单元,或者该特定代码点根本无法在编码中表示(例如 UCS-2 就是一个问题)。

由于组合字符的灵活性,即使在给定的编码中,每个字符的字节数也可能根据字符和规范化形式而变化。这是一个用于处理具有多种表示形式的字符的协议(您可以说“带有重音符号的'a'”,它是2个代码点,其中一个是组合字符或“重音'a'”这是一个代码点)。

You won't see a simple answer because there isn't one.

First, Unicode doesn't contain "every character from every language", although it sure does try.

Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an a or a u to create a new logical character. A character therefore can consist of 1 or more codepoints.

To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).

Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say "an 'a' with an accent" which is 2 codepoints, one of which is a combining char or "accented 'a'" which is one codepoint).

鱼窥荷 2024-10-28 09:35:40

我知道这个问题很旧并且已经有一个公认的答案,但我想提供一些例子(希望它对某人有用)。

据我所知,旧的 ASCII 字符每个字符占用一个字节。

正确的。实际上,由于 ASCII 是 7 位编码,因此它支持 128 个代码(其中 95 个可打印),因此它只使用半个字节(如果这有意义的话)。

一个 Unicode 字符需要多少字节?

Unicode 只是将字符映射到代码点。它没有定义如何对它们进行编码。文本文件不包含 Unicode 字符,而是包含可能表示 Unicode 字符的字节/八位字节。

我假设一个 Unicode 字符可以包含所有可能的字符
任何语言的字符 - 我是对的吗?

不,但差不多了。所以基本上是的。但还是没有。

那么每个字符需要多少字节?

和你的第二个问题一样。

UTF-7、UTF-6、UTF-16 等是什么意思?它们是某种 Unicode 吗
版本?

不,这些是编码。它们定义字节/八位字节应如何表示 Unicode 字符。

举几个例子。如果其中一些无法在您的浏览器中显示(可能是因为字体不支持它们),请转至 http://codepoints.net/U+1F6AA(替换 1F6AA code> 以及十六进制代码点)来查看图像。

    • U+0061 拉丁文小写字母 A:a
      • 数量:97
      • UTF-8:61
      • UTF-16:00 61
    • U+00A9 版权标志:©
      • 数量:169
      • UTF-8:C2 A9
      • UTF-16:00 A9
    • U+00AE 注册标志:®
      • 数量:174
      • UTF-8:C2 AE
      • UTF-16: 00 AE
    • U+1337 埃塞俄比亚语音节 PHWA:
      • 编号:4919
      • UTF-8:E1 8C B7
      • UTF-16:13 37
    • U+2014 EM DASH:
      • 编号:8212
      • UTF-8:E2 80 94
      • UTF-16:20 14
    • U+2030 每千分号:
      • 编号:8240
      • UTF-8:E2 80 B0
      • UTF-16:20 30
    • U+20AC 欧元符号:
      • 编号:8364
      • UTF-8:E2 82 AC
      • UTF-16:20 AC
    • U+2122 商标符号:
      • 编号:8482
      • UTF-8:E2 84 A2
      • UTF-16:21 22
    • U+2603 雪人:
      • 编号:9731
      • UTF-8:E2 98 83
      • UTF-16:26 03
    • U+260E 黑色电话:
      • 编号:9742
      • UTF-8:E2 98 8E
      • UTF-16:26 0E
    • U+2614 带雨滴的雨伞:
      • 编号:9748
      • UTF-8:E2 98 94
      • UTF-16:26 14
    • U+263A 白色笑脸:
      • 编号:9786
      • UTF-8:E2 98 BA
      • UTF-16:26 3A
    • U+2691 黑旗:
      • 编号:9873
      • UTF-8:E2 9A 91
      • UTF-16:26 91
    • U+269B 原子符号:
      • 编号:9883
      • UTF-8:E2 9A 9B
      • UTF-16:26 9B
    • U+2708 飞机:
      • 编号:9992
      • UTF-8:E2 9C 88
      • UTF-16:27 08
    • U+271E 阴影白色拉丁十字:
      • 编号:10014
      • UTF-8:E2 9C 9E
      • UTF-16: 27 1E
    • U+3020 邮政标记:
      • 编号:12320
      • UTF-8:E3 80 A0
      • UTF-16:30 20
    • U+8089 CJK UNIFIED IDEOGRAPH-8089:
      • 编号:32905
      • UTF-8:E8 82 89
      • UTF-16:80 89
    • U+1F4A9 一堆便便:

I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it'll be useful to someone).

As far as I know old ASCII characters took one byte per character.

Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so it only uses half a byte (if that makes any sense).

How many bytes does a Unicode character require?

Unicode just maps characters to codepoints. It doesn't define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.

I assume that one Unicode character can contain every possible
character from any language - am I correct?

No. But almost. So basically yes. But still no.

So how many bytes does it need per character?

Same as your 2nd question.

And what do UTF-7, UTF-6, UTF-16 etc mean? Are they some kind Unicode
versions?

No, those are encodings. They define how bytes/octets should represent Unicode characters.

A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn't support them), go to http://codepoints.net/U+1F6AA (replace 1F6AA with the codepoint in hex) to see an image.

    • U+0061 LATIN SMALL LETTER A: a
      • Nº: 97
      • UTF-8: 61
      • UTF-16: 00 61
    • U+00A9 COPYRIGHT SIGN: ©
      • Nº: 169
      • UTF-8: C2 A9
      • UTF-16: 00 A9
    • U+00AE REGISTERED SIGN: ®
      • Nº: 174
      • UTF-8: C2 AE
      • UTF-16: 00 AE
    • U+1337 ETHIOPIC SYLLABLE PHWA:
      • Nº: 4919
      • UTF-8: E1 8C B7
      • UTF-16: 13 37
    • U+2014 EM DASH:
      • Nº: 8212
      • UTF-8: E2 80 94
      • UTF-16: 20 14
    • U+2030 PER MILLE SIGN:
      • Nº: 8240
      • UTF-8: E2 80 B0
      • UTF-16: 20 30
    • U+20AC EURO SIGN:
      • Nº: 8364
      • UTF-8: E2 82 AC
      • UTF-16: 20 AC
    • U+2122 TRADE MARK SIGN:
      • Nº: 8482
      • UTF-8: E2 84 A2
      • UTF-16: 21 22
    • U+2603 SNOWMAN:
      • Nº: 9731
      • UTF-8: E2 98 83
      • UTF-16: 26 03
    • U+260E BLACK TELEPHONE:
      • Nº: 9742
      • UTF-8: E2 98 8E
      • UTF-16: 26 0E
    • U+2614 UMBRELLA WITH RAIN DROPS:
      • Nº: 9748
      • UTF-8: E2 98 94
      • UTF-16: 26 14
    • U+263A WHITE SMILING FACE:
      • Nº: 9786
      • UTF-8: E2 98 BA
      • UTF-16: 26 3A
    • U+2691 BLACK FLAG:
      • Nº: 9873
      • UTF-8: E2 9A 91
      • UTF-16: 26 91
    • U+269B ATOM SYMBOL:
      • Nº: 9883
      • UTF-8: E2 9A 9B
      • UTF-16: 26 9B
    • U+2708 AIRPLANE:
      • Nº: 9992
      • UTF-8: E2 9C 88
      • UTF-16: 27 08
    • U+271E SHADOWED WHITE LATIN CROSS:
      • Nº: 10014
      • UTF-8: E2 9C 9E
      • UTF-16: 27 1E
    • U+3020 POSTAL MARK FACE:
      • Nº: 12320
      • UTF-8: E3 80 A0
      • UTF-16: 30 20
    • U+8089 CJK UNIFIED IDEOGRAPH-8089:
      • Nº: 32905
      • UTF-8: E8 82 89
      • UTF-16: 80 89
    • U+1F4A9 PILE OF POO: ????
      • Nº: 128169
      • UTF-8: F0 9F 92 A9
      • UTF-16: D8 3D DC A9
    • U+1F680 ROCKET: ????
      • Nº: 128640
      • UTF-8: F0 9F 9A 80
      • UTF-16: D8 3D DE 80

Okay I'm getting carried away...

Fun facts:

粉红×色少女 2024-10-28 09:35:40

简单地说,Unicode 是一种为世界上所有字符分配一个数字(称为代码点)的标准(仍在进行中)。

现在您需要使用字节来表示此代码点,这称为字符编码UTF-8、UTF-16、UTF-6 是表示这些字符的方式。

UTF-8 是多字节字符编码。字符可以有 1 到 6 个字节(其中一些现在可能不需要)。

UTF-32每个字符有4个字节。

UTF-16 每个字符使用 16 位,它仅表示称为 BMP 的 Unicode 字符的一部分(对于所有实际目的来说,这已经足够了)。 Java 在其字符串中使用这种编码。

Simply speaking Unicode is a standard which assigned one number (called code point) to all characters of the world (Its still work in progress).

Now you need to represent this code points using bytes, thats called character encoding. UTF-8, UTF-16, UTF-6 are ways of representing those characters.

UTF-8 is multibyte character encoding. Characters can have 1 to 6 bytes (some of them may be not required right now).

UTF-32 each characters have 4 bytes a characters.

UTF-16 uses 16 bits for each character and it represents only part of Unicode characters called BMP (for all practical purposes its enough). Java uses this encoding in its strings.

维持三分热 2024-10-28 09:35:40

在 Unicode 中,每个字符都由从 0 到 0x10FFFF 的整数表示。在 32 位整数中简单地执行此操作称为 UTF-32 编码。为了减少浪费,UTF-8 和 UTF-16 是较低代码点需要较少空间的编码。

请注意,实现中所谓的 UTF-16 通常实际上只是 UCS2:UTF-16 可以容纳 32 位的代码点子集。

存储要求如下。

在 UTF-8 中:

1 byte:       0 -     7F  (ASCII)
2 bytes:     80 -    7FF  (all European plus some Middle Eastern)
3 bytes:    800 -   FFFF  (multilingual plane incl. the top 1792 and private-use)
4 bytes:  10000 - 10FFFF

在 UTF-16 中:

2 bytes:      0 -   D7FF  (multilingual plane except the top 1792 and private-use)
4 bytes:   D800 - 10FFFF

在 UTF-32 中:

4 bytes:      0 - 10FFFF

10FFFF 是定义的最后一个 unicode 代码点,之所以这样定义是因为它是 UTF-16 的技术限制。

它也是 UTF-8 可以以 4 字节编码的最大代码点,但 UTF-8 编码背后的想法也适用于 5 和 6 字节编码,以覆盖直到 7FFFFFFF 的代码点,即。 UTF-32 的一半。

In Unicode, every character is represented by an integer from zero to 0x10FFFF. Doing this naively in 32-bit integers is called the UTF-32 encoding. To be less wasteful, UTF-8 and UTF-16 are encodings that require less space for the lower codepoints.

Note that what is called UTF-16 in implementations is often really just UCS2: the subset of codepoints that UTF-16 can fit in 32 bits.

The storage requirements are as follows.

In UTF-8:

1 byte:       0 -     7F  (ASCII)
2 bytes:     80 -    7FF  (all European plus some Middle Eastern)
3 bytes:    800 -   FFFF  (multilingual plane incl. the top 1792 and private-use)
4 bytes:  10000 - 10FFFF

In UTF-16:

2 bytes:      0 -   D7FF  (multilingual plane except the top 1792 and private-use)
4 bytes:   D800 - 10FFFF

In UTF-32:

4 bytes:      0 - 10FFFF

10FFFF is the last unicode codepoint by definition, and it's defined that way because it's UTF-16's technical limit.

It is also the largest codepoint UTF-8 can encode in 4 byte, but the idea behind UTF-8's encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF, ie. half of what UTF-32 can.

挽清梦 2024-10-28 09:35:40

有一个很棒的工具可以计算 UTF-8 中任何字符串的字节数: http://mothereff.in/byte- counter

更新:@mathias 已公开代码:https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js

There is a great tool for calculating the bytes of any string in UTF-8: http://mothereff.in/byte-counter

Update: @mathias has made the code public: https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js

淡淡的优雅 2024-10-28 09:35:40

在 Unicode 中,这个答案并不容易给出。正如您已经指出的那样,问题在于编码。

给定任何没有变音符号的英语句子,UTF-8 的答案将是字符数的字节数,而对于 UTF-16 的答案是字符数乘以 2。

(到目前为止)我们可以声明大小的唯一编码是 UTF-32。每个字符总是 32 位,尽管我想象代码点是为未来的 UTF-64 准备的:)

使它如此困难的至少有两件事:

  1. 组合字符,而不是使用已经重音的字符实体/变音符号 (À),用户决定将重音符号和基本字符 (`A) 结合起来。
  2. 代码点。代码点是 UTF 编码允许编码的位数超过其名称通常允许的位数的方法。例如,UTF-8 指定某些字节,这些字节本身是无效的,但是当后跟有效的连续字节时,将允许描述超出 0..255 8 位范围的字符。请参阅示例超长编码位于维基百科下面的文章中UTF-8。
    • 这里给出的一个很好的例子是 € 字符(代码点 U+20AC 可以表示为三字节序列 E2 82 AC 或四字节序列F0 82 82 AC
    • 两者都是有效的,这表明当谈论“Unicode”而不是谈论 Unicode 的特定编码(例如 UTF-8 或 UTF-16)时,答案是多么复杂。严格来说,正如评论中指出的那样,情况似乎不再如此,甚至是基于我的误解。引用自更新后的维基百科文章内容较长的编码称为超长并且不是代码点的有效 UTF-8 表示


In Unicode the answer is not easily given. The problem, as you already pointed out, are the encodings.

Given any English sentence without diacritic characters, the answer for UTF-8 would be as many bytes as characters and for UTF-16 it would be number of characters times two.

The only encoding where (as of now) we can make the statement about the size is UTF-32. There it's always 32bit per character, even though I imagine that code points are prepared for a future UTF-64 :)

What makes it so difficult are at least two things:

  1. composed characters, where instead of using the character entity that is already accented/diacritic (À), a user decided to combine the accent and the base character (`A).
  2. code points. Code points are the method by which the UTF-encodings allow to encode more than the number of bits that gives them their name would usually allow. E.g. UTF-8 designates certain bytes which on their own are invalid, but when followed by a valid continuation byte will allow to describe a character beyond the 8-bit range of 0..255. See the Examples and Overlong Encodings below in the Wikipedia article on UTF-8.
    • The excellent example given there is that the € character (code point U+20AC can be represented either as three-byte sequence E2 82 AC or four-byte sequence F0 82 82 AC.
    • Both are valid, and this shows how complicated the answer is when talking about "Unicode" and not about a specific encoding of Unicode, such as UTF-8 or UTF-16. Strictly speaking, as pointed out in a comment, this doesn't seem to be the case any longer or was even based on a misunderstanding on my part. The quote from the updated Wikipedia article reads: Longer encodings are called overlong and are not valid UTF-8 representations of the code point.
〆一缕阳光ご 2024-10-28 09:35:40

好吧,我也刚刚打开了维基百科页面,在介绍部分我看到“Unicode 可以通过不同的字符编码来实现。最常用的编码是 UTF-8(它对任何 ASCII 字符使用一个字节,其中有UTF-8 和 ASCII 编码中的代码值相同,其他字符最多为四个字节),现在已过时的 UCS-2(每个字符使用两个字节,但无法对当前 Unicode 标准中的每个字符进行编码)”

正如此引用所表明的,您的问题是您假设 Unicode 是编码字符的单一方式。实际上,Unicode 有多种形式,并且,再次引用该内容,其中一种甚至每个字符有 1 个字节,就像您习惯的那样。

所以你想要的简单答案是它会有所不同。

Well I just pulled up the Wikipedia page on it too, and in the intro portion I saw "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard)"

As this quote demonstrates, your problem is that you are assuming Unicode is a single way of encoding characters. There are actually multiple forms of Unicode, and, again in that quote, one of them even has 1 byte per character just like what you are used to.

So your simple answer that you want is that it varies.

孤星 2024-10-28 09:35:40

Unicode 是一个标准,它为每个字符提供唯一的编号。这些唯一的数字被称为世界上存在的所有字符的代码点(这只是唯一的代码)(有些字符仍有待添加)。

出于不同的目的,您可能需要以字节表示此代码点(大多数编程语言都是这样做的),这就是字符编码发挥作用的地方。

UTF-8UTF-16UTF-32等都是字符编码,Unicode的码点都是用这些编码来表示的,以不同的方式。

UTF-8 编码具有可变宽度长度,编码后的字符可以占用 1 到 4 个字节(含);

UTF-16 具有可变长度和编码字符,可以占用 1 或 2 个字节(即 8 或 16 位)。这仅代表所有 Unicode 字符的一部分,称为 BMP(基本多语言平面),并且它足以满足几乎所有情况。 Java对其字符串和字符使用UTF-16编码;

UTF-32 具有固定长度,每个字符恰好占用 4 个字节(32 位)。

Unicode is a standard which provides a unique number for every character. These unique numbers are called code points (which is just unique code) to all characters existing in the world (some's are still to be added).

For different purposes, you might need to represent this code points in bytes (most programming languages do so), and here's where Character Encoding kicks in.

UTF-8, UTF-16, UTF-32 and so on are all Character Encodings, and Unicode's code points are represented in these encodings, in different ways.

UTF-8 encoding has a variable-width length, and characters, encoded in it, can occupy 1 to 4 bytes inclusive;

UTF-16 has a variable length and characters, encoded in it, can take either 1 or 2 bytes (which is 8 or 16 bits). This represents only part of all Unicode characters called BMP (Basic Multilingual Plane) and it's enough for almost all the cases. Java uses UTF-16 encoding for its strings and characters;

UTF-32 has fixed length and each character takes exactly 4 bytes (32 bits).

听闻余生 2024-10-28 09:35:40

对于UTF-16,如果字符以0xD800或更大开头,则需要四个字节(两个代码单元);这样的字符称为“代理对”。更具体地说,代理对具有以下形式:

[0xD800 - 0xDBFF]  [0xDC00 - 0xDFF]

其中 [...] 表示具有给定范围的两字节代码单元。任何 <= 0xD7FF 都是一个代码单元(两个字节)。任何 >= 0xE000 的内容都是无效的(可以说,BOM 标记除外)。

请参阅 http://unicodebook.readthedocs.io/unicode_encodings.html,第 7.5 节。

For UTF-16, the character needs four bytes (two code units) if it starts with 0xD800 or greater; such a character is called a "surrogate pair." More specifically, a surrogate pair has the form:

[0xD800 - 0xDBFF]  [0xDC00 - 0xDFF]

where [...] indicates a two-byte code unit with the given range. Anything <= 0xD7FF is one code unit (two bytes). Anything >= 0xE000 is invalid (except BOM markers, arguably).

See http://unicodebook.readthedocs.io/unicode_encodings.html, section 7.5.

笙痞 2024-10-28 09:35:40

Check out this Unicode code converter. For example, enter 0x2009, where 2009 is the Unicode number for thin space, in the "0x... notation" field, and click Convert. The hexadecimal number E2 80 89 (3 bytes) appears in the "UTF-8 code units" field.

染柒℉ 2024-10-28 09:35:40

来自维基:

UTF-8,一种 8 位可变宽度编码,可最大限度地提高与 ASCII 的兼容性;

UTF-16,一种 16 位、可变宽度编码;

UTF-32,一种 32 位、固定宽度的编码。

这是三种最流行的不同编码。

  • 在 UTF-8 中,每个字符被编码为 1 到 4 个字节(主要编码)。
  • 在 UTF16 中,每个字符被编码为 1 到两个 16 位字,
  • 在 UTF-32 中,每个字符被编码为单个 32 位字。

From Wiki:

UTF-8, an 8-bit variable-width encoding which maximizes compatibility with ASCII;

UTF-16, a 16-bit, variable-width encoding;

UTF-32, a 32-bit, fixed-width encoding.

These are the three most popular different encoding.

  • In UTF-8 each character is encoded into 1 to 4 bytes ( the dominant encoding )
  • In UTF16 each character is encoded into 1 to two 16-bit words and
  • in UTF-32 every character is encoded as a single 32-bit word.
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文