一个 Unicode 字符占用多少字节?
我对编码有点困惑。据我所知,旧的 ASCII 字符每个字符占用一个字节。一个 Unicode 字符需要多少字节?
我假设一个 Unicode 字符可以包含任何语言中的所有可能的字符 - 我是对的吗?那么每个字符需要多少字节呢?
UTF-7、UTF-6、UTF-16 等是什么意思?它们是不同版本的 Unicode 吗?
我读了关于 Unicode 的维基百科文章,但这对我来说相当困难。我期待看到一个简单的答案。
I am a bit confused about encodings. As far as I know old ASCII characters took one byte per character. How many bytes does a Unicode character require?
I assume that one Unicode character can contain every possible character from any language - am I correct? So how many bytes does it need per character?
And what do UTF-7, UTF-6, UTF-16 etc. mean? Are they different versions of Unicode?
I read the Wikipedia article about Unicode but it is quite difficult for me. I am looking forward to seeing a simple answer.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(12)
奇怪的是,没有人指出如何计算一个 Unicode 字符占用了多少字节。以下是 UTF-8 编码字符串的规则:
所以简单的回答是:它需要 1 到 4 个字节,具体取决于第一个字节,它将指示它将占用多少字节。
Strangely enough, nobody pointed out how to calculate how many bytes is taking one Unicode char. Here is the rule for UTF-8 encoded strings:
So the quick answer is: it takes 1 to 4 bytes, depending on the first one which will indicate how many bytes it'll take up.
你不会看到一个简单的答案,因为根本不存在。
首先,Unicode 并不包含“每种语言的每个字符”,尽管它确实在尝试。
Unicode 本身是一种映射,它定义代码点,而代码点是一个数字,通常与一个字符相关联。我说通常是因为存在诸如组合字符之类的概念。您可能熟悉重音或元音变音等内容。它们可以与其他字符(例如
a
或u
)一起使用来创建新的逻辑字符。因此,一个字符可以由 1 个或多个代码点组成。为了在计算系统中有用,我们需要选择此信息的表示形式。这些是各种 unicode 编码,例如 utf-8、utf-16le、utf-32 等。它们主要通过代码单元的大小来区分。 UTF-32 是最简单的编码,它具有 32 位的代码单元,这意味着单个代码点可以轻松地融入到代码单元中。其他编码会遇到这样的情况:一个代码点需要多个代码单元,或者该特定代码点根本无法在编码中表示(例如 UCS-2 就是一个问题)。
由于组合字符的灵活性,即使在给定的编码中,每个字符的字节数也可能根据字符和规范化形式而变化。这是一个用于处理具有多种表示形式的字符的协议(您可以说
“带有重音符号的'a'”
,它是2个代码点,其中一个是组合字符或“重音'a'”
这是一个代码点)。You won't see a simple answer because there isn't one.
First, Unicode doesn't contain "every character from every language", although it sure does try.
Unicode itself is a mapping, it defines codepoints and a codepoint is a number, associated with usually a character. I say usually because there are concepts like combining characters. You may be familiar with things like accents, or umlauts. Those can be used with another character, such as an
a
or au
to create a new logical character. A character therefore can consist of 1 or more codepoints.To be useful in computing systems we need to choose a representation for this information. Those are the various unicode encodings, such as utf-8, utf-16le, utf-32 etc. They are distinguished largely by the size of of their codeunits. UTF-32 is the simplest encoding, it has a codeunit that is 32bits, which means an individual codepoint fits comfortably into a codeunit. The other encodings will have situations where a codepoint will need multiple codeunits, or that particular codepoint can't be represented in the encoding at all (this is a problem for instance with UCS-2).
Because of the flexibility of combining characters, even within a given encoding the number of bytes per character can vary depending on the character and the normalization form. This is a protocol for dealing with characters which have more than one representation (you can say
"an 'a' with an accent"
which is 2 codepoints, one of which is a combining char or"accented 'a'"
which is one codepoint).我知道这个问题很旧并且已经有一个公认的答案,但我想提供一些例子(希望它对某人有用)。
正确的。实际上,由于 ASCII 是 7 位编码,因此它支持 128 个代码(其中 95 个可打印),因此它只使用半个字节(如果这有意义的话)。
Unicode 只是将字符映射到代码点。它没有定义如何对它们进行编码。文本文件不包含 Unicode 字符,而是包含可能表示 Unicode 字符的字节/八位字节。
不,但差不多了。所以基本上是的。但还是没有。
和你的第二个问题一样。
不,这些是编码。它们定义字节/八位字节应如何表示 Unicode 字符。
举几个例子。如果其中一些无法在您的浏览器中显示(可能是因为字体不支持它们),请转至
http://codepoints.net/U+1F6AA
(替换1F6AA
code> 以及十六进制代码点)来查看图像。a
©
®
ጷ
—
‰
€
™
☃
☎
☔
☺
⚑
⚛
✈
✞
〠
肉
I know this question is old and already has an accepted answer, but I want to offer a few examples (hoping it'll be useful to someone).
Right. Actually, since ASCII is a 7-bit encoding, it supports 128 codes (95 of which are printable), so it only uses half a byte (if that makes any sense).
Unicode just maps characters to codepoints. It doesn't define how to encode them. A text file does not contain Unicode characters, but bytes/octets that may represent Unicode characters.
No. But almost. So basically yes. But still no.
Same as your 2nd question.
No, those are encodings. They define how bytes/octets should represent Unicode characters.
A couple of examples. If some of those cannot be displayed in your browser (probably because the font doesn't support them), go to
http://codepoints.net/U+1F6AA
(replace1F6AA
with the codepoint in hex) to see an image.a
©
®
ጷ
—
‰
€
™
☃
☎
☔
☺
⚑
⚛
✈
✞
〠
肉
????
????
Okay I'm getting carried away...
Fun facts:
简单地说,
Unicode
是一种为世界上所有字符分配一个数字(称为代码点)的标准(仍在进行中)。现在您需要使用字节来表示此代码点,这称为
字符编码
。UTF-8、UTF-16、UTF-6
是表示这些字符的方式。UTF-8
是多字节字符编码。字符可以有 1 到 6 个字节(其中一些现在可能不需要)。UTF-32
每个字符有4个字节。UTF-16
每个字符使用 16 位,它仅表示称为 BMP 的 Unicode 字符的一部分(对于所有实际目的来说,这已经足够了)。 Java 在其字符串中使用这种编码。Simply speaking
Unicode
is a standard which assigned one number (called code point) to all characters of the world (Its still work in progress).Now you need to represent this code points using bytes, thats called
character encoding
.UTF-8, UTF-16, UTF-6
are ways of representing those characters.UTF-8
is multibyte character encoding. Characters can have 1 to 6 bytes (some of them may be not required right now).UTF-32
each characters have 4 bytes a characters.UTF-16
uses 16 bits for each character and it represents only part of Unicode characters called BMP (for all practical purposes its enough). Java uses this encoding in its strings.在 Unicode 中,每个字符都由从 0 到 0x10FFFF 的整数表示。在 32 位整数中简单地执行此操作称为 UTF-32 编码。为了减少浪费,UTF-8 和 UTF-16 是较低代码点需要较少空间的编码。
请注意,实现中所谓的 UTF-16 通常实际上只是 UCS2:UTF-16 可以容纳 32 位的代码点子集。
存储要求如下。
在 UTF-8 中:
在 UTF-16 中:
在 UTF-32 中:
10FFFF 是定义的最后一个 unicode 代码点,之所以这样定义是因为它是 UTF-16 的技术限制。
它也是 UTF-8 可以以 4 字节编码的最大代码点,但 UTF-8 编码背后的想法也适用于 5 和 6 字节编码,以覆盖直到 7FFFFFFF 的代码点,即。 UTF-32 的一半。
In Unicode, every character is represented by an integer from zero to 0x10FFFF. Doing this naively in 32-bit integers is called the UTF-32 encoding. To be less wasteful, UTF-8 and UTF-16 are encodings that require less space for the lower codepoints.
Note that what is called UTF-16 in implementations is often really just UCS2: the subset of codepoints that UTF-16 can fit in 32 bits.
The storage requirements are as follows.
In UTF-8:
In UTF-16:
In UTF-32:
10FFFF is the last unicode codepoint by definition, and it's defined that way because it's UTF-16's technical limit.
It is also the largest codepoint UTF-8 can encode in 4 byte, but the idea behind UTF-8's encoding also works for 5 and 6 byte encodings to cover codepoints until 7FFFFFFF, ie. half of what UTF-32 can.
有一个很棒的工具可以计算 UTF-8 中任何字符串的字节数: http://mothereff.in/byte- counter
更新:@mathias 已公开代码:https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js
There is a great tool for calculating the bytes of any string in UTF-8: http://mothereff.in/byte-counter
Update: @mathias has made the code public: https://github.com/mathiasbynens/mothereff.in/blob/master/byte-counter/eff.js
在 Unicode 中,这个答案并不容易给出。正如您已经指出的那样,问题在于编码。
给定任何没有变音符号的英语句子,UTF-8 的答案将是字符数的字节数,而对于 UTF-16 的答案是字符数乘以 2。
(到目前为止)我们可以声明大小的唯一编码是 UTF-32。每个字符总是 32 位,尽管我想象代码点是为未来的 UTF-64 准备的:)
使它如此困难的至少有两件事:
U+20AC
可以表示为三字节序列E2 82 AC 或四字节序列
F0 82 82 AC
。两者都是有效的,这表明当谈论“Unicode”而不是谈论 Unicode 的特定编码(例如 UTF-8 或 UTF-16)时,答案是多么复杂。严格来说,正如评论中指出的那样,情况似乎不再如此,甚至是基于我的误解。引用自更新后的维基百科文章内容 :较长的编码称为超长并且不是代码点的有效 UTF-8 表示。In Unicode the answer is not easily given. The problem, as you already pointed out, are the encodings.
Given any English sentence without diacritic characters, the answer for UTF-8 would be as many bytes as characters and for UTF-16 it would be number of characters times two.
The only encoding where (as of now) we can make the statement about the size is UTF-32. There it's always 32bit per character, even though I imagine that code points are prepared for a future UTF-64 :)
What makes it so difficult are at least two things:
U+20AC
can be represented either as three-byte sequenceE2 82 AC
or four-byte sequenceF0 82 82 AC
.Both are valid, and this shows how complicated the answer is when talking about "Unicode" and not about a specific encoding of Unicode, such as UTF-8 or UTF-16.Strictly speaking, as pointed out in a comment, this doesn't seem to be the case any longer or was even based on a misunderstanding on my part. The quote from the updated Wikipedia article reads: Longer encodings are called overlong and are not valid UTF-8 representations of the code point.好吧,我也刚刚打开了维基百科页面,在介绍部分我看到“Unicode 可以通过不同的字符编码来实现。最常用的编码是 UTF-8(它对任何 ASCII 字符使用一个字节,其中有UTF-8 和 ASCII 编码中的代码值相同,其他字符最多为四个字节),现在已过时的 UCS-2(每个字符使用两个字节,但无法对当前 Unicode 标准中的每个字符进行编码)”
正如此引用所表明的,您的问题是您假设 Unicode 是编码字符的单一方式。实际上,Unicode 有多种形式,并且,再次引用该内容,其中一种甚至每个字符有 1 个字节,就像您习惯的那样。
所以你想要的简单答案是它会有所不同。
Well I just pulled up the Wikipedia page on it too, and in the intro portion I saw "Unicode can be implemented by different character encodings. The most commonly used encodings are UTF-8 (which uses one byte for any ASCII characters, which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters), the now-obsolete UCS-2 (which uses two bytes for each character but cannot encode every character in the current Unicode standard)"
As this quote demonstrates, your problem is that you are assuming Unicode is a single way of encoding characters. There are actually multiple forms of Unicode, and, again in that quote, one of them even has 1 byte per character just like what you are used to.
So your simple answer that you want is that it varies.
Unicode
是一个标准,它为每个字符提供唯一的编号。这些唯一的数字被称为世界上存在的所有字符的代码点
(这只是唯一的代码)(有些字符仍有待添加)。出于不同的目的,您可能需要以字节表示此
代码点
(大多数编程语言都是这样做的),这就是字符编码
发挥作用的地方。UTF-8
、UTF-16
、UTF-32
等都是字符编码
,Unicode的码点都是用这些编码来表示的,以不同的方式。UTF-8
编码具有可变宽度长度,编码后的字符可以占用 1 到 4 个字节(含);UTF-16
具有可变长度和编码字符,可以占用 1 或 2 个字节(即 8 或 16 位)。这仅代表所有 Unicode 字符的一部分,称为 BMP(基本多语言平面),并且它足以满足几乎所有情况。 Java对其字符串和字符使用UTF-16
编码;UTF-32
具有固定长度,每个字符恰好占用 4 个字节(32 位)。Unicode
is a standard which provides a unique number for every character. These unique numbers are calledcode point
s (which is just unique code) to all characters existing in the world (some's are still to be added).For different purposes, you might need to represent this
code points
in bytes (most programming languages do so), and here's whereCharacter Encoding
kicks in.UTF-8
,UTF-16
,UTF-32
and so on are allCharacter Encodings
, and Unicode's code points are represented in these encodings, in different ways.UTF-8
encoding has a variable-width length, and characters, encoded in it, can occupy 1 to 4 bytes inclusive;UTF-16
has a variable length and characters, encoded in it, can take either 1 or 2 bytes (which is 8 or 16 bits). This represents only part of all Unicode characters called BMP (Basic Multilingual Plane) and it's enough for almost all the cases. Java usesUTF-16
encoding for its strings and characters;UTF-32
has fixed length and each character takes exactly 4 bytes (32 bits).对于UTF-16,如果字符以0xD800或更大开头,则需要四个字节(两个代码单元);这样的字符称为“代理对”。更具体地说,代理对具有以下形式:
其中 [...] 表示具有给定范围的两字节代码单元。任何 <= 0xD7FF 都是一个代码单元(两个字节)。任何 >= 0xE000 的内容都是无效的(可以说,BOM 标记除外)。
请参阅 http://unicodebook.readthedocs.io/unicode_encodings.html,第 7.5 节。
For UTF-16, the character needs four bytes (two code units) if it starts with 0xD800 or greater; such a character is called a "surrogate pair." More specifically, a surrogate pair has the form:
where [...] indicates a two-byte code unit with the given range. Anything <= 0xD7FF is one code unit (two bytes). Anything >= 0xE000 is invalid (except BOM markers, arguably).
See http://unicodebook.readthedocs.io/unicode_encodings.html, section 7.5.
查看这个 Unicode 代码转换器。例如,输入
0x2009
,其中 2009 是薄空间的 Unicode 编号< /a>,在“0x...符号”字段中,然后单击“转换”。十六进制数E2 80 89
(3 个字节)出现在“UTF-8 代码单元”字段中。Check out this Unicode code converter. For example, enter
0x2009
, where 2009 is the Unicode number for thin space, in the "0x... notation" field, and click Convert. The hexadecimal numberE2 80 89
(3 bytes) appears in the "UTF-8 code units" field.来自维基:
这是三种最流行的不同编码。
From Wiki:
These are the three most popular different encoding.