UTF-8 连续字节
我试图弄清楚 UTF-8 编码中的“连续字节”是什么(出于好奇)。
维基百科在 UTF-8 文章中介绍了这个术语,但根本没有定义它
Google 搜索也没有返回任何有用的信息。我即将进入官方规范,但最好先阅读高级摘要。
I'm trying to figure out what "continuation bytes" are (for curiousity sake) in the UTF-8 encoding.
Wikipedia introduces this term in the UTF-8 article without defining it at all
Google search returns no useful information either. I'm about to jump into the official specification, but would preferably read a high-level summary first.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
UTF-8 中的连续字节是前两位为
10
的任何字节。它们是多字节序列中的后续字节。下表可能会有所帮助:
在这里您可以看到 Unicode 代码点如何映射到 UTF-8 多字节字节序列及其等效的二进制值。
基本规则是这样的:
0
位开头,则它是小于 128 的单个字节值。11
开头,则它是一个字节的第一个字节多字节序列,开头的1
位数表示总共有多少字节(110xxxxx
有两个字节,1110xxxx
有三和11110xxx
有四个)。10
开头,则它是一个连续字节。这种区别允许非常方便的处理,例如能够从序列中的任何字节进行备份以查找该代码点的第一个字节。只需向后搜索,直到找到不以
10
位开头的位置。同样,它也可以用于 UTF-8
strlen
,仅计算非10xxxxxx
字节。A continuation byte in UTF-8 is any byte where the top two bits are
10
.They are the subsequent bytes in multi-byte sequences. The following table may help:
Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.
The basic rules are this:
0
bit, it's a single byte value less than 128.11
, it's the first byte of a multi-byte sequence and the number of1
bits at the start indicates how many bytes there are in total (110xxxxx
has two bytes,1110xxxx
has three and11110xxx
has four).10
, it's a continuation byte.This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the
10
bits.Similarly, it can also be used for a UTF-8
strlen
by only counting non-10xxxxxx
bytes.简而言之,连续字节是除了第一个字节或单个字节之外的字节。在 UTF-8 中,连续字节以 0x10 开始。
In short words, continuation bytes are the bytes except first byte or single byte. In UTF-8, continuation bytes are started with 0x10.
“连续字节”不是一个术语,而是一个普通的英语单词和术语“字节”。如果用作伪术语,可能会使读者感到困惑。
Unicode 标准仅在一处使用此表达式,Ch. 5,第 5.22 条:“例如,考虑四字节 UTF-8 序列的前三个字节,后跟一个不能是有效连续字节的字节:。”在这种情况下,含义很明确:它只是一个延续某些内容的字节,即字节序列。
维基百科页面显然使用“连续字节”来表示 UTF-8 编码中除字符编码形式的第一个字节之外的任何字节。
“Continuation byte” isn’t a term but a normal English word and the term “byte.” If used as a pseudo-term, it may confuse the reader.
The Unicode Standard uses this expression in one place only, Ch. 5, clause 5.22: “For example, consider the first three bytes of a four-byte UTF-8 sequence, followed by a byte which cannot be a valid continuation byte: .” In this context, the meaning is clear: it’s just a byte that continues something, namely a sequence of bytes.
The Wikipedia page apparently uses “continuation byte” to mean any byte in the UTF-8 encoding except the first byte of the encoded form of a character.