当前位置：文江博客话题详情

UTF-8 连续字节

发布于 2025-01-06 19:14:29 字数 181 浏览 0 评论 0原文

我试图弄清楚 UTF-8 编码中的“连续字节”是什么（出于好奇）。

维基百科在 UTF-8 文章中介绍了这个术语，但根本没有定义它

Google 搜索也没有返回任何有用的信息。我即将进入官方规范，但最好先阅读高级摘要。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

白鸥掠海 2025-01-13 19:14:29

UTF-8 中的连续字节是前两位为 10 的任何字节。

它们是多字节序列中的后续字节。下表可能会有所帮助：

Unicode code points  Encoding  Binary value
-------------------  --------  ------------
 U+000000-U+00007f   0xxxxxxx  0xxxxxxx

 U+000080-U+0007ff   110yyyxx  00000yyy xxxxxxxx
                     10xxxxxx

 U+000800-U+00ffff   1110yyyy  yyyyyyyy xxxxxxxx
                     10yyyyxx
                     10xxxxxx

 U+010000-U+10ffff   11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                     10zzyyyy
                     10yyyyxx
                     10xxxxxx

在这里您可以看到 Unicode 代码点如何映射到 UTF-8 多字节字节序列及其等效的二进制值。

基本规则是这样的：

如果一个字节以 0 位开头，则它是小于 128 的单个字节值。
如果它以 11 开头，则它是一个字节的第一个字节多字节序列，开头的1位数表示总共有多少字节（110xxxxx有两个字节，1110xxxx有三和11110xxx 有四个）。
如果它以 10 开头，则它是一个连续字节。

这种区别允许非常方便的处理，例如能够从序列中的任何字节进行备份以查找该代码点的第一个字节。只需向后搜索，直到找到不以 10 位开头的位置。

同样，它也可以用于 UTF-8 strlen，仅计算非 10xxxxxx 字节。

A continuation byte in UTF-8 is any byte where the top two bits are 10.

They are the subsequent bytes in multi-byte sequences. The following table may help:

Unicode code points  Encoding  Binary value
-------------------  --------  ------------
 U+000000-U+00007f   0xxxxxxx  0xxxxxxx

 U+000080-U+0007ff   110yyyxx  00000yyy xxxxxxxx
                     10xxxxxx

 U+000800-U+00ffff   1110yyyy  yyyyyyyy xxxxxxxx
                     10yyyyxx
                     10xxxxxx

 U+010000-U+10ffff   11110zzz  000zzzzz yyyyyyyy xxxxxxxx
                     10zzyyyy
                     10yyyyxx
                     10xxxxxx

Here you can see how the Unicode code points map to UTF-8 multi-byte byte sequences, and their equivalent binary values.

The basic rules are this:

If a byte starts with a 0 bit, it's a single byte value less than 128.
If it starts with 11, it's the first byte of a multi-byte sequence and the number of 1 bits at the start indicates how many bytes there are in total (110xxxxx has two bytes, 1110xxxx has three and 11110xxx has four).
If it starts with 10, it's a continuation byte.

This distinction allows quite handy processing such as being able to back up from any byte in a sequence to find the first byte of that code point. Just search backwards until you find one not beginning with the 10 bits.

Similarly, it can also be used for a UTF-8 strlen by only counting non-10xxxxxx bytes.

回复收藏 0 原文