为什么 UTF-8 是这样编码的?
如果我理解正确的话,UTF-8使用以下模式让计算机知道将使用多少个字节来编码一个字符:
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
0xxxxxxx | |||
110xxxx | 10xxxxxx | ||
1110xxxx | 10xxxxxx | 10xxxxxx | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
等等。但是没有更紧凑的模式吗?例如,是什么阻止我们使用这样的东西:
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
0xxxxxxx | |||
10xxxxxx | xxxxxxxx | ||
110xxxxxx | xxxxxxxx | xxxxxxxx | |
1110xxxx | xxxxxxxx | xxxxxxxx | xxxxxxxx |
If I understood correctly, UTF-8 uses the following pattern to let the computer know how many bytes are going to be used to encode a character:
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
0xxxxxxx | |||
110xxxxx | 10xxxxxx | ||
1110xxxx | 10xxxxxx | 10xxxxxx | |
11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Etc. But aren't there more compact patterns? For instance, what is stopping us from using something like this:
Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|
0xxxxxxx | |||
10xxxxxx | xxxxxxxx | ||
110xxxxx | xxxxxxxx | xxxxxxxx | |
1110xxxx | xxxxxxxx | xxxxxxxx | xxxxxxxx |
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您建议的编码不会自同步。如果您落在流的中间
xxxxxxxx
字节上,您将不知道它是否位于字符的中间。如果该随机字节恰好是10xxxxxx
,您可能会将其误认为是字符的开头。避免此错误的唯一方法是从头开始无错误地读取整个流。自同步是 UTF-8 的明确目标。如果您到达 UTF-8 流中的任何位置,您知道是否位于字符的中间,并且最多需要读取 3 个字节才能找到完整字符的下一个开头。
Your proposed encoding wouldn't be self-synchronizing. If you landed in the middle of a stream on an
xxxxxxxx
byte, you'd have no idea whether it's in the middle of a character or not. If that random byte happened to be10xxxxxx
, you could mistake it for the start of a character. The only way to avoid this mistake is to read the entire stream error free from the beginning.It's an explicit goal for UTF-8 to be self-synchronizing. If you land anywhere in a UTF-8 stream, you know whether you're in the middle of a character or not, and need to read at most 3 bytes to find the next start of a full character.