First, not all Unicode representations are variable length. UTF-32 and USC-2 are fixed length. UTF-8 and UTF-16 are each in their own way variable length.
Second, if you read the specification, you will learn that the sequences are self-describing. The byte values (in UTF-8) that can be first bytes can't be second or third, etc. Ditto for the surrogate pairs that represent non-BMP characters in UTF-16.
A commonly used encoding is UTF-8. The way it's structured is that some predefined bits of the character's bytes tell you whether there are more bytes to come.
发布评论
评论(2)
首先,并非所有 Unicode 表示形式都是可变长度的。 UTF-32 和 USC-2 是固定长度的。 UTF-8 和 UTF-16 各自以自己的方式可变长度。
其次,如果您阅读规范,您将了解到序列是自描述的。可以作为第一个字节的字节值(UTF-8 中)不能是第二个或第三个字节等。对于表示 UTF-16 中的非 BMP 字符的代理项对也是如此。
First, not all Unicode representations are variable length. UTF-32 and USC-2 are fixed length. UTF-8 and UTF-16 are each in their own way variable length.
Second, if you read the specification, you will learn that the sequences are self-describing. The byte values (in UTF-8) that can be first bytes can't be second or third, etc. Ditto for the surrogate pairs that represent non-BMP characters in UTF-16.
常用的编码是 UTF-8。它的结构方式是字符字节的一些预定义位告诉您是否还有更多字节。
请参阅http://en.wikipedia.org/wiki/UTF-8#Design 一个漂亮的图表。
A commonly used encoding is UTF-8. The way it's structured is that some predefined bits of the character's bytes tell you whether there are more bytes to come.
See http://en.wikipedia.org/wiki/UTF-8#Design for a nice diagram.