\r\n 作为 UTF8 字符的一部分?
某些 UTF8 符号是否可能包含字节 0x0D 0x0A 作为其一部分?如果是,这些符号是什么?
(我试图解决的任务是从某个点读取文本 UTF8 文件,而不是从头开始)
Is it possible, that some UTF8 symbol includes bytes 0x0D 0x0A as it's part? If yes, what are such symbols?
(that task that I'm trying to solve is reading textual UTF8 file from the certain point rather then from the very beginning)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
不,多字节编码码点的每个字节将始终具有最高有效位集。
UTF-8 流中值为 0-127 的字节唯一映射到 ASCII。
No, every byte of a multibyte encoded codepoint will always have the most significant bit set.
Bytes with values 0-127 in an UTF-8 stream are uniquely mapped to ASCII.
不,范围 0-127 ASCII 中的每个字符都在 UTF-8 文本中“按原样”表示。多字节字符的每个字节都有 8 位集。这是 UTF-8 的优点之一。
No, every character from range 0-127 ASCII is represented "as is" in UTF-8 text. Each byte of multi byte characters have they 8-bit set. It's one of adventages of UTF-8.
单个 Unicode 代码点 U+0D0A 将在 UTF-8 中表示为三个字节
0xE0 0xB4 0x8A
。两个 Unicode 代码点 U+000D U+000A 将在 UTF-8 中表示为两个字节0x0D 0x0A
。The single Unicode code point U+0D0A will be represented as the three bytes
0xE0 0xB4 0x8A
in UTF-8. The two Unicode code points U+000D U+000A will be represented as two bytes0x0D 0x0A
in UTF-8.