手动将 unicode 代码点转换为 UTF-8 和 UTF-16
我即将进行大学编程考试,其中一个部分是关于 unicode 的。
我已经彻底检查了这个问题的答案,而我的讲师毫无用处,所以这没有帮助,所以这是你们可能提供帮助的最后手段。
问题是这样的:
字符串“mЖ丽”具有以下 unicode 代码点
U+006D
、U+0416
和U+4E3D
,答案以十六进制编写,手动编码 将字符串转换为 UTF-8 和 UTF-16。
任何帮助都将不胜感激,因为我正在努力解决这个问题。
I have a university programming exam coming up, and one section is on unicode.
I have checked all over for answers to this, and my lecturer is useless so that’s no help, so this is a last resort for you guys to possibly help.
The question will be something like:
The string 'mЖ丽' has these unicode codepoints
U+006D
,U+0416
andU+4E3D
, with answers written in hexadecimal, manually encode the
string into UTF-8 and UTF-16.
Any help at all will be greatly appreciated as I am trying to get my head round this.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
哇。一方面,我很高兴知道大学课程所教导的现实是字符编码是一项艰苦的工作,但实际上了解 UTF-8 编码规则听起来像是期望很高。 (它会帮助学生通过土耳其测试?)
到目前为止,我所看到的关于将 UCS 代码点编码为 UTF-8 的规则的最清晰的描述来自许多 Linux 系统上的
utf-8(7)
联机帮助页:它可能更容易记住一个图表的“压缩”版本:
损坏的代码点的初始字节以
1
开头,并添加填充1+0
。后续字节从10
开始。您可以通过记下可以用新表示中允许的位填充多少空间来导出范围:
我知道我可以记住更容易导出图表的规则比图表本身。希望您也能善于记住规则。 :)
更新
一旦构建了上面的图表,您可以通过查找其范围、从十六进制转换为二进制、根据上述规则插入位,然后将输入的 Unicode 代码点转换为 UTF-8返回十六进制:
这符合
0x00000800 - 0x0000FFFF
范围 (0x4E3E <0xFFFF
),因此表示形式为:0x4E3E
is100111000111110b
。将这些位放入上面的x
中(从右侧开始,我们将用0
填充开头处缺失的位):有一个
x 在开头留下的位置,用
0
填充:从 位转十六进制:
Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the
utf-8(7)
manpage on many Linux systems:It might be easier to remember a 'compressed' version of the chart:
Initial bytes starts of mangled codepoints start with a
1
, and add padding1+0
. Subsequent bytes start10
.You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
Update
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
This fits in the
0x00000800 - 0x0000FFFF
range (0x4E3E < 0xFFFF
), so the representation will be of the form:0x4E3E
is100111000111110b
. Drop the bits into thex
above (start from the right, we'll fill in missing bits at the start with0
):There is an
x
spot left over at the start, fill it in with0
:Convert from bits to hex:
维基百科上对 UTF-8 和 UTF-16 很好:
示例字符串的过程:
UTF-8
UTF-8 使用最多 4 个字节来表示 Unicode 代码点。对于 1 字节的情况,请使用以下模式:
2、3 和 4 字节 UTF-8 的初始字节以 2 开头、 3 或 4 个 1 位,后跟 0 位。后续字节始终以两位模式
10
开头,留下 6 位用于数据:您的代码点是 U+006D、U+0416 和 U+4E3D,分别需要 1、2 和 3 字节 UTF-8 序列。转换为二进制并分配位:
最终字节序列:
或者如果需要以 null 结尾的字符串:
UTF-16
UTF-16 使用 2 或 4 个字节来表示 Unicode 代码点。算法:
使用您的代码点:
现在,我们还有一个问题。有些机器首先存储 16 位字最低有效字节的两个字节(所谓的小端机器),有些机器首先存储最高有效字节(大端机器)。 UTF-16 使用代码点 U+FEFF(称为字节顺序标记或 BOM)来帮助机器确定字节流是否包含大端或小端 UTF-16:
以 null 结尾,U+0000 = 0000hex:
由于您的讲师没有给出需要 4 字节 UTF-16 的代码点,因此这里有一个示例:
The descriptions on Wikipedia for UTF-8 and UTF-16 are good:
Procedures for your example string:
UTF-8
UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:
The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern
10
, leaving 6 bits for data:Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:
Final byte sequence:
or if nul-terminated strings are desired:
UTF-16
UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:
Using your codepoints:
Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:
With nul-termination, U+0000 = 0000hex:
Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:
以下程序将完成必要的工作。对于您的目的来说,它可能不够“手动”,但至少您可以检查您的工作。
The following program will do the necessary work. It may not be "manual" enough for your purposes, but at a minimum you can check your work.