当前位置：文江博客话题详情

手动将 unicode 代码点转换为 UTF-8 和 UTF-16

发布于 2024-11-14 07:15:08 字数 317 浏览 3 评论 0原文

我即将进行大学编程考试，其中一个部分是关于 unicode 的。

我已经彻底检查了这个问题的答案，而我的讲师毫无用处，所以这没有帮助，所以这是你们可能提供帮助的最后手段。

问题是这样的：

字符串“mЖ丽”具有以下 unicode 代码点 U+006D、U+0416 和 U+4E3D，答案以十六进制编写，手动编码将字符串转换为 UTF-8 和 UTF-16。

任何帮助都将不胜感激，因为我正在努力解决这个问题。

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

oО清风挽发oО 2024-11-21 07:15:08

哇。一方面，我很高兴知道大学课程所教导的现实是字符编码是一项艰苦的工作，但实际上了解 UTF-8 编码规则听起来像是期望很高。（它会帮助学生通过土耳其测试？）

到目前为止，我所看到的关于将 UCS 代码点编码为 UTF-8 的规则的最清晰的描述来自许多 Linux 系统上的 utf-8(7) 联机帮助页：

Encoding
   The following byte sequences are used to represent a
   character.  The sequence to be used depends on the UCS code
   number of the character:

   0x00000000 - 0x0000007F:
       0xxxxxxx

   0x00000080 - 0x000007FF:
       110xxxxx 10xxxxxx

   0x00000800 - 0x0000FFFF:
       1110xxxx 10xxxxxx 10xxxxxx

   0x00010000 - 0x001FFFFF:
       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

   [... removed obsolete five and six byte forms ...]

   The xxx bit positions are filled with the bits of the
   character code number in binary representation.  Only the
   shortest possible multibyte sequence which can represent the
   code number of the character can be used.

   The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
   as 0xfffe and 0xffff (UCS noncharacters) should not appear in
   conforming UTF-8 streams.

它可能更容易记住一个图表的“压缩”版本：

损坏的代码点的初始字节以 1 开头，并添加填充 1+0。后续字节从 10 开始。

0x80      5 bits, one byte
0x800     4 bits, two bytes
0x10000   3 bits, three bytes

您可以通过记下可以用新表示中允许的位填充多少空间来导出范围：

2**(5+1*6) == 2048       == 0x800
2**(4+2*6) == 65536      == 0x10000
2**(3+3*6) == 2097152    == 0x200000

我知道我可以记住更容易导出图表的规则比图表本身。希望您也能善于记住规则。 :)

更新

一旦构建了上面的图表，您可以通过查找其范围、从十六进制转换为二进制、根据上述规则插入位，然后将输入的 Unicode 代码点转换为 UTF-8返回十六进制：

U+4E3E

这符合 0x00000800 - 0x0000FFFF 范围 (0x4E3E <0xFFFF)，因此表示形式为：

   1110xxxx 10xxxxxx 10xxxxxx

0x4E3E is 100111000111110b。将这些位放入上面的 x 中（从右侧开始，我们将用 0 填充开头处缺失的位）：

   1110x100 10111000 10111110

有一个 x 在开头留下的位置，用 0 填充：

   11100100 10111000 10111110

从位转十六进制：

   0xE4 0xB8 0xBE

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)

The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:

Encoding
   The following byte sequences are used to represent a
   character.  The sequence to be used depends on the UCS code
   number of the character:

   0x00000000 - 0x0000007F:
       0xxxxxxx

   0x00000080 - 0x000007FF:
       110xxxxx 10xxxxxx

   0x00000800 - 0x0000FFFF:
       1110xxxx 10xxxxxx 10xxxxxx

   0x00010000 - 0x001FFFFF:
       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

   [... removed obsolete five and six byte forms ...]

   The xxx bit positions are filled with the bits of the
   character code number in binary representation.  Only the
   shortest possible multibyte sequence which can represent the
   code number of the character can be used.

   The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
   as 0xfffe and 0xffff (UCS noncharacters) should not appear in
   conforming UTF-8 streams.

It might be easier to remember a 'compressed' version of the chart:

Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.

0x80      5 bits, one byte
0x800     4 bits, two bytes
0x10000   3 bits, three bytes

You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:

2**(5+1*6) == 2048       == 0x800
2**(4+2*6) == 65536      == 0x10000
2**(3+3*6) == 2097152    == 0x200000

I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)

Update

Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:

U+4E3E

This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:

   1110xxxx 10xxxxxx 10xxxxxx

0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):

   1110x100 10111000 10111110

There is an x spot left over at the start, fill it in with 0:

   11100100 10111000 10111110

Convert from bits to hex:

   0xE4 0xB8 0xBE

回复收藏 0 原文

溺ぐ爱和你が 2024-11-21 07:15:08

维基百科上对 UTF-8 和 UTF-16 很好：

示例字符串的过程：

UTF-8

UTF-8 使用最多 4 个字节来表示 Unicode 代码点。对于 1 字节的情况，请使用以下模式：

1 字节 UTF-8 = 0xxxxxxx_bin = 7 位 = 0-7F_hex

2、3 和 4 字节 UTF-8 的初始字节以 2 开头、 3 或 4 个 1 位，后跟 0 位。后续字节始终以两位模式 10 开头，留下 6 位用于数据：

2 字节 UTF-8 = 110xxxxx 10xxxxxx_bin = 5+6(11) 位 = 80-7FF_十六进制
3 字节 UTF-8 = 1110xxxx 10xxxxxx 10xxxxxx_bin = 4+6+6(16) 位 = 800-FFFF_hex
4 字节 UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx_bin = 3+6+6+6(21) 位 = 10000-10FFFF_hex^†
^†超过 10FFFF_hex 的 Unicode 代码点未定义。

您的代码点是 U+006D、U+0416 和 U+4E3D，分别需要 1、2 和 3 字节 UTF-8 序列。转换为二进制并分配位：

U+006D = 1101101_bin = 01101101_bin = 6D_十六进制
U+0416 = 10000 010110_bin = 11010000 10010110_bin = D0 96_{十六进制
U+4E3D = 0100 111000 111101_bin = 11100100 10111000 10111101_bin = E4 B8 BD_十六进制}

最终字节序列：

6D D0 96 E4 B8 BD

或者如果需要以 null 结尾的字符串：

6D D0 96 E4 B8 BD 00

UTF-16

UTF-16 使用 2 或 4 个字节来表示 Unicode 代码点。算法：

U+0000 到 U+D7FF 使用 2 字节 0000_hex 到 D7FF_hex
U+D800 到 U+DFFF 是为 4 字节 UTF-16 保留的无效代码点
U+E000 到 U+FFFF 使用 2 字节 E000_hex 到 FFFF_hex
U+10000 到 U+10FFFF 使用 4 字节 UTF-16 编码，如下所示：
从代码点中减去 10000_hex。
将结果表示为 20 位二进制。
使用模式 110110xxxxxxxxxx 110111xxxxxxxxxx_bin 将高 10 位和低 10 位编码为两个 16 位字。

使用您的代码点：

U+006D = 006D_十六进制
U+0416 = 0416_十六进制
U+4E3D = 4E3D_十六进制

现在，我们还有一个问题。有些机器首先存储 16 位字最低有效字节的两个字节（所谓的小端机器），有些机器首先存储最高有效字节（大端机器）。 UTF-16 使用代码点 U+FEFF（称为字节顺序标记或 BOM）来帮助机器确定字节流是否包含大端或小端 UTF-16：

大端 = FE FF 00 6D 04 16 4E 3D
小端 = FF FE 6D 00 16 04 3D 4E

以 null 结尾，U+0000 = 0000_hex：

大端 = FE FF 00 6D 04 16 4E 3D 00 00
小端 = FF FE 6D 00 16 04 3D 4E 00 00

由于您的讲师没有给出需要 4 字节 UTF-16 的代码点，因此这里有一个示例：

U+1F031 = 1F031_十六进制 - 10000_十六进制 = F031_十六进制 = 0000111100 0000110001_bin =
1101100000111100 1101110000110001_bin = D83C DC31_十六进制

The descriptions on Wikipedia for UTF-8 and UTF-16 are good:

Procedures for your example string:

UTF-8

UTF-8 uses up to 4 bytes to represent Unicode codepoints. For the 1-byte case, use the following pattern:

1-byte UTF-8 = 0xxxxxxx_bin = 7 bits = 0-7F_hex

The initial byte of 2-, 3- and 4-byte UTF-8 start with 2, 3 or 4 one bits, followed by a zero bit. Follow on bytes always start with the two-bit pattern 10, leaving 6 bits for data:

2-byte UTF-8 = 110xxxxx 10xxxxxx_bin = 5+6(11) bits = 80-7FF_hex
3-byte UTF-8 = 1110xxxx 10xxxxxx 10xxxxxx_bin = 4+6+6(16) bits = 800-FFFF_hex
4-byte UTF-8 = 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx_bin = 3+6+6+6(21) bits = 10000-10FFFF_hex^†
^†Unicode codepoints are undefined beyond 10FFFF_hex.

Your codepoints are U+006D, U+0416 and U+4E3D requiring 1-, 2- and 3-byte UTF-8 sequences, respectively. Convert to binary and assign the bits:

U+006D = 1101101_bin = 01101101_bin = 6D_hex
U+0416 = 10000 010110_bin = 11010000 10010110_bin = D0 96_hex
U+4E3D = 0100 111000 111101_bin = 11100100 10111000 10111101_bin = E4 B8 BD_hex

Final byte sequence:

6D D0 96 E4 B8 BD

or if nul-terminated strings are desired:

6D D0 96 E4 B8 BD 00

UTF-16

UTF-16 uses 2 or 4 bytes to represent Unicode codepoints. Algorithm:

U+0000 to U+D7FF uses 2-byte 0000_hex to D7FF_hex
U+D800 to U+DFFF are invalid codepoints reserved for 4-byte UTF-16
U+E000 to U+FFFF uses 2-byte E000_hex to FFFF_hex
U+10000 to U+10FFFF uses 4-byte UTF-16 encoded as follows:
Subtract 10000_hex from the codepoint.
Express result as 20-bit binary.
Use the pattern 110110xxxxxxxxxx 110111xxxxxxxxxx_bin to encode the upper- and lower- 10 bits into two 16-bit words.

Using your codepoints:

U+006D = 006D_hex
U+0416 = 0416_hex
U+4E3D = 4E3D_hex

Now, we have one more issue. Some machines store the two bytes of a 16-bit word least significant byte first (so-called little-endian machines) and some store most significant byte first (big-endian machines). UTF-16 uses the codepoint U+FEFF (called the byte order mark or BOM) to help a machine determine if a byte stream contains big- or little-endian UTF-16:

big-endian = FE FF 00 6D 04 16 4E 3D
little-endian = FF FE 6D 00 16 04 3D 4E

With nul-termination, U+0000 = 0000_hex:

big-endian = FE FF 00 6D 04 16 4E 3D 00 00
little-endian = FF FE 6D 00 16 04 3D 4E 00 00

Since your instructor didn't give a codepoint that required 4-byte UTF-16, here's one example:

U+1F031 = 1F031_hex - 10000_hex = F031_hex = 0000111100 0000110001_bin =
1101100000111100 1101110000110001_bin = D83C DC31_hex

回复收藏 0 原文

撩人痒 2024-11-21 07:15:08

以下程序将完成必要的工作。对于您的目的来说，它可能不够“手动”，但至少您可以检查您的工作。

#!/usr/bin/perl

use 5.012;
use strict;
use utf8;
use autodie;
use warnings;
use warnings    qw< FATAL utf8 >;
no warnings     qw< uninitialized >;
use open        qw< :std :utf8 >;
use charnames   qw< :full >;
use feature     qw< unicode_strings >;

use Encode              qw< encode decode >;
use Unicode::Normalize  qw< NFD NFC >;

my ($x) = "mЖ丽";

open(U8,">:encoding(utf8)","/tmp/utf8-out");
print U8 $x;
close(U8);
open(U16,">:encoding(utf16)","/tmp/utf16-out");
print U16 $x;
close(U16);
system("od -t x1 /tmp/utf8-out");
my $u8 = encode("utf-8",$x);
print "utf-8: 0x".unpack("H*",$u8)."\n";

system("od -t x1 /tmp/utf16-out");
my $u16 = encode("utf-16",$x);
print "utf-16: 0x".unpack("H*",$u16)."\n";

The following program will do the necessary work. It may not be "manual" enough for your purposes, but at a minimum you can check your work.

#!/usr/bin/perl

use 5.012;
use strict;
use utf8;
use autodie;
use warnings;
use warnings    qw< FATAL utf8 >;
no warnings     qw< uninitialized >;
use open        qw< :std :utf8 >;
use charnames   qw< :full >;
use feature     qw< unicode_strings >;

use Encode              qw< encode decode >;
use Unicode::Normalize  qw< NFD NFC >;

my ($x) = "mЖ丽";

open(U8,">:encoding(utf8)","/tmp/utf8-out");
print U8 $x;
close(U8);
open(U16,">:encoding(utf16)","/tmp/utf16-out");
print U16 $x;
close(U16);
system("od -t x1 /tmp/utf8-out");
my $u8 = encode("utf-8",$x);
print "utf-8: 0x".unpack("H*",$u8)."\n";

system("od -t x1 /tmp/utf16-out");
my $u16 = encode("utf-16",$x);
print "utf-16: 0x".unpack("H*",$u16)."\n";

回复收藏 0 原文

~没有更多了~