关于“utf-8”行为的问题

发布于 2024-10-20 02:09:29 字数 432 浏览 3 评论 0原文

#!/usr/bin/env perl
use warnings;
use 5.012;
use Encode qw(encode);

no warnings qw(utf8);

my $c = "\x{ffff}";

my $utf_8 = encode( 'utf-8', $c );
my $utf8 = encode( 'utf8', $c );

say "utf-8 :  @{[ unpack '(B8)*', $utf_8 ]}";
say "utf8  :  @{[ unpack '(B8)*', $utf8 ]}";

# utf-8 :  11101111 10111111 10111101
# utf8  :  11101111 10111111 10111111

“utf-8”是否以这种方式编码,以将我的代码点自动修复为(第一个平面的)最后一个可互换的代码点?

#!/usr/bin/env perl
use warnings;
use 5.012;
use Encode qw(encode);

no warnings qw(utf8);

my $c = "\x{ffff}";

my $utf_8 = encode( 'utf-8', $c );
my $utf8 = encode( 'utf8', $c );

say "utf-8 :  @{[ unpack '(B8)*', $utf_8 ]}";
say "utf8  :  @{[ unpack '(B8)*', $utf8 ]}";

# utf-8 :  11101111 10111111 10111101
# utf8  :  11101111 10111111 10111111

Does the "utf-8" encode this way, to fix my codepoint automaticaly to the last interchangeable codepoint (of the first plane)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

仅此而已 2024-10-27 02:09:29

请参阅 UTF-8 与 utf8 与 UTF8 部分 编码 文档。

总而言之,Perl 有两种不同的 UTF-8 编码。它的本机编码称为 utf8,基本上允许任何代码点,无论 Unicode 标准对该代码点有何规定。

另一种编码称为 utf-8(又名 utf-8-strict)。这仅允许由 Unicode 标准分配的代码点。

根据 Unicode,\x{FFFF} 不是有效的代码点。但 Perl 的 utf8 编码并不关心这一点。

默认情况下,encode 函数会将目标字符集中不存在的任何字符替换为替换字符(请参阅 处理格式错误的数据部分)。对于 utf-8,该替换字符为 U +FFFD(替换字符),以 UTF-8 编码为 11101111 10111111 10111101(二进制)。

See the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.

To summarize, Perl has two different UTF-8 encodings. Its native encoding is called utf8, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.

The other encoding is called utf-8 (a.k.a. utf-8-strict). This allows only codepoints that are assigned by the Unicode standard.

\x{FFFF} is not a valid codepoint according to Unicode. But Perl's utf8 encoding doesn't care about that.

By default, the encode function replaces any character that does not exist in the destination charset with a substitution character (see the Handling Malformed Data section). For utf-8, that substitution character is U+FFFD (REPLACEMENT CHARACTER), which is encoded in UTF-8 as 11101111 10111111 10111101 (binary).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文