关于“utf-8”行为的问题
#!/usr/bin/env perl
use warnings;
use 5.012;
use Encode qw(encode);
no warnings qw(utf8);
my $c = "\x{ffff}";
my $utf_8 = encode( 'utf-8', $c );
my $utf8 = encode( 'utf8', $c );
say "utf-8 : @{[ unpack '(B8)*', $utf_8 ]}";
say "utf8 : @{[ unpack '(B8)*', $utf8 ]}";
# utf-8 : 11101111 10111111 10111101
# utf8 : 11101111 10111111 10111111
“utf-8”是否以这种方式编码,以将我的代码点自动修复为(第一个平面的)最后一个可互换的代码点?
#!/usr/bin/env perl
use warnings;
use 5.012;
use Encode qw(encode);
no warnings qw(utf8);
my $c = "\x{ffff}";
my $utf_8 = encode( 'utf-8', $c );
my $utf8 = encode( 'utf8', $c );
say "utf-8 : @{[ unpack '(B8)*', $utf_8 ]}";
say "utf8 : @{[ unpack '(B8)*', $utf8 ]}";
# utf-8 : 11101111 10111111 10111101
# utf8 : 11101111 10111111 10111111
Does the "utf-8" encode this way, to fix my codepoint automaticaly to the last interchangeable codepoint (of the first plane)?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请参阅 UTF-8 与 utf8 与 UTF8 部分 编码 文档。
总而言之,Perl 有两种不同的 UTF-8 编码。它的本机编码称为
utf8
,基本上允许任何代码点,无论 Unicode 标准对该代码点有何规定。另一种编码称为
utf-8
(又名utf-8-strict
)。这仅允许由 Unicode 标准分配的代码点。根据 Unicode,
\x{FFFF}
不是有效的代码点。但 Perl 的utf8
编码并不关心这一点。默认情况下,
encode
函数会将目标字符集中不存在的任何字符替换为替换字符(请参阅 处理格式错误的数据部分)。对于utf-8
,该替换字符为 U +FFFD(替换字符),以 UTF-8 编码为 11101111 10111111 10111101(二进制)。See the UTF-8 vs. utf8 vs. UTF8 section of the Encode docs.
To summarize, Perl has two different UTF-8 encodings. Its native encoding is called
utf8
, and basically allows any codepoint, regardless of what the Unicode standard says about that codepoint.The other encoding is called
utf-8
(a.k.a.utf-8-strict
). This allows only codepoints that are assigned by the Unicode standard.\x{FFFF}
is not a valid codepoint according to Unicode. But Perl'sutf8
encoding doesn't care about that.By default, the
encode
function replaces any character that does not exist in the destination charset with a substitution character (see the Handling Malformed Data section). Forutf-8
, that substitution character is U+FFFD (REPLACEMENT CHARACTER), which is encoded in UTF-8 as 11101111 10111111 10111101 (binary).