我怎样才能捕获“Unicode 非字符”警告？

发布于 2024-10-19 11:51:28 字数 415 浏览 9 评论 0原文

我怎样才能捕捉到“Unicode非字符0xffff对于交换是非法的”警告？

#!/usr/bin/env perl
use warnings;
use 5.012;
use Try::Tiny;

use warnings FATAL => qw(all);

my $character;

try {
    $character = "\x{ffff}";
} catch {
    die "---------- caught error ----------\n";
};

say "something";

输出：

# Unicode non-character 0xffff is illegal for interchange at ./perl1.pl line 11.

原文

How could I catch the "Unicode non-character 0xffff is illegal for interchange"-warning?

#!/usr/bin/env perl
use warnings;
use 5.012;
use Try::Tiny;

use warnings FATAL => qw(all);

my $character;

try {
    $character = "\x{ffff}";
} catch {
    die "---------- caught error ----------\n";
};

say "something";

Output:

# Unicode non-character 0xffff is illegal for interchange at ./perl1.pl line 11.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

﹉夏雨初晴づ 2024-10-26 11:51:28

一个Perl 5.10.0 ⋯ 5.13.8 Bug

我假设您实际上并不想“捕获”此警告，而是想生存或忽略它。如果你真的想抓住它，那么，可能有更简单的方法。

但首先要知道的是，不存在非法代码点，只有不可互换的代码点。

您只需在需要使用完整 Unicode 范围（或更多）的范围内使用无警告“utf8”即可。 无需为此使用eval。所需的只是范围内的警告抑制。即使对于较新的 perls 来说这是不必要的。

因此，不要这样

$char = chr(0xFFFE);

写：（在较旧的 perls 上）：

$char = do { no warnings "utf8"; chr(0xFFFE) };

这也是涉及此类字符的模式匹配的情况：

 $did_match = do { no warnings "utf8" ; $char =~ $char);

将导致警告或致命错误，具体取决于您的 perl 的旧版本，或者根本没有任何结果，具体取决于您的 perl 的新版本佩尔群岛

您可以仅在重要的版本上以这种方式禁用与 utf8 相关的警告：

no if $^V < 5.13.9, qw<warnings utf8>;

“在下一个版本中修复”

真正有趣的是他们（阅读：Perl5 Porters，特别是 Karl Williamson）已经修复了需要没有警告“utf8” 保护只是为了与任何代码点一起工作。您可能需要小心的只是输出。注意：

% perl5.10.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode character 0xfffe is illegal at -e line 1.

% perl5.11.3 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.12.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.12.3 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.8 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.9 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Ok

% perl5.13.10 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Ok

最安全的做法是在您需要的地方放置no warnings "utf8"。但不需要eval！

从 5.13.10 开始，因此在 5.14 中，utf8 警告分为三个子类别：UTF-16 的 surrogate、如下所述的 nonchar 和 non_unicode< /code> 用于超级，也在下面定义。

All-Perl 交换是安全的

不过，您可能不想抑制输出中的“非法交换”警告，因为这是事实。好吧，除非您使用 Perl 的 "utf8" 编码，这与它的 "UTF-8" 编码不同，这很奇怪。 "utf8" 编码比正式标准更宽松，因为它允许我们做比其他方式更有趣的事情。

但是，当且仅当您拥有 100% 纯 Perl 数据路径时，您仍然可以使用您想要的任何代码点，包括高达 ᴍᴀxɪɴᴛ 的非 unicode 代码点。这在 32 位机器上是 0x7FFF_FFFF，而在 64 位机器上则大得难以形容：0xFFFF_FFFF_FFFF_FFFF！这不仅仅是一个超级；而是一个超级。这是一个超巨型！

% perl -Mwarnings -CS -E 'my $a = chr(0xFFFF_FFFF); say $a ' | 
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
Code point 0xFFFFFFFF is not Unicode, may not be portable at -e line 1.
got ord 4294967295

% perl -Mwarnings -CS -E 'no warnings "utf8"; my $a = chr(0xFFFF_FFFF); say $a' |
 perl -Mwarnings -CS -nlE 'say "got ord ", ord'
got ord 4294967295

% perl -Mwarnings -CS -E 'no warnings "utf8"; my $a = chr(0xFFFF_FFFF_FFFF_FFFF); say $a' |
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
Hexadecimal number > 0xffffffff non-portable at -e line 1.
got ord 18446744073709551615

% perl -Mwarnings -CS -E 'no warnings qw[ utf8 portable ]; my $a = chr(0xFFFF_FFFF_FFFF_FFFF);  say $a ' |
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
got ord 18446744073709551615

请注意，在 32 位机器上，最后一个会产生以下内容：

Integer overflow in hexadecimal number at -e line 1.
got ord 4294967295

各种非法交换的非字符

有几种（实际上是相当多）不同类别的代码点对于交换是非法的。

任何满足 (ord(ᴄᴏᴅᴇᴘᴏɪɴᴛ) & 0xFFFE) == 0xFFFE 为真的代码点。这涵盖了所有可能平面中的最后两个代码点。由于它跨越 17 个平面，因此 Unicode 定义了 34 个这样的代码点。这些不是字符，尽管它们是 Unicode 代码点。我们将这些称为“Penults”。它们属于 5.13.10 或更高版本的 nonchar 警告类别。
从 U+FDD0 开始的 32 个代码点。这些保证是非字符，尽管它们当然仍然是 Unicode 代码点。与之前的 Penult 集一样，这些也属于 5.13.10 或更高版本的 nonchar 警告类别。
1024 个高代理项和 1024 个低代理项，它们是作为斜率而被雕刻出来的，以使 UTF-16 对于所有尝试 UCS-2 而不是 UTF-8 或 UTF-32 的愚蠢系统成为可能。这削弱了有效 Unicode 代码点的范围，将它们限制为仅前 21 位。 代理仍然是代码点。它们只是无法互换，因为它们不能总是由聪明的 UTF-16 正确表示。在 5.13.10 或更高版本中，这些由 surrogate 警告子类控制。
除此之外，我们现在已经超出了 Unicode 范围。我将这些称为“超级”。在 32 位计算机上，除了 Unicode 提供的标准 21 位之外，您仍然拥有（10 或）11 位。 Perl 可以很好地使用这些。这给出了您可以在 Perl 程序中使用的总共 2**32 个代码点（好吧，或者至少 2**31，由于有符号溢出）。您将获得一百万个 Unicode 代码点，但随后您将获得超出 Perl 中可以使用的数十亿个超级代码点。如果您运行的是 5.13.10 或更高版本，则可以通过 non_unicode warnings 子类控制对这些内容的访问。
即使在超级范围内，Perl 仍然遵循有关 Penult 的规则。在 32 位机器上有 480 个这样的 Superpenult，而在 64 位机器上则更多。
如果您真的想要以不可移植的方式播放它，那么如果您有本机 64 位整数，那么您还有比 supers 提供的高出 32 或 33 位的位。您现在有 18 万亿、446 万亿、744 万亿、730 亿、7.09 亿、55.1 万和 616 个字符。您拥有一整艾字节不同的代码点！这远远超出了超级的范围，我将称它们为Hypermegas。好的，所以这些不是很便携，因为它们需要真正的 64 位平台。它们有点陌生，所以也许我们应该写成Ὑπέρμεγας来吓跑人们。 :) 请注意，禁止倒数的规则仍然适用于 hypermegas。

测试程序

我编写了一个小程序来证明这些代码点很酷。

testing Penults             passed all 34 codepoints
testing Super_penults       passed all 480 codepoints
testing Noncharacters       passed all 32 codepoints
testing Low_surrogates      passed all 1024 codepoints
testing High_surrogates     passed all 1024 codepoints
testing Supers              passed all 8 codepoints
testing Ὑπέρμεγας            passed all 10 codepoints

注意：上面的最后一行显示了 SO 的地狱突出显示代码中的另一个愚蠢的错误。注意到最后一个 WɪᴋɪWᴏʀᴅ 了，\p{Greek} 被排除在着色方案之外了吗？这意味着他们只寻找大写的 ASCII 标识符。 已经过时了！如果您不打算正确使用 \p{Uppercase} 之类的内容，为什么还要费心接受 ᴜɴɪᴄᴏᴅᴇ 呢？正如您将在我的程序中看到的那样，我有一个 @ὑπέρμεγας 数组，我们 ᴍᴏᴅᴇʀɴ ᴘʀᴏɢʀᴀᴍᴍɪɴɢ ʟᴀɴɢᴜᴀɢᴇs 可以很好地处理这个问题。 ☺

显然我没有运行所有的超级或超级。在 32 位机器上，您只能获得 4 个经过测试的 hyper。我也没有测试任何超级惩罚。

这是测试程序，它可以在 5.10 及更高版本的所有版本上正常运行。

#!/usr/bin/env perl
#
# hypertest - show how to safely use code points not legal for interchange in Perl
# 
# Tom Christiansen
# [email protected]
# Sat Feb 26 16:38:44 MST 2011

use utf8;
use 5.10.0;
use strict;
use if $] > 5.010, "autodie";
use warnings FATAL => "all";

use Carp;

binmode(STDOUT, ":utf8");
END { close STDOUT }

$\ = "\n";

sub ghex(_);

my @penults = map { 
    (0x01_0000 * $_) + 0xfffE, 
    (0x01_0000 * $_) + 0xfffF, 
} 0x00 .. 0x10;

my @super_penults = map { 
    (0x01_0000 * $_) + 0xfffE, 
    (0x01_0000 * $_) + 0xfffF, 
} 0x10 .. 0xFF;

my @low_surrogates  = map { 0xDC00 + $_ } 0x000 .. 0x3FF;
my @high_surrogates = map { 0xD800 + $_ } 0x000 .. 0x3FF;

my @noncharacters = map { 0xFDD0 + $_ } 0x00 .. 0x1F;

my @supers = ( 
    0x0011_0000,  0x0100_0000,  0x1000_0000,  0x1F00_0000,  
    0x1FFF_FFFF,  0x3FFF_FFFF,  0x7FFF_FFFF,  0x7FFF_FFFF,  
);

# these should always work anywhere 
my @ὑπέρμεγας = ( 
    0x8000_0000,   0xF000_0000,   
    0x3FFF_FFFF,   0xFFFF_FFFF,  
);

####
# now we go fishing for 64-bit ὑπέρμεγας
####

eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => ( 
        0x01_0000_0000, 
        0x01_FFFF_FF00,
    );
};
eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => (
        0x0001_0000_0000_0000,
        0x001F_0000_0000_0000,
        0x7FFF_FFFF_FFFF_FFFF,
        0xFFFF_FFFF_FFFF_FFFF,
    );
};

# more than 64??
eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => (
        0x01_0001_0000_0000_0000,
        0x01_7FFF_FFFF_FFFF_FFFF,
        0x01_FFFF_FFFF_FFFF_FFFF,
    );
    1;
};


my @testpairs = (
    penults         => \@penults,
    super_penults   => \@super_penults,
    noncharacters   => \@noncharacters ,
    low_surrogates  => \@low_surrogates,
    high_surrogates => \@high_surrogates,
    supers          => \@supers,
    ὑπέρμεγας       => \@ὑπέρμεγας,   
);

while (my($name, $aref) = splice(@testpairs, 0, 2)) {
    printf "testing %-20s", ucfirst $name;

    my(@passed, @failed);

    for my $codepoint (@$aref) {

        use warnings FATAL => "all";

        my $char = do {
            # next line not needed under 5.13.9 or better: HURRAY!
            no warnings "utf8";
            chr(0xFFFF) && chr($codepoint);
        };

        my $regex_ok = do {
            # next line not needed under 5.13.9 or better: HURRAY!
            no warnings "utf8";
            $char =~ $char;
            1;
        };

        my $status = defined($char) && $regex_ok;

        push @{ $status ? \@passed : \@failed }, $codepoint;
    }

    my $total  = @$aref;
    my $passed = @passed;
    my $failed = @failed;

    given($total) {
        when ($passed)  { print "passed all $total codepoints" }
        when ($failed)  { print "failed all $total codepoints" }
        default         {
            print "of $total codepoints, failed $failed and passed $passed";
            my $flist = join(", ", map { ghex } @failed);
            my $plist = join(", ", map { ghex } @passed);
            print "\tpassed: $plist";
            print "\tfailed: $flist";
        }
    }

}

sub ghex(_) {
    my $num = shift();
    my $hex = sprintf("%X", $num);
    return $hex if length($hex) < 5;
    my $flip = reverse $hex;
    $flip =~ s<
        ( \p{ahex} \p{ahex} \p{ahex} \p{ahex} )
        (?= \p{ahex} )
        (?! \p{ahex}* \. )
    ><${1}_>gx;
    return "0x" . reverse($flip);
}

A Perl 5.10.0 ⋯ 5.13.8 Bug

I’m going to assume that you don’t actually want to “catch” this warning, but rather to survive or ignore it. If you really want to catch it, well, there may be easier ways to do that.

But the first thing to know is that there is no such thing as an illegal code point, only code points not valid for interchange.

You just have to use a no warnings "utf8" for the scope of where you need to use the full Unicode range (or more). There is no need to use an eval for this. All it takes is a scoped warning suppression. Even that it is unnecessary on newer perls.

So instead of this:

$char = chr(0xFFFE);

write (on older perls):

$char = do { no warnings "utf8"; chr(0xFFFE) };

This is also the situation with pattern matches involving such a character:

 $did_match = do { no warnings "utf8" ; $char =~ $char);

will cause a warning or a fatal, depending on how old your perl, or nothing at all, depending on how new your perl is.

You can disable utf8-related warnings only on releases where it matters this way:

no if $^V < 5.13.9, qw<warnings utf8>;

‘Fixed in the Next Release’

The really interesting thing is that they (read: Perl5 Porters, and in particular, Karl Williamson) have fixed the bug that requires a no warnings "utf8" guard just to work with any code point at all. It is only the output where you may have to be careful. Watch:

% perl5.10.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode character 0xfffe is illegal at -e line 1.

% perl5.11.3 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.12.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.12.3 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.0 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.8 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Unicode non-character 0xfffe is illegal for interchange at -e line 1.

% perl5.13.9 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Ok

% perl5.13.10 -Mwarnings=FATAL,all -E 'my $char = chr(0xFFFE); say "Ok"'
Ok

The safest thing to do is put no warnings "utf8" in just the places you need it. But there is no need of an eval!

As of 5.13.10, and hence in 5.14, there are three subcategories of utf8 warnings: surrogate for UTF‑16, nonchar as described below, and non_unicode for supers, also defined below.

An All‐Perl Interchange is Safe

You probably don’t want to suppress the “illegal for interchange” warnings on output, though, because this is true. Well, unless you’re using Perl’s "utf8" encoding, which isn’t the same as its "UTF‑8" encoding, oddly enough. The "utf8" encoding is laxer than the formal standard, because it allows us to do more interesting things than we otherwise could.

However, if and only if you have a 100% pure-perl datapath, you can still use any code point you want, including non-unicode code points up to ᴍᴀxɪɴᴛ. That’s 0x7FFF_FFFF on 32‑bit machines, and something unspeakably huge on 64‑bit machines: 0xFFFF_FFFF_FFFF_FFFF! That’s not just a super; it’s a hypermega!

% perl -Mwarnings -CS -E 'my $a = chr(0xFFFF_FFFF); say $a ' | 
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
Code point 0xFFFFFFFF is not Unicode, may not be portable at -e line 1.
got ord 4294967295

% perl -Mwarnings -CS -E 'no warnings "utf8"; my $a = chr(0xFFFF_FFFF); say $a' |
 perl -Mwarnings -CS -nlE 'say "got ord ", ord'
got ord 4294967295

% perl -Mwarnings -CS -E 'no warnings "utf8"; my $a = chr(0xFFFF_FFFF_FFFF_FFFF); say $a' |
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
Hexadecimal number > 0xffffffff non-portable at -e line 1.
got ord 18446744073709551615

% perl -Mwarnings -CS -E 'no warnings qw[ utf8 portable ]; my $a = chr(0xFFFF_FFFF_FFFF_FFFF);  say $a ' |
  perl -Mwarnings -CS -nlE 'say "got ord ", ord'
got ord 18446744073709551615

Note that on a 32‑bit machine, that last one produces this:

Integer overflow in hexadecimal number at -e line 1.
got ord 4294967295

Varieties of Noncharacters Illegal for Interchange

There are several — quite a few, actually — different classes of code points that are not legal for interchange.

Any code point such that (ord(ᴄᴏᴅᴇᴘᴏɪɴᴛ) & 0xFFFE) == 0xFFFE is true. This covers the last two code points in all possible planes. As it spans 17 planes, Unicode defines therefore 34 such code points. Those are not characters, although they are Unicode code points. Let’s call these the Penults. They fall under the nonchar warning class on 5.13.10 or better.
The 32 code points starting at U+FDD0. These are guaranteed to be Noncharacters, although of course they are still Unicode code points. Like the previous penult set, these too fall under the nonchar warning class on 5.13.10 or better.
The 1024 high surrogates and the 1024 low surrogates, which were carved out as slop to make UTF‑16 possible for all those dumb systems that tried UCS‑2 instead of UTF‑8 or UTF‑32. This cripples the range of valid Unicode code points, restricting them to only the first 21 bits worth. SURROGATES ARE STILL CODE POINTS. They just are not valid for interchange, because they cannot always be correctly represented by brain-dead-clever UTF‑16. Under 5.13.10 or better, these are controlled by the surrogate warning subclass.
Beyond that, we’re now above the Unicode range. I’ll call these Supers. On a 32‑bit machine, you still have (10 or) 11 bits of them beyond the standard 21 bits that Unicode gives you. Perl can use these just fine. That gives 2**32 total code points you can use in your Perl program (well, or 2**31 at least, due to signed overflow). You get a million Unicode code points, but then you get a couple of billion Super code points beyond those that you can use in Perl. If you are running 5.13.10 or better, you can control access to these via the non_unicode warnings subclass.
Perl still follows the rules about Penults even up in the Super range. There are 480 such Superpenults on a 32‑bit machine, and rather more of them on a 64‑bit one.
If you really want to play it nonportably, then if you have native 64‑bit ints, you have another 32 or 33 bits above what the supers give you. You now have 18 quintillion 446 quadrillion 744 trillion 73 billion 709 million 551 thousand and 616 characters. You have a whole exabyte of distinct code points! That’s far beyond super that I’m going to call them Hypermegas. Ok, so these aren’t very portable, since they require a truly 64‑bit platform. They’re a bit foreign, so maybe we should write that Ὑπέρμεγας to scare people away. :) Note that the rules against penults still apply to hypermegas.

The Test Program

I wrote a little program that proves that these code points are cool.

testing Penults             passed all 34 codepoints
testing Super_penults       passed all 480 codepoints
testing Noncharacters       passed all 32 codepoints
testing Low_surrogates      passed all 1024 codepoints
testing High_surrogates     passed all 1024 codepoints
testing Supers              passed all 8 codepoints
testing Ὑπέρμεγας            passed all 10 codepoints

NOTE: That last line above shows a Yet Another Stupid Bug in SO’s infernal highlighting code. Notice the last WɪᴋɪWᴏʀᴅ up there, the \p{Greek} one, got left out of the colorization scheme? That means they are only looking for capitalized ASCII identifiers. Très passé! Why bother accepting ᴜɴɪᴄᴏᴅᴇ if you aren’t going to use things like \p{Uppercase} correctly? As you’ll see in my program where I have a @ὑπέρμεγας array, us ᴍᴏᴅᴇʀɴ ᴘʀᴏɢʀᴀᴍᴍɪɴɢ ʟᴀɴɢᴜᴀɢᴇs handle this perfectly fine. ☺

I obviously didn’t run all the supers or the hypers. And on 32‑bit machine, you’ll only get 4 of the tested hypers. I also didn’t test any of the hyperpenults.

Here’s the testing program, which runs cleanly on all version from 5.10 and up.

#!/usr/bin/env perl
#
# hypertest - show how to safely use code points not legal for interchange in Perl
# 
# Tom Christiansen
# [email protected]
# Sat Feb 26 16:38:44 MST 2011

use utf8;
use 5.10.0;
use strict;
use if $] > 5.010, "autodie";
use warnings FATAL => "all";

use Carp;

binmode(STDOUT, ":utf8");
END { close STDOUT }

$\ = "\n";

sub ghex(_);

my @penults = map { 
    (0x01_0000 * $_) + 0xfffE, 
    (0x01_0000 * $_) + 0xfffF, 
} 0x00 .. 0x10;

my @super_penults = map { 
    (0x01_0000 * $_) + 0xfffE, 
    (0x01_0000 * $_) + 0xfffF, 
} 0x10 .. 0xFF;

my @low_surrogates  = map { 0xDC00 + $_ } 0x000 .. 0x3FF;
my @high_surrogates = map { 0xD800 + $_ } 0x000 .. 0x3FF;

my @noncharacters = map { 0xFDD0 + $_ } 0x00 .. 0x1F;

my @supers = ( 
    0x0011_0000,  0x0100_0000,  0x1000_0000,  0x1F00_0000,  
    0x1FFF_FFFF,  0x3FFF_FFFF,  0x7FFF_FFFF,  0x7FFF_FFFF,  
);

# these should always work anywhere 
my @ὑπέρμεγας = ( 
    0x8000_0000,   0xF000_0000,   
    0x3FFF_FFFF,   0xFFFF_FFFF,  
);

####
# now we go fishing for 64-bit ὑπέρμεγας
####

eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => ( 
        0x01_0000_0000, 
        0x01_FFFF_FF00,
    );
};
eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => (
        0x0001_0000_0000_0000,
        0x001F_0000_0000_0000,
        0x7FFF_FFFF_FFFF_FFFF,
        0xFFFF_FFFF_FFFF_FFFF,
    );
};

# more than 64??
eval q{
    use warnings FATAL => "overflow";
    no  warnings "portable";
    push @ὑπέρμεγας => (
        0x01_0001_0000_0000_0000,
        0x01_7FFF_FFFF_FFFF_FFFF,
        0x01_FFFF_FFFF_FFFF_FFFF,
    );
    1;
};


my @testpairs = (
    penults         => \@penults,
    super_penults   => \@super_penults,
    noncharacters   => \@noncharacters ,
    low_surrogates  => \@low_surrogates,
    high_surrogates => \@high_surrogates,
    supers          => \@supers,
    ὑπέρμεγας       => \@ὑπέρμεγας,   
);

while (my($name, $aref) = splice(@testpairs, 0, 2)) {
    printf "testing %-20s", ucfirst $name;

    my(@passed, @failed);

    for my $codepoint (@$aref) {

        use warnings FATAL => "all";

        my $char = do {
            # next line not needed under 5.13.9 or better: HURRAY!
            no warnings "utf8";
            chr(0xFFFF) && chr($codepoint);
        };

        my $regex_ok = do {
            # next line not needed under 5.13.9 or better: HURRAY!
            no warnings "utf8";
            $char =~ $char;
            1;
        };

        my $status = defined($char) && $regex_ok;

        push @{ $status ? \@passed : \@failed }, $codepoint;
    }

    my $total  = @$aref;
    my $passed = @passed;
    my $failed = @failed;

    given($total) {
        when ($passed)  { print "passed all $total codepoints" }
        when ($failed)  { print "failed all $total codepoints" }
        default         {
            print "of $total codepoints, failed $failed and passed $passed";
            my $flist = join(", ", map { ghex } @failed);
            my $plist = join(", ", map { ghex } @passed);
            print "\tpassed: $plist";
            print "\tfailed: $flist";
        }
    }

}

sub ghex(_) {
    my $num = shift();
    my $hex = sprintf("%X", $num);
    return $hex if length($hex) < 5;
    my $flip = reverse $hex;
    $flip =~ s<
        ( \p{ahex} \p{ahex} \p{ahex} \p{ahex} )
        (?= \p{ahex} )
        (?! \p{ahex}* \. )
    ><${1}_>gx;
    return "0x" . reverse($flip);
}

回复收藏 0 原文

萌酱 2024-10-26 11:51:28

这是一个编译时错误，类似于忘记关闭正则表达式。如果您将该片段的编译延迟到运行时，您可以捕获它：

#!/usr/bin/env perl
use warnings;

use warnings FATAL => qw(all);

my $character;

eval q{
    $character = "\x{ffff}";
};
if ($@) {
    die "---------- caught error ----------\n";
}

print "something\n";

输出：

---------- caught error  ----------

如果您在 eval 之后删除 q，您将获得与脚本相同的行为现在，因为 eval {...}; if($@) {...} 与 try {...} catch {...}; 相同，但带有 q它是字符串的评估，这是完全不同的。

更新：
正如 Tom 指出也许应该在您设置或获取这些值的位置周围的狭窄范围内禁用该警告，并使用 no warnings qw(utf8) 。您可能仍然希望将 utf8 警告捕获为输出错误（或将数据发送到程序外部的任何其他内容）：

#!/usr/bin/env perl
use warnings FATAL => qw(all);

my $character;

eval {
    no warnings qw(utf8);
    $character = "\x{ffff}";
};
if ($@) {
    die "---------- caught error  ----------\n";
}

print "something\n";
eval {
    print "something $character else\n";
};
if ($@) {
    die "---------- caught output error  ----------\n";
}

输出：

something
---------- caught output error  ----------

It's a compile-time error, similar to forgetting to close a regex. If you delay the compilation of that piece to runtime, you can catch it:

#!/usr/bin/env perl
use warnings;

use warnings FATAL => qw(all);

my $character;

eval q{
    $character = "\x{ffff}";
};
if ($@) {
    die "---------- caught error ----------\n";
}

print "something\n";

Output:

---------- caught error  ----------

If you remove the q after eval, you'll get the same behavior as your script does now, since eval {...}; if($@) {...} is the same as try {...} catch {...};, but with the q it's an eval of a string, which is totally different.

UPDATE:
As Tom points out, you should probably just disable that warning with no warnings qw(utf8) in a narrow scope around the spot you're setting or getting those kinds of values. You may still want to catch utf8 warnings as errors on output (or anything else that sends the data outside your program):

#!/usr/bin/env perl
use warnings FATAL => qw(all);

my $character;

eval {
    no warnings qw(utf8);
    $character = "\x{ffff}";
};
if ($@) {
    die "---------- caught error  ----------\n";
}

print "something\n";
eval {
    print "something $character else\n";
};
if ($@) {
    die "---------- caught output error  ----------\n";
}

Output:

something
---------- caught output error  ----------

回复收藏 0 原文

~没有更多了~

关于作者

终止放荡

暂无简介

文章

27 人气

关注发私信

友情链接

文江博客

我怎样才能捕获“Unicode 非字符”警告？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

一个Perl 5.10.0 ⋯ 5.13.8 Bug

“在下一个版本中修复”

All-Perl 交换是安全的

各种非法交换的非字符

测试程序

A Perl 5.10.0 ⋯ 5.13.8 Bug

‘Fixed in the Next Release’

An All‐Perl Interchange is Safe

Varieties of Noncharacters Illegal for Interchange

The Test Program

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

我怎样才能捕获“Unicode 非字符”警告？

如果你对这篇内容有疑问，欢迎到本站社区发帖提问 参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

评论（2）

一个Perl 5.10.0 ⋯ 5.13.8 Bug

“在下一个版本中修复”

All-Perl 交换是安全的

各种非法交换的非字符

测试程序

A Perl 5.10.0 ⋯ 5.13.8 Bug

‘Fixed in the Next Release’

An All‐Perl Interchange is Safe

Varieties of Noncharacters Illegal for Interchange

The Test Program

关于作者

相关话题

热门标签

推荐作者

佚名

今天

゛时过境迁

达拉崩吧

呆萌少年

孤者何惧

友情链接

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。