我怎样才能捕获“Unicode 非字符”警告?
我怎样才能捕捉到“Unicode非字符0xffff对于交换是非法的”警告?
#!/usr/bin/env perl
use warnings;
use 5.012;
use Try::Tiny;
use warnings FATAL => qw(all);
my $character;
try {
$character = "\x{ffff}";
} catch {
die "---------- caught error ----------\n";
};
say "something";
输出:
# Unicode non-character 0xffff is illegal for interchange at ./perl1.pl line 11.
How could I catch the "Unicode non-character 0xffff is illegal for interchange"-warning?
#!/usr/bin/env perl
use warnings;
use 5.012;
use Try::Tiny;
use warnings FATAL => qw(all);
my $character;
try {
$character = "\x{ffff}";
} catch {
die "---------- caught error ----------\n";
};
say "something";
Output:
# Unicode non-character 0xffff is illegal for interchange at ./perl1.pl line 11.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
一个Perl 5.10.0 ⋯ 5.13.8 Bug
我假设您实际上并不想“捕获”此警告,而是想生存或忽略它。如果你真的想抓住它,那么,可能有更简单的方法。
但首先要知道的是,不存在非法代码点,只有不可互换的代码点。
您只需在需要使用完整 Unicode 范围(或更多)的范围内使用
无警告“utf8”
即可。 无需为此使用eval
。所需的只是范围内的警告抑制。即使对于较新的 perls 来说这是不必要的。因此,不要这样
写:(在较旧的 perls 上):
这也是涉及此类字符的模式匹配的情况:
将导致警告或致命错误,具体取决于您的 perl 的旧版本,或者根本没有任何结果,具体取决于您的 perl 的新版本佩尔群岛
您可以仅在重要的版本上以这种方式禁用与 utf8 相关的警告:
“在下一个版本中修复”
真正有趣的是他们(阅读:Perl5 Porters,特别是 Karl Williamson)已经修复了需要
没有警告“utf8”
保护只是为了与任何代码点一起工作。您可能需要小心的只是输出。注意:最安全的做法是在您需要的地方放置
no warnings "utf8"
。但不需要eval
!从 5.13.10 开始,因此在 5.14 中,utf8 警告分为三个子类别:UTF-16 的
surrogate
、如下所述的nonchar
和non_unicode< /code> 用于超级,也在下面定义。
All-Perl 交换是安全的
不过,您可能不想抑制输出中的“非法交换”警告,因为这是事实。好吧,除非您使用 Perl 的
"utf8"
编码,这与它的"UTF-8"
编码不同,这很奇怪。"utf8"
编码比正式标准更宽松,因为它允许我们做比其他方式更有趣的事情。但是,当且仅当您拥有 100% 纯 Perl 数据路径时,您仍然可以使用您想要的任何代码点,包括高达 ᴍᴀxɪɴᴛ 的非 unicode 代码点。这在 32 位机器上是 0x7FFF_FFFF,而在 64 位机器上则大得难以形容:0xFFFF_FFFF_FFFF_FFFF!这不仅仅是一个超级;而是一个超级。这是一个超巨型!
请注意,在 32 位机器上,最后一个会产生以下内容:
各种非法交换的非字符
有几种(实际上是相当多)不同类别的代码点对于交换是非法的。
任何满足
(ord(ᴄᴏᴅᴇᴘᴏɪɴᴛ) & 0xFFFE) == 0xFFFE
为真的代码点。这涵盖了所有可能平面中的最后两个代码点。由于它跨越 17 个平面,因此 Unicode 定义了 34 个这样的代码点。这些不是字符,尽管它们是 Unicode 代码点。我们将这些称为“Penults”。它们属于 5.13.10 或更高版本的nonchar
警告类别。从 U+FDD0 开始的 32 个代码点。这些保证是非字符,尽管它们当然仍然是 Unicode 代码点。与之前的 Penult 集一样,这些也属于 5.13.10 或更高版本的
nonchar
警告类别。1024 个高代理项和 1024 个低代理项,它们是作为斜率而被雕刻出来的,以使 UTF-16 对于所有尝试 UCS-2 而不是 UTF-8 或 UTF-32 的愚蠢系统成为可能。这削弱了有效 Unicode 代码点的范围,将它们限制为仅前 21 位。 代理仍然是代码点。它们只是无法互换,因为它们不能总是由聪明的 UTF-16 正确表示。在 5.13.10 或更高版本中,这些由
surrogate
警告子类控制。除此之外,我们现在已经超出了 Unicode 范围。我将这些称为“超级”。在 32 位计算机上,除了 Unicode 提供的标准 21 位之外,您仍然拥有(10 或)11 位。 Perl 可以很好地使用这些。这给出了您可以在 Perl 程序中使用的总共 2**32 个代码点(好吧,或者至少 2**31,由于有符号溢出)。您将获得一百万个 Unicode 代码点,但随后您将获得超出 Perl 中可以使用的数十亿个超级代码点。如果您运行的是 5.13.10 或更高版本,则可以通过
non_unicode
warnings 子类控制对这些内容的访问。即使在超级范围内,Perl 仍然遵循有关 Penult 的规则。在 32 位机器上有 480 个这样的 Superpenult,而在 64 位机器上则更多。
如果您真的想要以不可移植的方式播放它,那么如果您有本机 64 位整数,那么您还有比 supers 提供的高出 32 或 33 位的位。您现在有 18 万亿、446 万亿、744 万亿、730 亿、7.09 亿、55.1 万和 616 个字符。您拥有一整艾字节不同的代码点!这远远超出了超级的范围,我将称它们为Hypermegas。好的,所以这些不是很便携,因为它们需要真正的 64 位平台。它们有点陌生,所以也许我们应该写成Ὑπέρμεγας来吓跑人们。 :) 请注意,禁止倒数的规则仍然适用于 hypermegas。
测试程序
我编写了一个小程序来证明这些代码点很酷。
注意:上面的最后一行显示了 SO 的地狱突出显示代码中的另一个愚蠢的错误。注意到最后一个 WɪᴋɪWᴏʀᴅ 了,
\p{Greek}
被排除在着色方案之外了吗?这意味着他们只寻找大写的 ASCII 标识符。 已经过时了!如果您不打算正确使用\p{Uppercase}
之类的内容,为什么还要费心接受 ᴜɴɪᴄᴏᴅᴇ 呢?正如您将在我的程序中看到的那样,我有一个@ὑπέρμεγας
数组,我们 ᴍᴏᴅᴇʀɴ ᴘʀᴏɢʀᴀᴍᴍɪɴɢ ʟᴀɴɢᴜᴀɢᴇs 可以很好地处理这个问题。 ☺显然我没有运行所有的超级或超级。在 32 位机器上,您只能获得 4 个经过测试的 hyper。我也没有测试任何超级惩罚。
这是测试程序,它可以在 5.10 及更高版本的所有版本上正常运行。
A Perl 5.10.0 ⋯ 5.13.8 Bug
I’m going to assume that you don’t actually want to “catch” this warning, but rather to survive or ignore it. If you really want to catch it, well, there may be easier ways to do that.
But the first thing to know is that there is no such thing as an illegal code point, only code points not valid for interchange.
You just have to use a
no warnings "utf8"
for the scope of where you need to use the full Unicode range (or more). There is no need to use aneval
for this. All it takes is a scoped warning suppression. Even that it is unnecessary on newer perls.So instead of this:
write (on older perls):
This is also the situation with pattern matches involving such a character:
will cause a warning or a fatal, depending on how old your perl, or nothing at all, depending on how new your perl is.
You can disable utf8-related warnings only on releases where it matters this way:
‘Fixed in the Next Release’
The really interesting thing is that they (read: Perl5 Porters, and in particular, Karl Williamson) have fixed the bug that requires a
no warnings "utf8"
guard just to work with any code point at all. It is only the output where you may have to be careful. Watch:The safest thing to do is put
no warnings "utf8"
in just the places you need it. But there is no need of aneval
!As of 5.13.10, and hence in 5.14, there are three subcategories of utf8 warnings:
surrogate
for UTF‑16,nonchar
as described below, andnon_unicode
for supers, also defined below.An All‐Perl Interchange is Safe
You probably don’t want to suppress the “illegal for interchange” warnings on output, though, because this is true. Well, unless you’re using Perl’s
"utf8"
encoding, which isn’t the same as its"UTF‑8"
encoding, oddly enough. The"utf8"
encoding is laxer than the formal standard, because it allows us to do more interesting things than we otherwise could.However, if and only if you have a 100% pure-perl datapath, you can still use any code point you want, including non-unicode code points up to ᴍᴀxɪɴᴛ. That’s 0x7FFF_FFFF on 32‑bit machines, and something unspeakably huge on 64‑bit machines: 0xFFFF_FFFF_FFFF_FFFF! That’s not just a super; it’s a hypermega!
Note that on a 32‑bit machine, that last one produces this:
Varieties of Noncharacters Illegal for Interchange
There are several — quite a few, actually — different classes of code points that are not legal for interchange.
Any code point such that
(ord(ᴄᴏᴅᴇᴘᴏɪɴᴛ) & 0xFFFE) == 0xFFFE
is true. This covers the last two code points in all possible planes. As it spans 17 planes, Unicode defines therefore 34 such code points. Those are not characters, although they are Unicode code points. Let’s call these the Penults. They fall under thenonchar
warning class on 5.13.10 or better.The 32 code points starting at U+FDD0. These are guaranteed to be Noncharacters, although of course they are still Unicode code points. Like the previous penult set, these too fall under the
nonchar
warning class on 5.13.10 or better.The 1024 high surrogates and the 1024 low surrogates, which were carved out as slop to make UTF‑16 possible for all those dumb systems that tried UCS‑2 instead of UTF‑8 or UTF‑32. This cripples the range of valid Unicode code points, restricting them to only the first 21 bits worth. SURROGATES ARE STILL CODE POINTS. They just are not valid for interchange, because they cannot always be correctly represented by brain-dead-clever UTF‑16. Under 5.13.10 or better, these are controlled by the
surrogate
warning subclass.Beyond that, we’re now above the Unicode range. I’ll call these Supers. On a 32‑bit machine, you still have (10 or) 11 bits of them beyond the standard 21 bits that Unicode gives you. Perl can use these just fine. That gives 2**32 total code points you can use in your Perl program (well, or 2**31 at least, due to signed overflow). You get a million Unicode code points, but then you get a couple of billion Super code points beyond those that you can use in Perl. If you are running 5.13.10 or better, you can control access to these via the
non_unicode
warnings subclass.Perl still follows the rules about Penults even up in the Super range. There are 480 such Superpenults on a 32‑bit machine, and rather more of them on a 64‑bit one.
If you really want to play it nonportably, then if you have native 64‑bit ints, you have another 32 or 33 bits above what the supers give you. You now have 18 quintillion 446 quadrillion 744 trillion 73 billion 709 million 551 thousand and 616 characters. You have a whole exabyte of distinct code points! That’s far beyond super that I’m going to call them Hypermegas. Ok, so these aren’t very portable, since they require a truly 64‑bit platform. They’re a bit foreign, so maybe we should write that Ὑπέρμεγας to scare people away. :) Note that the rules against penults still apply to hypermegas.
The Test Program
I wrote a little program that proves that these code points are cool.
NOTE: That last line above shows a Yet Another Stupid Bug in SO’s infernal highlighting code. Notice the last WɪᴋɪWᴏʀᴅ up there, the
\p{Greek}
one, got left out of the colorization scheme? That means they are only looking for capitalized ASCII identifiers. Très passé! Why bother accepting ᴜɴɪᴄᴏᴅᴇ if you aren’t going to use things like\p{Uppercase}
correctly? As you’ll see in my program where I have a@ὑπέρμεγας
array, us ᴍᴏᴅᴇʀɴ ᴘʀᴏɢʀᴀᴍᴍɪɴɢ ʟᴀɴɢᴜᴀɢᴇs handle this perfectly fine. ☺I obviously didn’t run all the supers or the hypers. And on 32‑bit machine, you’ll only get 4 of the tested hypers. I also didn’t test any of the hyperpenults.
Here’s the testing program, which runs cleanly on all version from 5.10 and up.
这是一个编译时错误,类似于忘记关闭正则表达式。如果您将该片段的编译延迟到运行时,您可以捕获它:
输出:
如果您在
eval
之后删除q
,您将获得与脚本相同的行为现在,因为eval {...}; if($@) {...}
与try {...} catch {...};
相同,但带有q
它是字符串的评估,这是完全不同的。更新:
正如 Tom 指出也许应该在您设置或获取这些值的位置周围的狭窄范围内禁用该警告,并使用
no warnings qw(utf8)
。您可能仍然希望将 utf8 警告捕获为输出错误(或将数据发送到程序外部的任何其他内容):输出:
It's a compile-time error, similar to forgetting to close a regex. If you delay the compilation of that piece to runtime, you can catch it:
Output:
If you remove the
q
aftereval
, you'll get the same behavior as your script does now, sinceeval {...}; if($@) {...}
is the same astry {...} catch {...};
, but with theq
it's an eval of a string, which is totally different.UPDATE:
As Tom points out, you should probably just disable that warning with
no warnings qw(utf8)
in a narrow scope around the spot you're setting or getting those kinds of values. You may still want to catch utf8 warnings as errors on output (or anything else that sends the data outside your program):Output: