\w 是否匹配 Unicode 标准中定义的所有字母数字字符?
Perl 的 \w
是否匹配 Unicode 标准中定义的所有字母数字字符?
例如,\w
是否匹配所有(例如)中文和俄语字母数字字符人物?
我编写了一个简单的测试脚本(见下文),它表明 \w
确实与我测试的非 ASCII 字母数字字符“按预期”匹配。但测试显然远非详尽无遗。
#!/usr/bin/perl
use utf8;
binmode(STDOUT, ':utf8');
my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";
foreach my $ok (@ok) {
die unless ($ok =~ /^\w+$/);
}
Does Perl's \w
match all alphanumeric characters defined in the Unicode standard?
For example, will \w
match all (say) Chinese and Russian alphanumeric characters?
I wrote a simple test script (see below) which suggests that \w
does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.
#!/usr/bin/perl
use utf8;
binmode(STDOUT, ':utf8');
my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";
foreach my $ok (@ok) {
die unless ($ok =~ /^\w+$/);
}
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
perldoc perlunicode 说
所以看起来你的问题的答案是“是”。
但是,您可能希望使用
\p{}
构造来直接访问特定的 Unicode 字符属性。您可以使用\p{L}
(或者更短,\pL
)来表示字母,使用\pN
来表示数字,这样会感觉更方便一些相信您会得到您想要的。perldoc perlunicode says
So it looks like the answer to your question is "yes".
However, you might want to use the
\p{}
construct to directly access specific Unicode character properties. You can probably use\p{L}
(or, shorter,\pL
) for letters and\pN
for numbers and feel a little more confident that you'll get exactly what you want.是和不是。
如果您想要所有字母数字,则需要
[\p{Alphabetic}\p{GC=Number}]
。\w
包含的内容既多又少。它特别排除任何既不是\p{Nd}
也不是\p{Nl}
的\pN
,例如上标、下标和分数。这些是\p{GC=Other_Number}
,并且不包含在\w
中。因为与大多数正则表达式系统不同,Perl 遵守 要求 1.2a,“兼容性属性”,来自 Unicode 正则表达式上的 UTS #18,然后假设您有 Unicode 字符串 <正则表达式中的 code>\w 匹配具有以下四个属性之一的任何单个代码点:
\p{GC=Alphabetic}
\p{GC=Mark}< /code>
\p{GC=Connector_Punctuation}
\p{GC=Decimal_Number}
上面的数字 4 可以用以下任何一种方式表示,这些方式都被认为是等效的:
\p{GC=Decimal_Number}
\p{Decimal_Number}
>\p{Nd}
\p{Numeric_Type=Decimal}
\p{Nt=De}
请注意
\p{Digit }
与\p{Numeric_Type=Digit}
不同。例如,代码点 B2 SUPERSCRIPT 2 仅具有\p{Numeric_Type=Digit}
属性,而不是普通的\p{Digit}
。这是因为它被视为\p{Other_Number}
或\p{No}
。然而,正如您所想象的,它确实具有\p{Numeric_Value=2}
属性。实际上,上面的第一点
\p{Alphabetic}
给人们带来了最大的麻烦。这是因为他们经常错误地认为它在某种程度上与\p{Letter}
(\pL
) 相同,但事实并非如此。字母表包含的内容远不止这些,这一切都是因为
\p{Other_Alphabetic}
属性,因为这又包括部分但不是全部
\p{GC=Mark}
、所有\p{Lowercase}
(与不同) >\p{GC=Ll}
因为它添加了\p{Other_Lowercase}
) 和所有\p{Uppercase}
(这与\p{GC=Lu}
因为它添加了\p{Other_Uppercase}
)。这就是它如何像罗马数字一样引入
\p{GC=Letter_Number}
以及所有带圆圈的字母,其类型为
\p{Other_Symbol}
和\p{Block=Enheld_Alphanumerics}
。你不高兴我们使用
\w
吗? :)Yes and no.
If you want all alphanumerics, you want
[\p{Alphabetic}\p{GC=Number}]
. The\w
contains both more and less than that. It specifically excludes any\pN
which is not\p{Nd}
nor\p{Nl}
, like the superscripts, subscripts, and fractions. Those are\p{GC=Other_Number}
, and are not included in\w
.Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a
\w
in a regex matches any single code point that has any of the following four properties:\p{GC=Alphabetic}
\p{GC=Mark}
\p{GC=Connector_Punctuation}
\p{GC=Decimal_Number}
Number 4 above can be expressed in any of these ways, which are all considered equivalent:
\p{Digit}
\p{General_Category=Decimal_Number}
\p{GC=Decimal_Number}
\p{Decimal_Number}
\p{Nd}
\p{Numeric_Type=Decimal}
\p{Nt=De}
Note that
\p{Digit}
is not the same as\p{Numeric_Type=Digit}
. For example, code point B2, SUPERSCRIPT TWO, has only the\p{Numeric_Type=Digit}
property and not plain\p{Digit}
. That is because it is considered a\p{Other_Number}
or\p{No}
. It does, however, have the\p{Numeric_Value=2}
property as you would imagine.It’s really point number 1 above,
\p{Alphabetic}
,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as\p{Letter}
(\pL
), but it is not.Alphabetics include much more than that, all because of the
\p{Other_Alphabetic}
property, as this in turnincludes some but not all
\p{GC=Mark}
, all of\p{Lowercase}
(which is not the same as\p{GC=Ll}
because it adds\p{Other_Lowercase}
) and all of\p{Uppercase}
(which is not the same as\p{GC=Lu}
because it adds\p{Other_Uppercase}
).That’s how it pulls in
\p{GC=Letter_Number}
like Roman numerals and alsoall the circled letters, which are of type
\p{Other_Symbol}
and\p{Block=Enclosed_Alphanumerics}
.Aren’t you glad we get to use
\w
? :)特别是
\w
也匹配下划线字符。下划线是一个单词字符。
In particular
\w
also matches the underscore character.Underscore is a word character.