\w 是否匹配 Unicode 标准中定义的所有字母数字字符?

发布于 2024-10-30 17:15:28 字数 777 浏览 0 评论 0原文

Perl 的 \w 是否匹配 Unicode 标准中定义的所有字母数字字符?

例如,\w 是否匹配所有(例如)中文和俄语字母数字字符人物?

我编写了一个简单的测试脚本(见下文),它表明 \w 确实与我测试的非 ASCII 字母数字字符“按预期”匹配。但测试显然远非详尽无遗。

#!/usr/bin/perl                                                                                                                                                                                                  

use utf8;

binmode(STDOUT, ':utf8');

my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";

foreach my $ok (@ok) {
    die unless ($ok =~ /^\w+$/);
}

Does Perl's \w match all alphanumeric characters defined in the Unicode standard?

For example, will \w match all (say) Chinese and Russian alphanumeric characters?

I wrote a simple test script (see below) which suggests that \w does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.

#!/usr/bin/perl                                                                                                                                                                                                  

use utf8;

binmode(STDOUT, ':utf8');

my @ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";

foreach my $ok (@ok) {
    die unless ($ok =~ /^\w+$/);
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

青春有你 2024-11-06 17:15:28

perldoc perlunicode

正则表达式中的字符类匹配字符而不是字节,并与 Unicode 属性数据库中指定的字符属性匹配。例如,\w 可用于匹配日语表意文字。

所以看起来你的问题的答案是“是”。

但是,您可能希望使用 \p{} 构造来直接访问特定的 Unicode 字符属性。您可以使用 \p{L} (或者更短,\pL)来表示字母,使用 \pN 来表示数字,这样会感觉更方便一些相信您会得到您想要的。

perldoc perlunicode says

Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance.

So it looks like the answer to your question is "yes".

However, you might want to use the \p{} construct to directly access specific Unicode character properties. You can probably use \p{L} (or, shorter, \pL) for letters and \pN for numbers and feel a little more confident that you'll get exactly what you want.

爱情眠于流年 2024-11-06 17:15:28

是和不是。

如果您想要所有字母数字,则需要 [\p{Alphabetic}\p{GC=Number}]\w 包含的内容既多又少。它特别排除任何既不是 \p{Nd} 也不是 \p{Nl}\pN,例如上标、下标和分数。这些是 \p{GC=Other_Number},并且不包含在 \w 中。

因为与大多数正则表达式系统不同,Perl 遵守 要求 1.2a,“兼容性属性”,来自 Unicode 正则表达式上的 UTS #18,然后假设您有 Unicode 字符串 <正则表达式中的 code>\w 匹配具有以下四个属性之一的任何单个代码点:

  1. \p{GC=Alphabetic}
  2. \p{GC=Mark}< /code>
  3. \p{GC=Connector_Punctuation}
  4. \p{GC=Decimal_Number}

上面的数字 4 可以用以下任何一种方式表示,这些方式都被认为是等效的:

  • <代码>\p{数字}
  • <代码>\p{General_Category=Decimal_Number}
  • \p{GC=Decimal_Number}
  • \p{Decimal_Number} >
  • \p{Nd}
  • \p{Numeric_Type=Decimal}
  • \p{Nt=De}

请注意 \p{Digit }\p{Numeric_Type=Digit} 不同。例如,代码点 B2 SUPERSCRIPT 2 仅具有 \p{Numeric_Type=Digit} 属性,而不是普通的 \p{Digit}。这是因为它被视为 \p{Other_Number}\p{No}。然而,正如您所想象的,它确实具有 \p{Numeric_Value=2} 属性。

实际上,上面的第一点 \p{Alphabetic} 给人们带来了最大的麻烦。这是因为他们经常错误地认为它在某种程度上与 \p{Letter} (\pL) 相同,但事实并非如此。

字母表包含的内容远不止这些,这一切都是因为 \p{Other_Alphabetic} 属性,因为这又
包括部分但不是全部\p{GC=Mark}、所有\p{Lowercase}(与不同) >\p{GC=Ll} 因为它添加了 \p{Other_Lowercase}) 和所有 \p{Uppercase} (这与\p{GC=Lu} 因为它添加了 \p{Other_Uppercase})。

这就是它如何像罗马数字一样引入 \p{GC=Letter_Number} 以及
所有带圆圈的字母,其类型为 \p{Other_Symbol}\p{Block=Enheld_Alphanumerics}

你不高兴我们使用 \w 吗? :)

Yes and no.

If you want all alphanumerics, you want [\p{Alphabetic}\p{GC=Number}]. The \w contains both more and less than that. It specifically excludes any \pN which is not \p{Nd} nor \p{Nl}, like the superscripts, subscripts, and fractions. Those are \p{GC=Other_Number}, and are not included in \w.

Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a \w in a regex matches any single code point that has any of the following four properties:

  1. \p{GC=Alphabetic}
  2. \p{GC=Mark}
  3. \p{GC=Connector_Punctuation}
  4. \p{GC=Decimal_Number}

Number 4 above can be expressed in any of these ways, which are all considered equivalent:

  • \p{Digit}
  • \p{General_Category=Decimal_Number}
  • \p{GC=Decimal_Number}
  • \p{Decimal_Number}
  • \p{Nd}
  • \p{Numeric_Type=Decimal}
  • \p{Nt=De}

Note that \p{Digit} is not the same as \p{Numeric_Type=Digit}. For example, code point B2, SUPERSCRIPT TWO, has only the \p{Numeric_Type=Digit} property and not plain \p{Digit}. That is because it is considered a \p{Other_Number} or \p{No}. It does, however, have the \p{Numeric_Value=2} property as you would imagine.

It’s really point number 1 above, \p{Alphabetic} ,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as \p{Letter} (\pL), but it is not.

Alphabetics include much more than that, all because of the \p{Other_Alphabetic} property, as this in turn
includes some but not all \p{GC=Mark}, all of \p{Lowercase} (which is not the same as \p{GC=Ll} because it adds \p{Other_Lowercase}) and all of \p{Uppercase} (which is not the same as \p{GC=Lu} because it adds \p{Other_Uppercase}).

That’s how it pulls in \p{GC=Letter_Number} like Roman numerals and also
all the circled letters, which are of type \p{Other_Symbol} and \p{Block=Enclosed_Alphanumerics}.

Aren’t you glad we get to use \w? :)

最美不过初阳 2024-11-06 17:15:28

特别是 \w 也匹配下划线字符。

#!/usr/bin/perl -w
$name = 'Arun_Kumar';
($name =~ /\w+/)? print "Underscore is a word character\n": print "No underscores\n";
$ underscore.pl 

下划线是一个单词字符。

In particular \w also matches the underscore character.

#!/usr/bin/perl -w
$name = 'Arun_Kumar';
($name =~ /\w+/)? print "Underscore is a word character\n": print "No underscores\n";
$ underscore.pl 

Underscore is a word character.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文