如何获取具有给定属性的所有 Unicode 字符的列表?

发布于 2024-07-29 01:35:36 字数 234 浏览 12 评论 0原文

如果不循环整个 Unicode 字符范围,如何获取具有给定属性的字符列表? 特别是我想要一个所有数字字符的列表(即那些匹配 /\d/ 的字符)。 我查看了 Unicode::UCD,它是对于确定给定字符的属性很有用,但似乎没有办法获取具有属性的列表字符。

Without looping over the entire range of Unicode characters, how can I get a list of characters that have a given property? In particular I want a list of all characters that are digits (i.e. those that match /\d/). I have looked at Unicode::UCD, and it is useful for determining the properties of a given character, but there doesn't seem to be a way to get a list characters that have a property out of it.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

破晓 2024-08-05 01:35:36

每个类的 Unicode 字符列表是在编译 Perl 时从 Unicode 规范生成的,通常存储在 /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/

例如,匹配的 Unicode 字符范围列表IsDigit(又名 \d)存储在文件 /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/Digit.pl 中

The list of Unicode characters for each class is generated from the Unicode spec when you compile Perl, and is typically stored in /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/

For example, the list of Unicode character ranges that match IsDigit (a.k.a. \d) is stored in the file /usr/lib/perl-YOURPERLVERSION/unicore/lib/gc_sc/Digit.pl

面如桃花 2024-08-05 01:35:36

unicore/To/Digit.pl 甚至比 unicore/lib/gc_sc/Digit.pl 更好。 它是 Unicode 数字字符(实际上是它们的偏移量)到它们的数值的直接映射。 这意味着

use Unicode::Digits qw/digit_to_int/;

my @digits;
for (split "\n", require "unicore/lib/gc_sc/Digit.pl") {
    my ($s, $e) = map hex, split;
    for (my $ord = $s; $ord <= $e; $ord++) {
        my $chr = chr $ord;
        push @{$digits[digits_to_int $chr]}, $chr;
    }
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

我可以说:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    my $chr = chr hex $ord;
    push @{$digits[$val]}, $chr;
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

甚至更好:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    $digits[$val] .= "\\x{$ord}";
}
@digits = map { qr/[$_]/ } @digits;

Even better than unicore/lib/gc_sc/Digit.pl is unicore/To/Digit.pl. It is a direct mapping of Unicode digit characters (well, really their offsets) to their numeric values. This means instead of:

use Unicode::Digits qw/digit_to_int/;

my @digits;
for (split "\n", require "unicore/lib/gc_sc/Digit.pl") {
    my ($s, $e) = map hex, split;
    for (my $ord = $s; $ord <= $e; $ord++) {
        my $chr = chr $ord;
        push @{$digits[digits_to_int $chr]}, $chr;
    }
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

I can say:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    my $chr = chr hex $ord;
    push @{$digits[$val]}, $chr;
}

for my $i (0 .. 9) {
    my $re = join '', "[", @{$digits[$i]}, "]";
    $digits[$i] = qr/$re/;
}

Or even better:

my @digits;
for (split "\n", require "unicore/To/Digit.pl") {
    my ($ord, $val) = split;
    $digits[$val] .= "\\x{$ord}";
}
@digits = map { qr/[$_]/ } @digits;
海夕 2024-08-05 01:35:36

/\d/ 匹配哪些字符完全取决于您的正则表达式实现(尽管保证标准 0-9)。 对于 Perl,使用的 perl 语言环境 定义哪些字符被视为字母和数字。

which characters /\d/ match depends entirely on your regexp implementation (although standard 0-9 are guaranteed). In the case of perl the perl locale used defines which characters are considered alphabetic and digits.

吐个泡泡 2024-08-05 01:35:36

如果不迭代所有字符,就无法做到这一点。
(如果您使用所有这些字符串创建一个巨大的字符串并使用正则表达式,您仍然必须至少执行一次循环才能创建字符串)。

There is no way to do that without iterating through all the characters.
(if you create a huge string with all of them and use a regexp you still have to do the loop at least once, to create the string).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文