在 Perl 中将 UTF8 字符串转换为数值

发布于 2024-09-15 11:32:31 字数 138 浏览 8 评论 0原文

例如，

my $str = '中國c'; # Chinese language of china

我想打印出数值

20013,22283,99

原文

For example,

my $str = '中國c'; # Chinese language of china

I want to print out the numeric values

20013,22283,99

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

薯片软お妹 2024-09-22 11:32:32

unpack 比 split< 更高效/code> 和 ord，因为它不必生成一堆临时的 1 字符字符串：

use utf8;

my $str = '中國c'; # Chinese language of china

my @codepoints = unpack 'U*', $str;

print join(',', @codepoints) . "\n"; # prints 20013,22283,99

快速基准测试显示它比 split+ord 快大约 3 倍：

use utf8;
use Benchmark 'cmpthese';

my $str = '中國中國中國中國中國中國中國中國中國中國中國中國中國中國c';

cmpthese(0, {
  'unpack'     => sub { my @codepoints = unpack 'U*', $str; },
  'split-map'  => sub { my @codepoints = map { ord } split //, $str },
  'split-for'  => sub { my @cp; for my $c (split(//, $str)) { push @cp, ord($c) } },
  'split-for2' => sub { my $cp; for my $c (split(//, $str)) { $cp = ord($c) } },
});

结果：

               Rate  split-map  split-for split-for2     unpack
split-map   85423/s         --        -7%       -32%       -67%
split-for   91950/s         8%         --       -27%       -64%
split-for2 125550/s        47%        37%         --       -51%
unpack     256941/s       201%       179%       105%         --

字符串较短时，差异不太明显，但 unpack 的速度仍然快两倍以上。（split-for2 比其他拆分要快一些，因为它不构建代码点列表。）

unpack will be more efficient than split and ord, because it doesn't have to make a bunch of temporary 1-character strings:

use utf8;

my $str = '中國c'; # Chinese language of china

my @codepoints = unpack 'U*', $str;

print join(',', @codepoints) . "\n"; # prints 20013,22283,99

A quick benchmark shows it's about 3 times faster than split+ord:

use utf8;
use Benchmark 'cmpthese';

my $str = '中國中國中國中國中國中國中國中國中國中國中國中國中國中國c';

cmpthese(0, {
  'unpack'     => sub { my @codepoints = unpack 'U*', $str; },
  'split-map'  => sub { my @codepoints = map { ord } split //, $str },
  'split-for'  => sub { my @cp; for my $c (split(//, $str)) { push @cp, ord($c) } },
  'split-for2' => sub { my $cp; for my $c (split(//, $str)) { $cp = ord($c) } },
});

Results:

               Rate  split-map  split-for split-for2     unpack
split-map   85423/s         --        -7%       -32%       -67%
split-for   91950/s         8%         --       -27%       -64%
split-for2 125550/s        47%        37%         --       -51%
unpack     256941/s       201%       179%       105%         --

The difference is less pronounced with a shorter string, but unpack is still more than twice as fast. (split-for2 is a bit faster than the other splits because it doesn't build a list of codepoints.)

回复收藏 0 原文

奢华的一滴泪 2024-09-22 11:32:32

请参阅 perldoc -f ord：

foreach my $c (split(//, $str))
{
    print ord($c), "\n";
}

或压缩为一行：my @ chars = map { ord } split //, $str;

Data::Dumper ed，这会产生：

See perldoc -f ord:

foreach my $c (split(//, $str))
{
    print ord($c), "\n";
}

Or compressed into a single line: my @chars = map { ord } split //, $str;

Data::Dumpered, this produces:

回复收藏 0 原文

对岸观火 2024-09-22 11:32:32

要让源代码中的 utf8 被识别，您必须事先使用 utf8; ：

$ perl
use utf8;
my $str = '中國c'; # Chinese language of china
foreach my $c (split(//, $str))
{
    print ord($c), "\n";
}
__END__
20013
22283
99

或者更简洁地说，

print join ',', map ord, split //, $str;

To have utf8 in your source code recognized as such, you must use utf8; beforehand:

$ perl
use utf8;
my $str = '中國c'; # Chinese language of china
foreach my $c (split(//, $str))
{
    print ord($c), "\n";
}
__END__
20013
22283
99

or more tersely,

print join ',', map ord, split //, $str;

回复收藏 0 原文

烟凡古楼 2024-09-22 11:32:32

http://www.perl.com/pub/2012/04 /perlunicook-standard-preamble.html

#!/usr/bin/env perl


 use utf8;      # so literals and identifiers can be in UTF-8
 use v5.12;     # or later to get "unicode_strings" feature
 use strict;    # quote strings, declare variables
 use warnings;  # on by default
 use warnings  qw(FATAL utf8);    # fatalize encoding glitches
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 # use charnames qw(:full :short);  # unneeded in v5.16

# http://perldoc.perl.org/functions/sprintf.html
# vector flag
# This flag tells Perl to interpret the supplied string as a vector of integers, one for each character in the string. 

my $str = '中國c';

printf "%*vd\n", ",", $str;

http://www.perl.com/pub/2012/04/perlunicook-standard-preamble.html

#!/usr/bin/env perl


 use utf8;      # so literals and identifiers can be in UTF-8
 use v5.12;     # or later to get "unicode_strings" feature
 use strict;    # quote strings, declare variables
 use warnings;  # on by default
 use warnings  qw(FATAL utf8);    # fatalize encoding glitches
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 # use charnames qw(:full :short);  # unneeded in v5.16

# http://perldoc.perl.org/functions/sprintf.html
# vector flag
# This flag tells Perl to interpret the supplied string as a vector of integers, one for each character in the string. 

my $str = '中國c';

printf "%*vd\n", ",", $str;

回复收藏 0 原文

~没有更多了~