Perl 数组语法

发布于 2024-12-15 04:49:30 字数 185 浏览 0 评论 0原文

我有一些包含 1e8 数字的文件,大小从 1 到 999 不等。我需要通读每个文件并保存一份报告,说明在每个文件中找到了每个数字的数量。我认为设置一个全零的常量数组,然后使用我刚刚读取的数字作为索引进行递增会给我答案。用于执行此操作的 Perl 语法不是我所期望的。并非所有数字都一定会出现。也许散列是一种方法,但数组中可能只有几个洞。有什么帮助吗?谢谢。

I have files with 1e8 numbers in them ranging in size from 1 to 999. I need to read through each file and save a report of how many of each number were found in each file. I thought that setting up a constant array of all zeros and then incrementing using the number I just read as the index would give me the answer. Perl syntax for doing this is not what I expected. Not all numbers will necessarily occur. Maybe hashes is a way to go but the array will likely only have a few holes in it. Any help? Thanks.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

晚风撩人 2024-12-22 04:49:30

执行此操作的 Perl 语法不是我所期望的。

展示你的作品

并非所有数字都一定会出现。

跳过零(简单的 if 块,请参阅 http://perldoc.perl.org/perlintro.html)

也许散列是一种方法,但数组中可能只有几个洞。

是的,哈希值是很自然的选择,请参阅 perlfaq4,搜索“计数”、“uniq”和“重复”

Perl syntax for doing this is not what I expected.

Show your work

Not all numbers will necessarily occur.

Skip the zeros (simple if block, see http://perldoc.perl.org/perlintro.html )

Maybe hashes is a way to go but the array will likely only have a few holes in it.

Yes, hashes are a natural fit, see perlfaq4, search for "count", "uniq", and "duplicate"

多像笑话 2024-12-22 04:49:30

Tom Duff(Duff's Device 的创建者):“如果你的代码太慢,你必须让它更快。如果没有更好的算法可用,你必须修剪周期。

我不同意这是一种情况其中哈希是最合适的。当然,哈希是惯用的,它是 perlfaq4 中提到的方法,并且它不会浪费计数器容器中的元素。但他谈论的是 1 到 999 之间的 100_000_000 个整数。计数器容器中使用的元素数量微不足道。获得计数所需的迭代次数非常重要。 100,000,000 次迭代需要大量时间。

如果我们使用数组,我们就会丢弃索引为零的元素。如果所有整数都是相同的值,我们就会丢弃 998 个以上的元素。有这么大的事吗?另一方面,即使索引到数组和索引到哈希都需要 O(1) 操作,Big-O 表示法只能说明部分情况。其中“n”是整数总数(100,000,000),数组方法和哈希方法都是 O(n) 操作。因此,归根结底取决于哪种方法的计算效率更高。尽管数组查找和哈希查找都是 O(1),但事实证明哈希查找需要更多的周期来执行。

迭代超过 100,000,000 个整数并递增计数器需要时间。但事实证明,在散列中执行此操作比在数组中花费更多时间。我知道从“常见习语”的角度来看这是一种亵渎。但这可能是一种非常特殊的情况,其中计算效率比惯用代码更重要,并且比使用数组作为计数器的稍大的内存占用更重要。

以下是一些代码,演示了我正在讨论的内容:

use strict;
use warnings;
use Benchmark  qw/ cmpthese /;
use List::Util qw/ max min  /;
use Test::More;
use Readonly;



Readonly my $datasize => 100_000_000;

Readonly my $test_size => 100_000;

my @array_results = keys count_in_array( $test_size );
my @hash_results  = keys count_in_hash(  $test_size );

is( max( @array_results ), 999, "max( keys count_in_array() ) in range." );
is( min( @array_results ),   1, "min( keys count_in_array() ) in range." );
is( max( @hash_results  ), 999, "max( keys  count_in_hash() ) in range." );
is( min( @hash_results  ),   1, "min( keys  count_in_hash() ) in range." );
done_testing();    



cmpthese( 5, {
    array => sub{ my $results = count_in_array() },
    hash  => sub{ my $results = count_in_hash()  },
} );




sub count_in_array {
    my @container;
    for( 1 .. $datasize ) {
        $container[ int( rand( 999 ) ) + 1 ]++;
    }
    return {
        map{ 
                $container[$_] > 0 
                    ? ( $_ => $container[$_] ) 
                    : () 
        } 1 .. $#container
    };
}



sub count_in_hash {
    my %container;
    for( 1 .. $datasize ) {
        $container{ int( rand ( 999 ) ) + 1 }++;
    }
    return \%container;
}

以下是该基准测试的结果:

ok 1 - max( keys count_in_array() ) in range.
ok 2 - min( keys count_in_array() ) in range.
ok 3 - max( keys  count_in_hash() ) in range.
ok 4 - min( keys  count_in_hash() ) in range.
1..4
      s/iter  hash array
hash    24.9    --  -42%
array   14.5   72%    --

这是数组方法的一大胜利(速度快了 72%)。如果这仍然太慢,请使用 Inline::C 来使用整数数组重写数组方法来修剪周期。这将会快一个数量级。

这可能是需要优化的关键 3%。

那么,我们该怎么做才能减轻脱离普遍认可的习语所带来的影响呢?我们确保记录我们正在做的事情,以便未来的读者(包括将来的我们自己)了解正在做什么,以及为什么以我们不熟悉的方式完成它。

只是我的.02。

Tom Duff (creator of Duff's Device): "If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles."

I disagree that this is a situation where a hash is the best fit. Sure, a hash is the idiomatic fit, it's the method mentioned in perlfaq4, and it doesn't waste elements in your counter container. But he's talking about 100_000_000 integers between 1 and 999. The number of elements being used in a counter container is insignificant. The number of iterations necessary to get a count is very significant. 100,000,000 iterations takes a lot of time.

If we use an array instead, we throw away the element indexed at zero. And if all of the integers are the same value, we throw away 998 more elements. Is that such a big deal? On the other hand, even though both indexing into an array, and indexing into a hash aggregate out to O(1) operations, Big-O notation only tells part of the story. And where 'n' is the number of total integers (100,000,000), both an array approach and a hash approach are O(n) operations. So it comes down to which approach is more computationally efficient. Even though an array lookup and a hash lookup are both O(1), it turns out the hash lookup requires significantly more cycles to perform.

Iterating over 100,000,000 integers and incrementing counters takes time. But it turns out that it takes more time doing so inside of a hash than it does in an array. I know this is sacrilege from a "common idioms" standpoint. But this may be a very special case where computational efficiency is more important than idiomatic code, and more important than the very slightly larger memory footprint of using an array as a counter.

Here is some code demonstrating what I'm talking about:

use strict;
use warnings;
use Benchmark  qw/ cmpthese /;
use List::Util qw/ max min  /;
use Test::More;
use Readonly;



Readonly my $datasize => 100_000_000;

Readonly my $test_size => 100_000;

my @array_results = keys count_in_array( $test_size );
my @hash_results  = keys count_in_hash(  $test_size );

is( max( @array_results ), 999, "max( keys count_in_array() ) in range." );
is( min( @array_results ),   1, "min( keys count_in_array() ) in range." );
is( max( @hash_results  ), 999, "max( keys  count_in_hash() ) in range." );
is( min( @hash_results  ),   1, "min( keys  count_in_hash() ) in range." );
done_testing();    



cmpthese( 5, {
    array => sub{ my $results = count_in_array() },
    hash  => sub{ my $results = count_in_hash()  },
} );




sub count_in_array {
    my @container;
    for( 1 .. $datasize ) {
        $container[ int( rand( 999 ) ) + 1 ]++;
    }
    return {
        map{ 
                $container[$_] > 0 
                    ? ( $_ => $container[$_] ) 
                    : () 
        } 1 .. $#container
    };
}



sub count_in_hash {
    my %container;
    for( 1 .. $datasize ) {
        $container{ int( rand ( 999 ) ) + 1 }++;
    }
    return \%container;
}

And here are the results of that benchmark:

ok 1 - max( keys count_in_array() ) in range.
ok 2 - min( keys count_in_array() ) in range.
ok 3 - max( keys  count_in_hash() ) in range.
ok 4 - min( keys  count_in_hash() ) in range.
1..4
      s/iter  hash array
hash    24.9    --  -42%
array   14.5   72%    --

That's a big win for the array approach (it is 72% faster). If that's still too slow, trim cycles by using Inline::C to rewrite the array approach using an array of ints. That will be an order of magnitude faster still.

This could be that critical 3% where optimization is warranted.

So what do we do to mitigate the effects of breaking away from a commonly recognized idiom? We make sure to document what we're doing so that future readers (including ourselves in the future) understand what is being done, and why it's being done in a way that isn't as familiar to us.

Just my .02.

我不会写诗 2024-12-22 04:49:30

根据系统排序实用程序的质量,在命令行中:

sort giant-file.txt | uniq -c > giant_file_counts.txt

Depending on the quality of your system's sort utility, at the command line:

sort giant-file.txt | uniq -c > giant_file_counts.txt
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文