Perl 中是否可以保留哈希表的大小?

发布于 2024-11-18 22:07:16 字数 1093 浏览 2 评论 0原文

我正在 Perl 中创建一个大小未知的哈希表。

哈希表将字符串映射到数组的引用。

我的应用程序的主循环在每次迭代中向哈希表添加 5-10 个元素。当哈希表填满时,速度开始急剧减慢。根据观察,当哈希表中有大约 50k 个键时,添加键的速度会减慢 20 倍。

我假设哈希表已满,并且正在发生冲突。我想“保留”哈希表的大小,但我不确定如何保留。


有问题的哈希是 hNgramsToWord。

对于每个单词,该单词的 1-len-gram 被添加为键,并引用包含该 ngram 的单词数组。

例如:

AddToNgramHash("Hello");

[h, e, l, l, o, he, el, ll, lo, hel, llo, hell, ello, hello ] 全部添加为键,映射到“hello”

sub AddToNgramHash($) {
    my $word = shift;
    my @aNgrams = MakeNgrams($word);
    foreach my $ngram (@aNgrams) {
       my @aWords;
       if(defined($hNgramsToWord{$ngram})) {
          @aWords = @{$hNgramsToWord{$ngram}};
       }
       push (@aWords, $word);
       $hNgramsToWord{$ngram} = \@aWords;
    }
    return scalar keys %hNgramsToWord;
}

sub MakeNgrams($) {
    my $word = shift;
    my $len = length($word);
    my @aNgrams;
    for(1..$len) {
       my $ngs = $_;
          for(0..$len-$ngs) {
           my $ngram = substr($word, $_, $ngs);
           push (@aNgrams, $ngram);
       }
    }
    return @aNgrams;
}

I'm creating a hash table in Perl, of an unknown size.

The hash table maps a string to a reference to an array.

The main loop of my application adds 5-10 elements to the hash table in each iteration. As the hash table fills up, things start to slow down drastically. From observation, when there are ~50k keys in the hash table, adding keys slows by a magnitude of 20x.

I postulate that the hash table has become full, and collisions are occurring. I would like to 'reserve' the size of the hash table, but I'm unsure how.


The hash in question is hNgramsToWord.

For each word, the 1-len-grams of that word are added as keys, with a reference to an array of words which contain that ngram.

For example:

AddToNgramHash("Hello");

[h, e, l, l, o, he, el, ll, lo, hel, llo, hell, ello, hello ] are all added as keys, mapping to "hello"

sub AddToNgramHash($) {
    my $word = shift;
    my @aNgrams = MakeNgrams($word);
    foreach my $ngram (@aNgrams) {
       my @aWords;
       if(defined($hNgramsToWord{$ngram})) {
          @aWords = @{$hNgramsToWord{$ngram}};
       }
       push (@aWords, $word);
       $hNgramsToWord{$ngram} = \@aWords;
    }
    return scalar keys %hNgramsToWord;
}

sub MakeNgrams($) {
    my $word = shift;
    my $len = length($word);
    my @aNgrams;
    for(1..$len) {
       my $ngs = $_;
          for(0..$len-$ngs) {
           my $ngram = substr($word, $_, $ngs);
           push (@aNgrams, $ngram);
       }
    }
    return @aNgrams;
}

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

抽个烟儿 2024-11-25 22:07:16

您可以设置哈希的存储桶数量,如下所示:

keys(%hash) = 128;

该数字将四舍五入为 2 的幂。

也就是说,您看到的速度减慢不太可能是由于过多的哈希冲突造成的,因为 Perl 会根据需要动态扩展存储桶的数量。从 5.8.2 开始,它甚至可以检测导致给定存储桶被过度使用的病态数据,并为该哈希重新配置哈希函数。

显示您的代码,我们很可能能够帮助找到真正的问题。

大量哈希键的演示(不要让它继续下去,直到内存耗尽......):

use strict;
use warnings;
my $start = time();
my %hash;
$SIG{ALRM} = sub {
    alarm 1;
    printf(
        "%.0f keys/s; %d keys, %s buckets used\n",
        keys(%hash) / (time() - $start),
        scalar(keys(%hash)),
        scalar(%hash)
    );
};
alarm 1;
$hash{rand()}++ while 1;

一旦有很多键,当需要扩展存储桶数量时,您会注意到明显的减速,但仍然保持着相当均匀的速度。

查看您的代码,加载的单词越多,每个单词需要做的工作就越多。

您可以通过将以下内容更改

   my @aWords;
   if(defined($hNgramsToWord{$ngram})) {
      @aWords = @{$hNgramsToWord{$ngram}};
   }
   push (@aWords, $word);
   $hNgramsToWord{$ngram} = \@aWords;

为:

   push @{ $hNgramsToWord{$ngram} }, $word;

无需复制数组两次来修复它。无需检查 ngram 是否已有条目 - 它会为您自动生成数组引用。

You can set the number of buckets for a hash like so:

keys(%hash) = 128;

The number will be rounded up to a power of two.

That said, it is very unlikely that the slowdown you see is due to excess hash collisions, since Perl will dynamically expand the number of buckets as needed. And since 5.8.2, it will even detect pathological data that results in a given bucket being overused and reconfigure the hashing function for that hash.

Show your code, and we will likely be able to help find the real problem.

A demonstration of a large number of hash keys (don't let it continue till you are out of memory...):

use strict;
use warnings;
my $start = time();
my %hash;
$SIG{ALRM} = sub {
    alarm 1;
    printf(
        "%.0f keys/s; %d keys, %s buckets used\n",
        keys(%hash) / (time() - $start),
        scalar(keys(%hash)),
        scalar(%hash)
    );
};
alarm 1;
$hash{rand()}++ while 1;

Once there are a LOT of keys, you will notice a perceptible slowdown when it needs to expand the number of buckets, but it still maintains a pretty even pace.

Looking at your code, the more words are loaded, the more work it has to do for each word.

You can fix it by changing this:

   my @aWords;
   if(defined($hNgramsToWord{$ngram})) {
      @aWords = @{$hNgramsToWord{$ngram}};
   }
   push (@aWords, $word);
   $hNgramsToWord{$ngram} = \@aWords;

to this:

   push @{ $hNgramsToWord{$ngram} }, $word;

No need to copy the array twice. No need to check if the ngram already has an entry - it will autovivify an array reference for you.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文