在 Perl 中非常快速地查找:可以重新加载哈希值吗?

发布于 2025-01-08 07:07:34 字数 283 浏览 0 评论 0原文

我有大约 1 亿行,例如:

A : value of A
B : value of B
|
|
|
Z : value of Z  upto 100 million unique entries

目前,每次运行程序时,我都会将整个文件作为哈希加载,这需要一些时间。在运行时,我需要访问 A,B 的值,因为我知道 A,B 等。

我想知道是否可以创建一次哈希值并将其存储为二进制数据结构或索引文件。在 Perl 中用最少的编程就能实现什么。

谢谢! -阿比

I have about 100 million rows such as:

A : value of A
B : value of B
|
|
|
Z : value of Z  upto 100 million unique entries

Currently each time I run my program I load the entire file as a hash which takes some time. During the run time I need access to value of A,B given I know A,B etc.

I am wondering if I can make a hash once and store it as a binary data structure or index the file. What would be possible in in perl with least programming.

Thanks!
-Abhi

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

半世蒼涼 2025-01-15 07:07:34

我建议使用磁盘上的键/值数据库。由于 Perl 的 tie 函数,它们可以与普通的内存中哈希相同地使用。如果您的散列非常大,它们的读/写速度将比 Perl 的散列更快,并且它们支持自动保存/加载到磁盘。

BerkeleyDB 是一个老最爱:

use BerkeleyDB;
# Make %db an on-disk database stored in database.dbm. Create file if needed
tie my %db, 'BerkeleyDB::Hash', -Filename => "database.dbm", -Flags => DB_CREATE
    or die "Couldn't tie database: $BerkeleyDB::Error";

$db{foo} = 1;            # get value
print $db{foo}, "\n";    # set value
for my $key (keys %db) {
    print "$key -> $db{$key}\n";  # iterate values
}

%db = ();  # wipe

对数据库的更改会自动保存到磁盘,并通过多次调用持续存在你的脚本。

检查 perldoc 的选项,但最重要的是:

# Increase memory allocation for database (increases performance), e.g. 640 MB
tie my %db, 'BerkeleyDB::Hash', -Filename => $filename, -CacheSize => 640*1024*1024;

# Open database in readonly mode
tie my %db, 'BerkeleyDB::Hash', -Filename => $filename, -Flags => DB_RDONLY;

更复杂但更快的数据库库是 Tokyo Cabinet ,当然还有很多其他选择(毕竟这是 Perl...)

I suggest an on-disk key/value database. Due to Perl's tie function, they can be used identically to normal, in-memory hashes. They'll be faster than Perl's hashes for reading/writing if your hash is very large, and they support saving/loading to disk automatically.

BerkeleyDB is an old favourite:

use BerkeleyDB;
# Make %db an on-disk database stored in database.dbm. Create file if needed
tie my %db, 'BerkeleyDB::Hash', -Filename => "database.dbm", -Flags => DB_CREATE
    or die "Couldn't tie database: $BerkeleyDB::Error";

$db{foo} = 1;            # get value
print $db{foo}, "\n";    # set value
for my $key (keys %db) {
    print "$key -> $db{$key}\n";  # iterate values
}

%db = ();  # wipe

Changes to the database are automatically saved to disk and will persist through multiple invocations of your script.

Check the perldoc for options, but the most important are:

# Increase memory allocation for database (increases performance), e.g. 640 MB
tie my %db, 'BerkeleyDB::Hash', -Filename => $filename, -CacheSize => 640*1024*1024;

# Open database in readonly mode
tie my %db, 'BerkeleyDB::Hash', -Filename => $filename, -Flags => DB_RDONLY;

A more complex but much faster database library would be Tokyo Cabinet, and there are of course many other options (this is Perl after all...)

请止步禁区 2025-01-15 07:07:34

看看Storable - 它应该做你想做的事情并且使用起来非常简单:

use Storable;
store \%table, 'file';
$hashref = retrieve('file');

这只在你的程序是当然,实际上受到 CPU 速度的限制。由于您的数据结构非常简单,因此您解析它的速度可能比从磁盘读取它的速度快。在这种情况下,可存储不会对您有太大帮助。

Have a look at Storable - it should do what you want and is extremely simple to use:

use Storable;
store \%table, 'file';
$hashref = retrieve('file');

This only helps if your program is actually limited by CPU speed, of course. Since your data structure is very simple, you may be parsing it faster than you can read it from disk. Storable isn't going to help you much in that case.

只是偏爱你 2025-01-15 07:07:34

我建议使用 Tie::File 因为它包含在核心中,也不是将整个数据结构加载到内存中,但根据需要从磁盘访问各个记录。

I recommend using Tie::File as it is included in the core, as well as not loading your entire data structure into memory, but accessing individual records as needed from the disk.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文