SQLite 优化数百万条目?

发布于 2024-09-24 19:37:15 字数 1431 浏览 7 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

云雾 2024-10-01 19:37:15

我倾向于使用哈希而不是 SQLite 来做你想做的事情。哈希经过优化以测试其存在性,而无需以任何排序顺序保留值,也无需在索引中保留数据的冗余副本。应用于数据的哈希算法会产生数据的存储位置(如果确实存在);您可以寻找该位置并查看它是否在那里。我认为你不需要将哈希表保存在 RAM 中。

以下是您可以采用混合哈希/SQLite 方法的方法。

创建 SQLite 表

STORE
id INTEGER PRIMARY KEY
BUCKET (integer, indexed) 
URL (text, not indexed)
status 

如果您想按状态将它们分开,则可以拥有其中三个表:STORE1、STORE2 和 STORE3。

假设每个商店中有 250,000,001 个不同的桶。 (您可以尝试这个数字;将其设为素数)。

找到一个哈希算法,它接受两个输入:URL 字符串和 250,000,0001,并返回 1 到 250,000,001 之间的数字。

当您获得 URL 时,将其输入哈希算法,它会告诉您要查找哪个 BUCKET:

Select * from STORE where BUCKET = {哈希函数返回的值}。

BUCKET 字段上的索引将快速返回行,并且您可以检查 URL。如果当前 URL 不是其中之一,请添加它:

INSERT STORE(BUCKET, URL) VALUES( {your hash return value}, theURL). 

SQLite 将索引整数值,我认为这比索引 URL 更有效。并且 URL 只会存储一次。

I'd be inclined to use a hash instead of SQLite to do what you want to do. A hash is optimized to test for existence without the need to keep the values in any sorted order and with no need to keep a redundant copy of the datum in an index. The hash algorithm applied to the datum yields the location where it would be stored, if it did exist; you can seek to that location and see if it's there. I don't think you'd need to keep the hash table in RAM.

Here's how you might take a hybrid hash/SQLite approach.

Create a SQLite table

STORE
id INTEGER PRIMARY KEY
BUCKET (integer, indexed) 
URL (text, not indexed)
status 

You could have three of these tables, STORE1, STORE2, and STORE3 if you want to keep them separate by status.

Let's assume that there will be 250,000,001 distinct buckets in each store. (You can experiment with this number; make it a prime number).

Find a hashing algorithm that takes two inputs, the URL string and 250,000,0001 and returns a number between 1 and 250,000,001.

When you get a URL, feed it to the hashing algorithm and it will tell you which BUCKET to look in:

Select * from STORE where BUCKET = {the value returned by your hash function}.

Your index on the BUCKET field will quickly return the rows, and you can examine the URLs. If the current URL is not one of them, add it:

INSERT STORE(BUCKET, URL) VALUES( {your hash return value}, theURL). 

SQLite will be indexing integer values, which I think will be more efficient than indexing the URL. And the URL will be stored only once.

习ぎ惯性依靠 2024-10-01 19:37:15

如果 $db 未定义,则打开数据库失败,您应该检查 $!$BerkeleyDB::Error 以了解原因。

您已经创建数据库了吗?如果没有,您需要 -Flags => DB_CREATE

工作示例:

use strict;
use warnings;
use BerkeleyDB;

my $dbFolder = '/home/ysth/bdbtmp/';

my $db  = BerkeleyDB::Hash->new (
    -Filename => "$dbFolder/fred.db", 
    -Flags => DB_CREATE,
) or die "couldn't create: $!, $BerkeleyDB::Error.\n";

my $status = $db->db_put("apple", "red");

不过,我无法让 BerkeleyDB::Env 做任何有用的事情;无论我尝试什么,构造函数都会返回 undef。

If $db is undefined, opening the database is failing, and you should inspect $! and $BerkeleyDB::Error to see why.

Have you created the database already? If not, you need -Flags => DB_CREATE.

Working example:

use strict;
use warnings;
use BerkeleyDB;

my $dbFolder = '/home/ysth/bdbtmp/';

my $db  = BerkeleyDB::Hash->new (
    -Filename => "$dbFolder/fred.db", 
    -Flags => DB_CREATE,
) or die "couldn't create: $!, $BerkeleyDB::Error.\n";

my $status = $db->db_put("apple", "red");

I couldn't get BerkeleyDB::Env to do anything useful, though; whatever I tried, the constructor returned undef.

美男兮 2024-10-01 19:37:15

我不知道这是否是最佳的,但您可以设置 SQLite 数据库,以便“好”表对 URL 列具有唯一约束。您可能没有足够的 RAM 来在 Perl 中进行比较(简单的解决方案是创建一个以 URL 为键的哈希值,但如果您有 10 亿个页面,您将需要大量内存)。

当需要进行插入时,数据库将强制执行唯一性,并在尝试插入重复的 URL 时抛出某种错误。您可以捕获此错误并忽略它,只要 DBI 针对不同的错误消息返回不同的错误值即可。

I don't know if this is optimal, but you could set up your SQLite DB such that the "good" table has a unique constraint on the URL column. You probably don't have enough RAM to do the comparisons in Perl (naive solution would be to create a hash where the URLs are the keys, but if you have a billion pages you'll need an awful lot of memory).

When it comes time to do an insert, the database will enforce the uniqueness and throw some kind of error when it tries to insert a duplicated URL. You can catch this error and ignore it, as long as DBI returns different error values for different error messages.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文