Berkeleydb - B 树与哈希表

发布于 2024-10-01 14:33:41 字数 235 浏览 6 评论 0原文

我试图了解在使用 BerkeleyDB 时应该选择哪些访问方法:B 树与哈希表。 哈希表提供 O(1) 查找,但插入成本很高(使用线性/可扩展哈希,我们可以为插入分摊 O(1))。但 B 树提供 log N(以 B 为底)查找和插入时间。 B 树还可以支持范围查询并允许按排序顺序进行访问。

  1. 除了这些考虑因素之外,还应该考虑哪些因素呢?
  2. 如果我不需要支持范围查询,我可以只使用哈希表访问方法吗?

I am trying to understand what should drive the choice of the access method while using a BerkeleyDB : B-Tree versus HashTable.
A Hashtable provides O(1) lookup but inserts are expensive (using Linear/Extensible hashing we get amortized O(1) for insert). But B-Trees provide log N (base B) lookup and insert times. A B-Tree can also support range queries and allow access in sorted order.

  1. Apart from these considerations what else should be factored in?
  2. If I don't need to support range queries can I just use a Hashtable access method?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

遗弃M 2024-10-08 14:33:41

当数据集变得非常大时,B 树仍然更好,因为大多数内部元数据可能仍然适合缓存。哈希本质上(数据的均匀随机分布)本质上是对缓存不友好的。也就是说,一旦数据集的总大小超过工作内存大小,散列性能就会急剧下降,而 B 树性能则会优雅地下降(实际上是对数下降)。

When your data sets get very large, B-trees are still better because the majority of the internal metadata may still fit in cache. Hashes, by their nature (uniform random distribution of data) are inherently cache-unfriendly. I.e., once the total size of the data set exceeds the working memory size, hash performance drops off a cliff while B-tree performance degrades gracefully (logarithmically, actually).

对你而言 2024-10-08 14:33:41

这取决于您的数据集和密钥在小数据集上,您的基准将接近相同,但是在较大的数据集上,它可能会根据密钥类型/您拥有的数据量而有所不同。通常 b 树更好,直到 b 树元数据超出你的缓存并且最终执行大量 io,在这种情况下散列更好。另外,正如您所指出的,b 树插入更昂贵,如果您发现您将执行大量插入和少量读取,则哈希可能会更好,如果您发现您执行很少插入,但进行大量读取,则 b 树可能会更好变得更好。

如果您真的关心性能,您可以尝试这两种方法并运行您自己的基准测试=]

It depends on your data set and keys On small data sets your benchmark will be close to the same, however on larger data sets it can vary depending on what type of keys / how much data you have. Usually b-tree is better, until the btree meta data exceeds your cache and it ends up doing lots of io, in that case hash is better. Also as you pointed out, b-tree inserts are more expensive, if you find you will be doing lots of inserts and few reads, hash may be better, if you find you do little inserts, but lots of reads, b-tree may be better.

If you are really concerned about performance you could try both methods and run your own benchmarks =]

耶耶耶 2024-10-08 14:33:41

对于许多应用程序来说,数据库是随机、交互式访问的
或“交易”。如果您有数据传入,则可能会发生这种情况
来自网络服务器。然而,您经常需要填充大量数据
数据库一次全部完成,作为“批处理”操作。如果您这样做,则可能会发生这种情况
正在做一个数据分析项目,或者将旧数据库迁移到
新格式。

当您一次填充数据库时,B 树或其他
排序索引是更好的选择,因为它允许批量插入
可以更有效地完成。这是通过排序来完成的
将密钥放入数据库之前。填充 BerkeleyDB
具有 1000 万条条目的数据库可能需要一个小时
未排序,因为每次访问都是缓存未命中。但当
条目已排序,相同的过程可能只需要十分钟。
连续按键的接近意味着您将使用各种
缓存几乎所有的插入。排序可以做到非常
很快,所以整个操作可以加速数倍
在插入数据之前对数据进行排序。通过哈希表索引,
因为你事先不知道哪些键会紧挨着每个键
另外,这种优化是不可能的。

更新:我决定提供一个实际的例子。它基于
下面的脚本“db-test

#!/usr/bin/perl
use warnings;
use strict;
use BerkeleyDB;
my %hash;
unlink "test.db";
tie %hash, (shift), -Filename=>"test.db", -Flags=>DB_CREATE or die;
while(<>) { $hash{$_}=1; }
untie %hash;

我们可以使用包含 1600 万个条目的维基百科转储索引文件来测试它。 (我在 800MHz 2 核笔记本电脑上运行,内存为 3G)

$ >enw.tab bunzip2 <enwiki-20151102-pages-articles-multistream-index.txt.bz2
$ wc -l enw.tab
16050432 enw.tab
$ du -shL enw.tab
698M    enw.tab
$ time shuf enw.tab > test-shuf
  16.05s user 6.65s system 67% cpu 33.604 total
$ time sort enw.tab > test-sort
  70.99s user 10.77s system 114% cpu 1:11.47 total
$ time ./db-test BerkeleyDB::Btree < test-shuf
  682.75s user 368.58s system 42% cpu 40:57.92 total
$ du -sh test.db
1.3G    test.db
$ time ./db-test BerkeleyDB::Btree < test-sort
  378.10s user 10.55s system 91% cpu 7:03.34 total
$ du -sh test.db
923M    test.db
$ time ./db-test BerkeleyDB::Hash < test-shuf
  672.21s user 387.18s system 39% cpu 44:11.73 total
$ du -sh test.db
1.1G    test.db
$ time ./db-test BerkeleyDB::Hash < test-sort
  665.94s user 376.65s system 36% cpu 46:58.66 total
$ du -sh test.db
1.1G    test.db

您可以看到,对 Btree 键进行预排序会降低插入时间
从 41 分钟缩短到 7 分钟。排序只需要1分钟,所以
净收益很大 - 数据库创建速度提高了 5 倍。和
Hash 格式,创建时间同样慢,无论输入
是否已排序。另请注意,数据库文件大小较小
排序后的插入;据推测这与树平衡有关。

加速一定是由于某种缓存,但我不确定
在哪里。内核中的缓存未命中可能会更少
具有排序插入的页面缓存。这将与
CPU 使用率 - 当页面缓存未命中时,进程
从磁盘检索页面时必须等待,因此 CPU 使用率是
降低。

我还使用两个较小的文件进行了相同的测试,以进行比较。

File       | WP index         | Wikt. words       | /usr/share/dict/words
Entries    | 16e6             | 4.7e6             | 1.2e5
Size       | 700M             | 65M               | 1.1M
shuf time  | 34s              | 4s                | 0.06s
sort time  | 1:10s            | 6s                | 0.12s
-------------------------------------------------------------------------
           | total  DB   CPU  |                   |
           | time  size  usage|                   |
-------------------------------------------------------------------------
Btree shuf | 41m,  1.3G, 42%  | 5:00s, 180M, 88%  | 6.4s, 3.9M, 86%
      sort | 7m,   920M, 91%  | 1:50s, 120M, 99%  | 2.9s, 2.6M, 97%
Hash  shuf | 44m,  1.1G, 39%  | 5:30s, 129M, 87%  | 6.2s, 2.4M, 98%
      sort | 47m,  1.1G, 36%  | 5:30s, 129M, 86%  | 6.2s, 2.4M, 94%
-------------------------------------------------------------------------
Speedup    | 5x               | 2.7x              | 2.2x

对于最大的数据集,排序插入使我们的速度提高了 5 倍。
对于最小的,我们仍然获得 2 倍的加速 - 即使数据
很容易装入内存,因此 CPU 使用率始终很高。这似乎
暗示我们正在从另一个效率来源中受益
除了页面缓存之外,5 倍的加速实际上是由于
与页面缓存和其他东西相同的部分 - 也许更好
树平衡?

无论如何,我更喜欢 Btree 格式,因为它允许
更快的批处理操作。即使最终数据库被访问
随机,我使用批处理操作进行开发、测试和
维护。如果我能找到一种加快这些速度的方法,生活就会变得更容易。

For many applications, a database is accessed at random, interactively
or with "transactions". This might happen if you have data coming in
from a web server. However, you often have to populate a large
database all at once, as a "batch" operation. This might happen if you
are doing a data analysis project, or migrating an old database to a
new format.

When you are populating a database all at once, a B-Tree or other
sorted index is preferable because it allows the batch insertions to
be done much more efficiently. This is accomplished by sorting the
keys before putting them into the database. Populating a BerkeleyDB
database with 10 million entries might take an hour when the entries
are unsorted, because every access is a cache miss. But when the
entries are sorted, the same procedure might take only ten minutes.
The proximity of consecutive keys means you'll be utilizing various
caches for almost all of the insertions. Sorting can be done very
quickly, so the whole operation could be sped up by several times just
by sorting the data before inserting it. With hashtable indexing,
because you don't know in advance which keys will end up next to each
other, this optimization is not possible.

Update: I decided to provide an actual example. It is based on the
following script "db-test"

#!/usr/bin/perl
use warnings;
use strict;
use BerkeleyDB;
my %hash;
unlink "test.db";
tie %hash, (shift), -Filename=>"test.db", -Flags=>DB_CREATE or die;
while(<>) { $hash{$_}=1; }
untie %hash;

We can test it with a Wikipedia dump index file of 16 million entries. (I'm running this on an 800MHz 2-core laptop, with 3G of memory)

$ >enw.tab bunzip2 <enwiki-20151102-pages-articles-multistream-index.txt.bz2
$ wc -l enw.tab
16050432 enw.tab
$ du -shL enw.tab
698M    enw.tab
$ time shuf enw.tab > test-shuf
  16.05s user 6.65s system 67% cpu 33.604 total
$ time sort enw.tab > test-sort
  70.99s user 10.77s system 114% cpu 1:11.47 total
$ time ./db-test BerkeleyDB::Btree < test-shuf
  682.75s user 368.58s system 42% cpu 40:57.92 total
$ du -sh test.db
1.3G    test.db
$ time ./db-test BerkeleyDB::Btree < test-sort
  378.10s user 10.55s system 91% cpu 7:03.34 total
$ du -sh test.db
923M    test.db
$ time ./db-test BerkeleyDB::Hash < test-shuf
  672.21s user 387.18s system 39% cpu 44:11.73 total
$ du -sh test.db
1.1G    test.db
$ time ./db-test BerkeleyDB::Hash < test-sort
  665.94s user 376.65s system 36% cpu 46:58.66 total
$ du -sh test.db
1.1G    test.db

You can see that pre-sorting the Btree keys drops the insertion time
down from 41 minutes to 7 minutes. Sorting takes only 1 minute, so
there's a big net gain - the database creation goes 5x faster. With
the Hash format, the creation times are equally slow whether the input
is sorted or not. Also note that the database file size is smaller for
the sorted insertions; presumably this has to do with tree balancing.

The speedup must be due to some kind of caching, but I'm not sure
where. It is likely that we have fewer cache misses in the kernel's
page cache with sorted insertions. This would be consistent with the
CPU usage numbers - when there is a page cache miss, then the process
has to wait while the page is retrieved from disk, so the CPU usage is
lower.

I ran the same tests with two smaller files as well, for comparison.

File       | WP index         | Wikt. words       | /usr/share/dict/words
Entries    | 16e6             | 4.7e6             | 1.2e5
Size       | 700M             | 65M               | 1.1M
shuf time  | 34s              | 4s                | 0.06s
sort time  | 1:10s            | 6s                | 0.12s
-------------------------------------------------------------------------
           | total  DB   CPU  |                   |
           | time  size  usage|                   |
-------------------------------------------------------------------------
Btree shuf | 41m,  1.3G, 42%  | 5:00s, 180M, 88%  | 6.4s, 3.9M, 86%
      sort | 7m,   920M, 91%  | 1:50s, 120M, 99%  | 2.9s, 2.6M, 97%
Hash  shuf | 44m,  1.1G, 39%  | 5:30s, 129M, 87%  | 6.2s, 2.4M, 98%
      sort | 47m,  1.1G, 36%  | 5:30s, 129M, 86%  | 6.2s, 2.4M, 94%
-------------------------------------------------------------------------
Speedup    | 5x               | 2.7x              | 2.2x

With the largest dataset, sorted insertions give us a 5x speedup.
With the smallest, we still get a 2x speedup - even though the data
fits easily into memory, so that CPU usage is always high. This seems
to imply that we are benefiting from another source of efficiency in
addition to the page cache, and that the 5x speedup was actually due
in equal parts to page cache and something else - perhaps the better
tree balancing?

In any case, I tend to prefer the Btree format because it allows
faster batch operations. Even if the final database is accessed at
random, I use batch operations for development, testing, and
maintenance. Life is easier if I can find a way to speed these up.

云朵有点甜 2024-10-08 14:33:41

引用 Berkeley DB 的两位主要作者在 this 中写的架构:

Btree 和 Hash 访问方法的主要区别在于
Btree 提供键的引用局部性,而 Hash 则不提供。这
意味着 Btree 是几乎所有数据的正确访问方法
套;然而,哈希访问方法适用于数据集,因此
大到连 Btree 索引结构都无法装入内存。在
在这一点上,将内存用于数据比用于索引更好
结构。这种权衡在 1990 年变得更有意义,当时主要
内存通常比现在小得多。

因此,也许在嵌入式设备和特殊用例中,哈希表可能会起作用。 BTree 用于现代文件系统,如 Btrfs,它几乎是构建的理想数据结构数据库或文件系统。

To quote the two main authors of Berkeley DB in this write up of the architecture:

The main difference between Btree and Hash access methods is that
Btree offers locality of reference for keys, while Hash does not. This
implies that Btree is the right access method for almost all data
sets; however, the Hash access method is appropriate for data sets so
large that not even the Btree indexing structures fit into memory. At
that point, it's better to use the memory for data than for indexing
structures. This trade-off made a lot more sense in 1990 when main
memory was typically much smaller than today.

So perhaps in embedded devices and specialized use cases a hash table may work. BTree is used in modern filesystems like Btrfs and it is pretty much the idea data structure for building either databases or filesystems.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文