B+树节点大小
我计划编写一个简单的键/值存储,其文件架构类似于 CouchDB,即仅追加的 b+树。
我已经阅读了我能在 B+tree 上找到的所有内容以及我能在 CouchDB 内部找到的所有内容,但我还没有时间研究源代码(使用完全不同的语言使其成为一个特殊的项目)它自己的权利)。
所以我有一个关于 B+tree 节点大小的问题:给定的密钥长度是可变的,是保持节点相同的长度(以字节为单位)更好还是给它们相同的长度更好键/子指针的数量,无论它们有多大?
我意识到,在传统数据库中,B+树节点保持固定的字节长度(例如 8K),因为数据文件中的空间是在固定大小的页面。但是在仅附加文件方案中,文档可以是任意长度并且更新的树节点是在之后写入的,拥有固定大小的节点似乎没有优势。
I'm planning on writing a simple key/value store with a file architecture similar to CouchDB, i.e. an append-only b+tree.
I've read everything I can find on B+trees and also everything I can find on CouchDB's internals, but I haven't had time to work my way through the source code (being in a very different language makes it a special project in its own right).
So I have a question about the sizing the of B+tree nodes which is: given key-length is variable, is it better to keep the nodes the same length (in bytes) or is it better to give them the same number of keys/child-pointers regardless of how big they become?
I realise that in conventional databases the B+tree nodes are kept at a fixed length in bytes (e.g. 8K) because space in the data files is managed in fixed size pages. But in an append-only file scheme where the documents can be any length and the updated tree nodes are written after, there seems to be no advantage to having a fixed-size node.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
B 树的目标是最小化磁盘访问次数。如果文件系统集群大小为 4k,则节点的理想大小为 4k。此外,节点应正确对齐。未对齐的节点会导致读取两个集群,从而降低性能。
对于基于日志的存储方案,选择 4k 节点大小可能是最糟糕的选择,除非在日志中创建间隙以改善对齐。否则,99.98% 的时间一个节点存储在两个集群上。对于 2k 节点大小,发生这种情况的几率略低于 50%。然而,较小的节点大小存在一个问题:b 树的平均高度增加,并且读取磁盘簇所花费的时间没有得到充分利用。
较大的节点大小会降低树的高度,但也会增加磁盘访问的次数。较大的节点还会增加维护节点内条目的开销。想象一下 B 树,其中节点大小足以封装整个数据库。您必须在节点本身中嵌入更好的数据结构,也许是另一个 B 树?
我花了一些时间对仅附加日志格式的 B 树实现进行原型设计,并最终完全拒绝了这个概念。为了补偿由于节点/集群未对齐而造成的性能损失,您需要拥有非常大的缓存。更传统的存储方法可以更好地利用 RAM。
最后的打击是当我评估随机排序的插入件的性能时。这会降低任何磁盘支持的存储系统的性能,但基于日志的格式受到的影响更大。即使是最小的条目的写入也会迫使多个节点写入日志,并且内部节点在写入后不久就会失效。结果,日志很快就被垃圾填满。
BerkeleyDB-JE(BDB-JE)也是基于日志的,我也研究了它的性能特征。它和我的原型遇到了同样的问题——垃圾快速堆积。 BDB-JE 有几个“更干净”的线程,它们将幸存的记录重新附加到日志中,但随机顺序被保留。结果,新的“干净”记录已经创建了充满垃圾的日志文件。系统的整体性能下降到唯一运行的就是清理程序,并且它占用了所有系统资源。
基于日志的格式非常有吸引力,因为可以快速实现强大的数据库。阿喀琉斯之踵是清洁工,这并非易事。正确的缓存策略也很棘手。
The goal of a b-tree is to minimize the number of disk accesses. If the file system cluster size is 4k, then the ideal size for the nodes is 4k. Also, the nodes should be properly aligned. A misaligned node will cause two clusters to be read, reducing performance.
With a log based storage scheme, choosing a 4k node size is probably the worst choice unless gaps are created in the log to improve alignment. Otherwise, 99.98% of the time one node is stored on two clusters. With a 2k node size, the odds of this happening are just under 50%. However, there's a problem with a small node size: average height of the b-tree is increased and the time spent reading a disk cluster is not fully utilized.
Larger node sizes reduce the height of the tree, but they too increase the number of disk accesses. Larger nodes also increase the overhead of maintaining the entries within the node. Imagine a b-tree where the node size is large enough to encapsulate the entire database. You have to embed a better data structure within the node itself, perhaps another b-tree?
I spent some time prototyping a b-tree implementation over an append-only log format and eventually rejected the concept altogether. To compensate for performance losses due to node/cluster misalignment, you need to have a very large cache. A more traditional storage approach can make better use of the RAM.
The final blow was when I evaluated the performance of randomly-ordered inserts. This kills performance of any disk-backed storage system, but log based formats suffer much more. A write of even the smallest entry forces several nodes to be written to the log, and the internal nodes are invalidated shortly after being written. As a result, the log rapidly fills up with garbage.
BerkeleyDB-JE (BDB-JE) is also log based, and I studied its performance characteristics too. It suffers the same problem that my prototype did -- rapid accumulation of garbage. BDB-JE has several "cleaner" threads which re-append surviving records to the log, but the random order is preserved. As a result, the new "clean" records have already created log files full of garbage. The overall performance of the system degrades to the point that the only thing running is the cleaner, and it's hogging all system resources.
Log based formats are very attractive because one can quickly implement a robust database. The Achilles heel is the cleaner, which is non-trivial. Caching strategies are also tricky to get right.