红黑树与B树
我有一个项目,需要对兆字节到太字节的数据实现快速搜索、插入和删除操作。我最近一直在研究数据结构并分析它们。具体来说,我想介绍 3 个案例并就此提出问题:
数据远远超出了内存一次性处理的范围(样本范围在 10-15 TB)。在这种情况下,我会将数据结构存储在磁盘上。
与系统内存相比,数据相对较少,因此可以在内存本身中存储和操作,以提高速度。
数据超过可用内存,并假设它小于分页文件中可能的连续数据块的大小。因此,我将数据结构存储在磁盘上的文件中,并对该文件进行内存映射。
我得出的结论是:
对于情况 1,我应该使用 B 树来实现更快的访问,因为它可以节省磁盘旋转产生的延迟
对于情况 2,我应该使用红黑树来实现更快的访问,因为数据位于内存中并且不。在最坏的情况下,需要扫描的元素数量将少于如果我使用 B 树,我必须扫描的元素数量
对于情况 3,我对此表示怀疑,页面文件位于磁盘上,使用本机操作系统 I/O 来操作文件,那么B树应该是更好的选择还是红黑树呢?
我想知道上述三个结论哪里正确,哪里错误,以及如何在这三种不同的情况下提高性能。
我使用的是C++语言,有红黑树和B树,两者都是我从头开始设计的。我正在使用 Boost 库进行文件映射。
更新 1:: 正在阅读 stackoverflow 中的这篇帖子。得到了一些真正好的见解,这让我觉得我在案例中所做的比较类型可能是错误的。投票最多的答案中发布了一个链接 http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html
I have a project in which I have to achieve fast search, insert and delete operations on data ranging from megabytes to terabytes. I had been studying data structures of late and analyzing them. Being specific, I want to introduce 3 cases and ask questions on that:
The data is much more than what the memory can handle (sample ranges in 10-15 terabytes) at one go. In this case, I would store the data structure on a disk.
The data is relatively less compared to the memory of the system and thus it can be stored and operated in the memory itself for speed.
The data is more than free memory and assume it is less than the size of a possible contiguous chunk of data in the paging file. thus I store the data structure in a file on disk and do a memory mapping of the file.
The conclusions I have drawn are:
For case 1, I should use a B-Tree for faster access as it saves on lag produced by disk rotation
For case 2, I should use a Red Black Tree for faster access as data is on memory and no. of elements needed to be scanned in worse case would be lesser than one i have to do if I use B Trees
For case 3, I am doubtful on this one, the page file is on disk uses native OS I/O to operate on files, so should B Tree be a better option or a Red Black tree?
I want to know where the above three conclusions go right and where it goes wrong and how I can improve on performance in the three separate cases.
I am using the C++ Language, with a red black tree and a B tree, both which I have designed from scratch. I am using Boost library for File mapping.
Update 1:: Was reading through this post in stackoverflow. Got some real good insights, which make me feel that the type of comparisons I have done in the cases may be faulty. A link was posted in the most-voted-for-answer http://idlebox.net/2007/stx-btree/stx-btree-0.8.3/doxygen-html/speedtest.html
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
红/黑树或多或少相当于2-3-4树,它只是B树的一种。如果您对 B 树节点值进行二分搜索,最坏情况下的性能是相同的。
B 树的明显缺点是浪费空间,但根据所使用的语言/内存分配器,您可能会发现 2-3-4 树平均使用的空间比红黑树少。例如,在 32 位 Java 中,每个对象大约有 8 字节的开销。 (这也很大程度上取决于分配器;IIRC phkmalloc 将小分配四舍五入为 2 的幂大小。)
为了回答您的情况,
为简单起见,我将使用 B 树并在各种节点大小上运行一些基准测试。
A red/black tree is more or less equivalent to a 2-3-4 tree, which is just a type of B-tree. The worst-case performance is identical, provided you do a binary search of the B-tree node values.
The obvious disadvantage of a B-tree is wasted space, but depending on the language/memory allocator used, you may find that a 2-3-4 tree uses less space than a red-black tree on average. For instance, in 32-bit Java, there's approximately an 8-byte overhead per object. (It also depends a lot on the allocator; IIRC phkmalloc rounds up small allocations to a power-of-2 size.)
To answer your cases,
For simplicity, I'd go with a B-tree and run some benchmarks on various node sizes.
Robert Sedgewick的实验数据(在线免费提供)表明,红色容量不断向黑树上移动,并交换到黑树顶部,其二次成本与其他树算法相同。
红黑树算法遭受全局红色瀑布(红色喷泉)的困扰。
B 树和红黑树具有相同的基本二次交换成本,这与非常简单的静态列表排序完全相同。
为了可视化不可见的隐藏熵流,你必须将一切解释为广义的红绿黑树。 6-B 树(或更大)中的每个分支都以这种方式分解。
所有额外容量必须解释为绿色。
中心节点必须是黑色,最外面的节点必须是红色。
中央黑色节点是不变的支撑(主干),除非明确需要,否则不得触摸它。最外面的红色节点必须是空的,因为它们是主要的换出通道。
红色卡诺通道必须始终大于绿色通道;绿色通道超过 1/2 就无法进行任何交换。
因为这是一个抽象的熵流,所以您必须忽略各个排列并仅提取容量流;能量透明地流经随机粒子。
因为绿色容量向下凝结到根部,红色容量不断被绿色边界推上树顶,这就是全局红色瀑布。
B 树换出通道是颠倒的。您必须从顶部到根递归地合并拆分 2 个垂直黑色行,以生成基本的卡诺交换流。对黑色节点运动的强烈限制增加了额外的冗余红色运动成本。如果你尝试优化一个子系统,熵成本自然会流向另一个子系统。
二叉树的二次交换成本与交换抽象高度(体积)增长的成本相同,这是一个简单的引力。二叉树的形状与对数牛顿重力可视化完全相同。
如果您有一个非常大的静态列表,则交换成本为 0。这样的列表始终与样本空间一样大。大列表的压缩总是有香农霍夫曼熵成本。这是一个非常简单的隐藏卡诺 I/O 成本。
如果将所有桶放在一起,您会得到一个非常简单的静态列表,这就是众所周知的杨图。这个静态列表太小,因此它总是具有不变的交换成本。任何树都是一个非常简单的二维静态列表。
压缩二维杨氏图的高度的成本始终是不变的。
树算法的主要成本不是算法成本(绿色成本);它们的主要成本是额外隐藏的 Carnot Swapout I/O,这需要使用大列表压缩-解压缩进行本地全排序(红色成本)。
Robert Sedgewick's experimental data(freely available online) shows that the red capacity is constantly moving up the black tree,and swapped out to the black tree top, the quadratic cost of which is identical to other tree algorithms.
The red black tree algorithm suffers from global red waterfall(red fountain).
B-tree and red black tree has the same fundamental quadratic swapping cost, which is exactly identical to a very simple static list sorting.
In order to visualize the invisible hidden entropy flow, you must interpret everything as the generalized red-green-black tree. Each branch in the 6-B-tree (or larger) is decomposed this way.
All the extra capacity must be interpreted as green.
The central node must be black, and the outermost nodes must be colored red.
The central black node is the invariant support (backbone) and it must not be touched except when explicitly needed. The outermost red nodes must be empty, because they are the main swap-out channels.
The red Carnot channel must be always larger than green channel; the green channel exceeding 1/2 makes any swapping impossible.
Because this is an abstract entropy flow, you must ignore individual permutations and extract only the capacity flow; energy transparently flows through random particles.
Because the green capacity condensates down to the root, the red capacity is constantly pushed up to the tree top by the green border, which is the global red waterfall.
The B-tree swap-out channel is upside-down. You must recursively merge-split 2 vertical black rows from top to root to generate the basic Carnot swap-out flow. The strong restriction to the black node motion increases the extra redundant red motion cost. If you try to optimize one subsystem, the entropic cost naturally flows out to the other subsystem.
The quadratic swapping cost of binary tree is identical to the cost of swapping out the abstract height(volume) growth, which is a simple gravity. The shape of the binary tree is exactly identical to the logarithmic Newton gravity visualization.
If you have a very large static list, the swapping cost is 0. Such a list is always as large as the sample space. The compression of the large list always have the Shannon-Huffman entropic cost. This is a very simple hidden Carnot I/O cost.
If you put all the buckets together, you get a very simple static list, which is well known as the Young diagram. This static list is far too small, so it always has an invariant swapping cost. Any tree is a very simple 2-dimensional static list.
The cost of compressing the height of a 2-dimentional Young diagram is always invariant.
The main cost of tree algorithms is not the algorithm cost(green cost); their main cost is the extra hidden Carnot Swapout I/O, which requires local full sorting with the large list comression-decompression(red cost).