数据库索引 B 树和列表

发布于 2024-12-19 21:02:06 字数 358 浏览 3 评论 0原文

任何人都可以解释为什么数据库倾向于使用 B 树索引而不是有序元素的链接列表。

我的想法是这样的:在B+树(大多数数据库使用)上,非叶节点是指向其他节点的指针的集合。每个集合(节点)都是一个有序列表。叶节点是所有数据指针所在的位置,是数据指针簇的链表。

非叶节点仅用于查找目标数据指针所在的正确叶节点。既然叶子节点就像一个链表,那么为什么不直接去掉树元素而只拥有链表呢?可以提供元数据,它给出每个叶节点簇的最小值和最大值,因此应用程序可以读取元数据并找到数据指针所在的正确叶。

需要明确的是,搜索随机访问有序列表的最有效算法是二分搜索,其性能为 O(log n),与 B 树相同。使用链表而不是树的好处是它们不需要平衡。

这种结构是否可行。

Can anyone explain why databases tend to use b-tree indexes rather than a linked list of ordered elements.

My thinking is this: On a B+ Tree (used by most databases), the none-leaf nodes are a collection of pointers to other nodes. Each collection (node) is a ordered list. The leaf nodes, which is where all the data pointers are, is a linked list of clusters of data pointers.

The non-leaf nodes are just used to find the correct leaf node in which your target data pointer lives. So as the leaf nodes are just like a linked list, then why not just do away with the tree elements and just have the linked list. Meta data can be provided which gives the minimum and maximum value of each leaf node cluster, so the application can just read the meta data and find the correct leaf where the data pointer lives.

Just to be clear that the most efficent algorithm for searching an random accessed ordered list is an binary search which has a performance of O(log n) which is the same as a b-tree. The benifit of using a linked list rather than a tree is that they don't need to be ballanced.

Is this structure feasible.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

墨落成白 2024-12-26 21:02:06

经过一些研究和论文阅读后,我找到了答案。

为了应对数百万条记录等大量数据,索引必须组织成集群。簇是磁盘上一组连续的扇区,可以快速读入内存。它们的长度通常约为 4096 字节。

每个集群都可以包含一堆索引,这些索引可以指向磁盘上的其他集群或数据。因此,如果我们有一个链表索引,则索引的每个元素将由单个簇(例如 100)中包含的索引集合组成。

那么,当我们寻找一条特定的记录时,我们如何知道它位于哪个簇上。我们执行二分搜索来查找有问题的集群 [O(log n)]。

然而,要进行二分搜索,我们需要知道每个簇中值的范围,因此我们需要元数据来表示每个簇的最小值和最大值以及该簇的位置。这太棒了。除非每个集群可以包含100个索引,并且我们的元数据也保存在单个集群上(为了速度),那么我们的元数据只能指向100个集群。

如果我们想要超过 100 个集群,会发生什么情况?我们必须有两个元数据索引,每个索引指向 100 个簇(10 000 条记录)。好吧,这还不够。让我们添加另一个元数据集群,现在我们可以访问 1 000 000 条记录。那么我们如何知道需要查询三个元数据集群中的哪一个才能找到目标数据集群。我们可以先搜索一个,然后再搜索另一个,但这无法扩展。因此,我添加了另一个元元数据集群来指示我应该查询三个元数据集群中的哪一个来查找目标数据集群。现在我有一棵树了!

这就是数据库使用树的原因。这不是速度,而是索引的大小以及索引引用其他索引的需要。上面我描述的是B+Tree——子节点包含对其他子节点或叶节点的引用,叶节点包含对磁盘上数据的引用。

唷!

After some research and paper reading I found the answer.

In order to cope with large amounts of data such a millions of records, indexes have to be organised into clusters. A cluster is a continuous group of sectors on a disk that can be read into memory quickly. These are usually about 4096 bytes long.

Each one of these clusters can contain a bunch of indexes which can point to other clusters or data on a disk. So if we had a linked list index, each element of the index would be made up of the collection of indexes contained in a single cluster (say 100).

So, when we are looking for a specific record, how do we know which cluster it is on. We perform a binary search to find the cluster in question [O(log n)].

However, to do a binary search we need to know where the range of values in each clusters, so we need meta-data that says the min and max value of each cluster and where that cluster is. This is great. Except if each cluster can contain 100 indexes, and our meta data is also held on a single cluster (for speed) , then our meta data can only point to 100 clusters.

What happens if we want more than 100 clusters. We have to have two meta-data indexes, each pointing to 100 clusters (10 000 records). Well that’s not enough. Lets add another meta-data cluster and we can now access 1 000 000 records. So how do we know which one of the three meta-data clusters we need to query in order to find our target data cluster. We could search one then the other, but that doesn’t scale. So I add another meta-meta-data cluster to indicate which one of the three meta-data clusters I should query to find the target data cluster. Now I have a tree!

So that’s why databases use trees. It’s not the speed it’s the size of the indexes and the need to have indexes referencing other indexes. What I have described above is a B+Tree – child nodes contain references to other child nodes or leaf nodes, and leaf nodes contain references to data on disk.

Phew!

江湖正好 2024-12-26 21:02:06

我想我在 SQL 索引教程的第一章中回答了这个问题: http://use- the-index-luke.com/sql/anatomy

总结关于您的特定问题的最重要部分:

--来自“叶节点”

索引的主要目的是提供有序的
索引数据的表示。然而,这是不可能的
按顺序存储数据,因为插入语句需要
移动以下条目以为新条目腾出空间。但在移动
大量数据非常耗时,所以插入
声明会非常慢。问题的解决方案是建立一个
独立于内存中物理顺序的逻辑顺序。

——来自“B树”:

索引叶节点以任意顺序存储 - 的位置
磁盘与逻辑位置不对应
索引顺序。它就像一个页面打乱的电话簿。如果
您在中搜索“Smith”,但首先在“Robinson”中打开它
无论如何,史密斯的回归绝不是理所当然的。
数据库需要第二个结构来快速找到条目
打乱的页面:平衡搜索树——简而言之:B-Tree。

I guess I answered that question in Chapter 1 of my SQL Indexing Tutorial: http://use-the-index-luke.com/sql/anatomy

To summarize the most important parts, with respect to your particular question:

-- from "The Leaf Nodes"

The primary purpose of an index is to provide an ordered
representation of the indexed data. It is, however, not possible to
store the data sequentially because an insert statement would need to
move the following entries to make room for the new one. But moving
large amounts of data is very time-consuming, so that the insert
statement would be very slow. The problem's solution is to establish a
logical order that is independent of physical order in memory.

-- from "The B-Tree":

The index leaf nodes are stored in an arbitrary order—the position on
the disk does not correspond to the logical position according to the
index order. It is like a telephone directory with shuffled pages. If
you search for “Smith” in but open it at “Robinson” in the first
place, it is by no means granted that Smith comes farther back.
Databases need a second structure to quickly find the entry among the
shuffled pages: a balanced search tree—in short: B-Tree.

毁梦 2024-12-26 21:02:06

链接列表通常不是按键值排序,而是按插入时刻排序:插入是在列表末尾完成的,每个新条目都包含指向列表中前一个条目的指针。

它们通常作为堆结构实现。

这有两个主要好处:

  • 它们非常易于管理(您只需要每个元素的指针)

  • 如果使用与索引结合使用,您可以克服顺序访问的问题。

相反,如果您使用按键值排序的列表,您将可以轻松访问(二分搜索),但每次编辑、删除、插入新元素时都会遇到问题:您实际上必须在执行操作后保持列表有序,使得算法更加复杂且耗时。

B+ 树是更好的结构,具有您所说的所有属性以及其他优点:

  • 您可以以与单次搜索相同的成本进行分组搜索(按键值的间隔):因为叶子中的元素会自动排序,这要归功于插入算法在链表中是不可能的,因为它需要在列表上进行多次线性搜索。

  • 成本与包含的元素数量成对数,特别是因为这些结构保持平衡,访问成本不取决于您正在寻找的特定值(非常有用)。

  • 这些结构在更新、插入或删除操作中非常高效。

Linked lists are usually not ordered by key value, but by the moment of insertion: insertion is done at the end of list and each new entry contains a pointer to the previous entry of the list.

They are usually implemented as heap structures.

This has 2 main benefits:

  • they are very easy to manage (you just need a pointer for each element)

  • if used in combination with an index you can overcome the problem of sequential access.

If instead you use an ordered list, by key value, you will have ease of access (binary search), but encounter problems each time you edit, delete, insert a new element: you must infact keep your list ordered after performing operation, making algorithms more complex and time consuming.

B+ trees are better structures, having all the properties you stated, and other advantages:

  • you can make group searches (by intervals of key values) with same cost of a single search: since elements in the leafs result automatically ordered thanks to the insertion algorithm, which is not possible in linked lists cause it would require many linear searches over the list.

  • cost is logarithmic with number of elements contained and especially since these structures are kept balanced cost of access does not depend on the particulare value you are looking for (very usefull).

  • these structures are very efficient in update, insert or delete operations.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文