大于内存的数据结构及其通常如何处理
假设我有一个基于文件的数据结构,例如 B+ 树。 我的理解是数据应该存储在磁盘上,但索引通常加载在内存中。 如果您有一个很大的文件,甚至其索引都无法放入内存怎么办? 通常如何处理? 其次,由于索引是一棵树,而不是线性数据集,那么它通常如何在磁盘上布局?
我基本上很好奇它在现实世界的项目(例如 Berkeley DB)中是如何完成的。 显然我对大方向感兴趣。 我希望得到一个想法,这样当我深入研究数据库书籍的 B 树部分时(或者从多年前的 CS XYZ 中回忆起我的记忆),我就有了一些背景信息
Say I have a file based data structure such as a B+ Tree. My understanding is that the data is expected to be stored on disk, but the index is usually loaded in memory. What if you have such a large file that even its index doesn't fit into memory? How is that typically handled? Secondly, since the index is a tree, not a linear set of data, how is it usually laid out on disk?
I'm basically curious about how it is done in real-world projects (such as Berkeley DB). Obviously I'm interested in broad strokes. I'm hoping to get an idea so I have some context when I dig into the B-Tree section of my database book (or jog my memory from CS XYZ from years ago)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
B 树适用于基于页面的系统,其中给定的节点适合页面。 要在 B 树中查找条目,只需一次加载一页,因此您可以这样做。
即使更新它们也不需要同时在内存中存在大量页面 - 我想最困难的操作是在重新组织节点时进行删除,但如果仔细实现,即使这也可以用相对较少的页面来完成记忆。
B-trees are intended for page-based systems, where a given node fits into a page. To find an entry in a B-tree, it is only necessary to load in one page at a time, so you can do that.
Even updating them doesn't require a large number of pages being in memory at the same time - I imagine the most difficult operation is a delete when nodes are being reorganised, but if it's implemented carefully even that could be done with relatively few pages in memory.
您可能想看看 SQLite。 代码库比 Berkeley DB 小得多,它是公共领域,它的组织和注释非常清晰,源代码文档非常出色。 教会了我很多关于现实世界中 btree 的知识
You might want to take a look at SQLite. The code base is much smaller than Berkeley DB, it's public domain, it's very clearly organized and commented, and the out-of-source documentation is excellent. Taught me a lot about btrees in the real world
要回答你的第一个问题,太大而无法放入内存的数据结构通常被分为“页面”,通常所有页面的大小相同,并且每个页面包含数据结构的一部分,以使用您加载和卸载的数据页。
另一个常见选项(在 RDBMS 中不常用,但在 XML 和媒体文件等中很常见)是流式传输,您可以通过加载下一部分并丢弃上一个部分来按顺序处理数据。
这也回答了你的第二个问题,如果你使用分页,那么文件结构是一系列相同大小的页面,如果你使用流式传输,那么数据应该按照你要使用它的顺序排列(在对于树来说,它可能是 DFS 或 BFS 顺序,具体取决于您的应用程序)。
To answer your first question, a data structure that is too large to fit into memory is usually divided into "pages", usually all pages are the same size and each page contains part of the data structure, to use the data you load and unload pages.
Another common option (that is not commonly used in RDBMS but is common with things like XML and media files) is streaming, where you process the data in order by loading the next section and discarding the previous one.
And that also answers your second question, if you use paging than the file structure is a sequence of pages of the same size, if you use streaming than the data should be laid out in the order you're going to use it (in the case of a tree it's likely going to be either DFS or BFS order, depending on your application).