为什么 DMBS 不能依赖操作系统缓冲池?
Stonebraker 的论文 (操作系统支持数据库管理)解释说,“从缓冲池管理器获取块的开销通常包括系统调用和核心到核心移动的开销。”忘记缓冲区替换策略等。我唯一质疑的一点是引用的内容。
我的理解是,当 DBMS 想要读取块 x 时,它会发出通用读取指令。与任何其他请求读取的应用程序应该没有区别。
我不是在寻找通用答案(我得到了它们,并阅读了论文)。我寻求所描述问题的详细答案。 请参阅 是否从 Java 读取文件应用程序调用系统调用?
Stonebraker's paper (Operating System Support for Database Management) explains that, "the overhead to fetch a block from the buffer pool manager usually includes that of a system call and a core-to-core move." Forget about the buffer-replacement strategy, etc. The only point I question is the quoted.
My understanding is that when a DBMS wants to read a block x it issues a common read instruction. There should be no difference from that of any other application requesting a read.
I'm not looking for generic answers (I got them, and read papers). I seek a detailed answer of the described problem.
See Does a file read from a Java application invoke a system call?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
阅读您的其他问题,并继续前进:
当 DBMS 必须从磁盘获取一页时,它将涉及至少一个系统调用。在他看来,大多数 DBMS 都会将页面放入它们自己的缓冲区中。 (它们最终也会进入操作系统的缓冲区,但这并不重要)。
所以,我们有一个系统调用。但是,我们可以避免任何进一步的系统调用。这是可能的,因为 DBMS 在其自己的内存空间中缓存页面。当 DBMS 决定需要一个页面时,它要做的第一件事就是检查它的缓存中是否有该页面。如果存在,它会从那里检索它,而无需调用系统调用。
DBMS 可以以最有利于其 IO 需求的方式自由地使缓存中的页面过期。操作系统的缓存以更一般的方式过期,因为操作系统还有其他事情需要担心。其中一个例子是,DBMS 通常会使用大量内存来缓存页面,因为它知道磁盘 IO 是它能做的最昂贵的事情之一。操作系统不会这样做,因为它必须平衡磁盘 IO 的成本和供其他应用程序使用的内存。
Reading from your other question, and working forward:
When the DBMS must bring a page from disk it will involve at least one system call. At his point most DBMSs place the page into their own buffer. (They also end up in the OS' buffer, but that's unimportant).
So, we have one system call. However, we can avoid any further system calls. This is possible because the DBMS is caching pages in its own memory space. The first thing the DBMS will do when it decides it needs a page is check and see if it has it in its cache. If it does, it retrieves it from there without ever invoking a system call.
The DBMS is free to expire pages in its cache in whatever way is most beneficial for its IO needs. The OS's cache is expired in a more general way since the OS has other things to worry about. One example of this is that a DBMS will typically use a great deal of memory to cache pages as it knows that disk IO is one of the most expensive things it can do. The OS won't do this as it has to balance the cost of disk IO against having memory for other applications to use.
操作系统磁盘 I/O 必须通用才能适用于各种情况。 DBMS 有时可以使用针对其自身需求进行优化的不太通用的代码来获得显着的性能。
DBMS 进行自己的缓存,因此不想通过 O/S 缓存进行工作。它“拥有”磁盘补丁,因此不需要担心与其他进程共享。
更新
该论文的链接是有帮助的。
首先,这篇论文已经有近三十年的历史了,并且涉及的是早已过时的硬件。尽管如此,读起来还是很有趣的。
首先,了解磁盘 I/O 是一个分层过程。那是1981年,现在更是如此。在最低点,设备驱动程序将向硬件发出物理读/写指令。上面可能是操作系统内核代码,然后是操作系统用户空间代码,然后是应用程序。在 C 程序的 fread() 和磁盘头移动之间,至少有三到四个级别,甚至可能更多。 DBMS 可能会寻求提高性能,可能会寻求绕过某些层并直接与内核对话,甚至更低。
我记得几年前在 Sun 机器上安装了 Oracle。它可以选择将磁盘专用为“原始”分区,Oracle 将以自己的方式格式化磁盘,然后直接与设备驱动程序对话。操作系统根本无法访问磁盘。
The operating system disk i/o must be generalised to work for a variety of situations. The DBMS can sometimes gain significant performance using less general code that is optimised to its own needs.
The DBMS does its own caching, so doesn't want to work through the O/S caching. It "owns" the patch of disk, so it doesn't need to worry about sharing with other processes.
Update
The link to the paper is a help.
Firstly, the paper is almost thirty years old and is referring to long-obsolete hardware. Notwithstanding that, it makes quite interesting reading.
Firstly, understand that disk i/o is a layered process. It was in 1981 and is even more so now. At the lowest point, a device driver will issue physical read/write instructions to the hardware. Above that may be the o/s kernel code then the o/s user space code then the application. Between a C program's fread() and the disk heads moving, there are at least three or four levels and might be considerably more. The DBMS may seek to improve performance might seek to bypass some layers and talk directly with the kernel, or even lower.
I recall some years ago installing Oracle on a Sun box. It had an option to dedicate a disk as a "raw" partition, where Oracle would format the disk in its own manner and then talk straight to the device driver. The O/S had no access to the disk at all.
主要是性能问题。 dbms 具有高度特定且不寻常的 I/O 需求。
操作系统可能有任意数量的进程执行 I/O 并用由此产生的各种缓存数据填充其缓冲区。
当然,还存在大小和缓存内容的问题(与更通用的设备缓冲区缓存相比,dbms 可能能够根据其需要执行更好的缓存)。
还有一个问题是,通用“块”实际上可能比 dbms 理想情况下承受的 I/O 负担大得多(这取决于分区等);它自己的缓存可以调整为更好地适应磁盘上数据的布局,从而能够最大限度地减少 I/O。
进一步的问题是索引和类似的加速查询的方法的问题,如果缓存实际上知道首先意味着什么,那么效果当然会更好。
It's mainly a performance issue. A dbms has highly specific and unusual I/O demands.
The OS may have any number of processes doing I/O and filling its buffers with the assorted cached data that this produces.
And of course there is the issue of size and what gets cached (a dbms may be able to peform better cache for its needs than the more generic device buffer caching).
And then there is the issue that a generic “block” may in fact amount to a considerably larger I/O burden (this depends on partitioning and such like) than what a dbms ideally would like to bear; its own cache may be tuned to work better with the layout of the data on the disk and thereby able to minimise I/O.
A further thing is the issue of indexes and similar means to speed up queries, which of course works rather better if the cache actually knows what these mean in the first place.
真正的问题是文件缓冲区缓存不在 DBMS 使用的文件系统中; 它位于内核中并由驻留在系统中的所有文件系统共享。从内核读取的任何内存都必须复制到用户空间:这是您读到的核心到核心的移动。
除此之外,还有一些不能依赖系统缓冲池的其他原因:
The real issue is that the file buffer cache is not in the filesystem used by the DBMS; it's in the kernel and shared by all of the filesystems resident in the system. Any memory read out of the kernel must be copied into user space: this is the core-to-core move you read about.
Beyond this, some other reasons you can't rely on the system buffer pool:
我知道这已经很旧了,但它没有得到答复。
本质上:
因此,为了将数据从内核地址空间获取到 DBMS 的地址空间,系统调用或页面错误是不可避免的。
您是对的,从操作系统缓冲池管理器访问数据并不比普通的 read() 调用更昂贵。 (事实上,这是通过正常的读取调用完成的。)但是,Stonebraker 并没有谈论这一点。他专门讨论了数据从磁盘读取并存在于 RAM 中之后的 DBMS 的缓存需求。
本质上,他是说操作系统的缓冲池缓存对于 DBMS 来说太慢而无法使用,因为它存储在不同的地址空间中。他建议在同一进程中使用本地缓存(因此也是相同的地址空间),这可以为像 DBMS 这样对缓存造成严重影响的应用程序提供显着的加速,因为它将消除系统调用开销。
这是他讨论在同一进程中使用本地缓存的确切段落:
他还在上面引用的摘录中提到了多核问题。类似的效果也适用于此,因为如果每个核心只有一个缓存,那么当多个 CPU 读取和写入相同数据时,您也许能够避免 CPU 缓存刷新造成的速度减慢。
** 顺便说一句,我相信 Stonebraker 1981 年的论文实际上是 mmap 之前的论文。他提到这是未来的工作。 “将文件系统作为共享虚拟内存的一部分提供的趋势(例如,Pilot [16])可能会为这个问题提供解决方案。”
I know this is old, but it came up as unanswered.
Essentially:
So ... to get data from the kernel address space to the DBMS's address space, a system call or page fault is unavoidable.
You're correct that accessing data from the OS buffer pool manager is no more expensive than a normal read() call. (In fact, it's done with a normal read call.) However, Stonebraker is not talking about that. He's specifically discussing the caching needs of DBMSes, after the data has been read from the disk and is present in RAM.
In essence, he's saying that the OS's buffer pool cache is too slow for the DBMS to use because it's stored in a different address space. He's suggesting using a local cache in the same process (and therefore same address space), which can give you a significant speedup for applications like DBMSes which hit the cache heavily, because it will eliminate that syscall overhead.
Here's the exact paragraph where he discusses using a local cache in the same process:
He also mentions multi-core issues in the excerpt you quote above. Similar effects apply here, because if you can have just one cache per core, you may be able to avoid the slowdowns from CPU cache flushes when multiple CPUs are reading and writing the same data.
** BTW, I believe Stonebraker's 1981 paper is actually pre-mmap. He mentions it as future work. "The trend toward providing the file system as a part of shared virtual memory (e.g., Pilot [16]) may provide a solution to this problem."