关于小读取（重叠、缓冲）优于大连续读取的解释？

发布于 2024-11-05 07:21:01 字数 3477 浏览 13 评论 0原文

（对有点冗长的介绍表示歉意）

在开发一个应用程序期间，该应用程序将整个大文件（> 400MB）预先放入缓冲区高速缓存中，以加快稍后的实际运行速度，我测试了是否以更快的速度读取 4MB与一次仅读取 1MB 块相比，time 仍然具有明显的优势。令人惊讶的是，较小的请求实际上速度更快。这似乎违反直觉，所以我进行了更广泛的测试。

在运行测试之前，缓冲区缓存已被清除（只是为了笑，我也对缓冲区中的文件进行了一次运行。无论请求大小如何，缓冲区缓存都可以提供高达 2GB/s 的速度，但令人惊讶的是 +/- 30%随机方差）。
所有读取都使用具有相同目标缓冲区的重叠 ReadFile（句柄是使用 FILE_FLAG_OVERLAPPED 打开的，没有 FILE_FLAG_NO_BUFFERING 打开的）。使用的硬盘有些老旧但功能齐全，NTFS的簇大小为8kB。初次运行后对磁盘进行了碎片整理（6 个碎片与未碎片，零差异）。为了获得更好的数字，我也使用了更大的文件，下面的数字是读取 1GB 的数据。

结果确实令人惊讶：

4MB x 256    : 5ms per request,    completion 25.8s @ ~40 MB/s
1MB x 1024   : 11.7ms per request, completion 23.3s @ ~43 MB/s
32kB x 32768 : 12.6ms per request, completion 15.5s @ ~66 MB/s
16kB x 65536 : 12.8ms per request, completion 13.5s @ ~75 MB/s

因此，这表明提交两个集群长度的一万个请求实际上比提交几百个大型连续读取更好。随着请求数量的增加，提交时间（ReadFile 返回之前的时间）确实会大幅增加，但异步完成时间几乎减半。
在每种情况下，当异步读取完成时，内核 CPU 时间大约为 5-6%（在四核上，所以应该说 20-30%），这是一个令人惊讶的 CPU 量——显然操作系统做了一些非-忙碌的等待时间也可以忽略不计。 30% CPU 在 2.6 GHz 下持续 25 秒，这对于“什么都不做”来说是相当多的周期。

知道如何解释吗？也许这里有人对 Windows 重叠 IO 的内部工作原理有更深入的了解？或者，使用 ReadFile 读取 1 MB 数据的想法是否有本质上的错误？

我可以看到 IO 调度程序如何能够通过最小化搜索来优化多个请求，特别是当请求是随机访问时（它们不是！）。我还可以看到硬盘如何能够根据 NCQ 中的一些请求执行类似的优化。
然而，我们讨论的是数量可笑的小请求——尽管如此，它们的性能却比看起来合理的请求高出 2 倍。

旁注： 明显的赢家是内存映射。我几乎倾向于添加“毫不奇怪”，因为我是内存映射的忠实粉丝，但在这种情况下，它实际上确实让我感到惊讶，因为“请求”甚至更小，操作系统应该更无法预测和调度 IO。我一开始没有测试内存映射，因为它甚至可以远程竞争，这似乎违反直觉。你的直觉就这么多了，呵呵。

以不同的偏移量重复映射/取消映射视图几乎需要零时间。使用 16MB 视图并通过简单的 for() 循环对每个页面进行故障读取，每页读取一个字节在 9.2 秒内完成（@ ~111 MB/s）。 CPU 使用率始终低于 3%（一个核心）。同样的电脑，同样的磁盘，一切都一样。

Windows 似乎一次将 8 页加载到缓冲区高速缓存中，尽管实际上只创建了一页。每 8 个页面出现故障以相同的速度运行，并从磁盘加载相同数量的数据，但显示较低的“物理内存”和“系统缓存”指标，并且页面故障仅占 1/8。随后的读取证明这些页面确实位于缓冲区高速缓存中（没有延迟，没有磁盘活动）。

（可能与内存映射文件非常非常遥远在巨大的顺序读取上速度更快？）

为了使其更具说明性：
在此处输入图像描述

更新：

使用FILE_FLAG_SEQUENTIAL_SCAN似乎有点“ Balance”读取128k，性能提升100%。另一方面，它严重影响 512k 和 256k 的读取（您一定想知道为什么？），并且对其他任何东西都没有真正的影响。较小块大小的 MB/s 图表可以说看起来更“均匀”，但运行时没有差异。

在此处输入图像描述

我可能也找到了较小块大小性能更好的解释。如您所知，如果操作系统可以立即（即从缓冲区）处理请求（以及各种特定于版本的技术限制），则异步请求可以同步运行。

当考虑实际异步读取与“立即”异步读取时，人们会注意到，Windows 会异步运行每个异步请求，超过 256k。块大小越小，“立即”处理的请求就越多，即使它们不能立即可用（即 ReadFile 只是同步运行）。我无法弄清楚一个清晰的模式（例如“前 100 个请求”或“超过 1000 个请求”），但请求大小和同步性之间似乎存在负相关。在块大小为 8k 时，每个异步请求都是同步服务的。
由于某种原因，缓冲同步传输的速度是异步传输的两倍（不知道为什么），因此请求大小越小，整体传输速度越快，因为更多的传输是同步完成的。

对于内存映射预故障，FILE_FLAG_SEQUENTIAL_SCAN 导致性能图的形状略有不同（有一个“缺口”向后移动了一点），但所花费的总时间完全相同（同样，这令人惊讶，但我不能帮助它）。

更新 2：

无缓冲 IO 使 1M、4M 和 512k 请求测试用例的性能图表稍高且更加“尖峰”，最大值为 90 GB/s，但最小值也很严酷， 1GB 的总体运行时间在缓冲运行的 +/- 0.5 秒内（但是，具有较小缓冲区大小的请求完成速度明显更快，这是因为超过 2558 个正在进行的请求，会返回 ERROR_WORKING_SET_QUOTA）。在所有无缓冲的情况下，测得的 CPU 使用率为零，这并不奇怪，因为发生的任何 IO 都是通过 DMA 运行的。

FILE_FLAG_NO_BUFFERING 的另一个非常有趣的观察是它显着改变了 API 行为。 CancelIO 不再起作用，至少在取消 IO 的意义上不起作用。对于未缓冲的正在进行的请求，CancelIO 将简单地阻塞，直到所有请求完成。律师可能会争辩说，该功能不能因忽视其职责而承担责任，因为当它返回时，不再有任何飞行中的请求，因此在某种程度上它已经完成了所要求的事情 - 但我对“取消”的理解有点不同。
使用缓冲、重叠 IO、CancelIO 将简单地切断绳子，所有正在进行的操作都会立即终止，正如人们所期望的那样。

另一个有趣的事情是，在所有请求完成或失败之前，该进程无法终止。如果操作系统正在该地址空间中执行 DMA，那么这种情况是有意义的，但它仍然是一个令人惊叹的“功能”。

原文

(apologies for the somewhat lengthy intro)

During development of an application which prefaults an entire large file (>400MB) into the buffer cache for speeding up the actual run later, I tested whether reading 4MB at a time still had any noticeable benefits over reading only 1MB chunks at a time. Surprisingly, the smaller requests actually turned out to be faster. This seemed counter-intuitive, so I ran a more extensive test.

The buffer cache was purged before running the tests (just for laughs, I did one run with the file in the buffers, too. The buffer cache delivers upwards of 2GB/s regardless of request size, though with a surprising +/- 30% random variance).
All reads used overlapped ReadFile with the same target buffer (the handle was opened with FILE_FLAG_OVERLAPPED and without FILE_FLAG_NO_BUFFERING). The harddisk used is somewhat elderly but fully functional, NTFS has a cluster size of 8kB. The disk was defragmented after an initial run (6 fragments vs. unfragmented, zero difference). For better figures, I used a larger file too, below numbers are for reading 1GB.

The results were really surprising:

4MB x 256    : 5ms per request,    completion 25.8s @ ~40 MB/s
1MB x 1024   : 11.7ms per request, completion 23.3s @ ~43 MB/s
32kB x 32768 : 12.6ms per request, completion 15.5s @ ~66 MB/s
16kB x 65536 : 12.8ms per request, completion 13.5s @ ~75 MB/s

So, this suggests that submitting ten thousands of requests two clusters in length is actually better than submitting a few hundred large, contiguous reads. The submit time (time before ReadFile returns) does go up substantially as the number of requests goes up, but asynchronous completion time nearly halves.
Kernel CPU time is around 5-6% in every case (on a quadcore, so one should really say 20-30%) while the asynchronous reads are completing, which is a surprising amount of CPU -- apparently the OS does some non-neglegible amount of busy waiting, too. 30% CPU for 25 seconds at 2.6 GHz, that's quite a few cycles for doing "nothing".

Any idea how this can be explained? Maybe someone here has a deeper insight of the inner workings of Windows overlapped IO? Or, is there something substantially wrong with the idea that you can use ReadFile for reading a megabyte of data?

I can see how an IO scheduler would be able to optimize multiple requests by minimizing seeks, especially when requests are random access (which they aren't!). I can also see how a harddisk would be able to perform a similar optimization given a few requests in the NCQ.
However, we're talking about ridiculous numbers of ridiculously small requests -- which nevertheless outperform what appears to be sensible by a factor of 2.

Sidenote: The clear winner is memory mapping. I'm almost inclined to add "unsurprisingly" because I am a big fan of memory mapping, but in this case, it actually does surprise me, as the "requests" are even smaller and the OS should be even less able to predict and schedule the IO. I didn't test memory mapping at first because it seemed counter-intuitive that it might be able to compete even remotely. So much for your intuition, heh.

Mapping/unmapping a view repeatedly at different offsets takes practically zero time. Using a 16MB view and faulting every page with a simple for() loop reading a single byte per page completes in 9.2 secs @ ~111 MB/s. CPU usage is under 3% (one core) at all times. Same computer, same disk, same everything.

It also appears that Windows loads 8 pages into the buffer cache at a time, although only one page is actually created. Faulting every 8th page runs at the same speed and loads the same amount of data from disk, but shows lower "physical memory" and "system cache" metrics and only 1/8 of the page faults. Subsequent reads prove that the pages are nevertheless definitively in the buffer cache (no delay, no disk activity).

(Possibly very, very distantly related to Memory-Mapped File is Faster on Huge Sequential Read?)

To make it a bit more illustrative:
enter image description here

Update:

Using FILE_FLAG_SEQUENTIAL_SCAN seems to somewhat "balance" reads of 128k, improving performance by 100%. On the other hand, it severely impacts reads of 512k and 256k (you have to wonder why?) and has no real effect on anything else. The MB/s graph of the smaller blocks sizes arguably seems a little more "even", but there is no difference in runtime.

enter image description here

I may have found an explanation for smaller block sizes performing better, too. As you know, asynchronous requests may run synchronously if the OS can serve the request immediately, i.e. from the buffers (and for a variety of version-specific technical limitations).

When accounting for actual asynchronous vs. "immediate" asyncronous reads, one notices that upwards of 256k, Windows runs every asynchronous request asynchronously. The smaller the blocksize, the more requests are being served "immediately", even when they are not available immediately (i.e. ReadFile simply runs synchronously). I cannot make out a clear pattern (such as "the first 100 requests" or "more than 1000 requests"), but there seems to be an inverse correlation between request size and synchronicity. At a blocksize of 8k, every asynchronous request is served synchronously.
Buffered synchronous transfers are, for some reason, twice as fast as asynchronous transfers (no idea why), hence the smaller the request sizes, the faster the overall transfer, because more transfers are done synchronously.

For memory mapped prefaulting, FILE_FLAG_SEQUENTIAL_SCAN causes a slightly different shape of the performance graph (there is a "notch" which is moved a bit backwards), but the total time taken is exactly identical (again, this is surprising, but I can't help it).

Update 2:

Unbuffered IO makes the performance graphs for the 1M, 4M, and 512k request testcases somewhat higher and more "spiky" with maximums in the 90s of GB/s, but with harsh minumums too, the overall runtime for 1GB is within +/- 0.5s of the buffered run (the requests with smaller buffer sizes complete significantly faster, however, that is because with more than 2558 in-flight requests, ERROR_WORKING_SET_QUOTA is returned). Measured CPU usage is zero in all unbuffered cases, which is unsurprising, since any IO that happens runs via DMA.

Another very interesting observation with FILE_FLAG_NO_BUFFERING is that it significantly changes API behaviour. CancelIO does not work any more, at least not in a sense of cancelling IO. With unbuffered in-flight requests, CancelIO will simply block until all requests have finished. A lawyer would probably argue that the function cannot be held liable for neglecting its duty, because there are no more in-flight requests left when it returns, so in some way it has done what was asked -- but my understanding of "cancel" is somewhat different.
With buffered, overlapped IO, CancelIO will simply cut the rope, all in-flight operations terminate immediately, as one would expect.

Yet another funny thing is that the process is unkillable until all requests have finished or failed. This kind of makes sense if the OS is doing DMA into that address space, but it's a stunning "feature" nevertheless.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

挽清梦 2024-11-12 07:21:02

我不是文件系统专家，但我认为这里发生了一些事情。首先。写下你对内存映射获胜的评论。这并不完全令人惊讶，因为 NT 缓存管理器基于内存映射 - 通过自己进行内存映射，您可以复制缓存管理器的行为，而无需额外的内存副本。

当您从文件中顺序读取时，缓存管理器会尝试为您预取数据 - 因此您可能会在缓存管理器中看到预读的效果。在某些时候，缓存管理器会停止预取读取（或者更确切地说，在某些时候预取的数据不足以满足您的读取，因此缓存管理器必须停止）。这可能是您所看到的较大 I/O 速度减慢的原因。

您是否尝试过将 FILE_FLAG_SEQUENTIAL_SCAN 添加到 CreateFile 标志中？这指示预取器更加积极。

这可能违反直觉，但传统上从磁盘读取数据的最快方法是使用异步 I/O 和 FILE_FLAG_NO_BUFFERING。当你这样做时，I/O 直接从磁盘驱动器进入你的 I/O 缓冲区，没有任何阻碍（假设文件的段是连续的 - 如果不是，文件系统将不得不发出多次磁盘读取以满足应用程序读取请求）。当然，这也意味着您失去了内置的预取逻辑，并且必须自行构建。但使用 FILE_FLAG_NO_BUFFERING 您可以完全控制 I/O 管道。

另一件需要记住的事情是：当您进行异步 I/O 时，确保始终有未完成的 I/O 请求非常重要 - 否则您会损失上一个 I/O 完成与下一个 I/O 之间的潜在时间已启动。

回复收藏 0 原文

~没有更多了~