为什么大块中的文件 I/O 比小块中的文件 I/O 慢?
如果您使用某些内容调用 ReadFile
一次例如 32 MB 的大小,与使用较小块大小(例如 32 KB)读取同等数量的字节相比,所需时间明显更长。
为什么?
(不,我的磁盘不忙。)
编辑 1:
忘了提及 - 我正在使用 FILE_FLAG_NO_BUFFERING
执行此操作!
编辑2:
奇怪...
我无法再访问我的旧机器(PATA),但是当我在那里测试它时,它花费了大约两倍的时间,有时甚至更多。在我的新机器 (SATA) 上,我只得到了约 25% 的差异。
这是一段要测试的代码:
#include <memory.h>
#include <windows.h>
#include <tchar.h>
#include <stdio.h>
int main()
{
HANDLE hFile = CreateFile(_T("\\\\.\\C:"), GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
OPEN_EXISTING, FILE_FLAG_NO_BUFFERING /*(redundant)*/, NULL);
__try
{
const size_t chunkSize = 64 * 1024;
const size_t bufferSize = 32 * 1024 * 1024;
void *pBuffer = malloc(bufferSize);
DWORD start = GetTickCount();
ULONGLONG totalRead = 0;
OVERLAPPED overlapped = { 0 };
DWORD nr = 0;
ReadFile(hFile, pBuffer, bufferSize, &nr, &overlapped);
totalRead += nr;
_tprintf(_T("Large read: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
totalRead = 0;
start = GetTickCount();
overlapped.Offset = 0;
for (size_t j = 0; j < bufferSize / chunkSize; j++)
{
DWORD nr = 0;
ReadFile(hFile, pBuffer, chunkSize, &nr, &overlapped);
totalRead += nr;
overlapped.Offset += chunkSize;
}
_tprintf(_T("Small reads: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
fflush(stdout);
}
__finally { CloseHandle(hFile); }
return 0;
}
结果:
大读取:1076 表示 67108864 字节
小读取:842 表示 67108864 字节
有什么想法吗?
If you call ReadFile
once with something like 32 MB as the size, it takes noticeably longer than if you read the equivalent number of bytes with a smaller chunk size, like 32 KB.
Why?
(No, my disk is not busy.)
Edit 1:
Forgot to mention -- I'm doing this with FILE_FLAG_NO_BUFFERING
!
Edit 2:
Weird...
I don't have access to my old machine anymore (PATA), but when I tested it there, it took around 2 times as long, sometimes more. On my new machine (SATA), I'm only getting a ~25% difference.
Here's a piece of code to test:
#include <memory.h>
#include <windows.h>
#include <tchar.h>
#include <stdio.h>
int main()
{
HANDLE hFile = CreateFile(_T("\\\\.\\C:"), GENERIC_READ,
FILE_SHARE_READ | FILE_SHARE_WRITE, NULL,
OPEN_EXISTING, FILE_FLAG_NO_BUFFERING /*(redundant)*/, NULL);
__try
{
const size_t chunkSize = 64 * 1024;
const size_t bufferSize = 32 * 1024 * 1024;
void *pBuffer = malloc(bufferSize);
DWORD start = GetTickCount();
ULONGLONG totalRead = 0;
OVERLAPPED overlapped = { 0 };
DWORD nr = 0;
ReadFile(hFile, pBuffer, bufferSize, &nr, &overlapped);
totalRead += nr;
_tprintf(_T("Large read: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
totalRead = 0;
start = GetTickCount();
overlapped.Offset = 0;
for (size_t j = 0; j < bufferSize / chunkSize; j++)
{
DWORD nr = 0;
ReadFile(hFile, pBuffer, chunkSize, &nr, &overlapped);
totalRead += nr;
overlapped.Offset += chunkSize;
}
_tprintf(_T("Small reads: %d for %d bytes\n"),
GetTickCount() - start, totalRead);
fflush(stdout);
}
__finally { CloseHandle(hFile); }
return 0;
}
Result:
Large read: 1076 for 67108864 bytes
Small reads: 842 for 67108864 bytes
Any ideas?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
您的测试包括读取文件元数据所需的时间,特别是文件数据到磁盘的映射。如果关闭文件句柄并重新打开它,则每个文件句柄应该得到相似的计时。我在本地进行了测试以确保这一点。
对于大量碎片,影响可能更严重,因为您必须读入更多文件到磁盘的映射。
编辑:需要明确的是,我在本地运行了此更改,并且看到大读取和小读取的时间几乎相同。重复使用相同的文件句柄,我从原始问题中看到了类似的时间。
Your test is including the time it take to read in file metadata, specifically, the mapping of file data to disk. If you close the file handle and re-open it, you should get similar timings for each. I tested this locally to make sure.
The effect is probably more severe with heavy fragmentation, as you have to read in more file to disk mappings.
EDIT: To be clear, I ran this change locally, and saw nearly identical times with large and small reads. Reusing the same file handle, I saw similar timings from the original question.
这不是特定于 Windows 的。不久前我用 C++ iostream 库做了一些测试,发现有一个最佳的读取缓冲区大小,超过这个大小性能就会下降。不幸的是,我不再进行测试,而且我不记得尺寸是多少:-)。至于为什么,有很多问题,例如大缓冲区可能会导致同时运行的其他应用程序进行分页(因为缓冲区无法分页)。
This is not specific to windows. I did some tests a while back with the C++ iostream library and found there was an optimum buffer size for reads, above which performance degraded. Unfortunately, I no longer have the tests, and I can't remember what the size was :-). As to why, well there are a lot of issues, such as a large buffer possibly causing paging in other applications running at the same time (as the buffer can't be paged).
当您执行 1024 * 32KB 读取时,您是一遍又一遍地读入同一内存块,还是同时分配总共 32MB 的内存并填充整个 32MB?
如果您将较小的读取读入同一 32K 内存块,那么时间差异可能只是 Windows 不必清理额外的内存。
基于
FILE_FLAG_NO_BUFFERING
除了问题之外的更新:我不是 100% 确定,但我相信当使用
FILE_FLAG_NO_BUFFERING
时,Windows 会将缓冲区锁定到物理内存中,因此它可以允许设备驱动程序处理物理地址(例如直接DMA到缓冲区)。它可以(我相信)通过将大请求分解为较小的请求来做到这一点,但我怀疑微软可能有这样的理念:“如果你要求FILE_FLAG_NO_BUFFERING
,那么我们假设你知道你在做什么我们不会妨碍你”。当然,一次锁定 32MB 而不是一次锁定 32KB 将需要更多资源。所以这有点像我最初的猜测,但是是在物理内存级别而不是虚拟内存级别。
然而,由于我不在 MS 工作,也无法访问 Windows 源代码,所以我只能模糊地回忆起我与 Windows 内核和设备驱动程序模型密切合作时的情况(所以这或多或少是猜测) 。
When you perform the 1024 * 32KB reads are you reading into the same memory block over and over, or are you allocating a total of 32MB to rad into as well and filling the entire 32MB?
If you're reading the smaller reads into the same 32K block of memory, then the time difference is probably simply that Windows doesn't have to scavenge up the additional memory.
Update based on the
FILE_FLAG_NO_BUFFERING
addition to the question:I'm not 100% certain, but I believe that when
FILE_FLAG_NO_BUFFERING
is used, Windows will lock the buffer into physical memory so it can allow the device driver to deal with physical addresses (such as to DMA directly into the buffer). It could (I believe) do this by breaking up a large request into smaller requests, but I suspect that Microsoft might have the philosophy that "if you ask forFILE_FLAG_NO_BUFFERING
then we assume you know what you're doing and we're not going to get in your way".Of course locking 32MB all at once instead of 32KB at a time will require more resources. So this would be kind of like my initial guess, but at the physical memory level rather than the virtual memory level.
However, since I don't work for MS and don't have access to Windows source, I'm going by vague recollection from times when I worked closer with the Windows kernel and device driver model (so this is more or less speculation).
当您完成
FILE_FLAG_NO_BUFFERING
时,这意味着操作系统将不会缓冲 I/O。因此,每次调用读取函数时,它都会进行一次系统调用,每次都会从磁盘获取数据。然后,如果使用较小的缓冲区大小来读取一个固定大小的文件,则需要更多的系统调用,以便每次启动磁盘 I/O 时,都需要更多的用户空间到内核空间。相反,如果您使用更大的块大小,那么对于要读取的相同文件大小,所需的系统调用就会更少,因此用户到内核空间的切换将会更少,并且启动磁盘 I/O 的次数也会更少。这就是为什么,通常较大的块将需要较少的时间来读取。尝试一次仅读取文件
1
字节而不进行缓冲,然后尝试使用4096
字节块并查看差异。when you have done
FILE_FLAG_NO_BUFFERING
that means that the operating system will not buffer the I/O. So each time you call the read function it will make a system call which will fetch each time the data from the disk. Then to read one file with a fixed size if you use less buffer size then more system calls are needed so more user space to kernel space and for each time a disk I/O is initiated. Instead if you use larger block size then for the same file size to be read there would be less system calls required so the user to kernel space switches would be lesser, and the number of times the disk i/O initiated will also be lesser. This is why, generally larger block will require less time to read.Try reading the file only
1
byte at a time without buffering, and try with4096
bytes block then and see the difference.我认为可能的解释是使用 FILE_FLAG_NO_BUFFERING 命令排队,因为这会在低级别直接进行 DMA 读取。
当然,单个大请求仍然必然会被分解为子请求,但这些子请求可能会或多或少地被一个接一个地发送(因为驱动程序需要锁定页面,并且很可能不愿意锁定几兆字节,以免达到配额)。
另一方面,如果您向驱动程序发出一打或两打请求,它只会将它们转发到磁盘和磁盘并利用 NCQ。
好吧,这就是我认为可能是原因(但这并不能解释为什么缓冲读取会发生完全相同的现象,如我上面链接的 Q 中所示)。
A possible explanation in my opinion would be command queueing with
FILE_FLAG_NO_BUFFERING
, since this does direct DMA reads at low level.A single large request will of course still necessarily be broken into sub-requests, but those will likely be sent more or less one after another (because the driver needs to lock the pages and will in all likelihood be reluctant to lock several megabytes lest it hits the quota).
On the other hand, if you throw a dozen or two dozen requests at the driver, it will just forward them to the disk and the disk and take advantage of NCQ.
Well, that's what I'm thinking might be the reason anyway (this does not explain why the exact same phenomenon happens with buffered reads though, as in the Q that I linked to above).
您可能观察到的是,当使用较小的块时,可以在处理第一个数据块时读取第二个数据块,然后在处理第二个数据块时读取第三个数据块,依此类推,因此速度限制是速度限制中较慢的一个。物理读取时间或处理时间。如果处理一个块与读取下一个块花费相同的时间,则速度可能是处理和读取分开时的两倍。当使用较大的块时,处理第一个块时读取的数据量将被限制为小于块大小。当代码准备好接收下一个数据块时,其中的一部分将被读取,但另一部分则不会;因此,在获取剩余数据时,代码需要等待。
What you are probably observing is that when using smaller blocks, the second block of data can be read while the first is being processed, then the third read while the second is being processed, etc. so that the speed limit is the slower of the physical read time or the processing time. If it takes the same amount of time to process one block as to read the next, the speed could be double what it would be if processing and reading were separate. When using larger blocks, the amount of data that is read while the first block is being processed will be limited to amount smaller than the block size. When the code is ready for the next block of data, part of it will have been read but some of it will not; it will thus be necessary for the code to wait while the remainder of the data is fetched.