提高高速文件复制的写入速度?
我一直在尝试找出编写文件复制例程以将大文件复制到 RAID 5 硬件上的最快方法。
平均文件大小约为 2 GB。
有 2 个 windows 盒子(都运行 win2k3)。第一个框是源,大文件所在的位置。第二个盒子有 RAID 5 存储。
http://blogs. technet.com/askperf/archive/2007/05/08/slow-large-file-copy-issues.aspx
上面的链接清楚地解释了为什么 Windows Copy、robocopy 和其他常见复制实用程序的写入性能受到影响。 因此,我编写了一个使用 CreateFile、ReadFile 和 ReadFile 的 C/C++ 程序。 WriteFile API 具有 NO_BUFFERING
& WRITE_THROUGH
标志。该程序模拟 ESEUTIL.exe,从某种意义上说,它使用 2 个线程,一个用于读取,一个用于写入。读取器线程从源读取 256 KB 并填充缓冲区。一旦 16 个这样的 256 KB 块被填满,写入器线程就会将缓冲区中的内容写入目标文件。如您所见,写入器线程一次性写入 8MB 数据。该程序分配 32 个这样的 8MB 块...因此,写入和读取可以并行发生。 ESEUtil.exe 的详细信息可以在上面的链接中找到。 注意:我正在处理使用 NO_BUFFERING
时的数据对齐问题。
我使用 ATTO 等基准测试实用程序,发现我们的 RAID 5 硬件在写入 8MB 数据块时的写入速度为每秒 44MB。大约每分钟 2.57 GB。
但我的程序只能达到每分钟 1.4 GB。
任何人都可以帮我找出问题所在吗?除了 CreateFile
、ReadFile
、WriteFile
之外,还有更快的 API 可用吗?
I've been trying to find out the fastest way to code a file copy routine to copy a large file onto a RAID 5 hardware.
The average file size is around 2 GB.
There are 2 windows boxes (both running win2k3). The first box is the source, where is the large file is located. And the second box has a RAID 5 storage.
http://blogs.technet.com/askperf/archive/2007/05/08/slow-large-file-copy-issues.aspx
The above link clearly explains why windows copy, robocopy and other common copy utilities suffer in write performance.
Hence, i've written a C/C++ program that uses CreateFile, ReadFile & WriteFile API's with NO_BUFFERING
& WRITE_THROUGH
flags. The program simulates ESEUTIL.exe, in the sense, it uses 2 threads, one for reading and one for writing. The reader thread reads 256 KB from source and fills a buffer. Once 16 such 256 KB blocks are filled, the writer thread writes the contents in the buffer to the destination file. As you can see, the writer thread writes 8MB of data in 1 shot. The program allocates 32 such 8MB blocks... hence, the writing and reading can happen in parallel.
Details of ESEUtil.exe can be found in the above link.
Note: I am taking care of the data alignment issues when using NO_BUFFERING
.
I used bench marking utilities like ATTO and found out that our RAID 5 hardware has a write speed of 44MB per second when writing 8MB data chunk. Which is around 2.57 GB per minute.
But my program is able to achieve only 1.4 GB per minute.
Can anyone please help me identify what the problem is? Are there faster API's other that CreateFile
, ReadFile
, WriteFile
available?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(7)
您应该使用异步 IO 以获得最佳性能。即使用
FILE_FLAG_OVERLAPPED
打开文件并使用 WriteFile 的LPOVERLAPPED
参数。使用FILE_FLAG_NO_BUFFERING
可能会也可能不会获得更好的性能。你必须测试才能看到。FILE_FLAG_NO_BUFFERING
通常会给您提供更一致的速度和更好的流行为,并且它可以避免您可能不再需要的数据污染磁盘缓存,但总体上不一定更快。您还应该测试每个 IO 块的最佳大小。根据我的经验,一次复制 4k 文件和一次复制 1Mb 文件之间存在巨大的性能差异。
在我过去(几年前)的测试中,我发现低于约 64kB 的块大小主要由开销主导,并且随着块大小高达约 512KB,总吞吐量继续提高。如果使用当今的驱动器您需要使用大于 1MB 的块大小才能获得最大吞吐量,我不会感到惊讶。
您当前使用的数字似乎是合理的,但可能不是最佳的。另外,我相当确定 FILE_FLAG_WRITE_THROUGH 会阻止使用磁盘缓存,因此会降低相当多的性能。
您还需要注意,使用 CreateFile/WriteFile 复制文件不会复制 NTFS 上的元数据,例如时间戳或备用数据流。您必须自己处理这些事情。
实际上,用您自己的代码替换
CopyFile
是相当大量的工作。附录:
我可能应该提到,当我在 WindowsNT 3.0 上使用 Raid 0 软件尝试这一点时(大约 10 年前)。速度对缓冲区内存中的对齐非常敏感。事实证明,当时,当 DMA 超过 16 个物理内存区域 (64Kb) 时,SCSI 驱动程序必须使用特殊算法从分散/聚集列表中执行 DMA。为了获得保证的最佳性能,需要物理上连续的分配——这是只有驱动程序才能请求的。这基本上是针对当时流行芯片组的 DMA 控制器中的错误的解决方法,并且不太可能仍然是一个问题。
但是 - 我仍然强烈建议您测试从 32kb 到 32Mb 的 2 个块大小的所有幂,看看哪个更快。您可能会考虑进行测试,看看某些缓冲区是否始终比其他缓冲区快 - 这并非闻所未闻。
You should use async IO to get the best performance. That is opening the file with
FILE_FLAG_OVERLAPPED
and using theLPOVERLAPPED
argument of WriteFile. You may or may not get better performance withFILE_FLAG_NO_BUFFERING
. You will have to test to see.FILE_FLAG_NO_BUFFERING
will generally give you more consistent speeds and better streaming behavior, and it avoids polluting your disk cache with data that you may not need again, but it isn't necessarily faster overall.You should also test to see what the best size is for each block of IO. In my experience There is a huge performance difference between copying a file 4k at a time and copying it 1Mb at a time.
In my past testing of this (a few years ago) I found that block sizes below about 64kB were dominated by overhead, and total throughput continued to improve with larger block sizes up to about 512KB. I wouldn't be surprised if with today's drives you needed to use block sizes larger than 1MB to get maximum throughput.
The numbers you are currently using appear to be reasonable, but may not be optimal. Also I'm fairly certain that FILE_FLAG_WRITE_THROUGH prevents the use of the on-disk cache and thus will cost you a fair bit of performance.
You need to also be aware that copying files using CreateFile/WriteFile will not copy metadata such as timestamps or alternate data streams on NTFS. You will have to deal with these things on your own.
Actually replacing
CopyFile
with your own code is quite a lot of work.Addendum:
I should probably mention that when I tried this with software Raid 0 on WindowsNT 3.0 (about 10 years ago). The speed was VERY sensitive to the alignment in memory of the buffers. It turned out that at the time, the SCSI drivers had to use a special algorithm for doing DMA from a scatter/gather list, when the DMA was more than 16 physical regions of memory (64Kb). To get guranteed optimal performance required physically contiguous allocations - which is something that only drivers can request. This was basically a workaround for a bug in the DMA controller of a popular chipset back then, and is unlikely to still be an issue.
BUT - I would still strongly suggest that you test ALL power of 2 block sizes from 32kb to 32Mb to see which is faster. And you might consider testing to see if some buffers are consistently faster than others - it's not unheard of.
不久前,我写了一篇关于异步文件 I/O 的博客文章,以及它如何往往最终实际上是同步的,除非你做的一切都恰到好处 (http://www.lenholgate.com/blog/2008/02/when-are-asynchronous-file- writes-not-asynchronous.html)。
关键点是,即使您使用
FILE_FLAG_OVERLAPPED
和FILE_FLAG_NO_BUFFERING
,您仍然需要预先扩展文件,以便您的异步写入不需要扩展文件当他们走的时候;出于安全原因,文件扩展名始终是同步的。要预先扩展,您需要执行以下操作:SE_MANAGE_VOLUME_NAME
权限。SetFilePointerEx()
寻求所需的文件长度。SetEndOfFile()
设置文件结尾。SetFileValidData()
。然后...
A while back I wrote a blog posting about async file I/O and how it often tends to actually end up being synchronous unless you do everything just right (http://www.lenholgate.com/blog/2008/02/when-are-asynchronous-file-writes-not-asynchronous.html).
The key points are that even when you're using
FILE_FLAG_OVERLAPPED
andFILE_FLAG_NO_BUFFERING
you still need to pre-extend the file so that your async writes don't need to extend the file as they go; for security reasons file extension is always synchronous. To pre-extend you need to do the following:SE_MANAGE_VOLUME_NAME
privilege.SetFilePointerEx()
.SetEndOfFile()
.SetFileValidData()
.Then...
如果不写入目标文件,读取源文件的速度有多快?
源文件是否碎片?碎片读取可能比连续读取慢一个数量级。您可以使用“contig”实用程序使其连续:
http://technet .microsoft.com/en-us/sysinternals/bb897428.aspx
连接两台计算机的网络速度有多快?
您是否尝试过只写入虚拟数据,而不先读取它,就像 ATTO 那样?
您是否同时有多个读取或写入请求?
您的 RAID-5 阵列的条带大小是多少?一次写入完整条带是写入 RAID-5 的最快方法。
How fast can you read the source file if you don't write the destination?
Is the source file fragmented? Fragmented reads can be an order of magnitude slower than contiguous reads. You can use the "contig" utility to make it contiguous:
http://technet.microsoft.com/en-us/sysinternals/bb897428.aspx
How fast is the network connecting the two machines?
Have you tried just writing dummy data, without reading it first, like ATTO does?
Do you have more than one read or write request in flight at a time?
What's the stripe size of your RAID-5 array? Writing a full stripe at a time is the fastest way to write to RAID-5.
请记住,硬盘会缓冲来自盘片和进入盘片的数据。大多数磁盘驱动器都会尝试优化读取请求,以保持盘片旋转并最大限度地减少磁头移动。在写入盘片之前,驱动器会尝试从主机吸收尽可能多的数据,以便尽快与主机断开连接。
您的性能还取决于 PC 上的 I/O 总线流量以及磁盘和主机之间的流量。还有其他替代因素需要考虑,例如“同时”运行的系统任务和程序。您可能无法达到测量工具的精确性能。请记住,由于上述开销,这些计时存在误差因素。
如果您的平台有 DMA 控制器,请尝试使用它们。
Just remember that a hard disk buffers data coming from the platters and going to the platters. Most disk drives will try to optimize the read requests to keep the platters rotating and minimize head movement. The drives try to absorb as much data from the Host before writing to the platters so that the Host can be disconnected as soon as possible.
Your performance also depends on the I/O bus traffic on the PC as well as the traffic between the disk and the host. There are other alternative factors to consider such as system tasks and programs running "at the same time". You may not be able to achieve the exact performance as your measuring tool. And remember that these timings have a error factor due to the above mentioned overheads.
If your platform has DMA controllers, try using these.
如果写入速度如此重要,为什么不考虑将 RAID 0 用于您的硬件配置呢?
这里使用 ATTO 对硬件进行基准测试,显示写入速度为每分钟 2.57 GB(8MB)
chunk write),为什么复制工具不能接近它?每个大约 2 GB
min 是我们正在查看的内容。到目前为止,我们只能达到每分钟约 1.5 GB。
If write speed is that important, why not consider RAID 0 for your hardware configuration?
here is benchmarking the hardware using ATTO shows a write speed of 2.57 GB per minute (8MB
chunk write), why cant a copy tool achieve close to it ? Something like 2 GB per
min is what we are looking at. We've been able to achieve only ~1.5 GB per min so far.
正确的方法是使用无缓冲的完全异步 I/O。您将需要发出多个 I/O 来保持队列继续运行。这使得文件系统、驱动程序和 Raid-5 子系统能够更优化地管理 I/O。
您还可以打开多个文件并对多个文件发出读取和写入操作。
笔记!未完成 I/O 的最佳数量以及如何交错读取和写入将在很大程度上取决于存储子系统本身。您的程序需要高度参数化,以便您可以对其进行调整。
注意 - 我相信 Robocopy 已经得到改进 - 你尝试过吗?我
The right way to do this is with un-buffered fully asynchronous I/O. You will want to issue multiple I/Os to keep a queue going. This lets the file system, driver, and Raid-5 sub-system more optimally mange the I/Os.
You can also open multiple files and issue read and wites to multiple files.
NOTE! The optimal number of outstanding I/Os and how you interleave reads and writes will depend greatly on the storage sub-system itself. Your program will need to be highly paramterized so you can tune it.
Note - I belive that Robocopy has been improved - have you tried it? I
我做了一些测试并得到了一些结果。
测试在 100Mbps 和 100Mbps 上进行。 1Gbps 网卡。源计算机是 Win2K3 服务器 (SATA),目标计算机是 Win2k3 服务器 (RAID 5)。
我进行了 3 项测试:
1) 网络阅读器 ->该程序只是通过网络读取文件。该程序的目的是找到最大 n/w 读取速度。我正在使用 CreateFile & 执行非缓冲读取读取文件。
2) 磁盘写入器 ->该程序通过写入数据来对 RAID 5 速度进行基准测试。非缓冲写入是使用 CreateFile & 执行的。写文件。
3) 闪电战复制 ->该程序是文件复制引擎。它通过网络复制文件。该程序的逻辑已在最初的问题中讨论过。我正在使用带有 NO_BUFFERING Reads & 的同步 I/O写道。使用的 API 是 CreateFile、ReadFile 和 ReadFile。写文件。
以下是结果:
网络读取器:-
100 Mbps NIC
花费 148344 毫秒读取 768 MB 数据,块大小为 8 KB。
花费读取块大小为 64 KB 的 768 MB 需要 89359 毫秒
读取块大小为 128 KB 的 768 MB 需要 82625 毫秒
读取块大小为 256 KB 的 768 MB 需要 79594 毫秒< /em>
花费 78687 毫秒读取 768 MB,块大小为 512 KB
花费 79078 毫秒读取 768 MB,块大小 1024 KB
花费 78594 毫秒读取 768块大小为 2048 KB 的 MB
读取块大小为 4096 KB 的 768 MB 需要 78406 毫秒
读取块大小为 8192 KB 的 768 MB 需要 78281 毫秒
1 Gbps NIC
读取块大小为 8 KB 的 5120 MB (5GB) 需要 206203 毫秒
读取块大小为 64 KB 的 5120 MB 需要 77860 毫秒
< em>读取块大小为 128 KB 的 5120 MB 需要 74531 毫秒
读取块大小为 256 KB 的 5120 MB 需要 68656 毫秒
读取块大小为 5120 MB 需要 64922 毫秒512 KB
读取块大小为 1024 KB 的 5120 MB 需要 66312 毫秒
读取块大小为 2048 KB 的 5120 MB 需要 68688 毫秒
需要 64922 毫秒读取 5120 MB,块大小 4096 KB
花费 66047 毫秒读取 5120 MB,块大小 8192 KB
磁盘写入器:-
在 RAID 上执行写入5 使用 NO_BUFFERING 和WRITE_THROUGH
写入 2048MB (2GB) 数据块大小为 4MB 需要 68328 毫秒。
写入 2048MB 数据块大小为 8MB 需要 55985 毫秒。
写入 2048MB 数据块大小为 16MB 的数据花费了 49569 毫秒。
写入块大小为 32MB 的 2048MB 数据花费了 47281 毫秒。
在仅使用 NO_BUFFERING 的 RAID 5 上执行写入
写入 2048MB (2GB) 块大小为 4MB 的数据耗时 57484 毫秒。
写入 2048MB 块大小为 8MB 的数据耗时 52594 毫秒。 写入
2048MB 块大小为 16MB 的数据耗时 49125 毫秒。 写入块大小为 16MB 的 2048MB 数据耗时 49125 毫秒。 em>
写入 2048MB 的块大小为 32MB 的数据需要 46360ms。
随着块大小的减小,写入性能会线性下降。 WRITE_THROUGH 标志会带来一些性能影响
BLITZ COPY:-
1 Gbps NIC,使用 NO_BUFFERING 复制 60 GB 文件
完成复制所需的时间: 2236735 女士即,37.2 分钟。
速度约为 97 GB/每。
100 Mbps NIC,在 NO_BUFFERING 情况下复制 60 GB 文件
完成复制所需的时间:7337219 毫秒。即,122 分钟。
速度约为 30 GB/每。
我确实尝试使用 Jeffrey Ritcher 的 10-FileCopy 程序,该程序使用带有 NO_BUFFERING 的 Async-IO。但是,结果很差。我猜原因可能是块大小为 256 KB...RAID 5 上的 256 KB 写入速度非常慢。
与 robocopy 相比:
100 Mbps NIC:Blitz Copy 和 robocopy 每小时执行约 30 GB。
1 GBps NIC:Blitz Copy 每小时约 97 GB,而 robocopy 每小时约 50 GB。
I did some tests and have some results.
The tests were performed on 100Mbps & 1Gbps NIC. The source machine is Win2K3 server (SATA) and the target machine is Win2k3 server (RAID 5).
I ran 3 tests:
1) Network Reader -> This program just reads files across the network. The purpose of the program is to find the maximum n/w read speed. I am performing a NON BUFFERED reads using CreateFile & ReadFile.
2) Disk Writer -> This program benchmarks the RAID 5 speed by writing data. NON BUFFERED writes are performed using CreateFile & WriteFile.
3) Blitz Copy -> This program is the file copy engine. It copies files across the network. The logic of this program was discussed in the initial question. I am using synchronous I/O with NO_BUFFERING Reads & Writes. The APIs used are CreateFile, ReadFile & WriteFile.
Below are the results:
NETWORK READER:-
100 Mbps NIC
Took 148344 ms to read 768 MB with chunk size 8 KB.
Took 89359 ms to read 768 MB with chunk size 64 KB
Took 82625 ms to read 768 MB with chunk size 128 KB
Took 79594 ms to read 768 MB with chunk size 256 KB
Took 78687 ms to read 768 MB with chunk size 512 KB
Took 79078 ms to read 768 MB with chunk size 1024 KB
Took 78594 ms to read 768 MB with chunk size 2048 KB
Took 78406 ms to read 768 MB with chunk size 4096 KB
Took 78281 ms to read 768 MB with chunk size 8192 KB
1 Gbps NIC
Took 206203 ms to read 5120 MB (5GB) with chunk size 8 KB
Took 77860 ms to read 5120 MB with chunk size 64 KB
Took 74531 ms to read 5120 MB with chunk size 128 KB
Took 68656 ms to read 5120 MB with chunk size 256 KB
Took 64922 ms to read 5120 MB with chunk size 512 KB
Took 66312 ms to read 5120 MB with chunk size 1024 KB
Took 68688 ms to read 5120 MB with chunk size 2048 KB
Took 64922 ms to read 5120 MB with chunk size 4096 KB
Took 66047 ms to read 5120 MB with chunk size 8192 KB
DISK WRITER:-
Write performed on RAID 5 With NO_BUFFERING & WRITE_THROUGH
Writing 2048MB (2GB) of data with chunk size 4MB took 68328ms.
Writing 2048MB of data with chunk size 8MB took 55985ms.
Writing 2048MB of data with chunk size 16MB took 49569ms.
Writing 2048MB of data with chunk size 32MB took 47281ms.
Write performed on RAID 5 With NO_BUFFERING only
Writing 2048MB (2GB) of data with chunk size 4MB took 57484ms.
Writing 2048MB of data with chunk size 8MB took 52594ms.
Writing 2048MB of data with chunk size 16MB took 49125ms.
Writing 2048MB of data with chunk size 32MB took 46360ms.
Write performance degrades linearly as the chunk size reduces. And WRITE_THROUGH flag introduces some performance hit
BLITZ COPY:-
1 Gbps NIC, Copying 60 GB of files with NO_BUFFERING
Time Taken to complete copy : 2236735 ms. Ie, 37.2 mins.
The speed is ~ 97 GB / per.
100 Mbps NIC, Copying 60 GB of files with NO_BUFFERING
Time Taken to complete copy : 7337219 ms. Ie, 122 mins.
The speed is ~ 30 GB / per.
I did try using 10-FileCopy program by Jeffrey Ritcher that uses Async-IO with NO_BUFFERING. But, the results were poor. I guess the reason could be the chunk size is 256 KB... 256 KB write on RAID 5 is terribly slow.
Comparing with robocopy:
100 Mbps NIC : Blitz Copy and robocopy perform @ ~30 GB per hour.
1 GBps NIC : Blitz Copy goes @ ~97 GB per hour while robocopy @ ~50 GB per hour.