以编程方式复制许多文件的最有效方法是什么?

发布于 2025-01-25 22:27:44 字数 3795 浏览 3 评论 0 原文

很久以前,我们有一个bash脚本,可以根据某些条件(基本上像CP -RF的过滤版本)来删除需要复制的文件列表。 这太慢了,被C ++程序所取代。

C ++程序实质上是:

foreach file
   read entire file into buffer
   write entire file

该程序使用POSIX呼叫 Open() read() and write()避免使用缓冲和其他开销vs iostream fopen fread & fwrite

是否有可能改进?

注意:

  • 我假设这些不是稀疏文件
  • 我假设GNU/linux
  • 我不是假设是否有特定文件系统,
  • 我不假定源和目的地是否在同一源上磁盘。
  • 我没有假设磁盘,SSD,HDD甚至NFS或SSHF的类型。
  • 我们可以假设源文件与彼此相同的磁盘上。
  • 我们可以假设目标文件也将在同一磁盘上。
  • 我们不能假设源和目的地是在同一磁盘上还是在同一磁盘上。

我认为答案是肯定的,但很细微。

复制速度当然受到磁盘io而不是CPU的限制。

但是,我们如何确保优化对磁盘IO的使用?

也许磁盘等于多个读取或写头的磁头? (也许是SSD?) 在这种情况下,并行执行多个副本将有所帮助。

我们可以以某种方式确定和利用这一点吗?


这肯定是很好的轨道领域,因此,不要直接将轮子重新发明(尽管总是很有趣),很高兴听到别人尝试或建议的东西。 否则,我将尝试各种事情,并在遥远的将来的某个时候回答我自己的问题。

这是我不断发展的答案到目前为止的样子...

如果源和目的地是不同的物理磁盘,那 我们至少可以同时读写类似的内容:

writer thread
  read from write queue
  write file

reader thread
   foreach file
   read file
   queue write on writer thread

如果源和目的地在同一物理磁盘上,并且我们恰好位于文件系统上 使用“写入语义”(例如XFS或BTRFS),我们可以避免实际复制文件。 这显然称为“连接” cp cp 命令支持此使用-reflink =自动。

另请参阅:

来自这个

问题master/src/copy.c“ rel =“ nofollow noreferrer”> https://github.com/coreutils/coreutils/coreutils/coreutils/blob/master/src/copy.c

看起来好像是使用IOCTCTL完成的如下:

ioctl (dest_fd, FICLONE, src_fd);

因此,快速获胜可能是:

try FICLONE on first file.
If it succeeds then:
   foreach file
      srcFD = open(src);
      destFD = open(dest);
      ioctl(destFD,FICLONE,srcFD);
else
   do it the other way - perhaps in parallel

根据低级系统API,我们拥有:

  • copy_file_range
  • ioctl ficlone
  • sendfile,

我不清楚什么时候可以选择一个,除了 copy_file_range 不是安全的与某些文件系统一起使用

这个答案提供了一些建议,建议sendfile()用于插座,但实际上仅适用于2.6之前的内核。 33。

copy_file_range()对于将一个文件复制到另一个文件很有用(在 相同的文件系统)无实际复制任何内容,直到 文件已修改(抄写或牛)。

splice()仅当文件描述符之一是指管道时起作用。所以 您可以在不复制的情况下用于EG套筒到管道或管道档案 数据进入用户空间。但是您不能使用它进行文件档案。

sendfile()仅在源文件描述符引用时工作 可以mmap()ed(即主要是普通文件)和之前的东西 2.6.33目的地必须是插座。


评论中也有一个建议,即读取多个文件,然后编写多个文件将带来更好的性能。 这可以使用一些解释。 我的猜测是,它试图利用启发式程序,即源文件和目标文件将在磁盘上靠近。 我认为并行阅读器和作者线程版本也许可以做到这一点。 这种设计的问题在于它无法利用低级系统复制API的任何性能增益。

Once upon a time long ago, we had a bash script that works out a list of files that need to be copied based on some criteria (basically like a filtered version of cp -rf).
This was too slow and was replaced by a C++ program.

What the C++ program does is essentially:

foreach file
   read entire file into buffer
   write entire file

The program uses Posix calls open(), read() and write() to avoid buffering and other overheads vs iostream and fopen, fread & fwrite.

Is it possible to improve on this?

Notes:

  • I am assuming these are not sparse files
  • I am assuming GNU/Linux
  • I am not assuming a particular filesystem is available
  • I am not assuming prior knowledge of whether the source and destination are on the same disk.
  • I am not assuming prior knowledge of the kind of disk, SSD, HDD maybe even NFS or sshfs.
  • We can assume the source files are on the same disk as each other.
  • We can assume the destination files will also be on the same disk as each other.
  • We cannot assume whether the source and destinations are on the same disk or or not.

I think the answer is yes but it is quite nuanced.

Copying speed is of course limited by disk IO not CPU.

But how can we be sure to optimise our use of disk IO?

Maybe the disk has the equivalent of multiple read or write heads available? (perhaps an SSD?)
In which case performing multiple copies in parallel will help.

Can we determine and exploit this somehow?


This is surely well trod territory so rather than re-invent the wheel straight away (though that is always fun) it would be nice to hear what others have tried or would recommend.
Otherwise I will try various things and answer my own question sometime in the distant future.

This is what my evolving answer looks like so far...

If the source and destination are different physical disks then
we can at least read and write at the same time with something like:

writer thread
  read from write queue
  write file

reader thread
   foreach file
   read file
   queue write on writer thread

If the source and destination are on the same physical disk and we happen to be on a filesystem
with copy on write semantics (like xfs or btrfs) we can potentially avoid actually copying the file at all.
This is apparently called "reflinking".
The cp command supports this using --reflink=auto.

See also:

From this question

and https://github.com/coreutils/coreutils/blob/master/src/copy.c

it looks as if this is done using an ioctl as in:

ioctl (dest_fd, FICLONE, src_fd);

So a quick win is probably:

try FICLONE on first file.
If it succeeds then:
   foreach file
      srcFD = open(src);
      destFD = open(dest);
      ioctl(destFD,FICLONE,srcFD);
else
   do it the other way - perhaps in parallel

In terms of low-level system APIs we have:

  • copy_file_range
  • ioctl FICLONE
  • sendfile

I am not clear when to choose one over the other except that copy_file_range is not safe to use with some filesystems notably procfs.

This answer gives some advice and suggests sendfile() is intended for sockets but in fact this is only true for kernels before 2.6.33.

https://www.reddit.com/r/kernel/comments/4b5czd/what_is_the_difference_between_splice_sendfile/

copy_file_range() is useful for copying one file to another (within
the same filesystem) without actually copying anything until either
file is modified (copy-on-write or COW).

splice() only works if one of the file descriptors refer to a pipe. So
you can use for e.g. socket-to-pipe or pipe-to-file without copying
the data into userspace. But you can't do file-to-file copies with it.

sendfile() only works if the source file descriptor refers to
something that can be mmap()ed (i.e. mostly normal files) and before
2.6.33 the destination must be a socket.


There is also a suggestion in a comment that reading multiple files then writing multiple files will result in better performance.
This could use some explanation.
My guess is that it tries to exploit the heuristic that the source files and destination files will be close together on the disk.
I think the parallel reader and writer thread version could perhaps do the same.
The problem with such a design is it cannot exploit any performance gain from the low level system copy APIs.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

如痴如狂 2025-02-01 22:27:44

一般答案是:在尝试另一种策略之前进行测量。

对于HDD,这可能是您的答案: https://unix.staccexchange.com/问题/124527/加速复制1000000-small-files

The general answer is: Measure before trying another strategy.

For HDD this is probably your answer: https://unix.stackexchange.com/questions/124527/speed-up-copying-1000000-small-files

吐个泡泡 2025-02-01 22:27:44

最终,我没有确定“最有效”的方式,但最终确实得到了一个足够快的解决方案。

  1. 生成要复制和存储它的文件列表

  2. 生成一个文件列表,用于使用OpenMP

    并行

    复制文件

     
    for(auto iter = fileStocopy.begin(); iter< fileStocopy.end(); ++ iter)
    {
       copyfile(*iter);
    }
     
  3. 使用copy_file_range()

    复制每个文件

  4. 倒回使用splice()与pipe()编译不支持copy_file_range()的旧平台时()。

在copy_file_range()支持的那样,重链链接在同一文件系统上的源和目的地时完全避免复制。

Ultimately I did not determine the "most efficient" way but I did end up with a solution that was sufficiently fast for my needs.

  1. generate a list of files to copy and store it

  2. copy files in parallel using openMP

    #pragma omp parallel for
    for (auto iter = filesToCopy.begin(); iter < filesToCopy.end(); ++iter)
    {
       copyFile(*iter);
    }
    
  3. copy each file using copy_file_range()

  4. falling back to using splice() with a pipe() when compiling for old platforms not supporting copy_file_range().

Reflinking, as supported by copy_file_range(), to avoid copying at all when the source and destination are on the same filesystem is a massive win.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文