cpio VS tar 和 cp
我刚刚了解到cpio有三种模式:copy-out、copy-in和pass-through。
我想知道 cpio 在复制输出和复制输入模式下相对于 tar 的优点和缺点是什么。什么时候使用 cpio 更好,什么时候使用 tar 更好?
直通模式下的 cpio 与 cp 的类似问题。
谢谢和问候!
I just learned that cpio has three modes: copy-out, copy-in and pass-through.
I was wondering what are the advantages and disadvantages of cpio under copy-out and copy-in modes over tar. When is it better to use cpio and when to use tar?
Similar question for cpio under pass-through mode versus cp.
Thanks and regards!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
除了通过 disrpm 或 < 翻录打开的 RPM 文件之外,我认为没有理由使用 cpio 。 a href="http://www.rpm.org/max-rpm/s1-rpm-miscellania-rpm2cpio.html" rel="nofollow">rpm2cpio,但可能存在 cpio 的极端情况优于焦油。
历史和流行度
tar 和 cpio 是 1979 年的 Unix 版本 7,然后包含在 POSIX.1-1988 中下一个标准 POSIX.1-20011。
cpio 的文件格式已更改多次,并且版本之间并未保持完全兼容。例如,现在有二进制文件信息数据的 ASCII 编码表示。
Tar 更广为人知,多年来变得更加通用,并且更有可能在给定系统上得到支持。不过,Cpio 仍在少数领域使用,例如 Red Hat 软件包 格式 (RPM) RPM v5(无疑是晦涩难懂的)使用xar 而不是 cpio。
两者都存在于大多数类 Unix 系统上,尽管 tar 更常见。以下是 Debian 的安装统计信息:
模式
复制:用于创建存档,类似于
tar -pc
复制:用于存档提取,类似于
tar -px
传递 >:这基本上是上述两者,类似于 tar -pc … |tar -px ,但在单个命令中(因此微观上更快)。它与 cp -pdr 类似,但 cpio 和(尤其是)tar 都具有更多的可定制性。还要考虑 rsync -a ,人们经常忘记它,因为它更常通过网络连接使用。
我没有比较它们的性能,但我预计它们在 CPU、内存和存档大小(压缩后)方面将非常相似。
I see no reason to use cpio for any reason other than ripping opened RPM files, via disrpm or rpm2cpio, but there may be corner cases in which cpio is preferable to tar.
History and popularity
Both tar and cpio are competing archive formats that were introduced in Version 7 Unix in 1979 and then included in POSIX.1-1988, though only tar remained in the next standard, POSIX.1-20011.
Cpio's file format has changed several times and has not remained fully compatible between versions. For example, there is now an ASCII-encoded representation of binary file information data.
Tar is more universally known, has become more versatile over the years, and is more likely to be supported on a given system. Cpio is still used in a few areas, such as the Red Hat package format (RPM), though RPM v5 (which is admittedly obscure) uses xar instead of cpio.
Both live on most Unix-like systems, though tar is more common. Here are Debian's install stats:
Modes
Copy-out: This is for archive creation, akin to
tar -pc
Copy-in: This is for archive extraction, akin to
tar -px
Pass-through: This is basically both of the above, akin to
tar -pc … |tar -px
but in a single command (and therefore microscopically faster). It's similar tocp -pdr
, though both cpio and (especially) tar have more customizability. Also considerrsync -a
, which people often forget since it's more typically used across a network connection.I have not compared their performance, but I expect they'll be quite similar in CPU, memory, and archive size (after compression).
TAR(1) 与 cpio() 一样好,甚至更好。人们可以说它实际上比 CPIO 更好,因为它无处不在并且经过审查。我们到处都是焦油球肯定是有原因的。
TAR(1) is just as good as cpio() if not better. One can argue that it is , in fact, better than CPIO because it is ubiquitous and vetted. There's got to be a reason why we have tar balls everywhere.
为什么 cpio 比 tar 更好?有很多原因。
在编写脚本时,它可以更好地控制复制哪些文件和不复制哪些文件,因为您必须显式列出要复制的文件。例如,以下哪项更容易阅读和理解?
<前><代码>查找 . -type f -name '*.sh' -print | cpio -o | gzip >sh.cpio.gz
或在 Solaris 上:
<前><代码>查找 . -type f -name '*.sh' -print >/tmp/includeme
焦油 -cf - . -I /tmp/includeme | -I /tmp/includeme | gzip >sh.tar.gz
或者使用gnutar:
<前><代码>查找 . -type f -name '*.sh' -print >/tmp/includeme
焦油 -cf - . --files-from=/tmp/includeme | --files-from=/tmp/includeme | gzip >sh.tar.gz
这里有一些具体的注意事项:对于大型文件列表,您不能将 find 放在反引号中;命令行长度将超出;您必须使用中间文件。单独的 find 和 tar 命令本身速度较慢,因为这些操作是串行完成的。
考虑这种更复杂的情况,您希望完全打包一棵树,但一些文件在一个 tar 中,其余文件在另一个 tar 中。
<前><代码>查找 . -深度-打印>/tmp/文件
egrep '\.sh$' /tmp/files | cpio -o | gzip >with.cpio.gz
egrep -v '\.sh$' /tmp/files | cpio -o | gzip > 不带.cpio.gz
或在 Solaris 下:
<前><代码>查找 . -深度-打印>/tmp/文件
egrep '\.sh$' /tmp/files >/tmp/with
焦油 -cf - . -I /tmp/与 | gzip >with.tar.gz
焦油 -cf - . /tmp/没有 | gzip > 没有.tar.gz
## ^^-- 不,这里没有遗漏参数。那样只是空的
或者使用gnutar:
<前><代码>查找 . -深度-打印>/tmp/文件
egrep '\.sh$' /tmp/files >/tmp/with
焦油 -cf - . -I /tmp/与 | gzip >with.tar.gz
焦油 -cf - . -X /tmp/没有 | gzip > 没有.tar.gz
再次强调一些注意事项:单独的 find 和 tar 命令本质上会较慢。创建更多中间文件会造成更多混乱。 gnutar 感觉更干净一些,但命令行选项本质上是不兼容的!
如果您需要通过繁忙的网络将大量文件从一台计算机快速复制到另一台计算机,您可以并行运行多个 cpio。例如:
<前><代码>查找 . -深度-打印>/tmp/文件
分割/tmp/文件
对于 /tmp/files 中的 F?? ;做
猫 $F | cpio -o | ssh 目标“cd /target && cpio -idum” &
完毕
请注意,如果您可以将输入分成均匀大小的部分,将会有所帮助。我创建了一个名为“npipe”的实用程序来执行此操作。 npipe 将从 stdin 读取行,并创建 N 个输出管道,并在消耗每行时将行提供给它们。这样,如果第一个条目是一个大文件,需要 10 分钟才能传输,而其余的都是小文件,需要 2 分钟才能传输,那么您就不会因为等待大文件以及后面排队的另外十几个小文件而停滞不前。 。这样,您最终会按需求进行拆分,而不是严格按文件列表中的行数或字节数进行拆分。类似的功能可以使用 gnu-xargs 的并行分叉功能来完成,只不过将参数放在命令行上而不是将它们流式传输到标准输入。
<前><代码>查找 . -深度-打印>/tmp/文件
npipe -4 /tmp/files 'cpio -o | npipe -4 ssh 目的地“cd /target && cpio -idum”'
这怎么更快?为什么不使用 NFS?为什么不使用rsync? NFS 本质上非常慢,但更重要的是,任何单一工具的使用本质上都是单线程的。 rsync 一次读取源树并将一个文件写入目标树。如果你有一台多处理器机器(当时我每台机器使用 16cpu),并行写入就变得非常重要。我将 8GB 树的复制速度缩短到了 30 分钟;那是 4.6MB/秒!当然,这听起来很慢,因为 100Mbit 网络可以轻松达到 5-10MB/秒,但真正让它慢的是 inode 创建时间;这棵树里有 500,000 个文件。因此,如果 inode 创建是瓶颈,那么我需要并行化该操作。相比之下,以单线程方式复制文件需要 4 个小时。速度提高了 8 倍!
速度更快的第二个原因是并行 tcp 管道不易受到各处丢失数据包的影响。如果一个管道由于丢失数据包而停滞,其他管道通常不会受到影响。我不太确定这会产生多大的影响,但对于精细的多线程内核,这可以再次提高效率,因为工作负载可以分布在所有这些空闲 cpu 上
根据我的经验,cpio 总体上比 tar 做得更好,以及更多的参数可移植性(参数在 cpio 版本之间不会改变!),尽管在某些系统上可能找不到它(RedHat 上默认没有安装),但话又说回来,Solaris 没有附带 gzip默认任一。
Why is cpio better than tar? A number of reasons.
When scripting, it has much better control over which files are and are not copied, since you must explicitly list the files you want copied. For example, which of the following is easier to read and understand?
or on Solaris:
or with gnutar:
A couple of specific notes here: for large lists of files, you can't put find in reverse quotes; the command-line length will be overrun; you must use an intermediate file. Separate find and tar commands are inherently slower, since the actions are done serially.
Consider this more complex case where you want a tree completely packaged up, but some files in one tar, and the remaining files in another.
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files | cpio -o | gzip >with.cpio.gz egrep -v '\.shor under Solaris:
or with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files | cpio -o | gzip >without.cpio.gzor under Solaris:
or with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files >/tmp/with tar -cf - . -I /tmp/with | gzip >with.tar.gz tar -cf - . /tmp/without | gzip >without.tar.gz ## ^^-- no there's no missing argument here. It's just empty that wayor with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files | cpio -o | gzip >with.cpio.gz egrep -v '\.shor under Solaris:
or with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files | cpio -o | gzip >without.cpio.gzor under Solaris:
or with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files >/tmp/with tar -cf - . -I /tmp/with | gzip >with.tar.gz tar -cf - . -X /tmp/without | gzip >without.tar.gzAgain, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files | cpio -o | gzip >with.cpio.gz egrep -v '\.shor under Solaris:
or with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files | cpio -o | gzip >without.cpio.gzor under Solaris:
or with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files >/tmp/with tar -cf - . -I /tmp/with | gzip >with.tar.gz tar -cf - . /tmp/without | gzip >without.tar.gz ## ^^-- no there's no missing argument here. It's just empty that wayor with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files | cpio -o | gzip >with.cpio.gz egrep -v '\.shor under Solaris:
or with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.
/tmp/files | cpio -o | gzip >without.cpio.gzor under Solaris:
or with gnutar:
Again, some notes: Separate find and tar commands are inherently slower. Creating more intermediate files creates more clutter. gnutar feels a little cleaner, but the command-line options are inherently incompatible!
If you need to copy a lot of files from one machine to another in a hurry across a busy network, you can run multiple cpio's in parallel. For example:
Note that it would help if you could split the input into even sized pieces. I created a utility called 'npipe' to do this. npipe would read lines from stdin, and create N output pipes and feed the lines to them as each line was consumed. This way, if the first entry was a large file that took 10 minutes to transfer and the rest were small files that took 2 minutes to transfer, you wouldn't get stalled waiting for the large file plus another dozen small files queued up behind it. This way you end up splitting by demand, not strictly by number of lines or bytes in the list of files. Similar functionality could be accomplished with gnu-xargs' parallel forking capability, except that puts arguments on the command-line instead of streaming them to stdin.
How is this faster? Why not use NFS? Why not use rsync? NFS is inherently very slow, but more importantly, the use of any single tool is inherently single threaded. rsync reads in the source tree and writes to the destination tree one file at a time. If you have a multi processor machine (at the time I was using 16cpu's per machine), parallel writing became very important. I speeded the copy of a 8GB tree down to 30 minutes; that's 4.6MB/sec! Sure it sounds slow since a 100Mbit network can easily do 5-10MB/sec, but it's the inode creation time that makes it slow; there were easily 500,000 files in this tree. So if inode creation is the bottleneck, then I needed to parallelize that operation. By comparison, copying the files in a single-threaded manner would take 4 hours. That's 8x faster!
A secondary reason that this was faster is that parallel tcp pipes are less vulnerable to a lost packet here and there. If one pipe gets stalled because of a lost packet, the others will generally not be affected. I'm not really sure how much this made a difference, but for finely multi-threaded kernels, this can again be more efficient since the workload can be spread across all those idle cpu's
In my experience, cpio does an overall better job than tar, as well as being more argument portable (arguments don't change between versions of cpio!), though it may not be found on some systems (not installed by default on RedHat), but then again Solaris doesn't come with gzip by default either.