更有效的方式来查找和查找tar 数百万个文件
我的服务器上有一个作业在命令行提示符下运行了两天:
find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;
它需要永远,然后是一些时间。是的,目标目录中有数百万个文件。 (在散列良好的目录结构中,每个文件只有区区 8 个字节。)但是仅仅运行...
find data/ -name filepattern-*2009* -print > filesOfInterest.txt
...只需要两个小时左右。按照我的工作运行速度,几周内都无法完成......这似乎不合理。 是否有更有效的方法来做到这一点?也许使用更复杂的 bash 脚本?
第二个问题是“为什么我当前的方法如此缓慢?”
I've got a job running on my server at the command line prompt for a two days now:
find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;
It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...
find data/ -name filepattern-*2009* -print > filesOfInterest.txt
...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script?
A secondary questions is "why is my current approach so slow?"
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(9)
一种选择是使用 cpio 生成 tar 格式的存档:
cpio 本身使用来自标准输入的文件名列表,而不是顶级目录,这使得对于这种情况来说是一个理想的工具。
One option is to use cpio to generate a tar-format archive:
cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.
如果您已经执行了创建文件列表的第二个命令,只需使用
-T
选项告诉 tar 从保存的文件列表中读取文件名。运行 1 个 tar 命令与运行 N 个 tar 命令相比会好很多。If you already did the second command that created the file list, just use the
-T
option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.这是一个 find-tar 组合,可以在不使用 xargs 或 exec 的情况下完成您想要的操作(这应该会导致明显的加速):
Here's a find-tar combination that can do what you want without the use of xargs or exec (which should result in a noticeable speed-up):
有 xargs 来解决这个问题:
猜测为什么它很慢很困难,因为没有太多信息。目录的结构是什么,使用什么文件系统,创建时如何配置。对于大多数文件系统来说,在单个目录中拥有数百万个文件是相当困难的情况。
There is xargs for this:
Guessing why it is slow is hard as there is not much information. What is the structure of the directory, what filesystem do you use, how it was configured on creating. Having milions of files in single directory is quite hard situation for most filesystems.
要正确处理带有奇怪(但合法)字符(例如换行符等)的文件名,您应该使用 find 的 -print0 将文件列表写入 filesOfInterest.txt:
To correctly handle file names with weird (but legal) characters (such as newlines, ...) you should write your file list to filesOfInterest.txt using find's -print0:
按照目前的方式,每次找到文件时都会调用 tar 命令,这并不奇怪。您实际上是将这些时间相乘,而不是花费两个小时进行打印,加上打开 tar 存档、查看文件是否过期以及将它们添加到存档中所需的时间。在将所有名称批处理在一起之后,调用 tar 命令一次可能会更成功,可能使用 xargs 来实现调用。顺便说一句,我希望您使用的是 'filepattern-*2009*' 而不是 filepattern-*2009*,因为星星将被 shell 扩展而不带引号。
The way you currently have things, you are invoking the tar command every single time it finds a file, which is not surprisingly slow. Instead of taking the two hours to print plus the amount of time it takes to open the tar archive, see if the files are out of date, and add them to the archive, you are actually multiplying those times together. You might have better success invoking the tar command once, after you have batched together all the names, possibly using xargs to achieve the invocation. By the way, I hope you are using 'filepattern-*2009*' and not filepattern-*2009* as the stars will be expanded by the shell without quotes.
有一个名为
tarsplitter
的实用程序。将使用 8 个线程将匹配“folder/*.json”的文件存档到“archive.tar”的输出存档
https://github.com/AQUAOSOTech/tarsplitter
There is a utility for this called
tarsplitter
.will use 8 threads to archive the files matching "folder/*.json" into an output archive of "archive.tar"
https://github.com/AQUAOSOTech/tarsplitter
在我找到使用 Python 的 tarfile 库的更简单且可能更快的解决方案之前,我在 Linux 上挣扎了很长时间。
这是我的代码示例:
然而,这总共花费了大约 12 秒来查找 16222 个文件路径并创建存档,这主要是通过简单地搜索文件路径来完成的。创建包含 16000 个文件路径的 tar 存档只花了 7 秒。通过一些多线程,这可能会快得多。
如果您正在寻找多线程实现,我已经制作了一个并将其放在这里:
当然,您需要确保
max_threads
和filepaths_per_thread
的值是优化;创建线程需要时间,因此对于某些值,时间实际上可能会增加。最后要注意的一点是,由于我们使用追加模式,因此我们会自动创建一个具有指定名称的新存档(如果该存档尚不存在)。然而,如果一个确实已经存在,它只会添加到预先存在的存档中,而不是重置它或创建一个新存档。I was struggling with linux for a long time before I found a much easier and potentially faster solution using Python's tarfile library.
Here is my code sample:
This took a total of about 12 seconds to find 16222 filepaths and create the archive, however, this was predominantly taken up by simply searching for the filepaths. It took just 7 seconds to create the tar archive with 16000 filepaths. With some multithreading this could be much faster.
If you're looking for a multithreaded implementation, I've made one and placed it here:
Of course, you need to make sure that the values of
max_threads
andfilepaths_per_thread
are optimized; it takes time to create threads, so the time may actually increase for certain values. A final thing to note is that since we are using append mode, we are automatically creating a new archive with the designated name if one does not already exist. However, if one does already exist, it will simply add to the preexisting archive, not reset it or make a new one.最简单(也在创建存档后删除文件):
Simplest (also remove file after archive creation):