更有效的方式来查找和查找tar 数百万个文件

发布于 2024-08-30 04:00:44 字数 456 浏览 7 评论 0原文

我的服务器上有一个作业在命令行提示符下运行了两天:

find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;

它需要永远,然后是一些时间。是的,目标目录中有数百万个文件。 (在散列良好的目录结构中,每个文件只有区区 8 个字节。)但是仅仅运行...

find data/ -name filepattern-*2009* -print > filesOfInterest.txt

...只需要两个小时左右。按照我的工作运行速度,几周内都无法完成......这似乎不合理。 是否有更有效的方法来做到这一点?也许使用更复杂的 bash 脚本?

第二个问题是“为什么我当前的方法如此缓慢?”

I've got a job running on my server at the command line prompt for a two days now:

find data/ -name filepattern-*2009* -exec tar uf 2009.tar {} ;

It is taking forever, and then some. Yes, there are millions of files in the target directory. (Each file is a measly 8 bytes in a well hashed directory structure.) But just running...

find data/ -name filepattern-*2009* -print > filesOfInterest.txt

...takes only two hours or so. At the rate my job is running, it won't be finished for a couple of weeks.. That seems unreasonable. Is there a more efficient to do this? Maybe with a more complicated bash script?

A secondary questions is "why is my current approach so slow?"

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(9

静谧 2024-09-06 04:00:44

一种选择是使用 cpio 生成 tar 格式的存档:

$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar

cpio 本身使用来自标准输入的文件名列表,而不是顶级目录,这使得对于这种情况来说是一个理想的工具。

One option is to use cpio to generate a tar-format archive:

$ find data/ -name "filepattern-*2009*" | cpio -ov --format=ustar > 2009.tar

cpio works natively with a list of filenames from stdin, rather than a top-level directory, which makes it an ideal tool for this situation.

半透明的墙 2024-09-06 04:00:44

如果您已经执行了创建文件列表的第二个命令,只需使用 -T 选项告诉 tar 从保存的文件列表中读取文件名。运行 1 个 tar 命令与运行 N 个 tar 命令相比会好很多。

If you already did the second command that created the file list, just use the -T option to tell tar to read the files names from that saved file list. Running 1 tar command vs N tar commands will be a lot better.

神魇的王 2024-09-06 04:00:44

这是一个 find-tar 组合,可以在不使用 xargs 或 exec 的情况下完成您想要的操作(这应该会导致明显的加速):

tar --version    # tar (GNU tar) 1.14 

# FreeBSD find (on Mac OS X)
find -x data -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# for GNU find use -xdev instead of -x
gfind data -xdev -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# added: set permissions via tar
find -x data -name "filepattern-*2009*" -print0 | \
    tar --null --no-recursion --owner=... --group=... --mode=... -uf 2009.tar --files-from -

Here's a find-tar combination that can do what you want without the use of xargs or exec (which should result in a noticeable speed-up):

tar --version    # tar (GNU tar) 1.14 

# FreeBSD find (on Mac OS X)
find -x data -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# for GNU find use -xdev instead of -x
gfind data -xdev -name "filepattern-*2009*" -print0 | tar --null --no-recursion -uf 2009.tar --files-from -

# added: set permissions via tar
find -x data -name "filepattern-*2009*" -print0 | \
    tar --null --no-recursion --owner=... --group=... --mode=... -uf 2009.tar --files-from -
夏了南城 2024-09-06 04:00:44

有 xargs 来解决这个问题:

find data/ -name filepattern-*2009* -print0 | xargs -0 tar uf 2009.tar

猜测为什么它很慢很困难,因为没有太多信息。目录的结构是什么,使用什么文件系统,创建时如何配置。对于大多数文件系统来说,在单个目录中拥有数百万个文件是相当困难的情况。

There is xargs for this:

find data/ -name filepattern-*2009* -print0 | xargs -0 tar uf 2009.tar

Guessing why it is slow is hard as there is not much information. What is the structure of the directory, what filesystem do you use, how it was configured on creating. Having milions of files in single directory is quite hard situation for most filesystems.

苏璃陌 2024-09-06 04:00:44

要正确处理带有奇怪(但合法)字符(例如换行符等)的文件名,您应该使用 find 的 -print0 将文件列表写入 filesOfInterest.txt:

find -x data -name "filepattern-*2009*" -print0 > filesOfInterest.txt
tar --null --no-recursion -uf 2009.tar --files-from filesOfInterest.txt 

To correctly handle file names with weird (but legal) characters (such as newlines, ...) you should write your file list to filesOfInterest.txt using find's -print0:

find -x data -name "filepattern-*2009*" -print0 > filesOfInterest.txt
tar --null --no-recursion -uf 2009.tar --files-from filesOfInterest.txt 
半葬歌 2024-09-06 04:00:44

按照目前的方式,每次找到文件时都会调用 tar 命令,这并不奇怪。您实际上是将这些时间相乘,而不是花费两个小时进行打印,加上打开 tar 存档、查看文件是否过期以及将它们添加到存档中所需的时间。在将所有名称批处理在一起之后,调用 tar 命令一次可能会更成功,可能使用 xargs 来实现调用。顺便说一句,我希望您使用的是 'filepattern-*2009*' 而不是 filepattern-*2009*,因为星星将被 shell 扩展而不带引号。

The way you currently have things, you are invoking the tar command every single time it finds a file, which is not surprisingly slow. Instead of taking the two hours to print plus the amount of time it takes to open the tar archive, see if the files are out of date, and add them to the archive, you are actually multiplying those times together. You might have better success invoking the tar command once, after you have batched together all the names, possibly using xargs to achieve the invocation. By the way, I hope you are using 'filepattern-*2009*' and not filepattern-*2009* as the stars will be expanded by the shell without quotes.

丶情人眼里出诗心の 2024-09-06 04:00:44

有一个名为 tarsplitter 的实用程序。

tarsplitter -m archive -i folder/*.json -o archive.tar -p 8

将使用 8 个线程将匹配“folder/*.json”的文件存档到“archive.tar”的输出存档

https://github.com/AQUAOSOTech/tarsplitter

There is a utility for this called tarsplitter.

tarsplitter -m archive -i folder/*.json -o archive.tar -p 8

will use 8 threads to archive the files matching "folder/*.json" into an output archive of "archive.tar"

https://github.com/AQUAOSOTech/tarsplitter

只涨不跌 2024-09-06 04:00:44

在我找到使用 Python 的 tarfile 库的更简单且可能更快的解决方案之前,我在 Linux 上挣扎了很长时间。

  1. 使用 glob.glob 搜索所需的文件路径
  2. 以追加模式创建一个新存档
  3. 将每个文件路径添加到此存档
  4. 关闭存档

这是我的代码示例:

import tarfile
import glob
from tqdm import tqdm

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")
for filepath in tqdm(filepaths, "Appending files to the archive..."):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

print ("Closing the archive...")
out.close()

然而,这总共花费了大约 12 秒来查找 16222 个文件路径并创建存档,这主要是通过简单地搜索文件路径来完成的。创建包含 16000 个文件路径的 tar 存档只花了 7 秒。通过一些多线程,这可能会快得多。

如果您正在寻找多线程实现,我已经制作了一个并将其放在这里:

import tarfile
import glob
from tqdm import tqdm
import threading

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")

def add(filepath):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

def add_multiple(filepaths):
  for filepath in filepaths:
    add(filepath)

max_threads = 16
filepaths_per_thread = 16

interval = max_threads * filepaths_per_thread

for i in tqdm(range(0, n, interval), "Appending files to the archive..."):
  threads = [threading.Thread(target = add_multiple, args = (filepaths[j:j + filepaths_per_thread],)) for j in range(i, min([n, i + interval]), filepaths_per_thread)]
  for thread in threads:
    thread.start()
  for thread in threads:
    thread.join()

print ("Closing the archive...")
out.close()

当然,您需要确保 max_threadsfilepaths_per_thread 的值是优化;创建线程需要时间,因此对于某些值,时间实际上可能会增加。最后要注意的一点是,由于我们使用追加模式,因此我们会自动创建一个具有指定名称的新存档(如果该存档尚不存在)。然而,如果一个确实已经存在,它只会添加到预先存在的存档中,而不是重置它或创建一个新存档。

I was struggling with linux for a long time before I found a much easier and potentially faster solution using Python's tarfile library.

  1. Use glob.glob to search for the desired filepaths
  2. Create a new archive in append mode
  3. Add each filepath to this archive
  4. Close the archive

Here is my code sample:

import tarfile
import glob
from tqdm import tqdm

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")
for filepath in tqdm(filepaths, "Appending files to the archive..."):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

print ("Closing the archive...")
out.close()

This took a total of about 12 seconds to find 16222 filepaths and create the archive, however, this was predominantly taken up by simply searching for the filepaths. It took just 7 seconds to create the tar archive with 16000 filepaths. With some multithreading this could be much faster.

If you're looking for a multithreaded implementation, I've made one and placed it here:

import tarfile
import glob
from tqdm import tqdm
import threading

filepaths = glob.glob("Images/7 *.jpeg")
n = len(filepaths)
print ("{} files found.".format(n))
print ("Creating Archive...")
out = tarfile.open("Images.tar.gz", mode = "a")

def add(filepath):
  try:
    out.add(filepath)
  except:
    print ("Failed to add: {}".format(filepath))

def add_multiple(filepaths):
  for filepath in filepaths:
    add(filepath)

max_threads = 16
filepaths_per_thread = 16

interval = max_threads * filepaths_per_thread

for i in tqdm(range(0, n, interval), "Appending files to the archive..."):
  threads = [threading.Thread(target = add_multiple, args = (filepaths[j:j + filepaths_per_thread],)) for j in range(i, min([n, i + interval]), filepaths_per_thread)]
  for thread in threads:
    thread.start()
  for thread in threads:
    thread.join()

print ("Closing the archive...")
out.close()

Of course, you need to make sure that the values of max_threads and filepaths_per_thread are optimized; it takes time to create threads, so the time may actually increase for certain values. A final thing to note is that since we are using append mode, we are automatically creating a new archive with the designated name if one does not already exist. However, if one does already exist, it will simply add to the preexisting archive, not reset it or make a new one.

送舟行 2024-09-06 04:00:44

最简单(也在创建存档后删除文件):

find *.1  -exec tar czf '{}.tgz' '{}' --remove-files \;

Simplest (also remove file after archive creation):

find *.1  -exec tar czf '{}.tgz' '{}' --remove-files \;
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文