如何在没有中间文件的情况下将多个文件合并为一个文件?

发布于 2024-09-30 08:07:15 字数 1431 浏览 8 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

独孤求败 2024-10-07 08:07:15

如果您不需要随机访问最终的大文件(即,您只需从头到尾阅读一次),则可以使数百个中间文件显示为一个。你通常会做的地方

$ consume big-file.txt

而不是这样做

$ consume <(cat file1 file2 ... fileN)

这使用Unix 进程替换,有时也称为“匿名命名管道。”

您还可以通过拆分输入并同时进行处理来节省时间和空间; GNU Parallel 有一个--pipe 开关 正是这样做的。它还可以将输出重新组装成一个大文件,可能会使用更少的暂存空间,因为它只需要一次在磁盘上保留核心数个片段。如果您实际上同时运行数百个进程,并行将让您调整机器的并行度,从而大大提高您的效率。我强烈推荐它。

If you don't need random access into the final big file (i.e., you just read it through once from start to finish), you can make your hundreds of intermediate files appear as one. Where you would normally do

$ consume big-file.txt

instead do

$ consume <(cat file1 file2 ... fileN)

This uses Unix process substitution, sometimes also called "anonymous named pipes."

You may also be able to save time and space by splitting your input and doing the processing at the same time; GNU Parallel has a --pipe switch that will do precisely this. It can also reassemble the outputs back into one big file, potentially using less scratch space as it only needs to keep number-of-cores pieces on disk at once. If you are literally running your hundreds of processes at the same time, Parallel will greatly improve your efficiency by letting you tune the amount of parallelism to your machine. I highly recommend it.

临走之时 2024-10-07 08:07:15

将文件重新连接在一起时,您可以在附加小文件时将其删除:

for file in file1 file2 file3 ... fileN; do
  cat "$file" >> bigFile && rm "$file"
done

这将避免需要双倍的空间。

没有其他方法可以神奇地使文件神奇地连接。文件系统 API 根本没有执行此操作的函数。

When concatenating files back together, you could delete the small files as they get appended:

for file in file1 file2 file3 ... fileN; do
  cat "$file" >> bigFile && rm "$file"
done

This would avoid needing double the space.

There is no other way of magically making files magically concatenate. The filesystem API simply doesn't have a function that does that.

我只土不豪 2024-10-07 08:07:15

我相信这是捕获同一文件夹中包含的所有文件的最快方法:

$ ls [path to folder] | while read p; do cat $p; done

I believe this is the fastest way to cat all the files contained in the same folder:

$ ls [path to folder] | while read p; do cat $p; done
淡淡绿茶香 2024-10-07 08:07:15

也许 dd 会更快,因为您不必在 cat 和 shell 之间传递东西。像这样的东西:

mv file1 newBigFile
dd if=file2 of=newBigFile seek=$(stat -c %s newBigFile)

Maybe dd would be faster because you wouldn't have to pass stuff between cat and the shell. Something like:

mv file1 newBigFile
dd if=file2 of=newBigFile seek=$(stat -c %s newBigFile)
反话 2024-10-07 08:07:15

我真正需要的是数百个文件重新显示为 1 个文件...

仅在文件系统级别以这种方式加入文件是不切实际的原因,因为文本文件通常不会填充确切地说,它是一个磁盘块,因此后续文件中的数据必须向上移动以填补空白,无论如何都会导致一堆读/写。

all I really need is for the hundreds of files to reappear as 1 file...

The reason it isn't practical to just join files that way at a filesystem level because text files don't usually fill a disk block exactly, so the data in subsequent files would have to be moved up to fill in the gaps, causing a bunch of reads/writes anyway.

魄砕の薆 2024-10-07 08:07:15

您是否可以简单地不分割文件?相反,通过在每个并行工作线程中设置文件指针来分块处理文件。如果文件需要以面向行的方式处理,这会变得更加棘手,但仍然可以完成。每个工作人员都需要明白,它必须首先逐字节查找下一个换行符 +1,而不是从您给它的偏移量开始。每个工作线程还必须明白,它不会处理您给它的设定字节数,但必须在分配给它处理的设定字节数之后处理第一个换行符。

文件指针的实际分配和设置非常简单。如果有n个worker,则每个worker处理n/文件大小字节,并且文件指针从worker编号*n/file_size开始。

有什么原因导致这种计划不够充分吗?

Is it possible for you to simply not split the file? Instead process the file in chunks by setting the file pointer in each of your parallel workers. If the file needs to be processed in a line oriented way, that makes it trickier but it can still be done. Each worker needs to understand that rather than starting at the offset you give it, it must first seek byte by byte to the next newline +1. Each worker must also understand that it does not process the set amount of bytes you give it but must process up the the first newline after the set amount of bytes it is allocated to process.

The actual allocation and setting of the file pointer is pretty straightforward. If there are n workers, each one processes n/file size bytes and the file pointer starts at the worker number * n/file_size.

is there some reason that kind of plan is not sufficient?

坠似风落 2024-10-07 08:07:15

快速但不是免费的解决方案?获取 SSD 驱动器或基于 PCIe 的闪存存储。如果这是必须定期完成的事情,那么提高磁盘 IO 速度将是您可以获得的最具成本效益和最快的加速。

Fast, but not free solution? Get an SSD drive or flash PCIe based storage. If this is something that has to be done on a regular basis, increasing disk IO speed is going to be the most cost effective and fastest speedup you can get.

假情假意假温柔 2024-10-07 08:07:15

存在并发过多这样的情况。

更好的方法是在所需范围内使用随机访问读取文件,而不会实际将其拆分并仅处理与计算机中物理 CPU/核心数量相同的文件数量。除非磁盘的 IOPS 也被淹没,否则您应该减少用量,直到磁盘不再成为瓶颈。

无论哪种方式,所有天真的分割/复制/删除都会产生大量的 IOPS,并且没有办法绕过它的物理原理。

除非这是一个持续的日常问题,否则一个透明的解决方案可能会比其值得的工作更多,那就是编写一个自定义的 FUSE 文件系统,将单个文件表示为多个文件。有很多关于将存档文件内容作为单独文件处理的示例,这些示例将向您展示如何执行此操作的基础知识。

There is a such thing as too much concurrency.

A better way of doing this would be to use random access reads into the file over the desired ranges and never actually split it up and process only the number of files as the number of physical CPU/Cores in the machine. That is unless that is swamping the disk with IOPS as well, then you should cut back until the disk isn't the bottleneck.

What you are doing either way with all the naive splitting/copying/deleting is generating tonnes of IOPS and there is no way around the physics of it.

A transparent solution that would be probably be more work than is worth it unless this is an ongoing daily issue/problem is to write a custom FUSE filesystem that represents a single file as multiple files. There are lots of examples on dealing with archive files contents as individual files that would show you the basics of how to do this.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文