在 Linux 上快速连接多个文件
我正在使用 Python 多重处理为每个进程生成一个临时输出文件。它们的大小可能有几 GB,我制作了几十个。这些临时文件需要连接起来才能形成所需的输出,而这一步被证明是瓶颈(也是并行杀手)。是否有一个 Linux 工具可以通过修改文件系统元数据来创建连接文件,而不是实际复制内容?只要它能在任何我能接受的Linux系统上运行就行。但特定于文件系统的解决方案不会有太大帮助。
我没有接受过操作系统或计算机科学方面的培训,但从理论上讲,似乎应该可以创建一个新的索引节点,并从我想要复制的文件的索引节点复制索引节点指针结构,然后取消链接这些索引节点。有没有任何实用程序可以做到这一点?鉴于有太多经过深思熟虑的 unix 实用程序,我完全预料到它会是这样,但找不到任何东西。因此我对SO有疑问。文件系统位于块设备上,实际上是硬盘上,以防这些信息很重要。我没有信心自己写这个,因为我以前从未做过任何系统级编程,所以任何指针(指向 C/Python 代码片段)都会非常有帮助。
I am using Python multiprocessing to generate a temporary output file per process. They can be several GBs in size and I make several tens of these. These temporary files need to be concated to form the desired output and this is the step that is proving to be a bottleneck (and a parallelism killer). Is there a Linux tool that will create the concated file by modifying the file-system meta-data and not actually copy the content ? As long as it works on any Linux system that would be acceptable to me. But a file system specific solution wont be of much help.
I am not OS or CS trained, but in theory it seems it should be possible to create a new inode and copy over the inode pointer structure from the inode of the files I want to copy from, and then unlink those inodes. Is there any utility that will do this ? Given the surfeit of well thought out unix utilities I fully expected it to be, but could not find anything. Hence my question on SO. The file system is on a block device, a hard disk actually, in case this information matters. I dont have the confidence to write this on my own, as I have never done any systems level programming before, so any pointers (to C/Python code snipppets) will be very helpful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
即使有这样的工具,也只有在除最后一个文件之外的文件
保证其大小是文件系统块的倍数
尺寸。
如果您控制数据写入临时文件的方式,并且您知道
每个文件有多大,您可以执行以下操作
在开始多重处理之前,创建最终的输出文件并增长
到最终尺寸
fseek()
到最后,这将创建一个
稀疏文件。
启动多处理,将 FD 和偏移量传递给每个进程
文件的特定切片。
这样,进程将协作填充单个输出文件,
无需稍后将它们放在一起。
编辑
如果您无法预测单个文件的大小,但该文件的使用者
最终文件可以使用顺序(而不是随机访问)输入,您可以
将
cat tmpfile1 .. tmpfileN
提供给消费者:通过标准输入或通过命名管道(使用 bash 的进程替换)
Even if there was such a tool, this could only work if the files except the last
were guaranteed to have a size that is a multiple of the filesystem's block
size.
If you control how the data is written into the temporary files, and you know
how large each one will be, you can instead do the following
Before starting the multiprocessing, create the final output file, and grow
it to the final size by
fseek()
ingto the end, this will create a
sparse file.
Start multiprocessing, handing each process the FD and the offset into its
particular slice of the file.
This way, the processes will collaboratively fill the single output file,
removing the need to cat them together later.
EDIT
If you can't predict the size of the individual files, but the consumer of the
final file can work with sequential (as opposed to random-access) input, you can
feed
cat tmpfile1 .. tmpfileN
to the consumer, either on stdinor via named pipes (using bash's Process Substitution):
您表明您事先不知道每个临时文件的大小。考虑到这一点,我认为最好的选择是编写一个 FUSE 文件系统,将块呈现为单个大文件,同时将它们作为单独的文件保存在底层文件系统上。
在此解决方案中,您的生产应用程序和消费应用程序保持不变。制作者写出一堆文件,FUSE 层使这些文件显示为单个文件。然后将该虚拟文件呈现给消费者。
FUSE 具有多种语言的绑定,包括 Python 。如果您查看此处或此处(这些用于不同的绑定),这需要很少的代码。
You indicate that you don't know in advance the size of each temporary file. With this in mind, I think your best bet is to write a FUSE filesystem that would present the chunks as a single large file, while keeping them as individual files on the underlying filesystem.
In this solution, your producing and consuming apps remain unchanged. The producers write out a bunch of files that the FUSE layer makes appear as a single file. This virtual file is then presented to the consumer.
FUSE has bindings for a bunch of languages, including Python. If you look at some examples here or here (these are for different bindings), this requires surprisingly little code.
对于 4 个文件; xaa、xab、xac、xad bash 中的快速串联(作为 root):(
假设 Loop0、loop1、loop2、loop3 是新设备文件的名称。
) com/PtEDQH7G" rel="nofollow">http://pastebin.com/PtEDQH7G 到“join_us”脚本文件中。然后你可以像这样使用它:
然后(如果这个大文件是一部电影)你可以将其所有权授予普通用户(chown itsme /dev/mapper/joined),然后他/她可以通过以下方式播放它: mplayer /dev /mapper/joined
这些之后的清理(作为 root):
For 4 files; xaa, xab, xac, xad a fast concatention in bash (as root):
(Let's suppose that loop0, loop1, loop2, loop3 are the names of the new device files.)
Put http://pastebin.com/PtEDQH7G into a "join_us" script file. Then you can use it like this:
Then (if this big file is a film) you can give its ownership to a normal user (chown itsme /dev/mapper/joined) and then he/she can play it via: mplayer /dev/mapper/joined
The cleanup after these (as root):
我不这么认为,索引节点可能是对齐的,所以只有当您可以在一个文件的页脚和另一个文件的标题之间留下一些零(或未知字节)时,才有可能。
我建议重新设计分析工具以支持从多个文件进行采购,而不是连接这些文件。以日志文件为例,很多日志分析器都支持读取一天的日志文件。
编辑
@san:正如您所说,您无法控制正在使用的代码,您可以使用命名管道即时连接单独的文件:
I don't think so, inode may be aligned, so it may only possible if you are ok to leave some zeros (or unknown bytes) between one file's footer and another file's header.
Instead of concatenate these files, I'd like suggest to re-design the analysis tool to support sourcing from multiple files. Take log files for example, many log analyzers support to read log files each for one day.
EDIT
@san: As you say the code in use you can't control, well you can concatenate the separate files on the fly by using named pipes:
不,没有这样的工具或系统调用。
您可能会调查每个进程是否可以直接写入最终文件。假设进程 1 写入字节 0-X,进程 2 写入 X-2X,依此类推。
No, there is no such tool or syscall.
You might investigate if it's possible for each process to write directly into the final file. Say process 1 writes bytes 0-X, process 2 writes X-2X and so on.
一个潜在的替代方案是将所有临时文件放入命名管道中,然后使用该命名管道作为单输入程序的输入。只要您的单输入程序只是顺序读取输入而不进行查找。
A potential alternative is to cat all your temp files into a named pipe and then use that named pipe as input to your single-input program. As long as your single-input program just reads the input sequentially and doesn't seek.