bash:以块的形式处理文件列表
设置:
我有数百个文件,名称类似于input0.dat
、input1.dat
、...、input150。 dat
,我需要使用一些命令 cmd
来处理它(它基本上合并了所有文件的内容)。 cmd
将输出文件名作为第一个选项,然后是所有输入文件名的列表:
./cmd output.dat input1.dat input2.dat [...] input150.dat
问题:
问题是脚本只能处理大约 10 个文件,因为记忆问题(不要为此责怪我)。 Afterwards 那样使用 bash 通配符扩展,而是
./cmd output.dat *dat
因此,我不需要像
./cmd temp_output0.dat file0.dat file1.dat [...] file9.dat
[...]
./cmd temp_outputN.dat fileN0.dat fileN1.dat [...] fileN9.dat
可以合并临时输出。
./cmd output.dat output0.dat [...] outputN.dat
如何在 bash
中有效地编写此脚本?
我尝试过,但没有成功,例如
for filename in `echo *dat | xargs -n 3`; do [...]; done
问题是,这再次一次处理所有文件,因为 的输出行xargs
被连接起来。
编辑:请注意,在调用cmd
时,我需要指定输出文件名作为第一个命令行参数!
The setting:
I have some hundred files, named something like input0.dat
, input1.dat
, ..., input150.dat
, which I need to process using some command cmd
(which basically merges the contents of all files). The cmd
takes as first option the output filename and then a list of all input filenames:
./cmd output.dat input1.dat input2.dat [...] input150.dat
The problem:
The problem is that the script can only handle like 10 files or so due to memory issues (don't blame me for that). Thus, instead of using the bash
wildcard extension like
./cmd output.dat *dat
I need to do something like
./cmd temp_output0.dat file0.dat file1.dat [...] file9.dat
[...]
./cmd temp_outputN.dat fileN0.dat fileN1.dat [...] fileN9.dat
Afterwards I can merge the temporary outputs.
./cmd output.dat output0.dat [...] outputN.dat
How do I script this efficiently in bash
?
I tried, without success, e.g.
for filename in `echo *dat | xargs -n 3`; do [...]; done
The problem is that this again processes all files at once, because the output lines of xargs
get concatenated.
EDIT: Note that I need to specify an output filename as first command line argument when calling cmd
!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
编辑 没有管道或进程替换 - 需要 bash。这能够处理名称中带有空格的文件。使用 bash 数组并提取切片:
使用 fifo - 这不能能够处理文件名中的空格:
您需要使用 fifo 来保留
i
变量值,以及最终的文件串联集。如果需要,您可以在
./cmd
的内部调用后台运行,在最后一次调用 cmd 之前放置wait
:update
如果您想完全避免使用 fifo,可以使用进程替换来模拟它,因此将第一个重写为:
再次避免管道进入 while,但从重定向中读取以保留 opfiles 变量在 while 循环之后。
edit Without a pipe or process substitution - requires bash. This is able to deal with files with spaces in their names. Use a bash array and extract in slices:
Using a fifo - this is not capable of dealing with spaces in filenames:
You need to use a fifo to keep the
i
variable value, as well as for the final concatenation set of files.If you want, you can background the inside invocation of
./cmd
, put await
before the last invocation of cmd:update
If you want to avoid using a fifo entirely, you can use process substitution to emulate it, so rewriting the first one as:
Again avoiding piping into the while, but reading from a redirection to keep the
opfiles
variable after the while loop.尝试以下操作,它应该对您有用:
编辑:响应您的评论:
一次向
./cmd
发送不超过三个文件,同时查看中的所有文件file00.dat
到file99.dat
,并具有 10 个不同的输出文件,output1.dat
到output9.dat
。Try the following, it should work for you:
EDIT: In response to your comment:
That would send no more than three files at a time to
./cmd
, while going over all file fromfile00.dat
tofile99.dat
, and having 10 different output files,output1.dat
tooutput9.dat
.我知道这个问题很久以前就被回答和接受了,但我发现有一个比迄今为止提供的更简单的解决方案。
要进行更细粒度的控制,或者进一步操作字符串,请使用以下形式(根据您的喜好替换 bash):
要并行化输出(例如,在 2 个线程上):
注意:这不适用于其中包含空格的文件。
I know that this question was answered and accepted a long time ago, but I find that there is a more simple solution than those offered so far.
For more fine grained control, or to manipulate your string further, use the following form (substitute bash to your liking):
To parallelize the output (say, on 2 threads):
NOTE: This will not work for files that have spaces in them.
我正在使用从 bash 联机帮助页中找到的快速解决方案。看来其他人也存在。与 xargs -n 不同,这应该正确处理文件名中的空格。
I'm using this quick solution I found from the
bash
manpage. It looks like others exist too. Unlikexargs -n
, this should handle spaces in filenames properly.GNU Parallel 非常擅长“将事物分块”以及生成输入/输出文件名和计数器。这将一次获取 3 个文件 (
-N3
) 并生成一个按顺序编号并包含合并内容的中间输出文件。它会为您并行执行此操作 - 利用您向英特尔支付的所有 CPU 核心:要查看它的实际效果,请使用
--dry-run
选项示例输出< /强>
GNU Parallel is excellent at "chunking things up" and generating input/output filenames and counters. This will take 3 files at a time (
-N3
) and generate an intermediate output file that is sequentially numbered and contains the merged contents. And it does it in parallel for you - making use of all those CPU cores that you paid Intel so handsomely for:To see it in action, use
--dry-run
optionSample Output