bash：以块的形式处理文件列表

发布于 2024-12-28 06:21:04 字数 1017 浏览 3 评论 0原文

设置：

我有数百个文件，名称类似于input0.dat、input1.dat、...、input150。 dat，我需要使用一些命令 cmd 来处理它（它基本上合并了所有文件的内容）。 cmd 将输出文件名作为第一个选项，然后是所有输入文件名的列表：

./cmd output.dat input1.dat input2.dat [...] input150.dat

问题：

问题是脚本只能处理大约 10 个文件，因为记忆问题（不要为此责怪我）。 Afterwards 那样使用 bash 通配符扩展，而是

./cmd output.dat *dat

因此，我不需要像

./cmd temp_output0.dat file0.dat file1.dat [...] file9.dat
[...]
./cmd temp_outputN.dat fileN0.dat fileN1.dat [...] fileN9.dat

可以合并临时输出。

./cmd output.dat output0.dat [...] outputN.dat

如何在 bash 中有效地编写此脚本？

我尝试过，但没有成功，例如

for filename in `echo *dat | xargs -n 3`; do [...]; done

问题是，这再次一次处理所有文件，因为 的输出行xargs 被连接起来。

编辑：请注意，在调用cmd时，我需要指定输出文件名作为第一个命令行参数！

原文

The setting:

I have some hundred files, named something like input0.dat, input1.dat, ..., input150.dat, which I need to process using some command cmd (which basically merges the contents of all files). The cmd takes as first option the output filename and then a list of all input filenames:

./cmd output.dat input1.dat input2.dat [...] input150.dat

The problem:

The problem is that the script can only handle like 10 files or so due to memory issues (don't blame me for that). Thus, instead of using the bash wildcard extension like

./cmd output.dat *dat

I need to do something like

./cmd temp_output0.dat file0.dat file1.dat [...] file9.dat
[...]
./cmd temp_outputN.dat fileN0.dat fileN1.dat [...] fileN9.dat

Afterwards I can merge the temporary outputs.

./cmd output.dat output0.dat [...] outputN.dat

How do I script this efficiently in bash?

I tried, without success, e.g.

for filename in `echo *dat | xargs -n 3`; do [...]; done

The problem is that this again processes all files at once, because the output lines of xargs get concatenated.

EDIT: Note that I need to specify an output filename as first command line argument when calling cmd!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

叹沉浮 2025-01-04 06:21:04

编辑没有管道或进程替换 - 需要 bash。这能够处理名称中带有空格的文件。使用 bash 数组并提取切片：

i=0
infiles=(*dat)
opfiles=()
while ((${#infiles[@]})); do
    threefiles=("${infiles[@]:0:3}")
    echo ./cmd tmp_output$i.dat "${threefiles[@]}"
    opfiles+=("tmp_output$i.dat")
    ((i++))
    infiles=("${infiles[@]:3}")
done
echo ./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

使用 fifo - 这不能能够处理文件名中的空格：

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

您需要使用 fifo 来保留 i 变量值，以及最终的文件串联集。

如果需要，您可以在 ./cmd 的内部调用后台运行，在最后一次调用 cmd 之前放置 wait：

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles&
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

update
如果您想完全避免使用 fifo，可以使用进程替换来模拟它，因此将第一个重写为：

i=0
opfiles=()
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles+=("tmp_output$i.dat")
    ((i++)) 
done < <(echo *dat | xargs -n 3)
./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

再次避免管道进入 while，但从重定向中读取以保留 opfiles 变量在 while 循环之后。

edit Without a pipe or process substitution - requires bash. This is able to deal with files with spaces in their names. Use a bash array and extract in slices:

i=0
infiles=(*dat)
opfiles=()
while ((${#infiles[@]})); do
    threefiles=("${infiles[@]:0:3}")
    echo ./cmd tmp_output$i.dat "${threefiles[@]}"
    opfiles+=("tmp_output$i.dat")
    ((i++))
    infiles=("${infiles[@]:3}")
done
echo ./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

Using a fifo - this is not capable of dealing with spaces in filenames:

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

You need to use a fifo to keep the i variable value, as well as for the final concatenation set of files.

If you want, you can background the inside invocation of ./cmd, put a wait before the last invocation of cmd:

i=0
opfiles=
mkfifo /tmp/foo
echo *dat | xargs -n 3 >/tmp/foo&
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles&
    opfiles="$opfiles tmp_output$i.dat"
    ((i++)) 
done </tmp/foo
rm -f /tmp/foo
wait
./cmd output.dat $opfiles
rm $opfiles

update
If you want to avoid using a fifo entirely, you can use process substitution to emulate it, so rewriting the first one as:

i=0
opfiles=()
while read threefiles; do
    ./cmd tmp_output$i.dat $threefiles
    opfiles+=("tmp_output$i.dat")
    ((i++)) 
done < <(echo *dat | xargs -n 3)
./cmd output.dat "${opfiles[@]}"
rm "${opfiles[@]}"

Again avoiding piping into the while, but reading from a redirection to keep the opfiles variable after the while loop.

回复收藏 0 原文

无人问我粥可暖 2025-01-04 06:21:04

尝试以下操作，它应该对您有用：

echo *dat | xargs -n3 ./cmd output.dat

编辑：响应您的评论：

for i in {0..9}; do
    echo file${i}*.dat | xargs -n3 ./cmd output${i}.dat
done

一次向 ./cmd 发送不超过三个文件，同时查看 中的所有文件file00.dat 到 file99.dat，并具有 10 个不同的输出文件，output1.dat 到 output9.dat。

Try the following, it should work for you:

echo *dat | xargs -n3 ./cmd output.dat

EDIT: In response to your comment:

for i in {0..9}; do
    echo file${i}*.dat | xargs -n3 ./cmd output${i}.dat
done

That would send no more than three files at a time to ./cmd, while going over all file from file00.dat to file99.dat, and having 10 different output files, output1.dat to output9.dat.

回复收藏 0 原文

故笙诉离歌 2025-01-04 06:21:04

我知道这个问题很久以前就被回答和接受了，但我发现有一个比迄今为止提供的更简单的解决方案。

find -name '*.dat' | xargs -n3 | xargs -n3 your_command

要进行更细粒度的控制，或者进一步操作字符串，请使用以下形式（根据您的喜好替换 bash）：

find -name '*.dat' | xargs -n3 | xargs -n3 -I{} sh -c 'your_command {}'

要并行化输出（例如，在 2 个线程上）：

find -name '*.dat' | xargs -n3 | xargs -P2 -n3 -I{} sh -c 'your_command {}'

注意：这不适用于其中包含空格的文件。

I know that this question was answered and accepted a long time ago, but I find that there is a more simple solution than those offered so far.

find -name '*.dat' | xargs -n3 | xargs -n3 your_command

For more fine grained control, or to manipulate your string further, use the following form (substitute bash to your liking):

find -name '*.dat' | xargs -n3 | xargs -n3 -I{} sh -c 'your_command {}'

To parallelize the output (say, on 2 threads):

find -name '*.dat' | xargs -n3 | xargs -P2 -n3 -I{} sh -c 'your_command {}'

NOTE: This will not work for files that have spaces in them.

回复收藏 0 原文

人间不值得 2025-01-04 06:21:04

我正在使用从 bash 联机帮助页中找到的快速解决方案。看来其他人也存在。与 xargs -n 不同，这应该正确处理文件名中的空格。

ls *dat | while readarray -tn 10 tenfiles && ((${#tenfiles[@]}))
do
  cmd output.dat "${tenfiles[@]}"
done

I'm using this quick solution I found from the bash manpage. It looks like others exist too. Unlike xargs -n, this should handle spaces in filenames properly.

ls *dat | while readarray -tn 10 tenfiles && ((${#tenfiles[@]}))
do
  cmd output.dat "${tenfiles[@]}"
done

回复收藏 0 原文

陈年往事 2025-01-04 06:21:04

GNU Parallel 非常擅长“将事物分块”以及生成输入/输出文件名和计数器。这将一次获取 3 个文件 (-N3) 并生成一个按顺序编号并包含合并内容的中间输出文件。它会为您并行执行此操作 - 利用您向英特尔支付的所有 CPU 核心：

parallel -N3 cmd output.{#} {} ::: {1..150}.dat

要查看它的实际效果，请使用 --dry-run 选项

parallel --dry-run -N3 cmd output.{#} {} ::: {1..150}.dat

示例输出< /强>

cmd output.1 1.dat 2.dat 3.dat
cmd output.2 4.dat 5.dat 6.dat
cmd output.3 7.dat 8.dat 9.dat
cmd output.4 10.dat 11.dat 12.dat
cmd output.5 13.dat 14.dat 15.dat
cmd output.6 16.dat 17.dat 18.dat
cmd output.7 19.dat 20.dat 21.dat
...
...
cmd output.49 145.dat 146.dat 147.dat
cmd output.50 148.dat 149.dat 150.dat

GNU Parallel is excellent at "chunking things up" and generating input/output filenames and counters. This will take 3 files at a time (-N3) and generate an intermediate output file that is sequentially numbered and contains the merged contents. And it does it in parallel for you - making use of all those CPU cores that you paid Intel so handsomely for:

parallel -N3 cmd output.{#} {} ::: {1..150}.dat

To see it in action, use --dry-run option

parallel --dry-run -N3 cmd output.{#} {} ::: {1..150}.dat

Sample Output

cmd output.1 1.dat 2.dat 3.dat
cmd output.2 4.dat 5.dat 6.dat
cmd output.3 7.dat 8.dat 9.dat
cmd output.4 10.dat 11.dat 12.dat
cmd output.5 13.dat 14.dat 15.dat
cmd output.6 16.dat 17.dat 18.dat
cmd output.7 19.dat 20.dat 21.dat
...
...
cmd output.49 145.dat 146.dat 147.dat
cmd output.50 148.dat 149.dat 150.dat

回复收藏 0 原文

~没有更多了~