等待脚本中的 bash 后台作业完成
为了最大限度地提高 CPU 使用率(我在 EC2 中的 Debian Lenny 上运行),我有一个简单的脚本来并行启动作业:
#!/bin/bash
for i in apache-200901*.log; do echo "Processing $i ..."; do_something_important; done &
for i in apache-200902*.log; do echo "Processing $i ..."; do_something_important; done &
for i in apache-200903*.log; do echo "Processing $i ..."; do_something_important; done &
for i in apache-200904*.log; do echo "Processing $i ..."; do_something_important; done &
...
我对这个工作解决方案非常满意; 但是,我不知道如何编写进一步的代码,仅在所有循环完成后才执行。
有没有办法做到这一点?
To maximize CPU usage (I run things on a Debian Lenny in EC2) I have a simple script to launch jobs in parallel:
#!/bin/bash
for i in apache-200901*.log; do echo "Processing $i ..."; do_something_important; done &
for i in apache-200902*.log; do echo "Processing $i ..."; do_something_important; done &
for i in apache-200903*.log; do echo "Processing $i ..."; do_something_important; done &
for i in apache-200904*.log; do echo "Processing $i ..."; do_something_important; done &
...
I'm quite satisfied with this working solution; however, I couldn't figure out how to write further code to be executed only once ALL of the loops have been completed.
Is there a way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(6)
有一个 bash 内置命令可以实现这一点。
There's a
bash
builtin command for that.使用 GNU Parallel 将使您的脚本更短并且可能更高效:
这将为每个 CPU 核心运行一项作业,并继续执行此操作,直到处理完所有文件。
您的解决方案基本上会在运行之前将作业分成组。 这里有 4 个组中的 32 个作业:
GNU Parallel 在一个进程完成后会生成一个新进程 - 保持 CPU 处于活动状态从而节省时间:
要了解更多信息:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
会因此而爱你。
Using GNU Parallel will make your script even shorter and possibly more efficient:
This will run one job per CPU core and continue to do that until all files are processed.
Your solution will basically split the jobs into groups before running. Here 32 jobs in 4 groups:
GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:
To learn more:
https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1
will love you for it.
我最近不得不这样做,并最终得到了以下解决方案:
其工作原理如下:
一旦其中一个(可能有很多)后台作业退出,
wait -n
就会退出。 它的计算结果始终为 true,并且循环一直持续到:127
:最后一个后台作业成功退出。 在在这种情况下,我们忽略退出代码并使用代码退出子 shell
0.
使用
set -e
,这将保证脚本提前终止并传递任何失败的后台作业的退出代码。I had to do this recently and ended up with the following solution:
Here's how it works:
wait -n
exits as soon as one of the (potentially many) background jobs exits. It always evaluates to true and the loop goes on until:127
: the last background job successfully exited. Inthat case, we ignore the exit code and exit the sub-shell with code
0.
With
set -e
, this will guarantee that the script will terminate early and pass through the exit code of any failed background job.使用
wait $(jobs -p)
的最小示例:示例输出:
A minimal example with
wait $(jobs -p)
:Exemplary output:
如果您只想等待所有作业并返回,请使用以下一行。
NB 一旦多个作业中的任何一个完成,
wait
就会返回If you just want to wait for all the jobs and return, use the following one-liner.
N.B.
wait
returns as soon as any one of several jobs is complete这是我的粗略解决方案:
想法是使用“作业”来查看有多少孩子在后台活跃,并等待这个数字下降(孩子退出)。 一旦孩子存在,就可以开始下一个任务。
正如您所看到的,还有一些额外的逻辑可以避免多次运行相同的实验/命令。 它为我完成了这项工作。但是,这个逻辑可以被跳过或进一步改进(例如,检查文件创建时间戳、输入参数等)。
This is my crude solution:
The idea is to use "jobs" to see how many children are active in the background and wait till this number drops (a child exits). Once a child exists, the next task can be started.
As you can see, there is also a bit of extra logic to avoid running the same experiments/commands multiple times. It does the job for me.. However, this logic could be either skipped or further improved (e.g., check for file creation timestamps, input parameters, etc.).