锁定并行多次调用的 shell 脚本的输出文件

发布于 2024-10-21 12:19:49 字数 347 浏览 4 评论 0原文

我有近一百万个文件,我想在其中运行 shell 脚本并将结果附加到单个文件中。

例如,假设我只想对文件运行 wc 。 为了让它运行得更快,我可以将它与 xargs 并行化。但我不希望脚本在编写输出时相互跨越。最好写入几个单独的文件,而不是写入一个文件,然后再 cat 它们。但我仍然希望此类临时输出文件的数量明显小于输入文件的数量。有没有办法获得我想要的锁定类型,或者默认情况下总是确保这种情况?

是否有任何实用程序可以并行递归地cat两个文件?

我可以编写一个脚本来做到这一点,但必须处理临时数据并进行清理。所以想知道是否有一个实用程序可以做到这一点。

I have close to a million files over which I want to run a shell script and append the result to a single file.

For example suppose I just want to run wc on the files.
So that it runs fast I can parallelize it with xargs. But I do not want the scripts to step over each other when writing the output. It is probably better to write to a few separate files rather than one and then cat them later. But I still want the number of such temporary output files to be significantly smaller than the number of input files. Is there a way to get the kind of locking I want, or is it the case that is always ensured by default?

Is there any utility that will recursively cat two files in parallel?

I can write a script to do that, but have to deal with the temporaries and clean up. So was wondering if there is an utility which does that.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

苍景流年 2024-10-28 12:19:49

GNU 并行 声称它:

确保命令的输出是
与您得到的输出相同
您按顺序运行命令

如果是这种情况,那么我认为将输出简单地通过管道传输到文件并让并行处理中间数据应该是安全的

使用-k选项来维护输出的顺序。

更新:(非 Perl 解决方案)

另一种选择是 prll,它是通过带有一些 C 扩展的 shell 函数实现的。与 GNU 并行相比,它的功能不太丰富,但应该可以满足基本用例的要求。

功能列表声称:

是否进行内部缓冲和锁定
防止损坏/交错
不同作业的输出。

因此,只要输出顺序不重要,它就应该满足您的需求

但是,请注意以下关于 此页面

prll 生成很多状态
关于 STDERR 的信息使得它
更难使用 STDERR 输出
工作直接作为另一个工作的输入
程序。


免责声明:我没有尝试过这两种工具,只是引用了它们各自的文档。

GNU parallel claims that it:

makes sure output from the commands is
the same output as you would get had
you run the commands sequentially

If that's the case, then I presume it should be safe to simple pipe the output to your file and let parallel handle the intermediate data.

Use the -k option to maintain the order of the output.

Update: (non-Perl solution)

Another alternative would be prll, which is implemented with shell functions with some C extensions. It is less feature-rich compared to GNU parallel but should the the job for basic use cases.

The feature listing claims:

Does internal buffering and locking to
prevent mangling/interleaving of
output from separate jobs.

so it should meet your needs as long as order of output is not important

However, note on the following statement on this page:

prll generates a lot of status
information on STDERR which makes it
harder to use the STDERR output of the
job directly as input for another
program.


Disclaimer: I've tried neither of the tools and am merely quoting from their respective docs.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文