“找到”和“ls”与 GNU 并行

发布于 2024-12-07 06:52:09 字数 573 浏览 3 评论 0原文

我正在尝试使用 GNU 并行将大量文件发布到网络服务器。在我的目录中，我有一些文件：

file1.xml
file2.xml

我有一个如下所示的 shell 脚本：

#! /usr/bin/env bash

CMD="curl -X POST -d@$1 http://server/path"

eval $CMD

脚本中还有一些其他内容，但这是最简单的示例。我尝试执行以下命令：

ls | parallel -j2 script.sh {}

这是 GNU 并行页面显示的操作目录中文件的“正常”方式。这似乎将文件名传递到我的脚本中，但curl抱怨它无法加载传入的数据文件。但是，如果我这样做：

find . -name '*.xml' | parallel -j2 script.sh {}

它工作正常。 ls 和 find 将参数传递给我的脚本的方式有区别吗？或者我需要在该脚本中做一些额外的事情吗？

原文

I'm trying to use GNU parallel to post a lot of files to a web server. In my directory, I have some files:

file1.xml
file2.xml

and I have a shell script that looks like this:

#! /usr/bin/env bash

CMD="curl -X POST -d@$1 http://server/path"

eval $CMD

There's some other stuff in the script, but this was the simplest example. I tried to execute the following command:

ls | parallel -j2 script.sh {}

Which is what the GNU parallel pages show as the "normal" way to operate on files in a directory. This seems to pass the name of the file into my script, but curl complains that it can't load the data file passed in. However, if I do:

find . -name '*.xml' | parallel -j2 script.sh {}

it works fine. Is there a difference between how ls and find are passing arguments to my script? Or do I need to do something additional in that script?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

一片旧的回忆 2024-12-14 06:52:10

GNU parallel 是 xargs 的变体。它们都有非常相似的界面，如果您正在寻求有关并行的帮助，您可能会更幸运地查找有关xargs的信息。

话虽如此，它们的操作方式都相当简单。在默认行为下，这两个程序都从 STDIN 读取输入，然后根据空格将输入分解为标记。然后将每个令牌作为参数传递给提供的程序。 xargs 的默认设置是将尽可能多的令牌传递给程序，然后在达到限制时启动一个新进程。我不确定并行的默认值是如何工作的。

下面是一个示例：

> echo "foo    bar \
  baz" | xargs echo
foo bar baz

默认行为存在一些问题，因此通常会看到多种变化。

第一个问题是，由于空格用于标记，因此任何包含空格的文件都会导致并行和 xargs 中断。一种解决方案是围绕 NULL 字符进行标记。 find 甚至提供了一个选项来简化此操作：

> echo "Success!" > bad\ filename
> find . "bad\ filename" -print0 | xargs -0 cat
Success!

-print0 选项告诉 find 使用 NULL 字符而不是空格分隔文件。
-0 选项告诉 xargs 使用 NULL 字符来标记每个参数。

请注意，parallel 比 xargs 稍好一些，因为它的默认行为是仅围绕换行符进行标记，因此无需更改默认行为。

另一个常见问题是您可能希望控制如何将参数传递给 xargs 或parallel。如果您需要将传递给程序的参数指定为特定位置，则可以使用 {} 指定参数的放置位置。

> mkdir new_dir
> find -name *.xml | xargs mv {} new_dir

这会将当前目录和子目录中的所有文件移动到 new_dir 目录中。它实际上分为以下几部分：

> find -name *.xml | xargs echo mv {} new_dir
> mv foo.xml new_dir
> mv bar.xml new_dir
> mv baz.xml new_dir

因此，考虑到 xargs 和parallel 的工作方式，您应该能够通过命令看到问题。 <代码>查找 . -name '*.xml' 将生成要传递给 script.sh 程序的 xml 文件列表。

> find . -name '*.xml' | parallel -j2 echo script.sh {}
> script.sh foo.xml
> script.sh bar.xml
> script.sh baz.xml

然而，ls | parallel -j2 script.sh {} 将生成当前目录中所有文件的列表，并将其传递给 script.sh 程序。

> ls | parallel -j2 echo script.sh {}
> script.sh some_directory
> script.sh some_file
> script.sh foo.xml
> ...

ls 版本的更正确变体如下：

> ls *.xml | parallel -j2 script.sh {}

但是，此版本与 find 版本之间的重要区别在于 find 将搜索所有子目录中的文件，而 ls 将仅搜索当前目录。上述 ls 命令的等效 find 版本如下：

> find -maxdepth 1 -name '*.xml'

这只会搜索当前目录。

GNU parallel is a variant of xargs. They both have very similar interfaces, and if you're looking for help on parallel, you may have more luck looking up information about xargs.

That being said, the way they both operate is fairly simple. With their default behavior, both programs read input from STDIN, then break the input up into tokens based on whitespace. Each of these tokens is then passed to a provided program as an argument. The default for xargs is to pass as many tokens as possible to the program, and then start a new process when the limit is hit. I'm not sure how the default for parallel works.

Here is an example:

> echo "foo    bar \
  baz" | xargs echo
foo bar baz

There are some problems with the default behavior, so it is common to see several variations.

The first issue is that because whitespace is used to tokenize, any files with white space in them will cause parallel and xargs to break. One solution is to tokenize around the NULL character instead. find even provides an option to make this easy to do:

> echo "Success!" > bad\ filename
> find . "bad\ filename" -print0 | xargs -0 cat
Success!

The -print0 option tells find to seperate files with the NULL character instead of whitespace.
The -0 option tells xargs to use the NULL character to tokenize each argument.

Note that parallel is a little better than xargs in that its default behavior is the tokenize around only newlines, so there is less of a need to change the default behavior.

Another common issue is that you may want to control how the arguments are passed to xargs or parallel. If you need to have a specific placement of the arguments passed to the program, you can use {} to specify where the argument is to be placed.

> mkdir new_dir
> find -name *.xml | xargs mv {} new_dir

This will move all files in the current directory and subdirectories into the new_dir directory. It actually breaks down into the following:

> find -name *.xml | xargs echo mv {} new_dir
> mv foo.xml new_dir
> mv bar.xml new_dir
> mv baz.xml new_dir

So taking into consideration how xargs and parallel work, you should hopefully be able to see the issue with your command. find . -name '*.xml' will generate a list of xml files to be passed to the script.sh program.

> find . -name '*.xml' | parallel -j2 echo script.sh {}
> script.sh foo.xml
> script.sh bar.xml
> script.sh baz.xml

However, ls | parallel -j2 script.sh {} will generate a list of ALL files in the current directory to be passed to the script.sh program.

> ls | parallel -j2 echo script.sh {}
> script.sh some_directory
> script.sh some_file
> script.sh foo.xml
> ...

A more correct variant on the ls version would be as follows:

> ls *.xml | parallel -j2 script.sh {}

However, and important difference between this and the find version is that find will search through all subdirectories for files, while ls will only search the current directory. The equivalent find version of the above ls command would be as follows:

> find -maxdepth 1 -name '*.xml'

This will only search the current directory.

回复收藏 0 原文

晨敛清荷 2024-12-14 06:52:10

由于它与 find 一起使用，您可能想查看 GNU Parallel 正在运行什么命令（使用 -v 或 --dryrun），然后尝试手动运行失败的命令。

ls *.xml | parallel --dryrun -j2 script.sh
find -maxdepth 1 -name '*.xml' | parallel --dryrun -j2 script.sh

Since it works with find you probably want to see what command GNU Parallel is running (using -v or --dryrun) and then try to run the failing commands manually.

ls *.xml | parallel --dryrun -j2 script.sh
find -maxdepth 1 -name '*.xml' | parallel --dryrun -j2 script.sh

回复收藏 0 原文

悲欢浪云 2024-12-14 06:52:10

我没有使用过parallel，但是ls和ls之间有区别。 <代码>查找 . -名称“*.xml”。 ls 将列出 find 所在的所有文件和目录。 -name '*.xml' 将仅列出以 .xml 结尾的文件（和目录）。
正如 Paul Rubel 所建议的，只需在脚本中打印 $1 的值即可进行检查。此外，您可能需要考虑仅使用 -type f 选项在 find 中过滤文件输入。
希望这有帮助！

回复收藏 0 原文

表情可笑 2024-12-14 06:52:10

整洁的。

我以前从未使用过并行。看起来，虽然他们有两个。
一种是 Gnu Parallel，我的系统上安装的有 Tollef Fog Heen
在手册页中列为作者。

正如保罗提到的，你应该使用
set -x

另外，您上面提到的范例似乎不适用于我的并行，相反，我有
执行以下操作：

$ cat ../script.sh
+ cat ../script.sh
#!/bin/bash
echo $@
$ parallel -ij2 ../script.sh {} -- $(find -name '*.xml')
++ find -name '*.xml'
+ parallel -ij2 ../script.sh '{}' -- ./b.xml ./c.xml ./a.xml ./d.xml ./e.xml
./c.xml
./b.xml
./d.xml
./a.xml
./e.xml
$ parallel -ij2 ../script.sh {} -- $(ls *.xml)
++ ls --color=auto a.xml b.xml c.xml d.xml e.xml
+ parallel -ij2 ../script.sh '{}' -- a.xml b.xml c.xml d.xml e.xml
b.xml
a.xml
d.xml
c.xml
e.xml

find 确实提供了不同的输入，它在名称前面添加了相对路径。
也许这就是你的剧本混乱的原因？

Neat.

I had never used parallel before. It appears, though that there are two of them.
One is the Gnu Parrallel, and the one that was installed on my system has Tollef Fog Heen
listed as the author in the man pages.

As Paul mentioned, you should use
set -x

Also, the paradigm that you mentioned above doesn't seem to work on my parallel, rather, I have
to do the following:

$ cat ../script.sh
+ cat ../script.sh
#!/bin/bash
echo $@
$ parallel -ij2 ../script.sh {} -- $(find -name '*.xml')
++ find -name '*.xml'
+ parallel -ij2 ../script.sh '{}' -- ./b.xml ./c.xml ./a.xml ./d.xml ./e.xml
./c.xml
./b.xml
./d.xml
./a.xml
./e.xml
$ parallel -ij2 ../script.sh {} -- $(ls *.xml)
++ ls --color=auto a.xml b.xml c.xml d.xml e.xml
+ parallel -ij2 ../script.sh '{}' -- a.xml b.xml c.xml d.xml e.xml
b.xml
a.xml
d.xml
c.xml
e.xml

find does provide a different input, It prepends the relative path to the name.
Maybe that is what is messing up your script?

回复收藏 0 原文

~没有更多了~