使用 gnu parallel 处理带标题的 CSV 文件

发布于 2024-12-13 18:46:31 字数 807 浏览 5 评论 0原文

是否可以以将原始输入的第一行重复到每个子作业的 STDIN 的方式调用 gnu 并行？

我有一个 CSV 文件，顶部包含标题行。例如：

> cat large.csv
id,count
abc,123
def,456

我有一个可以按名称而不是位置提取列的工具：

> csv_extract large.csv count
123
456

我可以将值连续求和为：

> csv_extract large.csv count | awk '{ SUM += $1 } END { print SUM }'
579

我拥有的实际文件要大得多，并且操作比求和更复杂，但适用相同的原则。我想使用 gnuparallel 来处理该文件，但我不知道是否可以告诉 gnuparallel 为每个作业重复 CSV 标头。

理想情况下，我可以使用以下内容来运行该操作：

> cat large.csv | parallel --pipe --repeat-first-line "csv_extract /dev/stdin count | awk '{ SUM += $1 } END { print SUM }'"
579

我在上面编写了 --repeat-first-line 选项来表示我无法弄清楚的功能。我看过 YouTube 视频，并阅读了手册页，但我只是不知道如何做到这一点（如果可能的话）。

谢谢！

丹布

原文

Is it possible to invoke gnu parallel in a way that it would repeat the first line of original input to the STDIN of each child job?

I have a CSV file that contains a header line at the top. For example:

> cat large.csv
id,count
abc,123
def,456

I have a tool that can extract columns by name rather than position:

> csv_extract large.csv count
123
456

I can sum the values serially as:

> csv_extract large.csv count | awk '{ SUM += $1 } END { print SUM }'
579

The actual file I have is much larger, and the operation more complex than summing, but the same principles would apply. I'd like to use gnu parallel to process the file, but I don't know if it is possible to tell gnu parallel to repeat the CSV header for each job.

Ideally I could run the operation with something like:

> cat large.csv | parallel --pipe --repeat-first-line "csv_extract /dev/stdin count | awk '{ SUM += $1 } END { print SUM }'"
579

I've made up the --repeat-first-line option above to represent the functionality I cannot figure out. I've watched the YouTube videos, and read the man page, but I'm just not able to see how it can be done, if at all possible.

Thanks!

danboo

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

镜花水月 2024-12-20 18:46:31

今天，您可以 --skip-first-line 并使用 echo 添加标头：

seq 10 | parallel --skip-first-line --pipe '(echo hea,der; cat) | my_prog'

在未来版本中，您将拥有选项“--header”，它将是一个正则表达式与标题末尾匹配（例如：'\n' 表示一行，'\n.*\n' 表示两行，或 '---' 表示第一行 ---）

-- 编辑 - -

最新版本的 GNU Parallel 现在可以执行以下操作：

parallel --pipe --header : my_program

Today you can --skip-first-line and add the header using echo:

seq 10 | parallel --skip-first-line --pipe '(echo hea,der; cat) | my_prog'

In a future version you will have the option '--header' which will be a regexp that matches the end of your header (e.g: '\n' for one line or '\n.*\n' for two lines or '---' for up to and including the first ---)

-- Edit --

Newest version of GNU Parallel can now do: