使用 gnu parallel 处理带标题的 CSV 文件
是否可以以将原始输入的第一行重复到每个子作业的 STDIN 的方式调用 gnu 并行?
我有一个 CSV 文件,顶部包含标题行。例如:
> cat large.csv
id,count
abc,123
def,456
我有一个可以按名称而不是位置提取列的工具:
> csv_extract large.csv count
123
456
我可以将值连续求和为:
> csv_extract large.csv count | awk '{ SUM += $1 } END { print SUM }'
579
我拥有的实际文件要大得多,并且操作比求和更复杂,但适用相同的原则。我想使用 gnuparallel 来处理该文件,但我不知道是否可以告诉 gnuparallel 为每个作业重复 CSV 标头。
理想情况下,我可以使用以下内容来运行该操作:
> cat large.csv | parallel --pipe --repeat-first-line "csv_extract /dev/stdin count | awk '{ SUM += $1 } END { print SUM }'"
579
我在上面编写了 --repeat-first-line 选项来表示我无法弄清楚的功能。我看过 YouTube 视频,并阅读了手册页,但我只是不知道如何做到这一点(如果可能的话)。
谢谢!
- 丹布
Is it possible to invoke gnu parallel in a way that it would repeat the first line of original input to the STDIN of each child job?
I have a CSV file that contains a header line at the top. For example:
> cat large.csv
id,count
abc,123
def,456
I have a tool that can extract columns by name rather than position:
> csv_extract large.csv count
123
456
I can sum the values serially as:
> csv_extract large.csv count | awk '{ SUM += $1 } END { print SUM }'
579
The actual file I have is much larger, and the operation more complex than summing, but the same principles would apply. I'd like to use gnu parallel to process the file, but I don't know if it is possible to tell gnu parallel to repeat the CSV header for each job.
Ideally I could run the operation with something like:
> cat large.csv | parallel --pipe --repeat-first-line "csv_extract /dev/stdin count | awk '{ SUM += $1 } END { print SUM }'"
579
I've made up the --repeat-first-line option above to represent the functionality I cannot figure out. I've watched the YouTube videos, and read the man page, but I'm just not able to see how it can be done, if at all possible.
Thanks!
- danboo
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
今天,您可以
--skip-first-line
并使用echo
添加标头:在未来版本中,您将拥有选项“--header”,它将是一个正则表达式与标题末尾匹配(例如:'\n' 表示一行,'\n.*\n' 表示两行,或 '---' 表示第一行 ---)
-- 编辑 - -
最新版本的 GNU Parallel 现在可以执行以下操作:
Today you can
--skip-first-line
and add the header usingecho
:In a future version you will have the option '--header' which will be a regexp that matches the end of your header (e.g: '\n' for one line or '\n.*\n' for two lines or '---' for up to and including the first ---)
-- Edit --
Newest version of GNU Parallel can now do: