如何与循环时并行处理多行文件?
input.txt
:我的实际输入文件有5000000行
A B C D4.2 E 2022-05-31
A B C D4.2 E 2022-05-31
A B F D4.2 E 2022-05-07
A B C D4.2 E 2022-05-31
X B D E2.0 F 2022-05-30
X B Y D4.2 E 2022-05-06
data.txt:这是我在循环时需要引用的另一个文件。
A B C D4.2 E 2022-06-31
X B D E2.0 F 2022-07-30
这是我需要进行
cat input.txt |while read foo bar tan ban can man
do
KEYVALUE=$(echo $4 |awk -F. '{print $1}')
END_DATE=`egrep -w '$1|${KEYVALUE}|$6' data.txt |awk '{print $5,$6}'`
echo $foo,$bar,$tan,$ban,$can,$man,${END_DATE}
done
所需的输出:
A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31
我的主要问题是循环花费一个多小时才能完成500000输入线。我该如何并行处理此操作,因为每行都是彼此独立的,并且输出文件中的行顺序无关紧要。我尝试根据一些讨论使用GNU并行使用GNU。但是他们都没有帮助,或者我不确定如何实施它。我正在使用rhel和bash或ksh。
input.txt
: My actual input file has 5000000 lines
A B C D4.2 E 2022-05-31
A B C D4.2 E 2022-05-31
A B F D4.2 E 2022-05-07
A B C D4.2 E 2022-05-31
X B D E2.0 F 2022-05-30
X B Y D4.2 E 2022-05-06
data.txt : This is another file I need to refer in while loop.
A B C D4.2 E 2022-06-31
X B D E2.0 F 2022-07-30
Here's what I need to do
cat input.txt |while read foo bar tan ban can man
do
KEYVALUE=$(echo $4 |awk -F. '{print $1}')
END_DATE=`egrep -w '$1|${KEYVALUE}|$6' data.txt |awk '{print $5,$6}'`
echo $foo,$bar,$tan,$ban,$can,$man,${END_DATE}
done
Desired output:
A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31
My major problem is the while loop takes more than an hour to complete going through 500000 input lines. How can I parallel process this since each line is independent of each other and the order of lines in output file doesn't matter. I've tried using GNU parallel based on few discussions. But none of them were helpful or maybe I am not sure how to implement it. I am using RHEL with BASH or KSH.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
没有平行的5068056行花费8秒,
并行
使用拆分
Without parallel took 8 seconds for 5068056 lines
With parallel
Using split
这是一个潜在的解决方案:
它应该相对较快。您可以根据您的参数(即每个文件的大小,可用核心的数量等)来调整并行命令(例如使用 - pipepart而不是 - pipe)。
编辑
粗糙的基准测试表明,它将更快地:
Take< 10秒来处理6m行输入文件。这可以解决您的问题吗?
Here is one potential solution:
It should be relatively fast. You can tweak the parallel command (e.g. use --pipepart instead of --pipe) to increase performance depending on your parameters (i.e. size of each file, number of available cores, etc).
Edit
Rough benchmarking suggests it will be significantly faster:
Took <10 seconds to process the 6M row input file. Does this solve your problem?
如果您实现了一个函数的开发以完成每次迭代需要的任何操作,则可以使用 nohup 。
在下面的示例中,我模拟了迭代,读取input.txt文件。我创建了一个 do_something.sh ,它使用 mode 依次称为。我使用日期和日志打印处理日期。另外,我模拟了每次迭代的处理延迟2秒。
。
脚本
a href =“ https://stackoverflow.com/a/23877183/3957754”> https://stackoverflow.com/a/a/a/23877183/3957754
/214OW.png“ rel =“ nofollow noreferrer”>
If you achieve the development of a function to do whatever you need for each iteration, you could use nohup.
In the following example I simulated the iteration, reading a input.txt file. I created a do_something.sh which is called sequentially and parallelly using the mode. I used date and a log to print the processing date. Also I simulate a 2 seconds delay of processing on each iteration.
script.sh
do_something.sh
lines.txt
Also if you want to avoid another script, you could use this to keep just one script:
https://stackoverflow.com/a/23877183/3957754