如何与循环时并行处理多行文件?

发布于 2025-02-05 16:06:16 字数 974 浏览 4 评论 0原文

input.txt:我的实际输入文件有5000000行

A B C D4.2 E 2022-05-31
A B C D4.2 E 2022-05-31
A B F D4.2 E 2022-05-07
A B C D4.2 E 2022-05-31
X B D E2.0 F 2022-05-30
X B Y D4.2 E 2022-05-06

data.txt:这是我在循环时需要引用的另一个文件。

A B C D4.2 E 2022-06-31
X B D E2.0 F 2022-07-30

这是我需要进行

cat input.txt |while read foo bar tan ban can man
do
KEYVALUE=$(echo $4 |awk -F. '{print $1}')
END_DATE=`egrep -w '$1|${KEYVALUE}|$6' data.txt |awk '{print $5,$6}'`
echo  $foo,$bar,$tan,$ban,$can,$man,${END_DATE}
done

所需的输出:

A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31

我的主要问题是循环花费一个多小时才能完成500000输入线。我该如何并行处理此操作,因为每行都是彼此独立的,并且输出文件中的行顺序无关紧要。我尝试根据一些讨论使用GNU并行使用GNU。但是他们都没有帮助,或者我不确定如何实施它。我正在使用rhel和bash或ksh。

input.txt: My actual input file has 5000000 lines

A B C D4.2 E 2022-05-31
A B C D4.2 E 2022-05-31
A B F D4.2 E 2022-05-07
A B C D4.2 E 2022-05-31
X B D E2.0 F 2022-05-30
X B Y D4.2 E 2022-05-06

data.txt : This is another file I need to refer in while loop.

A B C D4.2 E 2022-06-31
X B D E2.0 F 2022-07-30

Here's what I need to do

cat input.txt |while read foo bar tan ban can man
do
KEYVALUE=$(echo $4 |awk -F. '{print $1}')
END_DATE=`egrep -w '$1|${KEYVALUE}|$6' data.txt |awk '{print $5,$6}'`
echo  $foo,$bar,$tan,$ban,$can,$man,${END_DATE}
done

Desired output:

A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31

My major problem is the while loop takes more than an hour to complete going through 500000 input lines. How can I parallel process this since each line is independent of each other and the order of lines in output file doesn't matter. I've tried using GNU parallel based on few discussions. But none of them were helpful or maybe I am not sure how to implement it. I am using RHEL with BASH or KSH.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

故事还在继续 2025-02-12 16:06:17

没有平行的5068056行花费8秒,

$ wc -l input.txt 
5068056 input.txt
$ time awk 'NR==FNR{a[$4]=$6} NR!=FNR{print $0, a[$4]}' data.txt input.txt  > output.txt

real    0m8.274s
user    0m5.397s
sys     0m2.869s

$ wc -l output.txt
5068056 output.txt

并行

time cat input.txt | parallel --pipe -q awk 'NR==FNR{a[$4]=$6; next} {print $0, a[$4]}' data.txt - > output.txt 

real    0m3.319s
user    0m9.284s
sys     0m5.990s

使用拆分

inputfile=input.txt
outputfile=output.txt
data=data.txt
count=10

split -n l/$count $inputfile /tmp/input$
for file in /tmp/input$*; do
    awk 'NR==FNR{a[$4]=$6; next} {print $0, a[$4]}' $data $file > ${file}.out &
done
wait
cat /tmp/input$*.out > $outputfile
rm /tmp/input$*

$ time ./split.sh

real    0m1.781s
user    0m7.244s
sys     0m1.536s

Without parallel took 8 seconds for 5068056 lines

$ wc -l input.txt 
5068056 input.txt
$ time awk 'NR==FNR{a[$4]=$6} NR!=FNR{print $0, a[$4]}' data.txt input.txt  > output.txt

real    0m8.274s
user    0m5.397s
sys     0m2.869s

$ wc -l output.txt
5068056 output.txt

With parallel

time cat input.txt | parallel --pipe -q awk 'NR==FNR{a[$4]=$6; next} {print $0, a[$4]}' data.txt - > output.txt 

real    0m3.319s
user    0m9.284s
sys     0m5.990s

Using split

inputfile=input.txt
outputfile=output.txt
data=data.txt
count=10

split -n l/$count $inputfile /tmp/input$
for file in /tmp/input$*; do
    awk 'NR==FNR{a[$4]=$6; next} {print $0, a[$4]}' $data $file > ${file}.out &
done
wait
cat /tmp/input$*.out > $outputfile
rm /tmp/input$*

$ time ./split.sh

real    0m1.781s
user    0m7.244s
sys     0m1.536s
碍人泪离人颜 2025-02-12 16:06:17

这是一个潜在的解决方案:

cat script.awk
#!/usr/bin/awk -f

NR==FNR{
  n=gsub("\.*","",$4)
  a[n,$5]=$6; next
} (n,$5) in a {
  print $0, a[n,$5]
}
cat input.txt | parallel --pipe -q ./script.awk data.txt -
A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31

它应该相对较快。您可以根据您的参数(即每个文件的大小,可用核心的数量等)来调整并行命令(例如使用 - pipepart而不是 - pipe)。

编辑

粗糙的基准测试表明,它将更快地:

# Copy input.txt many times
for f in {1..100}; do cat input.txt >> input.txt_2; done
for f in {1..1000}; do cat input.txt_2 >> input.txt_3; done
for f in {1..10}; do cat input.txt_3 >> input.txt_4; done

du -h input.txt_4
137M    input.txt_4

wc -l input.txt_4
6000000 input.txt_4

time cat input.txt_4 | parallel --pipe -q ./script.awk data.txt - > output.txt
real    0m7.533s
user    0m22.085s
sys     0m4.494s

Take< 10秒来处理6m行输入文件。这可以解决您的问题吗?

Here is one potential solution:

cat script.awk
#!/usr/bin/awk -f

NR==FNR{
  n=gsub("\.*","",$4)
  a[n,$5]=$6; next
} (n,$5) in a {
  print $0, a[n,$5]
}
cat input.txt | parallel --pipe -q ./script.awk data.txt -
A B C D4.2 E 2022-05-31 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
A B F D4.2 E 2022-05-07 2022-06-31
A B C D4.2 E 2022-05-31 2022-06-31
X B D E2.0 F 2022-05-30 2022-07-30
X B Y D4.2 E 2022-05-06 2022-06-31

It should be relatively fast. You can tweak the parallel command (e.g. use --pipepart instead of --pipe) to increase performance depending on your parameters (i.e. size of each file, number of available cores, etc).

Edit

Rough benchmarking suggests it will be significantly faster:

# Copy input.txt many times
for f in {1..100}; do cat input.txt >> input.txt_2; done
for f in {1..1000}; do cat input.txt_2 >> input.txt_3; done
for f in {1..10}; do cat input.txt_3 >> input.txt_4; done

du -h input.txt_4
137M    input.txt_4

wc -l input.txt_4
6000000 input.txt_4

time cat input.txt_4 | parallel --pipe -q ./script.awk data.txt - > output.txt
real    0m7.533s
user    0m22.085s
sys     0m4.494s

Took <10 seconds to process the 6M row input file. Does this solve your problem?

梦与时光遇 2025-02-12 16:06:17

如果您实现了一个函数的开发以完成每次迭代需要的任何操作,则可以使用 nohup

在下面的示例中,我模拟了迭代,读取input.txt文件。我创建了一个 do_something.sh ,它使用 mode 依次称为。我使用日期和日志打印处理日期。另外,我模拟了每次迭代的处理延迟2秒。

#!/bin/bash
mode=$1
log_file=log.txt
echo "" > $log_file

while read folder; do
  if [ "$mode" == "parallel" ] ;then
    nohup $(pwd)/do_something.sh $folder >/dev/null 2>&1 &
  else
    $(pwd)/do_something.sh $p    
  fi
done <input.txt

log_file=log.txt
sleep 2
echo "$(date) : $1" >> $log_file

脚本

aaaa
bbbb
cccc
dddd

​a href =“ https://stackoverflow.com/a/23877183/3957754”> https://stackoverflow.com/a/a/a/23877183/3957754

/214OW.png“ rel =“ nofollow noreferrer”>

If you achieve the development of a function to do whatever you need for each iteration, you could use nohup.

In the following example I simulated the iteration, reading a input.txt file. I created a do_something.sh which is called sequentially and parallelly using the mode. I used date and a log to print the processing date. Also I simulate a 2 seconds delay of processing on each iteration.

script.sh

#!/bin/bash
mode=$1
log_file=log.txt
echo "" > $log_file

while read folder; do
  if [ "$mode" == "parallel" ] ;then
    nohup $(pwd)/do_something.sh $folder >/dev/null 2>&1 &
  else
    $(pwd)/do_something.sh $p    
  fi
done <input.txt

do_something.sh

log_file=log.txt
sleep 2
echo "$(date) : $1" >> $log_file

lines.txt

aaaa
bbbb
cccc
dddd

Also if you want to avoid another script, you could use this to keep just one script:

https://stackoverflow.com/a/23877183/3957754

sample result

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文