OpenMP,Gfortran和非速度的诅咒
我有一个可笑的(在我看来)的Fortran代码,因此我认为我可以尝试一下OpenMP,看看是否可以实现任何加速。这是在Windows 10,MingW64上,带有GCC 11.2。我在英特尔Xeon Gold 5122 @ 3.6 GHz上有16个内核和256 GB RAM。
该代码绝对没有什么大不了的:
start_time = omp_get_wtime()
!$omp parallel do default(shared) private(i)
do i = 1, nF
call something(input1, input2(i), input3, input4, output1(i, :), output2(i, :), output3(i, :))
end do
!$omp end parallel do
end_time = omp_get_wtime()
write(*, *) 'Number of threads:', omp_get_max_threads(), ', Elapsed time:', end_time-start_time
nf
的顺序为8,600。计算完全独立于我正在循环的i
索引,在执行哪个顺序的情况下,无关紧要。在迭代时的输入或输出之间没有互动i-1
及其值在迭代中i
。子例程内发生的任何内容都是纯计算,也不是那么艰难。
现在,我没想到会提高神奇的线性速度,但这就是我得到的:
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=1 & test.exe
Number of threads: 1 , Elapsed time: 0.43099999427795410
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=2 & test.exe
Number of threads: 2 , Elapsed time: 0.75899982452392578
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=4 & test.exe
Number of threads: 4 , Elapsed time: 0.69499993324279785
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=8 & test.exe
Number of threads: 8 , Elapsed time: 0.55299997329711914
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=16 & test.exe
Number of threads: 16 , Elapsed time: 0.57800006866455078
快乐,没有速度。实际上,无论我告诉OpenMP使用多少个线程,都会有所放缓。可能需要0.43秒才能运行使用OpenMP是毫无意义的,因为线程创建时间始终会淹没它们可能带来的任何改进。我也可能误解了所有这些工作。
我用来编译代码的优化开关:
-O3 -funroll-loops -march=native -fno-asynchronous-unwind-tables -fopenmp
最欢迎任何解释,并且还欢迎修改我的!$ omp
的建议。
编辑
,output2
等来编辑代码...
start_time = omp_get_wtime()
!$omp parallel do default(shared) private(i)
do i = 1, nF
call something(input1, input2(i), input3, input4, output1(:, i), output2(:, i), output3(:, i))
end do
!$omp end parallel do
end_time = omp_get_wtime()
write(*, *) 'Number of threads:', omp_get_max_threads(), ', Elapsed time:', end_time-start_time
我已经通过声明矩阵OUTOUT1
时间变化(可能更糟):
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=1 & test.exe
Number of threads: 1 , Elapsed time: 0.43000006675720215
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=2 & test.exe
Number of threads: 2 , Elapsed time: 0.63800001144409180
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=4 & test.exe
Number of threads: 4 , Elapsed time: 0.96600008010864258
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=8 & test.exe
Number of threads: 8 , Elapsed time: 0.80500006675720215
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=16 & test.exe
Number of threads: 16 , Elapsed time: 0.75200009346008301
I have a ridiculously parallelizable (in my mind) piece of Fortran code, so I thought I could give OpenMP a try and see if any speedup could be achieved. This is on Windows 10, MinGW64 with gcc 11.2. I have 16 cores and 256 GB of RAM on an Intel Xeon Gold 5122 @ 3.6 GHz.
The code is definitely not a big deal:
start_time = omp_get_wtime()
!$omp parallel do default(shared) private(i)
do i = 1, nF
call something(input1, input2(i), input3, input4, output1(i, :), output2(i, :), output3(i, :))
end do
!$omp end parallel do
end_time = omp_get_wtime()
write(*, *) 'Number of threads:', omp_get_max_threads(), ', Elapsed time:', end_time-start_time
nF
is in the order of 8,600. The calculations are completely independent on which i
index I am looping on, it doesn't matter in which order they are executed. There is no interaction between the inputs or the outputs at iteration i-1
with their values at iteration i
. Whatever goes on inside the subroutine something
is pure calculations, and not that tough either.
Now, I am not expecting magical linear speed improvements, but this is what I get:
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=1 & test.exe
Number of threads: 1 , Elapsed time: 0.43099999427795410
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=2 & test.exe
Number of threads: 2 , Elapsed time: 0.75899982452392578
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=4 & test.exe
Number of threads: 4 , Elapsed time: 0.69499993324279785
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=8 & test.exe
Number of threads: 8 , Elapsed time: 0.55299997329711914
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=16 & test.exe
Number of threads: 16 , Elapsed time: 0.57800006866455078
Happily, no speed up. And actually a slowdown, no matter how many threads I tell OpenMP to use. It may be that for a code that takes 0.43 seconds to run it's pointless to attempt to use OpenMP, as the thread creation time will always swamp whatever improvement they may bring. It may also be I am misunderstanding how all of this work.
Optimization switches I use to compile the code:
-O3 -funroll-loops -march=native -fno-asynchronous-unwind-tables -fopenmp
Any explanation is most welcome, and suggestions on modifying my !$omp
stuff is also welcome.
EDIT
I have edited the code by declaring the matrices outout1
, output2
etc... as "transposed", so it looks like this:
start_time = omp_get_wtime()
!$omp parallel do default(shared) private(i)
do i = 1, nF
call something(input1, input2(i), input3, input4, output1(:, i), output2(:, i), output3(:, i))
end do
!$omp end parallel do
end_time = omp_get_wtime()
write(*, *) 'Number of threads:', omp_get_max_threads(), ', Elapsed time:', end_time-start_time
No changes in timings (it's possibly even worse):
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=1 & test.exe
Number of threads: 1 , Elapsed time: 0.43000006675720215
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=2 & test.exe
Number of threads: 2 , Elapsed time: 0.63800001144409180
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=4 & test.exe
Number of threads: 4 , Elapsed time: 0.96600008010864258
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=8 & test.exe
Number of threads: 8 , Elapsed time: 0.80500006675720215
C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=16 & test.exe
Number of threads: 16 , Elapsed time: 0.75200009346008301
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论