OpenMP 中的循环重塑和 matmul
我正在调试一些并行代码,发现重塑操作搞乱了 OpenMP。这是重现该问题的演示。我还不太熟悉使用 OpenMP,所以我想知道我在这里做错了什么的原因,以及是否有更好的方法来做事情(即如何最好地将 reshape 和 matmul 嵌套在 do 循环中) )。我已将 OpenBLAS 视为一种潜在的解决方案,但首先想知道原因。预先感谢您
program unittest
complex*16, save, dimension(10,10) :: testmat
integer :: i
real :: t0, t1, t2
!$call OMP_set_num_threads(12);
!$call OMP_set_dynamic(.FALSE.);
testmat = 0.d0;
call cpu_time(t0);
!$OMP parallel
!$OMP DO
do i=1,1000000
testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
!$OMP END DO
!$OMP end parallel
call cpu_time(t1);
do i=1,1000000
testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
call cpu_time(t2);
print *, 'parallel time, ', t1-t0, 's, single thread time, ', t2-t1, 's'
end program unittest
在 MinGW 上用 gfortran 编译。我的机器上的输出
(并行)为 10.01 s (单线程)0.328 s
CPU 寄存器在并行情况下总体使用率低于 20%,这可能意味着有什么东西阻碍了 OpenMP?
====================
编辑:
谢谢。一些澄清,以下是好的,例如,并行版本的运行速度并不慢(两者完成的时间大约相同),
!$OMP parallel private(testmat2)
!$OMP DO
do i=1,1000000
testmat2 = testmat * 10.d0;
end do
!$OMP END DO
!$OMP end parallel
但是并行版本的运行速度比单线程慢得多(需要并行时间比单次多 50 倍)
!$OMP parallel private(testmat2)
!$OMP DO
do i=1,1000000
testmat2 = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
!$OMP END DO
!$OMP end parallel
那么……导致这种情况的 reshape 有何特别之处?
I was debugging some piece of parallel code and found a reshape operation messed up OpenMP. This is a demo to reproduce the issue. I am not very familiar with using OpenMP yet so I'd like to know the reason about what am I doing wrong here, and if there is a better way to do things, (i.e. how best to have reshape and matmul nested in do loops). I have read OpenBLAS as a potential solution but would first like to know why. Thank you in advance
program unittest
complex*16, save, dimension(10,10) :: testmat
integer :: i
real :: t0, t1, t2
!$call OMP_set_num_threads(12);
!$call OMP_set_dynamic(.FALSE.);
testmat = 0.d0;
call cpu_time(t0);
!$OMP parallel
!$OMP DO
do i=1,1000000
testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
!$OMP END DO
!$OMP end parallel
call cpu_time(t1);
do i=1,1000000
testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
call cpu_time(t2);
print *, 'parallel time, ', t1-t0, 's, single thread time, ', t2-t1, 's'
end program unittest
Compiled with gfortran on MinGW. Output on my machine is
(with parallel) 10.01 s
(single thread) 0.328 s
CPU registers less than 20% usage overall for the parallel case which probably means something is holding up OpenMP?
====================
Edit:
Thank you. Some clarification, the following is okay-ish, as in, the parallel version does not run slower (both completes around the same amount of time)
!$OMP parallel private(testmat2)
!$OMP DO
do i=1,1000000
testmat2 = testmat * 10.d0;
end do
!$OMP END DO
!$OMP end parallel
but this runs much slower in parallel than on single thread (takes 50x more time in parallel than single)
!$OMP parallel private(testmat2)
!$OMP DO
do i=1,1000000
testmat2 = reshape(reshape(testmat,(/100,1/)),(/10,10/));
end do
!$OMP END DO
!$OMP end parallel
So... what is special about reshape that causes this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
首先,重塑操作不是操作。这是内部记账的问题:您告诉 Fortran 100x1 数组现在将被解释为 10x10 左右。这基本上需要零时间。
接下来,您有一个并行执行,这意味着循环迭代被划分到可用线程上。意思是:他们完成了多次。所以现在你有一些内部簿记的东西,并且你会多次这样做。我只是有点惊讶这会产生奇怪的结果。
如果要使用并行 for,则需要有一个循环,其中每次迭代都会执行涉及循环索引的操作。例如
v(i) = ....
。First of all, a reshape operation is not an operation. It's a matter of internal book keeping: you're telling Fortran that an 100x1 array is now to be interpreted as 10x10, or so. This basically takes zero time.
Next, you have a parallel do, which means that the loop iterations get divided over the available threads. Meaning: they get done multiple times. So now you have something that is some internal book keeping, and you do that multiple times. I'm only a little surprised that that gives strange results.
If you want to use a parallel for, you need to have a loop where each iteration does something involving the loop index. For instance
v(i) = ....
.