OpenMP 运行时波动
我目前正在 FORTRAN 代码的大循环中测试 OpenMP。该代码是从 VB.NET 用户界面调用的模拟模块的一部分;该接口还进行定时测量。所以我开始模拟,最后软件显示了它花了多长时间(我写这个只是为了表明对于时序测量我不使用 wtime 或 cpu_time)。 现在,当我使用并行循环重复启动模拟时,我总是会得到不同的模拟时间,在一个示例中,达到从 1 分 30 秒到几乎 3 分钟!结果总是正确的。
我尝试了不同的循环计划(静态、引导、动态),我尝试手动计算分配给每个线程的块(do i=1,N -> do i=i_start,i_end),我尝试更改参与循环计算的线程数 - 情况没有变化。当我从代码中删除 OpenMP 指令时,这种情况不会发生,因此它们一定是导致此行为的原因。 我的机器是四核 Intel Xeon(R) CPU X3470 @2.93GHz,安装了 Win7。我尝试在启用和禁用多线程(在BIOS中)的情况下运行该程序,但是,这也没有改变任何东西。
您有什么想法可能会出错吗?对此类情况的网络搜索表明,其他程序员的测试环境中也出现了类似的行为,但从未提及解决方案/原因。预先感谢您的想法。
马丁
编辑
这是代码:
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP PRIVATE(n,k,nk,i,j,l,List,Vx,Vz,cS,AE1,RootCh,Ec1,Ec2,Ec3,FcE,GcE,VxE,VzE,SMuL1,SMuL2) &
!$OMP PRIVATE(W1,W2,W3,Wx,Wz,S,i1,j1,AcE,j2,ic,iB,iBound,i2) &
!$OMP FIRSTPRIVATE(NumSEL) REDUCTION(-:Cum0,Cum1) REDUCTION(+:CumR)
DO n=1, NumEl
! Loop on subelements
DO k=1, Elements(n)%nCorners-2
nk = (k-1) * 3
NumSEL=NumSEL+1
!
i=Elements(n)%KX(1)
j=Elements(n)%KX(k+1)
l=Elements(n)%KX(k+2)
List(1)=i
List(2)=j
List(3)=l
!
IF(Level == NLevel) THEN
Vx(1)=Nodes(i)%VxO
Vx(2)=Nodes(j)%VxO
Vx(3)=Nodes(l)%VxO
Vz(1)=Nodes(i)%VzO
Vz(2)=Nodes(j)%VzO
Vz(3)=Nodes(l)%VzO
ELSE
Vx(1)=Nodes(i)%VxN
Vx(2)=Nodes(j)%VxN
Vx(3)=Nodes(l)%VxN
Vz(1)=Nodes(i)%VzN
Vz(2)=Nodes(j)%VzN
Vz(3)=Nodes(l)%VzN
END IF
!
cS=cBound(sol,5)
cS=(MIN(cS,Nodes(i)%Conc(sol))+MIN(cS,Nodes(j)%Conc(sol))+MIN(cS,Nodes(l)%Conc(sol)))/3.0D0
AE1=Elements(n)%xMul(k)*Elements(n)%Area(k)*dt*Eps
RootCh=AE1*cS*(Nodes(i)%Sink+Nodes(j)%Sink+Nodes(l)%Sink)/3.0D0
Cum0=Cum0-AE1*(Nodes(i)%Gc1+Nodes(j)%Gc1+Nodes(l)%Gc1)/3.0D0
Cum1=Cum1-AE1*(Nodes(i)%Fc1+Nodes(j)%Fc1+Nodes(l)%Fc1)/3.0D0
CumR=CumR+RootCh
Ec1=(Nodes(i)%Dispxx+Nodes(j)%Dispxx+Nodes(l)%Dispxx)/3.0D0
Ec2=(Nodes(i)%Dispxz+Nodes(j)%Dispxz+Nodes(l)%Dispxz)/3.0D0
Ec3=(Nodes(i)%Dispzz+Nodes(j)%Dispzz+Nodes(l)%Dispzz)/3.0D0
!
IF (Level == NLevel) AcE=(Nodes(i)%Ac+Nodes(j)%Ac+Nodes(l)%Ac)/3.0D0
!
FcE=(Nodes(i)%Fc+Nodes(j)%Fc+Nodes(l)%Fc)/3.0D0
GcE=(Nodes(i)%Gc+Nodes(j)%Gc+Nodes(l)%Gc)/3.0D0
VxE=(Vx(1)+Vx(2)+Vx(3))/3.0D0
VzE=(Vz(1)+Vz(2)+Vz(3))/3.0D0
SMul1=-Elements(n)%AMul(k)
SMul2=Elements(n)%Area(k)/20.0D0*Elements(n)%XMul(k)
!
If (lUpw) THEN
!W1=WeTab(1,NumSEl)
!W2=WeTab(2,NumSEl)
!W3=WeTab(3,NumSEl)
W1=WeTab(1,(n-1)*(Elements(n)%nCorners-2)+k)
W2=WeTab(2,(n-1)*(Elements(n)%nCorners-2)+k)
W3=WeTab(3,(n-1)*(Elements(n)%nCorners-2)+k)
Wx(1)=2.0D0*Vx(1)*(W2-W3)+Vx(2)*(W2-2.0D0*W3)+Vx(3)*(2.0D0*W2-W3)
Wx(2)=Vx(1)*(2.0D0*W3-W1)+2.0D0*Vx(2)*(W3-W1)+Vx(3)*(W3-2.0D0*W1)
Wx(3)=Vx(1)*(W1-2.0D0*W2)+Vx(2)*(2.0D0*W1-W2)+2.0D0*Vx(3)*(W1-W2)
Wz(1)=2.0D0*Vz(1)*(W2-W3)+Vz(2)*(W2-2.0D0*W3)+Vz(3)*(2.0D0*W2-W3)
Wz(2)=Vz(1)*(2.0D0*W3-W1)+2.0D0*Vz(2)*(W3-W1)+Vz(3)*(W3-2.0D0*W1)
Wz(3)=Vz(1)*(W1-2.0D0*W2)+Vz(2)*(2.0D0*W1-W2)+2.0D0*Vz(3)*(W1-W2)
END IF
!
DO j1=1, 3
i1=List(j1)
!$OMP ATOMIC
Nodes(i1)%F=Nodes(i1)%F+Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
IF (Level==NLevel) then
!$OMP ATOMIC
Nodes(i1)%DS=Nodes(i1)%DS+Elements(n)%GMul(k)*(Ace+Nodes(i1)%Ac/3.0D0)
end if
iBound=0
IF (Nodes(i)%Kode/=0) THEN
BP_Loop : DO id=1, NumBP
IF((KXB(id)==i1) .AND. (KodCB(id) > 0)) THEN
iBound=1
EXIT BP_Loop
END IF
END DO BP_Loop
END IF
!
DO j2=1, 3
i2=List(j2)
S(j1,j2)=SMul1*(Ec1*Elements(n)%dz(nk+j1)*Elements(n)%dz(nk+j2)+ &
Ec3*Elements(n)%dx(nk+j1)*Elements(n)%dx(nk+j2)+ &
Ec2*(Elements(n)%dz(nk+j1)*Elements(n)%dx(nk+j2)+ &
Elements(n)%dx(nk+j1)*Elements(n)%dz(nk+j2)))
S(j1,j2)=S(j1,j2)-(Elements(n)%dz(nk+j2)/8.0D0*(VxE+Vx(j1)/3.0D0)+ &
Elements(n)%dx(nk+j2)/8.0D0*(VzE+Vz(j1)/3.0D0))*Elements(n)%xMul(k)
IF(lUpw) S(j1,j2)=S(j1,j2)-Elements(n)%xMul(k)* &
(Elements(n)%dz(nk+j2)/40.0D0*Wx(j1)+ &
Elements(n)%dx(nk+j2)/40.0D0*Wz(j1))
ic=1
IF (i1==i2) ic=2
S(j1,j2)=S(j1,j2)+SMul2*ic*(FcE+(Nodes(i1)%Fc+Nodes(i2)%Fc)/3.0D0)
IF (iBound==1) then
if(j2.eq.1) then
!$OMP ATOMIC
Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)-Eps*Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
else
!$OMP ATOMIC
Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)
end if
end if
IF (Level/=NLevel) THEN
!$OMP ATOMIC
B(i1)=B(i1)-alf*S(j1,j2)*Nodes(i2)%Conc(sol)
ELSE
IF (lOrt) THEN
CALL rFIND(i1,i2,kk,NumNP,MBandD,IAD,IADN)
iB=kk
ELSE
iB=MBand+i2-i1
END IF
!$OMP ATOMIC
A(iB,i1)=A(iB,i1)+epsi*S(j1,j2)
END IF
END DO
END DO
END DO
END DO
!$OMP END PARALLEL DO
I am currently testing OpenMP in a big loop in my FORTRAN code. The code is part of a simulation module which is called from a VB.NET user interface; this interface also does the timing measurements. So I start a simulation, and at the end the software shows me how long it took (I only write this to show that for timing measurements I don't use wtime or cpu_time).
Now when I repeatedly start a simulation with my parallelized loop, I always get different simulation times, reaching, in one example, from 1min30sec to almost 3min! The results are always correct.
I tried different schedules for the loop (static, guided, dynamic), I tried to calculate the chunks that are assigned to each thread manually (do i=1,N -> do i=i_start,i_end), I tried to change the number of threads taking part in the calculation of the loop - with no change of the situation. When I remove the OpenMP directives from the code this does not occur, so they must be the reason for this behavior.
My machine is a quadcore Intel Xeon(R) CPU X3470 @2.93GHz with Win7 installed. I tried to run the program with both enabled and disabled multithreading (in the bios), however, this also didn't change anything.
Do you have any ideas what could go wrong? A web search for a situation like this showed that similar behavior occured in test environments of other programmers as well, however a solution / reason has never been mentioned. Thanks in advance for your thoughts.
Martin
EDIT
Here's the code:
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP PRIVATE(n,k,nk,i,j,l,List,Vx,Vz,cS,AE1,RootCh,Ec1,Ec2,Ec3,FcE,GcE,VxE,VzE,SMuL1,SMuL2) &
!$OMP PRIVATE(W1,W2,W3,Wx,Wz,S,i1,j1,AcE,j2,ic,iB,iBound,i2) &
!$OMP FIRSTPRIVATE(NumSEL) REDUCTION(-:Cum0,Cum1) REDUCTION(+:CumR)
DO n=1, NumEl
! Loop on subelements
DO k=1, Elements(n)%nCorners-2
nk = (k-1) * 3
NumSEL=NumSEL+1
!
i=Elements(n)%KX(1)
j=Elements(n)%KX(k+1)
l=Elements(n)%KX(k+2)
List(1)=i
List(2)=j
List(3)=l
!
IF(Level == NLevel) THEN
Vx(1)=Nodes(i)%VxO
Vx(2)=Nodes(j)%VxO
Vx(3)=Nodes(l)%VxO
Vz(1)=Nodes(i)%VzO
Vz(2)=Nodes(j)%VzO
Vz(3)=Nodes(l)%VzO
ELSE
Vx(1)=Nodes(i)%VxN
Vx(2)=Nodes(j)%VxN
Vx(3)=Nodes(l)%VxN
Vz(1)=Nodes(i)%VzN
Vz(2)=Nodes(j)%VzN
Vz(3)=Nodes(l)%VzN
END IF
!
cS=cBound(sol,5)
cS=(MIN(cS,Nodes(i)%Conc(sol))+MIN(cS,Nodes(j)%Conc(sol))+MIN(cS,Nodes(l)%Conc(sol)))/3.0D0
AE1=Elements(n)%xMul(k)*Elements(n)%Area(k)*dt*Eps
RootCh=AE1*cS*(Nodes(i)%Sink+Nodes(j)%Sink+Nodes(l)%Sink)/3.0D0
Cum0=Cum0-AE1*(Nodes(i)%Gc1+Nodes(j)%Gc1+Nodes(l)%Gc1)/3.0D0
Cum1=Cum1-AE1*(Nodes(i)%Fc1+Nodes(j)%Fc1+Nodes(l)%Fc1)/3.0D0
CumR=CumR+RootCh
Ec1=(Nodes(i)%Dispxx+Nodes(j)%Dispxx+Nodes(l)%Dispxx)/3.0D0
Ec2=(Nodes(i)%Dispxz+Nodes(j)%Dispxz+Nodes(l)%Dispxz)/3.0D0
Ec3=(Nodes(i)%Dispzz+Nodes(j)%Dispzz+Nodes(l)%Dispzz)/3.0D0
!
IF (Level == NLevel) AcE=(Nodes(i)%Ac+Nodes(j)%Ac+Nodes(l)%Ac)/3.0D0
!
FcE=(Nodes(i)%Fc+Nodes(j)%Fc+Nodes(l)%Fc)/3.0D0
GcE=(Nodes(i)%Gc+Nodes(j)%Gc+Nodes(l)%Gc)/3.0D0
VxE=(Vx(1)+Vx(2)+Vx(3))/3.0D0
VzE=(Vz(1)+Vz(2)+Vz(3))/3.0D0
SMul1=-Elements(n)%AMul(k)
SMul2=Elements(n)%Area(k)/20.0D0*Elements(n)%XMul(k)
!
If (lUpw) THEN
!W1=WeTab(1,NumSEl)
!W2=WeTab(2,NumSEl)
!W3=WeTab(3,NumSEl)
W1=WeTab(1,(n-1)*(Elements(n)%nCorners-2)+k)
W2=WeTab(2,(n-1)*(Elements(n)%nCorners-2)+k)
W3=WeTab(3,(n-1)*(Elements(n)%nCorners-2)+k)
Wx(1)=2.0D0*Vx(1)*(W2-W3)+Vx(2)*(W2-2.0D0*W3)+Vx(3)*(2.0D0*W2-W3)
Wx(2)=Vx(1)*(2.0D0*W3-W1)+2.0D0*Vx(2)*(W3-W1)+Vx(3)*(W3-2.0D0*W1)
Wx(3)=Vx(1)*(W1-2.0D0*W2)+Vx(2)*(2.0D0*W1-W2)+2.0D0*Vx(3)*(W1-W2)
Wz(1)=2.0D0*Vz(1)*(W2-W3)+Vz(2)*(W2-2.0D0*W3)+Vz(3)*(2.0D0*W2-W3)
Wz(2)=Vz(1)*(2.0D0*W3-W1)+2.0D0*Vz(2)*(W3-W1)+Vz(3)*(W3-2.0D0*W1)
Wz(3)=Vz(1)*(W1-2.0D0*W2)+Vz(2)*(2.0D0*W1-W2)+2.0D0*Vz(3)*(W1-W2)
END IF
!
DO j1=1, 3
i1=List(j1)
!$OMP ATOMIC
Nodes(i1)%F=Nodes(i1)%F+Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
IF (Level==NLevel) then
!$OMP ATOMIC
Nodes(i1)%DS=Nodes(i1)%DS+Elements(n)%GMul(k)*(Ace+Nodes(i1)%Ac/3.0D0)
end if
iBound=0
IF (Nodes(i)%Kode/=0) THEN
BP_Loop : DO id=1, NumBP
IF((KXB(id)==i1) .AND. (KodCB(id) > 0)) THEN
iBound=1
EXIT BP_Loop
END IF
END DO BP_Loop
END IF
!
DO j2=1, 3
i2=List(j2)
S(j1,j2)=SMul1*(Ec1*Elements(n)%dz(nk+j1)*Elements(n)%dz(nk+j2)+ &
Ec3*Elements(n)%dx(nk+j1)*Elements(n)%dx(nk+j2)+ &
Ec2*(Elements(n)%dz(nk+j1)*Elements(n)%dx(nk+j2)+ &
Elements(n)%dx(nk+j1)*Elements(n)%dz(nk+j2)))
S(j1,j2)=S(j1,j2)-(Elements(n)%dz(nk+j2)/8.0D0*(VxE+Vx(j1)/3.0D0)+ &
Elements(n)%dx(nk+j2)/8.0D0*(VzE+Vz(j1)/3.0D0))*Elements(n)%xMul(k)
IF(lUpw) S(j1,j2)=S(j1,j2)-Elements(n)%xMul(k)* &
(Elements(n)%dz(nk+j2)/40.0D0*Wx(j1)+ &
Elements(n)%dx(nk+j2)/40.0D0*Wz(j1))
ic=1
IF (i1==i2) ic=2
S(j1,j2)=S(j1,j2)+SMul2*ic*(FcE+(Nodes(i1)%Fc+Nodes(i2)%Fc)/3.0D0)
IF (iBound==1) then
if(j2.eq.1) then
!$OMP ATOMIC
Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)-Eps*Elements(n)%GMul(k)*(GcE+Nodes(i1)%Gc/3.0D0)
else
!$OMP ATOMIC
Nodes(i1)%Qc(sol)=Nodes(i1)%Qc(sol)-Eps*S(j1,j2)*Nodes(i2)%Conc(sol)
end if
end if
IF (Level/=NLevel) THEN
!$OMP ATOMIC
B(i1)=B(i1)-alf*S(j1,j2)*Nodes(i2)%Conc(sol)
ELSE
IF (lOrt) THEN
CALL rFIND(i1,i2,kk,NumNP,MBandD,IAD,IADN)
iB=kk
ELSE
iB=MBand+i2-i1
END IF
!$OMP ATOMIC
A(iB,i1)=A(iB,i1)+epsi*S(j1,j2)
END IF
END DO
END DO
END DO
END DO
!$OMP END PARALLEL DO
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
如果您想检查程序中的性能,我建议您使用 OpenMP 计时函数在程序中进行计时。请参阅 OpenMP 参考。床单。
所以你需要做类似的事情:
我有时发现这些可以更好地反映实际的并行化时序。你用那些吗?
正如 FFox 所说,这可能只是由于 ATOMIC 语句在每次运行时在不同的庄园中延迟所致。请记住,线程是在运行时创建的,因此每次运行的线程布局可能不同。
通过这样的循环,我会尝试看看是否可以通过拆分来提高速度。当然,如果 2 个线程的加速比约为 2,则不需要这样做。只是一个想法。
If you want to check the performance in the program i would suggest you did timings in the program with the OpenMP timing functions. See OpenMP Ref. sheet.
So you need to do something like:
I some time find these to reflect the actual parallization timings better. Do you use those?
As FFox says it can simply be due to the
ATOMIC
statements which is delaying in different manors on each run. Remember that the threads are created at run time, so the layout of the threads may not be the same for each run.With such a loop i would try to see if you could gain speed by splitting it up. Of course this is not needed if the speedup is around 2 for 2 threads. Just an idea.