OPENMP F90/95 嵌套 DO 循环 - 问题在串行实现上得到改进

发布于 2024-11-14 07:21:46 字数 1024 浏览 1 评论 0原文

我已经进行了一些搜索，但找不到任何与我的问题相关的内容（很抱歉，如果我的问题是多余的！）。无论如何，正如标题所述，我在代码的串行实现方面无法获得任何改进。我需要并行化的代码片段如下（这是带有 OpenMP 的 Fortran90）：

do n=1,lm     
  do m=1,jm   
    do l=1,im      
      sum_u = 0
      sum_v = 0
      sum_t = 0
      do k=1,lm
       !$omp parallel do reduction (+:sum_u,sum_v,sum_t) 
        do j=1,jm  
          do i=1,im
            exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
            sum_u = sum_u + u_p(i,j,k) * exp_smoother
            sum_v = sum_v + v_p(i,j,k) * exp_smoother
            sum_t = sum_t + t_p(i,j,k) * exp_smoother

            sum_u_pert(l,m,n) = sum_u
            sum_v_pert(l,m,n) = sum_v
            sum_t_pert(l,m,n) = sum_t          

            end do
          end do
       end do      
    end do
  end do  
end do

我是否遇到了竞争条件问题？或者我只是把指令放在了错误的地方？我对此还很陌生，所以如果这是一个过于简单的问题，我深表歉意。

不管怎样，如果没有并行化，代码就会极其缓慢。为了了解问题的规模，lm、jm 和 im 索引分别为 60、401 和 501。所以并行化至关重要。任何帮助或有用资源的链接将不胜感激！我正在使用 xlf 编译上面的代码，如果这有用的话。

谢谢！ -仁

原文

I've done some searching but couldn't find anything that appeared to be related to my question (sorry if my question is redundant!). Anyway, as the title states, I'm having trouble getting any improvement over the serial implementation of my code. The code snippet that I need to parallelize is as follows (this is Fortran90 with OpenMP):

do n=1,lm     
  do m=1,jm   
    do l=1,im      
      sum_u = 0
      sum_v = 0
      sum_t = 0
      do k=1,lm
       !$omp parallel do reduction (+:sum_u,sum_v,sum_t) 
        do j=1,jm  
          do i=1,im
            exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
            sum_u = sum_u + u_p(i,j,k) * exp_smoother
            sum_v = sum_v + v_p(i,j,k) * exp_smoother
            sum_t = sum_t + t_p(i,j,k) * exp_smoother

            sum_u_pert(l,m,n) = sum_u
            sum_v_pert(l,m,n) = sum_v
            sum_t_pert(l,m,n) = sum_t          

            end do
          end do
       end do      
    end do
  end do  
end do

Am I running into race condition issues? Or am I simply putting the directive in the wrong place? I'm pretty new to this, so I apologize if this is an overly simplistic problem.

Anyway, without parallelization, the code is excruciatingly slow. To give an idea of the size of the problem, the lm, jm, and im indexes are 60, 401, and 501 respectively. So the parallelization is critical. Any help or links to helpful resources would be very much appreciated! I'm using xlf to compile the above code, if that's at all useful.

Thanks!
-Jen

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

貪欢 2024-11-21 07:21:46

放置 omp pragma 的明显位置是最外面的循环。

对于每个 (l,m,n)，您都在计算扰动变量和指数平滑器之间的卷积。每个 (l,m,n) 计算都完全独立于其他计算，因此您可以将其放在最外层循环上。例如，最简单的事情

!$omp parallel do private(n,m,l,i,j,k,exp_smoother) shared(sum_u_pert,sum_v_pert,sum_t_pert,u_p,v_p,t_p), default(none)
do n=1,lm
  do m=1,jm
    do l=1,im
      do k=1,lm
        do j=1,jm
          do i=1,im
            exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
            sum_u_pert(l,m,n) = sum_u_pert(l,m,n) + u_p(i,j,k) * exp_smoother
            sum_v_pert(l,m,n) = sum_v_pert(l,m,n) + v_p(i,j,k) * exp_smoother
            sum_t_pert(l,m,n) = sum_t_pert(l,m,n) + t_p(i,j,k) * exp_smoother
          end do
        end do
      end do
    end do
  end do
end do

可以让我在 8 个核心上获得大约 6 倍的加速（使用大大减少的问题大小 20x41x41）。考虑到循环中要做的工作量，即使是在较小的大小下，我认为它不是 8 倍加速的原因涉及内存争用或错误共享；为了进一步调整性能，您可能需要显式地将总和数组分成每个线程的子块，并在最后将它们组合起来；但根据问题的大小，拥有相当于额外的 im x jm x lm 大小的数组可能并不理想。

看起来这个问题有很多结构，你可以利用它来加快串行案例的速度，但说起来比找到它更容易；在纸和笔上摆弄，几分钟之内什么也想不出来，但聪明的人可能会发现一些东西。

The obvious place to put the omp pragma is at the very outside loop.

For every (l,m,n), you're calculating a convolution between your perturbed variables and an exponential smoother. Each (l,m,n) calculation is completely independant from the others, so you can put it on the outermost loop. So for instance the simplest thing

!$omp parallel do private(n,m,l,i,j,k,exp_smoother) shared(sum_u_pert,sum_v_pert,sum_t_pert,u_p,v_p,t_p), default(none)
do n=1,lm
  do m=1,jm
    do l=1,im
      do k=1,lm
        do j=1,jm
          do i=1,im
            exp_smoother=exp(-(abs(i-l)/hzscl)-(abs(j-m)/hzscl)-(abs(k-n)/vscl))
            sum_u_pert(l,m,n) = sum_u_pert(l,m,n) + u_p(i,j,k) * exp_smoother
            sum_v_pert(l,m,n) = sum_v_pert(l,m,n) + v_p(i,j,k) * exp_smoother
            sum_t_pert(l,m,n) = sum_t_pert(l,m,n) + t_p(i,j,k) * exp_smoother
          end do
        end do
      end do
    end do
  end do
end do

gives me a ~6x speedup on 8 cores (using a much reduced problem size of 20x41x41). Given the amount of work there is to do in the loops, even at the smaller size, I assume the reason it's not an 8x speedup involves memory contension or false sharing; for further performance tuning you might want to explicitly break the sum arrays into sub-blocks for each thread, and combine them at the end; but depending on the problem size, having the equivalent of an extra im x jm x lm sized array might not be desirable.

It seems like there's a lot of structure in this problem you aught to be able to explot to speed up even the serial case, but it's easier to say that then to find it; playing around on pen and paper nothing comes to mind in a few minutes, but someone cleverer may spot something.

回复收藏 0 原文