如何惯用地打破嵌套并行 (OpenMP) Fortran 循环?
这是顺序代码:
do i = 1, n
do j = i+1, n
if ("some_condition(i,j)") then
result = "here's result"
return
end if
end do
end do
是否有一种更清晰的方法来同时执行外循环的迭代,除了:
!$OMP PARALLEL private(i,j)
!$OMP DO
do i = 1, n
!$OMP FLUSH(found)
if (found) goto 10
do j = i+1, n
if ("some_condition(i,j)") then
!$OMP CRITICAL
!$OMP FLUSH(found)
if (.not.found) then
found = .true.
result = "here's result"
end if
!$OMP FLUSH(found)
!$OMP END CRITICAL
goto 10
end if
end do
10 continue
end do
!$OMP END DO NOWAIT
!$OMP END PARALLEL
i
-loop 上的迭代顺序可以是任意的,只要 some < code>result 已找到(只要满足 "some_condition"
,运行之间的变化并不重要)。
Here's sequential code:
do i = 1, n
do j = i+1, n
if ("some_condition(i,j)") then
result = "here's result"
return
end if
end do
end do
Is there a cleaner way to execute iterations of the outer loop concurrently other than:
!$OMP PARALLEL private(i,j)
!$OMP DO
do i = 1, n
!$OMP FLUSH(found)
if (found) goto 10
do j = i+1, n
if ("some_condition(i,j)") then
!$OMP CRITICAL
!$OMP FLUSH(found)
if (.not.found) then
found = .true.
result = "here's result"
end if
!$OMP FLUSH(found)
!$OMP END CRITICAL
goto 10
end if
end do
10 continue
end do
!$OMP END DO NOWAIT
!$OMP END PARALLEL
The order of iterations over i
-loop may be arbitrary as long as some result
is found (it doesn't matter if it changes from run to run as long as it satisfies "some_condition"
).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
看来您的顺序代码具有依赖性,使其不适合并行化。假设 i & 有多个值。 j 使“某些条件”为真 - 那么 i & 的执行顺序是j do 循环确定首先找到这些条件中的哪一个并设置 result 的值,之后 return 语句结束对“某些条件”为 true 的其他情况 i,j 的搜索。在顺序代码中,do 循环始终以相同的顺序执行,因此程序的操作是确定性的,并且 i 和 i 的值相同。总能找到使“某些条件”为真的 j 。在并发版本中,各种循环 i 以非确定性顺序执行,因此从运行到运行不同的 i 值可能是找到真正的“某些条件”的第一个 i 值。
也许作为一名程序员,您知道 i & 只有一个值。 j 导致真正的“某些条件”?在这种情况下,短路执行似乎没问题。但 OpenMP 规范表示“除了 DO 语句之外,关联循环中的任何语句都不会导致分支
循环之外”,因此不允许内循环中的某些内容中止输出循环。如果总是只有一个真实的“某些条件”,则可以删除“返回”并浪费 CPU在找到一种情况后,让线程查找“某种条件”是否成立,这可能仍然比使用缩放器“结果”变量更快,但它仍然可能不合规,因为它依赖于如果您需要找到“某些条件”为 true 的 i 的最小值,您可以将其更改为“减少”,对结果求和,或将结果返回为维度 (n) 的一维数组。 ,您可以使用 Fortran 内在函数 minloc 从数组结果中获取该值。
具有许多“flush”和“ritic”指令的解决方案可能不会比顺序版本更快。
更新:基于澄清。多个结果是可能的,并且任何结果都可以,一种并行方法是返回多个结果并让顺序代码选择一个——将“结果”放入一维数组而不是缩放器中。您可以短路内部 j 循环,因为它与“omp do”指令不“关联”,因此“结果”只需为 1D,根据 i 的范围确定尺寸。所以像这样:
It seems that your sequential code has a dependency that makes it unsuitable to being made parallel. Suppose that there are multiple values of i & j that make "some condition" true -- then the order of execution of the i & j do loops determines which of these conditions is found first and sets the value of result, after which the return statement ends the search for additional cases i,j that "some condition" is true. In the sequential code, the do loops always execute in the same order, so the operation of the program is deterministic and identical values of i & j that make "some condition" true will always be found. In a concurrent version, various loops i execute in non-deterministic order, so that from run to run different values of i might be the first i-value that finds a true "some condition".
Perhaps you as a programmer know that there is only one value of i & j that results in a true "some condition"? In that case short-circuiting the execution would seem OK. But the OpenMP spec says that "No statement in the associated loops other than the DO statements can cause a branch
out of the loops" so having the something in the inner loop abort the output loop isn't allowed. If it is the case that there is always only one true "some condition", you could just remove the "return" and waste CPU time by having threads look for "some condition" is true after the one case has been found. That might still be faster than a sequential program. With a scaler "result" variable, it still probably non-compliant, having an dependency on the order of execution. You could change it in to a "reduction", summing the result, or return result as 1-D array of dimension (n). If you need to find the smallest value of i that has "some condition" true, you could obtain that from an array result using the Fortran instrinsic function minloc.
A solution with many "flush" and "critical" directives may not be faster than the sequential version.
UPDATE: Based on the clarification that multiple results are possible and that any will do, one parallel method would be to return mutiple results and let sequential code pick one out -- make "result" into a 1D array rather than a scaler. You are allowed to short-circuit the inner j-loop because it is not "associated" with the "omp do" directive, so "result" need only be 1D, dimensioned according to the range of i. So something like this:
另一种方法完全是使用 TASK 构造,它是 OpenMP 3.0 的一部分。您似乎想要做的是将循环划分为多个线程,计算直到任何线程找到答案,然后让所有线程停止。问题是,让所有线程检查共享标志的必要性是(a)降低你的性能,(b)导致你进入带有 BREAKS 和 CYCLES 的丑陋循环。
我认为 @MSB 的回答就如何调整现有方法提供了非常好的建议。但是,解决该问题的一种更自然的方法可能是让程序创建多个任务(可能为最内层循环的每次迭代创建一个任务)并将这些任务分派给工作线程。一旦任何线程报告成功,就可以向所有线程发送终结任务,并且您的程序可以继续。
当然,这需要对程序进行更多的重写,并且可能会使顺序执行变得更糟。它肯定会要求您的 OpenMP 实现支持该标准的 v3.0。
您在这方面可能需要比我所能提供的更多帮助,我自己才刚刚开始使用 OpenMP TASKS。
Another approach entirely would be to use the TASK construct which is part of OpenMP 3.0. What you seem to be trying to do is to divide your loops across threads, compute until any thread finds an answer, then have all threads stop. Trouble is, the necessity to have all threads check a shared flag is (a) killing your performance and (b) leading you into ugly loops with BREAKS and CYCLES.
I think @M.S.B.'s answer gives very good advice on how to adapt your existing approach. But, perhaps a more natural way of tackling the problem would be for the program to create a number of tasks (perhaps one for each iteration of your innermost loop) and to dispatch those to worker threads. Once any thread reports success all threads can be sent a finalisation task and your program can continue.
This would, of course, require more re-writing of your program and probably make sequential execution worse. It will definitely require that your implementation of OpenMP supports v3.0 of the standard.
And you may need more help in this area than I can manage, I've only just started playing with OpenMP TASKS myself.
看来 $OMP DO 不允许提前跳出循环。另一种选择可能是手动实现它。
为每个线程提供固定的连续索引范围以处理
以下OpenMP 指南:轻松进行多线程编程C++:
更新:用
exit
替换goto
,引入基于results数组“https://stackoverflow.com/questions/2979760/how-to-break-out-of-a-nested-parallel-openmp-fortran-loop-idiomatically/2981267#2981267">@MSB 的回答。如果存在解决方案,则由于更早退出,此方法比
$OMP DO
更快。一次给每个线程一次迭代来处理
使用任务指令 (由 @High Performance Mark):
在我的测试中,此变体比带有
外部
循环的版本快 2 倍。It seems
$OMP DO
doesn't allow break out of the loop earlier. An alternative might be to implement it by hand.Give each thread fixed continuous range of indices to process
Following Guide into OpenMP: Easy multithreading programming for C++:
UPDATE: replaced
goto
byexit
, introducedresults
array based on @M. S. B.'s answer.If solution exists this approach is faster then
$OMP DO
due to earlier exit.Give each thread one iteration at a time to process
Using task directive (suggested by @High Performance Mark):
This variant is 2 times faster on my tests than the version with the
outer
-loop.