使用 openmp 并行化内循环
我有三个嵌套循环,但只有最里面的循环是可并行的。外部和中间循环停止条件取决于最内部循环完成的计算,因此我无法更改顺序。
我在最内层循环之前使用了 OPENMP pragma 指令,但两个线程的性能比一个线程的性能最差。我猜这是因为外循环的每次迭代都会创建线程。
有没有什么方法可以在外循环之外创建线程,但只在最内循环中使用它?
提前致谢
I have three nested loops but only the innermost is parallelizable. The outer and middle loop stop conditions depend on the calculations done by the innermost loop and therefore I cannot change the order.
I have used a OPENMP pragma directive just before the innermost loop but the performance with two threads is worst than with one. I guess it is because the threads are being created every iteration of the outer loops.
Is there any way to create the threads outside the outer loops but just use it in the innermost loop?
Thanks in advance
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
OpenMP 应该使用线程池,这样您就不会在每次执行循环时重新创建线程。然而,严格来说,这可能取决于您正在使用的 OpenMP 实现(我知道 GNU 编译器使用池)。我建议您查找其他常见问题,例如错误共享。
OpenMP should be using a thread-pool, so you won't be recreating threads every time you execute your loop. Strictly speaking, however, that might depend on the OpenMP implementation you are using (I know the GNU compiler uses a pool). I suggest you look for other common problems, such as false sharing.
不幸的是,当前的多核计算机系统不适合这种细粒度的内循环并行性。这不是因为线程创建/分叉问题。正如 Itjax 指出的,实际上所有 OpenMP 实现都利用线程池,即它们预先创建许多线程,并且线程被停放。所以,实际上没有创建线程的开销。
然而,这种并行化内循环的问题在于以下两个开销:
因此,应该最小化线程分配/加入的实际数量。您可以通过增加每次调用的内部循环的工作量来减少此类开销。这可以通过一些代码更改(例如循环展开)来完成。
Unfortunately, current multicore computer systems are no good for such fine-grained inner-loop parallelism. It's not because of a thread creation/forking issue. As Itjax pointed out, virtually all OpenMP implementations exploit thread pools, i.e., they pre-create a number of threads, and threads are parked. So, there is actually no overhead of creating threads.
However, the problems of such parallelizing inner loops are the following two overhead:
Hence, one should minimize the actual number of thread assigning/joining. You may decrease such overhead by increasing the amount of work of the inner loop per invocation. This could be done by some code changes like loop unrolling.