使用 openmp 并行化内循环

发布于 2024-10-16 03:59:52 字数 199 浏览 3 评论 0原文

我有三个嵌套循环,但只有最里面的循环是可并行的。外部和中间循环停止条件取决于最内部循环完成的计算,因此我无法更改顺序。

我在最内层循环之前使用了 OPENMP pragma 指令,但两个线程的性能比一个线程的性能最差。我猜这是因为外循环的每次迭代都会创建线程。

有没有什么方法可以在外循环之外创建线程,但只在最内循环中使用它?

提前致谢

I have three nested loops but only the innermost is parallelizable. The outer and middle loop stop conditions depend on the calculations done by the innermost loop and therefore I cannot change the order.

I have used a OPENMP pragma directive just before the innermost loop but the performance with two threads is worst than with one. I guess it is because the threads are being created every iteration of the outer loops.

Is there any way to create the threads outside the outer loops but just use it in the innermost loop?

Thanks in advance

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

笑饮青盏花 2024-10-23 03:59:52

OpenMP 应该使用线程池,这样您就不会在每次执行循环时重新创建线程。然而,严格来说,这可能取决于您正在使用的 OpenMP 实现(我知道 GNU 编译器使用池)。我建议您查找其他常见问题,例如错误共享。

OpenMP should be using a thread-pool, so you won't be recreating threads every time you execute your loop. Strictly speaking, however, that might depend on the OpenMP implementation you are using (I know the GNU compiler uses a pool). I suggest you look for other common problems, such as false sharing.

软糯酥胸 2024-10-23 03:59:52

不幸的是,当前的多核计算机系统不适合这种细粒度的内循环并行性。这不是因为线程创建/分叉问题。正如 Itjax 指出的,实际上所有 OpenMP 实现都利用线程池,即它们预先创建许多线程,并且线程被停放。所以,实际上没有创建线程的开销。

然而,这种并行化内循环的问题在于以下两个开销:

  • 将作业/任务分派给线程:即使我们不需要物理地创建线程,至少我们必须将作业(=创建逻辑任务)分配给线程,而这些线程主要是需要同步。
  • 加入线程:在团队中的所有线程之后,应该加入这些线程(除非使用 nowait OpenMP 指令)。这通常作为屏障操作来实现,这也是非常密集的同步。

因此,应该最小化线程分配/加入的实际数量。您可以通过增加每次调用的内部循环的工作量来减少此类开销。这可以通过一些代码更改(例如循环展开)来完成。

Unfortunately, current multicore computer systems are no good for such fine-grained inner-loop parallelism. It's not because of a thread creation/forking issue. As Itjax pointed out, virtually all OpenMP implementations exploit thread pools, i.e., they pre-create a number of threads, and threads are parked. So, there is actually no overhead of creating threads.

However, the problems of such parallelizing inner loops are the following two overhead:

  • Dispatching jobs/tasks to threads: even if we don't need to physically create threads, at least we must assign jobs (= create logical tasks) to threads which mostly requires synchronizations.
  • Joining threads: after all threads in a team, then these threads should be joined (unless nowait OpenMP directive used). This is typically implemented as a barrier operation, which is also very intensive synchronization.

Hence, one should minimize the actual number of thread assigning/joining. You may decrease such overhead by increasing the amount of work of the inner loop per invocation. This could be done by some code changes like loop unrolling.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文