OpenMP：嵌套并行化有什么好处？

发布于 2024-10-04 20:54:24 字数 261 浏览 2 评论 0原文

据我了解，#pragma omp parallel 及其变体基本上在多个并发线程中执行以下块，这些并发线程对应于 CPU 的数量。当具有嵌套并行化时 - 并行 for 内并行、并行函数内并行函数等 - 内部并行化会发生什么？

我是 OpenMP 的新手，我想到的情况可能相当简单 - 将向量与矩阵相乘。这是通过两个嵌套的 for 循环完成的。假设 CPU 的数量小于向量中元素的数量，尝试并行运行内部循环有什么好处吗？线程总数会大于CPU数量，还是内循环会顺序执行？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

内心旳酸楚 2024-10-11 20:54:24

(1) OpenMP中的嵌套并行性：
http://docs.oracle.com/cd/ E19205-01/819-5270/aewbc/index.html

您需要通过设置 OMP_NESTED 或 omp_set_nested 来打开嵌套并行性，因为许多实现都关闭此功能默认情况下，甚至某些实现也不完全支持嵌套并行性。如果打开，每当遇到parallel for时，OpenMP 都会创建OMP_NUM_THREADS 中定义的线程数。因此，如果是 2 级并行，线程总数将为 N^2，其中 N = OMP_NUM_THREADS。

这种嵌套并行性将导致超额订阅（即繁忙线程的数量大于核心数），这可能会降低加速比。在极端情况下，嵌套并行被递归调用，线程可能会变得臃肿（例如，创建1000个线程），并且计算机只是浪费时间进行上下文切换。在这种情况下，您可以通过设置omp_set_dynamic动态控制线程数量。

(2) 矩阵向量乘法的示例：代码如下所示：

// Input:  A(N by M), B(M by 1)
// Output: C(N by 1)
for (int i = 0; i < N; ++i)
  for (int j = 0; j < M; ++j)
     C[i] += A[i][j] * B[j];

一般来说，由于线程的分叉/连接开销，在可能存在外部循环的情况下并行化内部循环是不好的。（尽管许多 OpenMP 实现预先创建线程，但它仍然需要一些将任务分派给线程并在并行 for 结束时调用隐式屏障）

您关心的是 N < 的情况。 CPU 数量。是的，没错，在这种情况下，加速将受到 N 的限制，而让嵌套并行肯定会有好处。

但是，如果 N 足够大，则代码将导致超额订阅。我只是在考虑以下解决方案：

更改循环结构，以便仅存在 1 级循环。（看起来可行）
专门化代码：如果 N 很小，则进行嵌套并行，否则不要这样做。
使用 omp_set_dynamic 进行嵌套并行。但是，请确保omp_set_dynamic如何控制线程数量和线程活动。实施方式可能有所不同。

(1) Nested parallelism in OpenMP:
http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html

You need to turn on nested parallelism by setting OMP_NESTED or omp_set_nested because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meet parallel for, OpenMP will create the number of threads as defined in OMP_NUM_THREADS. So, if 2-level parallelism, the total number of threads would be N^2, where N = OMP_NUM_THREADS.

Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting omp_set_dynamic.

(2) An example of matrix-vector multiplication: the code would look like:

// Input:  A(N by M), B(M by 1)
// Output: C(N by 1)
for (int i = 0; i < N; ++i)
  for (int j = 0; j < M; ++j)
     C[i] += A[i][j] * B[j];

In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for)

Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits.

However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions:

Changing the loop structure so that only 1-level loop exists. (It looks doable)
Specializing the code: if N is small, then do nested parallelism, otherwise don't do that.
Nested parallelism with omp_set_dynamic. But, please make it sure how omp_set_dynamic controls the number of threads and the activity of threads. Implementations may vary.

回复收藏 0 原文

梦毁影碎の 2024-10-11 20:54:24

对于像密集线性代数这样的东西，所有潜在的并行性已经暴露在一个很好的宽 for 循环中的一个地方，你不需要嵌套并行性 - 如果你确实想防止（比如说）非常窄的情况如果矩阵的主维数可能小于核心数，那么您所需要的只是 collapse 指令理论上将多个循环扁平化为一个。

嵌套并行性适用于并行性并未同时全部暴露的情况 - 假设您想要同时进行 2 个函数评估，每个评估都可以有效地利用 4 个内核，而您有一个 8 核系统。您在并行部分中调用该函数，并且在函数定义中还有一个附加的并行 for。

回复收藏 0 原文

嘿哥们儿 2024-10-11 20:54:24

在外层使用 NUM_THREADS(num_groups) 子句来设置要使用的线程数。如果外循环的计数为 N，并且处理器或核心的数量为 num_cores，请使用 num_groups = min(N,num_cores)。在内层，需要设置每个线程组的子线程数量，使子线程总数等于核心数量。因此，如果 num_cores = 8，N = 4，则 num_groups = 4。在较低级别，每个子线程应使用 2 个线程（因为 2+2+2+2 = 8），因此请使用 NUM_THREADS(2) 子句。您可以将子线程的数量收集到一个数组中，每个外部区域线程有一个元素（包含 num_groups 个元素）。

此策略始终能够最佳地利用您的核心。当N＜N时num_cores 发生一些嵌套并行化。当 N >= num_cores 时，子线程计数数组包含全 1，因此内部循环实际上是串行的。

回复收藏 0 原文

~没有更多了~