OpenMP:嵌套并行化有什么好处?
据我了解,#pragma omp parallel
及其变体基本上在多个并发线程中执行以下块,这些并发线程对应于 CPU 的数量。当具有嵌套并行化时 - 并行 for 内并行、并行函数内并行函数等 - 内部并行化会发生什么?
我是 OpenMP 的新手,我想到的情况可能相当简单 - 将向量与矩阵相乘。这是通过两个嵌套的 for 循环完成的。假设 CPU 的数量小于向量中元素的数量,尝试并行运行内部循环有什么好处吗?线程总数会大于CPU数量,还是内循环会顺序执行?
From what I understand, #pragma omp parallel
and its variations basically execute the following block in a number of concurrent threads, which corresponds to the number of CPUs. When having nested parallelizations - parallel for within parallel for, parallel function within parallel function etc. - what happens on the inner parallelization?
I'm new to OpenMP, and the case I have in mind is probably rather trivial - multiplying a vector with a matrix. This is done in two nested for loops. Assuming the number of CPUs is smaller than the number of elements in the vector, is there any benefit in trying to run the inner loop in parallel? Will the total number of threads be larger than the number of CPUs, or will the inner loop be executed sequentially?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
(1) OpenMP中的嵌套并行性:
http://docs.oracle.com/cd/ E19205-01/819-5270/aewbc/index.html
您需要通过设置
OMP_NESTED
或omp_set_nested
来打开嵌套并行性,因为许多实现都关闭此功能默认情况下,甚至某些实现也不完全支持嵌套并行性。如果打开,每当遇到parallel for
时,OpenMP 都会创建OMP_NUM_THREADS
中定义的线程数。因此,如果是 2 级并行,线程总数将为 N^2,其中 N =OMP_NUM_THREADS
。这种嵌套并行性将导致超额订阅(即繁忙线程的数量大于核心数),这可能会降低加速比。在极端情况下,嵌套并行被递归调用,线程可能会变得臃肿(例如,创建1000个线程),并且计算机只是浪费时间进行上下文切换。在这种情况下,您可以通过设置
omp_set_dynamic
动态控制线程数量。(2) 矩阵向量乘法的示例:代码如下所示:
一般来说,由于线程的分叉/连接开销,在可能存在外部循环的情况下并行化内部循环是不好的。 (尽管许多 OpenMP 实现预先创建线程,但它仍然需要一些将任务分派给线程并在并行 for 结束时调用隐式屏障)
您关心的是 N < 的情况。 CPU 数量。是的,没错,在这种情况下,加速将受到 N 的限制,而让嵌套并行肯定会有好处。
但是,如果 N 足够大,则代码将导致超额订阅。我只是在考虑以下解决方案:
omp_set_dynamic
进行嵌套并行。但是,请确保omp_set_dynamic
如何控制线程数量和线程活动。实施方式可能有所不同。(1) Nested parallelism in OpenMP:
http://docs.oracle.com/cd/E19205-01/819-5270/aewbc/index.html
You need to turn on nested parallelism by setting
OMP_NESTED
oromp_set_nested
because many implementations turn off this feature by default, even some implementations didn't support nested parallelism fully. If turned on, whenever you meetparallel for
, OpenMP will create the number of threads as defined inOMP_NUM_THREADS
. So, if 2-level parallelism, the total number of threads would be N^2, where N =OMP_NUM_THREADS
.Such nested parallelism will cause oversubscription, (i.e., the number of busy threads is greater than the cores), which may degrade the speedup. In an extreme case, where nested parallelism is called recursively, threads could be bloated (e.g., creating 1000s threads), and computer just wastes time for context switching. In such case, you may control the number of threads dynamically by setting
omp_set_dynamic
.(2) An example of matrix-vector multiplication: the code would look like:
In general, parallelizing inner loops while outer loops are possible is bad because of forking/joining overhead of threads. (though many OpenMP implementations pre-create threads, it still requires some to dispatch tasks to threads and to call implicit barrier at the end of parallel-for)
Your concern is the case of where N < # of CPU. Yes, right, in this case, the speedup would be limited by N, and letting nested parallelism will definitely have benefits.
However, then the code would cause oversubscription if N is sufficiently large. I'm just thinking the following solutions:
omp_set_dynamic
. But, please make it sure howomp_set_dynamic
controls the number of threads and the activity of threads. Implementations may vary.对于像密集线性代数这样的东西,所有潜在的并行性已经暴露在一个很好的宽 for 循环中的一个地方,你不需要嵌套并行性 - 如果你确实想防止(比如说)非常窄的情况如果矩阵的主维数可能小于核心数,那么您所需要的只是 collapse 指令理论上将多个循环扁平化为一个。
嵌套并行性适用于并行性并未同时全部暴露的情况 - 假设您想要同时进行 2 个函数评估,每个评估都可以有效地利用 4 个内核,而您有一个 8 核系统。您在并行部分中调用该函数,并且在函数定义中还有一个附加的并行 for。
For something like dense linear algebra, where all the potential parallelism is already lain bare in one place in nice wide for loops, you don't need nested parallism -- if you do want to protect against the case of having (say) really narrow matricies where the leading dimension might be smaller than the number of cores, then all you need is the collapse directive which notionally flattens the multiple loops into one.
Nested parallelism is for those cases where the parallelism isn't all exposed at once -- say you want to do 2 simultaneous function evaluations, each of which could usefully utilize 4 cores, and you have an 8 core system. You call the function in a parallel section, and within the function definition there is an additional, say, parallel for.
在外层使用 NUM_THREADS(num_groups) 子句来设置要使用的线程数。如果外循环的计数为 N,并且处理器或核心的数量为 num_cores,请使用 num_groups = min(N,num_cores)。在内层,需要设置每个线程组的子线程数量,使子线程总数等于核心数量。因此,如果 num_cores = 8,N = 4,则 num_groups = 4。在较低级别,每个子线程应使用 2 个线程(因为 2+2+2+2 = 8),因此请使用 NUM_THREADS(2) 子句。您可以将子线程的数量收集到一个数组中,每个外部区域线程有一个元素(包含 num_groups 个元素)。
此策略始终能够最佳地利用您的核心。当N<N时num_cores 发生一些嵌套并行化。当 N >= num_cores 时,子线程计数数组包含全 1,因此内部循环实际上是串行的。
At the outer level use the NUM_THREADS(num_groups) clause to set the number of threads to use. If your outer loop has a count N, and the number of processors or cores is num_cores, use num_groups = min(N,num_cores). At the inner level, you need to set the number of sub-threads for each thread group so that the total number of subthreads equals the number of cores. So if num_cores = 8, N = 4, then num_groups = 4. At the lower level each sub-thread should use 2 threads (since 2+2+2+2 = 8) so use the NUM_THREADS(2) clause. You can collect the number of sub-threads into an array with one element per outer region thread (with num_groups elements).
This strategy always makes optimal use of your cores. When N < num_cores some nested parallelisation occurs. When N >= num_cores the array of subthread counts contains all 1s and so the inner loop is effectively serial.