OpenMP GCC GOMP 浪费屏障
我有以下程序。 nv 约为 100,dgemm 约为 20x100 左右,因此还有大量工作要做:
#pragma omp parallel for schedule(dynamic,1)
for (int c = 0; c < int(nv); ++c) {
omp::thread thread;
matrix &t3_c = vv_.at(omp::num_threads()+thread);
if (terms.first) {
blas::gemm(1, t2_, vvvo_, 1, t3_c);
blas::gemm(1, vvvo_, t2_, 1, t3_c);
}
matrix &t3_b = vv_[thread];
if (terms.second) {
matrix &t2_ci = vo_[thread];
blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
}
}
但是,对于 GCC 4.4、GOMP v1,gomp_barrier_wait_end
占据了近 50% 的运行时间。更改 GOMP_SPINCOUNT
可以减轻开销,但只会使用 60% 的内核。 OMP_WAIT_POLICY=passive
相同。系统是Linux,8核。
如何在不旋转/等待超线程的情况下充分利用
I have the following program.
nv is around 100, dgemm is 20x100 or so, so there is plenty of work to go around:
#pragma omp parallel for schedule(dynamic,1)
for (int c = 0; c < int(nv); ++c) {
omp::thread thread;
matrix &t3_c = vv_.at(omp::num_threads()+thread);
if (terms.first) {
blas::gemm(1, t2_, vvvo_, 1, t3_c);
blas::gemm(1, vvvo_, t2_, 1, t3_c);
}
matrix &t3_b = vv_[thread];
if (terms.second) {
matrix &t2_ci = vo_[thread];
blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
}
}
however with GCC 4.4, GOMP v1, the gomp_barrier_wait_end
accounts for nearly 50% of runtime. Changing GOMP_SPINCOUNT
aleviates the overhead but then only 60% of cores are used. Same for OMP_WAIT_POLICY=passive
. The system is Linux, 8 cores.
How can i get full utilization without spinning/waiting overhread
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
障碍是一种症状,而不是问题。在循环结束时有大量等待的原因是,某些线程先于其他线程完成,并且它们都在 for 循环结束时等待相当长的时间,直到每个线程都完成为止。
这是一个经典的负载不平衡问题,这很奇怪,因为它只是一堆矩阵乘法。它们的大小不同吗?就 NUMA 而言,它们在内存中是如何布局的——它们当前都位于一个核心的缓存中,还是存在其他共享问题?或者,更简单地说——是否只有 9 个矩阵,那么剩下的 8 个矩阵注定要等待谁获得最后一个矩阵?
当这种事情发生在较大的并行代码块中时,有时可以在某些循环迭代尚未完成时继续执行下一个代码块;您可以在 for 中添加 nowait 指令,该指令将覆盖默认行为并消除隐含的障碍。但在这里,由于并行块的大小恰好是 for 循环的大小,因此这并没有什么帮助。
The barrier is a symptom, not the problem. The reason that there's lots of waiting at the end of the loop is that some of the threads are done well before the others, and they all wait at the end of the for loop for quite a while until everyone's done.
This is a classic load imbalance problem, which is weird here, since it's just a bunch of matrix multiplies. Are they of varying sizes? How are they laid out in memory, in terms of NUMA stuff - are they all currently sitting in one core's cache, or are there other sharing issues? Or, more simply -- are there only 9 matricies, so that the remaining 8 are doomed to be stuck waiting for whoever got the last one?
When this sort of thing happens in a larger parallel block of code, sometime it's ok to proceed to the next block of code while some of the loop iterations aren't done yet; there you can add the
nowait
directive to the for which will override the default behaviour and get rid of the implied barrier. Here, though, since the parallel block is exactly the size of the for loop, that can't really help.难道你的 BLAS 实现也在内部调用了 OpenMP 吗?除非您只看到一次对
gomp_barrier_wait_end
的调用。Could it be that your BLAS implementation also calls OpenMP inside? Unless you only see one call to
gomp_barrier_wait_end
.