OpenMP GCC GOMP 浪费屏障

发布于 2024-11-02 12:55:22 字数 940 浏览 1 评论 0原文

我有以下程序。 nv 约为 100,dgemm 约为 20x100 左右,因此还有大量工作要做:

#pragma omp parallel for schedule(dynamic,1)
        for (int c = 0; c < int(nv); ++c) {
            omp::thread thread;                                               
            matrix &t3_c = vv_.at(omp::num_threads()+thread);
            if (terms.first) {
                blas::gemm(1, t2_, vvvo_, 1, t3_c);
                blas::gemm(1, vvvo_, t2_, 1, t3_c);
            }

            matrix &t3_b = vv_[thread];
            if (terms.second) {
                matrix &t2_ci = vo_[thread];
                blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
                blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
            }
        }

但是,对于 GCC 4.4、GOMP v1,gomp_barrier_wait_end 占据了近 50% 的运行时间。更改 GOMP_SPINCOUNT 可以减轻开销,但只会使用 60% 的内核。 OMP_WAIT_POLICY=passive 相同。系统是Linux,8核。

如何在不旋转/等待超线程的情况下充分利用

I have the following program.
nv is around 100, dgemm is 20x100 or so, so there is plenty of work to go around:

#pragma omp parallel for schedule(dynamic,1)
        for (int c = 0; c < int(nv); ++c) {
            omp::thread thread;                                               
            matrix &t3_c = vv_.at(omp::num_threads()+thread);
            if (terms.first) {
                blas::gemm(1, t2_, vvvo_, 1, t3_c);
                blas::gemm(1, vvvo_, t2_, 1, t3_c);
            }

            matrix &t3_b = vv_[thread];
            if (terms.second) {
                matrix &t2_ci = vo_[thread];
                blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
                blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
            }
        }

however with GCC 4.4, GOMP v1, the gomp_barrier_wait_end accounts for nearly 50% of runtime. Changing GOMP_SPINCOUNT aleviates the overhead but then only 60% of cores are used. Same for OMP_WAIT_POLICY=passive. The system is Linux, 8 cores.

How can i get full utilization without spinning/waiting overhread

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

萌逼全场 2024-11-09 12:55:22

障碍是一种症状,而不是问题。在循环结束时有大量等待的原因是,某些线程先于其他线程完成,并且它们都在 for 循环结束时等待相当长的时间,直到每个线程都完成为止。

这是一个经典的负载不平衡问题,这很奇怪,因为它只是一堆矩阵乘法。它们的大小不同吗?就 NUMA 而言,它们在内存中是如何布局的——它们当前都位于一个核心的缓存中,还是存在其他共享问题?或者,更简单地说——是否只有 9 个矩阵,那么剩下的 8 个矩阵注定要等待谁获得最后一个矩阵?

当这种事情发生在较大的并行代码块中时,有时可以在某些循环迭代尚未完成时继续执行下一个代码块;您可以在 for 中添加 nowait 指令,该指令将覆盖默认行为并消除隐含的障碍。但在这里,由于并行块的大小恰好是 for 循环的大小,因此这并没有什么帮助。

The barrier is a symptom, not the problem. The reason that there's lots of waiting at the end of the loop is that some of the threads are done well before the others, and they all wait at the end of the for loop for quite a while until everyone's done.

This is a classic load imbalance problem, which is weird here, since it's just a bunch of matrix multiplies. Are they of varying sizes? How are they laid out in memory, in terms of NUMA stuff - are they all currently sitting in one core's cache, or are there other sharing issues? Or, more simply -- are there only 9 matricies, so that the remaining 8 are doomed to be stuck waiting for whoever got the last one?

When this sort of thing happens in a larger parallel block of code, sometime it's ok to proceed to the next block of code while some of the loop iterations aren't done yet; there you can add the nowait directive to the for which will override the default behaviour and get rid of the implied barrier. Here, though, since the parallel block is exactly the size of the for loop, that can't really help.

若水般的淡然安静女子 2024-11-09 12:55:22

难道你的 BLAS 实现也在内部调用了 OpenMP 吗?除非您只看到一次对 gomp_barrier_wait_end 的调用。

Could it be that your BLAS implementation also calls OpenMP inside? Unless you only see one call to gomp_barrier_wait_end.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文