OpenMP GCC GOMP 浪费屏障

发布于 2024-11-02 12:55:22 字数 940 浏览 1 评论 0原文

我有以下程序。 nv 约为 100，dgemm 约为 20x100 左右，因此还有大量工作要做：

#pragma omp parallel for schedule(dynamic,1)
        for (int c = 0; c < int(nv); ++c) {
            omp::thread thread;                                               
            matrix &t3_c = vv_.at(omp::num_threads()+thread);
            if (terms.first) {
                blas::gemm(1, t2_, vvvo_, 1, t3_c);
                blas::gemm(1, vvvo_, t2_, 1, t3_c);
            }

            matrix &t3_b = vv_[thread];
            if (terms.second) {
                matrix &t2_ci = vo_[thread];
                blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
                blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
            }
        }

但是，对于 GCC 4.4、GOMP v1，gomp_barrier_wait_end 占据了近 50% 的运行时间。更改 GOMP_SPINCOUNT 可以减轻开销，但只会使用 60% 的内核。 OMP_WAIT_POLICY=passive 相同。系统是Linux，8核。

如何在不旋转/等待超线程的情况下充分利用

原文

I have the following program.
nv is around 100, dgemm is 20x100 or so, so there is plenty of work to go around:

#pragma omp parallel for schedule(dynamic,1)
        for (int c = 0; c < int(nv); ++c) {
            omp::thread thread;                                               
            matrix &t3_c = vv_.at(omp::num_threads()+thread);
            if (terms.first) {
                blas::gemm(1, t2_, vvvo_, 1, t3_c);
                blas::gemm(1, vvvo_, t2_, 1, t3_c);
            }

            matrix &t3_b = vv_[thread];
            if (terms.second) {
                matrix &t2_ci = vo_[thread];
                blas::gemm(-1, t2_ci, Vjk_, 1, t3_c);
                blas::gemm(-1, t2_ci, Vkj_, 0, t3_b);
            }
        }

however with GCC 4.4, GOMP v1, the gomp_barrier_wait_end accounts for nearly 50% of runtime. Changing GOMP_SPINCOUNT aleviates the overhead but then only 60% of cores are used. Same for OMP_WAIT_POLICY=passive. The system is Linux, 8 cores.

How can i get full utilization without spinning/waiting overhread

分享到QQ

分享到微博