OpenMP 中的循环重塑和 matmul

发布于 2025-01-09 06:09:31 字数 1525 浏览 1 评论 0原文

我正在调试一些并行代码,发现重塑操作搞乱了 OpenMP。这是重现该问题的演示。我还不太熟悉使用 OpenMP,所以我想知道我在这里做错了什么的原因,以及是否有更好的方法来做事情(即如何最好地将 reshape 和 matmul 嵌套在 do 循环中) )。我已将 OpenBLAS 视为一种潜在的解决方案,但首先想知道原因。预先感谢您

program unittest

    complex*16, save, dimension(10,10) :: testmat
    integer :: i
    real :: t0, t1, t2

    !$call OMP_set_num_threads(12);
    !$call OMP_set_dynamic(.FALSE.);
    testmat = 0.d0;

    call cpu_time(t0);
    !$OMP parallel
    !$OMP DO
    do i=1,1000000
         testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    !$OMP END DO
    !$OMP end parallel
    call cpu_time(t1);
    do i=1,1000000
         testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    call cpu_time(t2);
    print *, 'parallel time, ', t1-t0, 's, single thread time, ', t2-t1, 's'

end program unittest

在 MinGW 上用 gfortran 编译。我的机器上的输出

(并行)为 10.01 s (单线程)0.328 s

CPU 寄存器在并行情况下总体使用率低于 20%,这可能意味着有什么东西阻碍了 OpenMP?

====================

编辑:

谢谢。一些澄清,以下是好的,例如,并行版本的运行速度并不慢(两者完成的时间大约相同),

    !$OMP parallel private(testmat2)
    !$OMP DO
    do i=1,1000000
        testmat2 = testmat * 10.d0;
    end do
    !$OMP END DO
    !$OMP end parallel

但是并行版本的运行速度比单线程慢得多(需要并行时间比单次多 50 倍)

    !$OMP parallel private(testmat2)
    !$OMP DO
    do i=1,1000000
        testmat2 = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    !$OMP END DO
    !$OMP end parallel

那么……导致这种情况的 reshape 有何特别之处?

I was debugging some piece of parallel code and found a reshape operation messed up OpenMP. This is a demo to reproduce the issue. I am not very familiar with using OpenMP yet so I'd like to know the reason about what am I doing wrong here, and if there is a better way to do things, (i.e. how best to have reshape and matmul nested in do loops). I have read OpenBLAS as a potential solution but would first like to know why. Thank you in advance

program unittest

    complex*16, save, dimension(10,10) :: testmat
    integer :: i
    real :: t0, t1, t2

    !$call OMP_set_num_threads(12);
    !$call OMP_set_dynamic(.FALSE.);
    testmat = 0.d0;

    call cpu_time(t0);
    !$OMP parallel
    !$OMP DO
    do i=1,1000000
         testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    !$OMP END DO
    !$OMP end parallel
    call cpu_time(t1);
    do i=1,1000000
         testmat = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    call cpu_time(t2);
    print *, 'parallel time, ', t1-t0, 's, single thread time, ', t2-t1, 's'

end program unittest

Compiled with gfortran on MinGW. Output on my machine is

(with parallel) 10.01 s
(single thread) 0.328 s

CPU registers less than 20% usage overall for the parallel case which probably means something is holding up OpenMP?

====================

Edit:

Thank you. Some clarification, the following is okay-ish, as in, the parallel version does not run slower (both completes around the same amount of time)

    !$OMP parallel private(testmat2)
    !$OMP DO
    do i=1,1000000
        testmat2 = testmat * 10.d0;
    end do
    !$OMP END DO
    !$OMP end parallel

but this runs much slower in parallel than on single thread (takes 50x more time in parallel than single)

    !$OMP parallel private(testmat2)
    !$OMP DO
    do i=1,1000000
        testmat2 = reshape(reshape(testmat,(/100,1/)),(/10,10/));
    end do
    !$OMP END DO
    !$OMP end parallel

So... what is special about reshape that causes this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

鸩远一方 2025-01-16 06:09:31

首先,重塑操作不是操作。这是内部记账的问题:您告诉 Fortran 100x1 数组现在将被解释为 10x10 左右。这基本上需要零时间。

接下来,您有一个并行执行,这意味着循环迭代被划分到可用线程上。意思是:他们完成了多次。所以现在你有一些内部簿记的东西,并且你会多次这样做。我只是有点惊讶这会产生奇怪的结果。

如果要使用并行 for,则需要有一个循环,其中每次迭代都会执行涉及循环索引的操作。例如v(i) = ....

First of all, a reshape operation is not an operation. It's a matter of internal book keeping: you're telling Fortran that an 100x1 array is now to be interpreted as 10x10, or so. This basically takes zero time.

Next, you have a parallel do, which means that the loop iterations get divided over the available threads. Meaning: they get done multiple times. So now you have something that is some internal book keeping, and you do that multiple times. I'm only a little surprised that that gives strange results.

If you want to use a parallel for, you need to have a loop where each iteration does something involving the loop index. For instance v(i) = .....

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文