OpenMP，Gfortran和非速度的诅咒

发布于 2025-02-10 19:03:23 字数 3026 浏览 0 评论 0原文

我有一个可笑的（在我看来）的Fortran代码，因此我认为我可以尝试一下OpenMP，看看是否可以实现任何加速。这是在Windows 10，MingW64上，带有GCC 11.2。我在英特尔Xeon Gold 5122 @ 3.6 GHz上有16个内核和256 GB RAM。

该代码绝对没有什么大不了的：

  start_time = omp_get_wtime() 
  !$omp parallel do default(shared) private(i)
  do i = 1, nF   
    call something(input1, input2(i), input3, input4, output1(i, :), output2(i, :), output3(i, :))   
  end do 
  !$omp end parallel do
  end_time = omp_get_wtime()       
  
  write(*, *) 'Number of threads:', omp_get_max_threads(), ', Elapsed time:', end_time-start_time

nf的顺序为8,600。计算完全独立于我正在循环的i索引，在执行哪个顺序的情况下，无关紧要。在迭代时的输入或输出之间没有互动i-1及其值在迭代中i。子例程内发生的任何内容都是纯计算，也不是那么艰难。

现在，我没想到会提高神奇的线性速度，但这就是我得到的：

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=1 & test.exe
 Number of threads:           1 , Elapsed time:  0.43099999427795410

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=2 & test.exe
 Number of threads:           2 , Elapsed time:  0.75899982452392578

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=4 & test.exe
 Number of threads:           4 , Elapsed time:  0.69499993324279785

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=8 & test.exe
 Number of threads:           8 , Elapsed time:  0.55299997329711914

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=16 & test.exe
 Number of threads:          16 , Elapsed time:  0.57800006866455078

快乐，没有速度。实际上，无论我告诉OpenMP使用多少个线程，都会有所放缓。可能需要0.43秒才能运行使用OpenMP是毫无意义的，因为线程创建时间始终会淹没它们可能带来的任何改进。我也可能误解了所有这些工作。

我用来编译代码的优化开关：

-O3 -funroll-loops -march=native -fno-asynchronous-unwind-tables -fopenmp

最欢迎任何解释，并且还欢迎修改我的！$ omp的建议。

编辑

，output2等来编辑代码...

  start_time = omp_get_wtime() 
  !$omp parallel do default(shared) private(i)
  do i = 1, nF   
    call something(input1, input2(i), input3, input4, output1(:, i), output2(:, i), output3(:, i))   
  end do 
  !$omp end parallel do
  end_time = omp_get_wtime()       
  
  write(*, *) 'Number of threads:', omp_get_max_threads(), ', Elapsed time:', end_time-start_time

我已经通过声明矩阵OUTOUT1 时间变化（可能更糟）：

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=1 & test.exe
 Number of threads:           1 , Elapsed time:  0.43000006675720215

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=2 & test.exe
 Number of threads:           2 , Elapsed time:  0.63800001144409180

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=4 & test.exe
 Number of threads:           4 , Elapsed time:  0.96600008010864258

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=8 & test.exe
 Number of threads:           8 , Elapsed time:  0.80500006675720215

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=16 & test.exe
 Number of threads:          16 , Elapsed time:  0.75200009346008301

原文

I have a ridiculously parallelizable (in my mind) piece of Fortran code, so I thought I could give OpenMP a try and see if any speedup could be achieved. This is on Windows 10, MinGW64 with gcc 11.2. I have 16 cores and 256 GB of RAM on an Intel Xeon Gold 5122 @ 3.6 GHz.

The code is definitely not a big deal:

  start_time = omp_get_wtime() 
  !$omp parallel do default(shared) private(i)
  do i = 1, nF   
    call something(input1, input2(i), input3, input4, output1(i, :), output2(i, :), output3(i, :))   
  end do 
  !$omp end parallel do
  end_time = omp_get_wtime()       
  
  write(*, *) 'Number of threads:', omp_get_max_threads(), ', Elapsed time:', end_time-start_time

nF is in the order of 8,600. The calculations are completely independent on which i index I am looping on, it doesn't matter in which order they are executed. There is no interaction between the inputs or the outputs at iteration i-1 with their values at iteration i. Whatever goes on inside the subroutine something is pure calculations, and not that tough either.

Now, I am not expecting magical linear speed improvements, but this is what I get:

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=1 & test.exe
 Number of threads:           1 , Elapsed time:  0.43099999427795410

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=2 & test.exe
 Number of threads:           2 , Elapsed time:  0.75899982452392578

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=4 & test.exe
 Number of threads:           4 , Elapsed time:  0.69499993324279785

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=8 & test.exe
 Number of threads:           8 , Elapsed time:  0.55299997329711914

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=16 & test.exe
 Number of threads:          16 , Elapsed time:  0.57800006866455078

Happily, no speed up. And actually a slowdown, no matter how many threads I tell OpenMP to use. It may be that for a code that takes 0.43 seconds to run it's pointless to attempt to use OpenMP, as the thread creation time will always swamp whatever improvement they may bring. It may also be I am misunderstanding how all of this work.

Optimization switches I use to compile the code:

-O3 -funroll-loops -march=native -fno-asynchronous-unwind-tables -fopenmp

Any explanation is most welcome, and suggestions on modifying my !$omp stuff is also welcome.

EDIT

I have edited the code by declaring the matrices outout1, output2 etc... as "transposed", so it looks like this:

  start_time = omp_get_wtime() 
  !$omp parallel do default(shared) private(i)
  do i = 1, nF   
    call something(input1, input2(i), input3, input4, output1(:, i), output2(:, i), output3(:, i))   
  end do 
  !$omp end parallel do
  end_time = omp_get_wtime()       
  
  write(*, *) 'Number of threads:', omp_get_max_threads(), ', Elapsed time:', end_time-start_time

No changes in timings (it's possibly even worse):

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=1 & test.exe
 Number of threads:           1 , Elapsed time:  0.43000006675720215

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=2 & test.exe
 Number of threads:           2 , Elapsed time:  0.63800001144409180

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=4 & test.exe
 Number of threads:           4 , Elapsed time:  0.96600008010864258

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=8 & test.exe
 Number of threads:           8 , Elapsed time:  0.80500006675720215

C:\Users\User\MyProjects\Test>set OMP_NUM_THREADS=16 & test.exe
 Number of threads:          16 , Elapsed time:  0.75200009346008301

分享到QQ

分享到微博