OpenMP + Apple M1 上的 Fortran 比 MPI+Fortran 慢

发布于 2025-01-09 05:19:34 字数 2114 浏览 4 评论 0原文

我有一台配备 Apple M1 Max 处理器（总共 10 个核心）的新 MacBook Pro，运行操作系统 12.2.1。我使用 Homebrew 来安装 gcc：

~/homebrew/bin/gcc-11 --version
gcc-11 (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

这个软件包附带了 gfortran：

gfortran --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

它还附带了 mpifort：

mpifort --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

我有一个使用 MPI 和 OpenMP 的 Fortran 代码。它运行良好，并已在各种 Linux 机器和超级计算机上使用。我正在对新笔记本电脑进行一些基准测试，我注意到代码的整体速度取决于 MPI 任务 (np) 和 OpenMP 线程数量的组合：

np     OMP_NUM_THREADS     wall time    loop time
                           (sec)        (sec)
--------------------------------------------------
1      8                   2731         299.906
2      4                   1816         194.753
4      2                   1424         156.876   
8      1                   1415         156.372

在所有情况下，总共使用 8 个内核。这个特定的测试有一个大循环，执行了 9 次。使用纯 OpenMP 的代码几乎比使用纯 MPI 的代码慢 2 倍。我在 Linux 机器（AMD Ryzen threadripper）上进行了相同的测试，对于 np 和 OMP_NUM_threads 的各种组合，执行时间基本上没有变化，其中乘积 np*OMP_NUM_THREADS 是恒定的。

我的编译命令

gfortran -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384

仅适用于 OpenMP 和

mpifort -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384

MPI 混合代码。我可以使用 OpenMP 版本的编译器标志来加快速度吗？我有很多相关的 OpenMP 代码尚未修改为与 MPI 一起使用，因此如果一些编译器调整可以提供帮助，那就太好了。

另一方面，Apple M1 的 gfortran+OpenMP 是否需要比我能做的更多更深层次的工作？

原文

I have a new MacBook pro with the Apple M1 Max processor (10 cores total), running OS 12.2.1. I used Homebrew to install gcc:

~/homebrew/bin/gcc-11 --version
gcc-11 (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

This package came with gfortran:

gfortran --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

It also came with mpifort:

mpifort --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I have a Fortran code that uses MPI along with OpenMP. It works well and has been used on various Linux boxes and on a supercomputer. I was doing some benchmarking of the new laptop and I noticed that the overall speed of my code depends on the the combination of the number of MPI tasks (np) and OpenMP threads:

np     OMP_NUM_THREADS     wall time    loop time
                           (sec)        (sec)
--------------------------------------------------
1      8                   2731         299.906
2      4                   1816         194.753
4      2                   1424         156.876   
8      1                   1415         156.372

In all cases, a total of 8 cores were used. This particular test had a large loop, executed 9 times. The code using pure OpenMP is almost a factor of 2 slower than the code using pure MPI. I have done the same test on a linux box (AMD Ryzen threadripper) and there was essentially no change in execution times for various combinations of np and OMP_NUM_threads, where the product np*OMP_NUM_THREADS is constant.

My compile command is

gfortran -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384

for OpenMP only, and

mpifort -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384

for the MPI hybrid code. Are there compiler flags for the OpenMP version I could use to speed things up? I have a lot of related OpenMP codes that have not yet been modified to work with MPI, so it would be nice if some compiler tweaks could help.

On the other hand, is this a case of gfortran+OpenMP for Apple M1 needing more work at a deeper level than what I can do?

分享到QQ

分享到微博