OpenMP + Apple M1 上的 Fortran 比 MPI+Fortran 慢
我有一台配备 Apple M1 Max 处理器(总共 10 个核心)的新 MacBook Pro,运行操作系统 12.2.1。我使用 Homebrew 来安装 gcc:
~/homebrew/bin/gcc-11 --version
gcc-11 (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
这个软件包附带了 gfortran:
gfortran --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
它还附带了 mpifort:
mpifort --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
我有一个使用 MPI 和 OpenMP 的 Fortran 代码。它运行良好,并已在各种 Linux 机器和超级计算机上使用。我正在对新笔记本电脑进行一些基准测试,我注意到代码的整体速度取决于 MPI 任务 (np) 和 OpenMP 线程数量的组合:
np OMP_NUM_THREADS wall time loop time
(sec) (sec)
--------------------------------------------------
1 8 2731 299.906
2 4 1816 194.753
4 2 1424 156.876
8 1 1415 156.372
在所有情况下,总共使用 8 个内核。这个特定的测试有一个大循环,执行了 9 次。使用纯 OpenMP 的代码几乎比使用纯 MPI 的代码慢 2 倍。我在 Linux 机器(AMD Ryzen threadripper)上进行了相同的测试,对于 np 和 OMP_NUM_threads 的各种组合,执行时间基本上没有变化,其中乘积 np*OMP_NUM_THREADS 是恒定的。
我的编译命令
gfortran -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384
仅适用于 OpenMP 和
mpifort -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384
MPI 混合代码。我可以使用 OpenMP 版本的编译器标志来加快速度吗?我有很多相关的 OpenMP 代码尚未修改为与 MPI 一起使用,因此如果一些编译器调整可以提供帮助,那就太好了。
另一方面,Apple M1 的 gfortran+OpenMP 是否需要比我能做的更多更深层次的工作?
I have a new MacBook pro with the Apple M1 Max processor (10 cores total), running OS 12.2.1. I used Homebrew to install gcc:
~/homebrew/bin/gcc-11 --version
gcc-11 (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
This package came with gfortran:
gfortran --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
It also came with mpifort:
mpifort --version
GNU Fortran (Homebrew GCC 11.2.0_3) 11.2.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
I have a Fortran code that uses MPI along with OpenMP. It works well and has been used on various Linux boxes and on a supercomputer. I was doing some benchmarking of the new laptop and I noticed that the overall speed of my code depends on the the combination of the number of MPI tasks (np) and OpenMP threads:
np OMP_NUM_THREADS wall time loop time
(sec) (sec)
--------------------------------------------------
1 8 2731 299.906
2 4 1816 194.753
4 2 1424 156.876
8 1 1415 156.372
In all cases, a total of 8 cores were used. This particular test had a large loop, executed 9 times. The code using pure OpenMP is almost a factor of 2 slower than the code using pure MPI. I have done the same test on a linux box (AMD Ryzen threadripper) and there was essentially no change in execution times for various combinations of np and OMP_NUM_threads, where the product np*OMP_NUM_THREADS is constant.
My compile command is
gfortran -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384
for OpenMP only, and
mpifort -Ofast -fopenmp -march=native -mtune=native -fmax-stack-var-size=16384
for the MPI hybrid code. Are there compiler flags for the OpenMP version I could use to speed things up? I have a lot of related OpenMP codes that have not yet been modified to work with MPI, so it would be nice if some compiler tweaks could help.
On the other hand, is this a case of gfortran+OpenMP for Apple M1 needing more work at a deeper level than what I can do?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论