提高 Fortran 代码性能的提示和技巧

发布于 2024-12-10 15:42:07 字数 1040 浏览 0 评论 0原文

作为我博士学位的一部分。研究方面，我正在研究大气和海洋环流数值模型的开发。这些涉及到约 10^6 个网格点、超过约 10^4 个时间步长的偏微分方程数值求解系统。因此，当在数十个 CPU 上的 MPI 中运行时，典型的模型仿真需要数小时到几天才能完成。当然，尽可能提高模型效率很重要，同时确保结果逐字节相同。

虽然我对 Fortran 编程感到非常满意，并且知道很多使代码更高效的技巧，但我觉得仍然有改进的空间，以及我不知道的技巧。

目前，我确保使用尽可能少的除法，并尽量不使用文字常量（我很早就被教导要这样做，例如在实际计算中使用 half=0.5 而不是 0.5），使用尽可能少的超越函数尽可能等。

还有哪些其他性能敏感因素？目前，我想知道几个问题：

1）数学运算的顺序重要吗？例如，如果我有：

a=1E-7 ; b=2E4 ; c=3E13
d=a*b*c

d 会根据乘法顺序以不同的效率进行评估吗？如今，这必须是特定于编译器的，但是有直接的答案吗？我注意到 d 根据顺序（精度限制）获得（略有）不同的值，但这会影响效率吗？

2）将大量（例如数十个）数组作为参数传递给子例程与从子例程内的模块访问这些数组？

3) Fortran 95 结构（FORALL 和 WHERE）与 DO 和 IF 比较？我知道这些在 90 年代很重要，当时代码矢量化是一件大事，但是现在现代编译器能够矢量化显式 DO 循环有什么区别吗？（我在工作中使用 PGI、Intel 和 IBM 编译器）

4) 计算整数幂与乘法？例如：

b=a**4

或者

b=a*a*a*a

我被教导要尽可能使用后者。这会影响效率和/或精度吗？（可能也依赖于编译器）

请讨论和/或添加您所知道的有关提高 Fortran 代码效率的任何技巧和技巧。外面还有什么？如果您知道上述每个编译器与此问题相关的具体操作，请也将其包括在内。

补充：请注意，我本身没有任何瓶颈或性能问题。我想问是否有任何在操作意义上优化代码的通用规则。

谢谢！

原文

As part of my Ph.D. research, I am working on development of numerical models of atmosphere and ocean circulation. These involve numerically solving systems of PDE's on the order of ~10^6 grid points, over ~10^4 time steps. Thus, a typical model simulation takes hours to a few days to complete when run in MPI on dozens of CPUs. Naturally, improving model efficiency as much as possible is important, while making sure the results are byte-to-byte identical.

While I feel quite comfortable with my Fortran programming, and am aware of quite some tricks to make code more efficient, I feel like there is still space to improve, and tricks that I am not aware of.

Currently, I make sure I use as few divisions as possible, and try not to use literal constants (I was taught to do this from very early on, e.g. use half=0.5 instead of 0.5 in actual computations), use as few transcendental functions as possible etc.

What other performance sensitive factors are there? At the moment, I am wondering about a few:

1) Does the order of mathematical operations matter? For example if I have:

a=1E-7 ; b=2E4 ; c=3E13
d=a*b*c

would d evaluate with different efficiency based on the order of multiplication? Nowadays, this must be compiler specific, but is there a straight answer? I notice d getting (slightly) different value based on the order (precision limit), but will this impact the efficiency or not?

2) Passing lots (e.g. dozens) of arrays as arguments to a subroutine versus accessing these arrays from a module within the subroutine?

3) Fortran 95 constructs (FORALL and WHERE) versus DO and IF? I know that these mattered back in the 90's when code vectorization was a big thing, but is there any difference now with modern compilers being able to vectorize explicit DO loops? (I am using PGI, Intel, and IBM compilers in my work)

4) Raising a number to an integer power versus multiplication? E.g.:

b=a**4

b=a*a*a*a

I have been taught to always use the latter where possible. Does this affect efficiency and/or precision? (probably compiler dependent as well)

Please discuss and/or add any tricks and tips that you know about improving Fortran code efficiency. What else is out there? If you know anything specific to what each of the compilers above do related to this question, please include that as well.

Added: Note that I do not have any bottlenecks or performance issues per se. I am asking if there are any general rules for optimizing the code in sense of operations.

Thanks!

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

盗琴音 2024-12-17 15:42:07

抱歉，但你提到的所有技巧都是……荒谬的。更准确地说，它们在实践中没有任何意义。例如：

使用 half(=0.5) 而不是 0.5 有什么优势？
同上用于计算 a**4 或 a*a*a*a。 (a*a)** 2 也是另一种可能性。我个人的品味是**4，因为一个好的编译器会自动选择最佳方式。

对于 ** 来说，唯一重要的一点是 a ** 4 和 a ** 4. 之间的区别，后者要大得多消耗更多的CPU时间。但如果没有实际模拟中的测量，即使这一点也没有任何意义。

事实上，你的做法是错误的。尽可能地开发您的代码。之后，客观地衡量代码不同部分的成本。事先不进行测量就进行优化是毫无意义的。

如果某个部分的 CPU 占用率很高，例如 50%，请不要忘记，仅优化该部分无法将整个代码的成本除以大于两倍的系数。无论如何，从最昂贵的部分（瓶颈）开始优化工作。

还不要忘记，主要的改进通常来自更好的算法。

回复收藏 0 原文

你げ笑在眉眼 2024-12-17 15:42:07

我同意你的建议，你所学到的这些技巧在这个时代是愚蠢的。编译器现在会为你做这件事；这种微观优化不太可能产生重大影响，并且可能不可移植。写清楚&可以理解的代码。仔细选择您的算法。能够产生影响的一件事是以正确的顺序使用多维数组的索引...将 MXN 数组重新转换为 NXM 可以有所帮助，具体取决于程序的数据访问模式。此后，如果您的程序太慢，请测量 CPU 消耗的位置并仅改进这些部分。经验表明，猜测常常是错误的，并且会导致无缘无故地编写出更多不透明的代码。如果你编写一个代码段，其中你的程序花费 1% 的时间而速度提高了一倍，那不会有任何区别。

以下是之前关于 FORALL 和 WHERE 的答案：我如何确保我的 Fortran FORALL 构造正在并行化？和 Fortran 95 是否构造这样的WHERE、FORALL 和 SPREAD 通常会产生更快的并行代码？

回复收藏 0 原文