提高 Fortran 代码性能的提示和技巧
作为我博士学位的一部分。研究方面,我正在研究大气和海洋环流数值模型的开发。这些涉及到约 10^6 个网格点、超过约 10^4 个时间步长的偏微分方程数值求解系统。因此,当在数十个 CPU 上的 MPI 中运行时,典型的模型仿真需要数小时到几天才能完成。当然,尽可能提高模型效率很重要,同时确保结果逐字节相同。
虽然我对 Fortran 编程感到非常满意,并且知道很多使代码更高效的技巧,但我觉得仍然有改进的空间,以及我不知道的技巧。
目前,我确保使用尽可能少的除法,并尽量不使用文字常量(我很早就被教导要这样做,例如在实际计算中使用 half=0.5 而不是 0.5),使用尽可能少的超越函数尽可能等。
还有哪些其他性能敏感因素?目前,我想知道几个问题:
1)数学运算的顺序重要吗?例如,如果我有:
a=1E-7 ; b=2E4 ; c=3E13
d=a*b*c
d 会根据乘法顺序以不同的效率进行评估吗?如今,这必须是特定于编译器的,但是有直接的答案吗?我注意到 d 根据顺序(精度限制)获得(略有)不同的值,但这会影响效率吗?
2)将大量(例如数十个)数组作为参数传递给子例程与从子例程内的模块访问这些数组?
3) Fortran 95 结构(FORALL 和 WHERE)与 DO 和 IF 比较?我知道这些在 90 年代很重要,当时代码矢量化是一件大事,但是现在现代编译器能够矢量化显式 DO 循环有什么区别吗? (我在工作中使用 PGI、Intel 和 IBM 编译器)
4) 计算整数幂与乘法?例如:
b=a**4
或者
b=a*a*a*a
我被教导要尽可能使用后者。这会影响效率和/或精度吗? (可能也依赖于编译器)
请讨论和/或添加您所知道的有关提高 Fortran 代码效率的任何技巧和技巧。外面还有什么?如果您知道上述每个编译器与此问题相关的具体操作,请也将其包括在内。
补充:请注意,我本身没有任何瓶颈或性能问题。我想问是否有任何在操作意义上优化代码的通用规则。
谢谢!
As part of my Ph.D. research, I am working on development of numerical models of atmosphere and ocean circulation. These involve numerically solving systems of PDE's on the order of ~10^6 grid points, over ~10^4 time steps. Thus, a typical model simulation takes hours to a few days to complete when run in MPI on dozens of CPUs. Naturally, improving model efficiency as much as possible is important, while making sure the results are byte-to-byte identical.
While I feel quite comfortable with my Fortran programming, and am aware of quite some tricks to make code more efficient, I feel like there is still space to improve, and tricks that I am not aware of.
Currently, I make sure I use as few divisions as possible, and try not to use literal constants (I was taught to do this from very early on, e.g. use half=0.5 instead of 0.5 in actual computations), use as few transcendental functions as possible etc.
What other performance sensitive factors are there? At the moment, I am wondering about a few:
1) Does the order of mathematical operations matter? For example if I have:
a=1E-7 ; b=2E4 ; c=3E13
d=a*b*c
would d evaluate with different efficiency based on the order of multiplication? Nowadays, this must be compiler specific, but is there a straight answer? I notice d getting (slightly) different value based on the order (precision limit), but will this impact the efficiency or not?
2) Passing lots (e.g. dozens) of arrays as arguments to a subroutine versus accessing these arrays from a module within the subroutine?
3) Fortran 95 constructs (FORALL and WHERE) versus DO and IF? I know that these mattered back in the 90's when code vectorization was a big thing, but is there any difference now with modern compilers being able to vectorize explicit DO loops? (I am using PGI, Intel, and IBM compilers in my work)
4) Raising a number to an integer power versus multiplication? E.g.:
b=a**4
or
b=a*a*a*a
I have been taught to always use the latter where possible. Does this affect efficiency and/or precision? (probably compiler dependent as well)
Please discuss and/or add any tricks and tips that you know about improving Fortran code efficiency. What else is out there? If you know anything specific to what each of the compilers above do related to this question, please include that as well.
Added: Note that I do not have any bottlenecks or performance issues per se. I am asking if there are any general rules for optimizing the code in sense of operations.
Thanks!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
抱歉,但你提到的所有技巧都是……荒谬的。更准确地说,它们在实践中没有任何意义。例如:
a**4
或a*a*a*a
。(a*a)** 2
也是另一种可能性。我个人的品味是**4,因为一个好的编译器会自动选择最佳方式。对于
**
来说,唯一重要的一点是a ** 4
和a ** 4.
之间的区别,后者要大得多消耗更多的CPU时间。但如果没有实际模拟中的测量,即使这一点也没有任何意义。事实上,你的做法是错误的。尽可能地开发您的代码。之后,客观地衡量代码不同部分的成本。事先不进行测量就进行优化是毫无意义的。
如果某个部分的 CPU 占用率很高,例如 50%,请不要忘记,仅优化该部分无法将整个代码的成本除以大于两倍的系数。无论如何,从最昂贵的部分(瓶颈)开始优化工作。
还不要忘记,主要的改进通常来自更好的算法。
Sorry but all the tricks you mentioned are simply ... ridiculous. More exactly, they have no meaning in practice. For instance:
a**4
ora*a*a*a
.(a*a)** 2
would be another possibility too. My personal taste is a**4 because a good compiler which choose automatically the best way.For
**
, the only point which could matter is the difference betweena ** 4
anda ** 4.
, the latter being much more CPU time consuming. But even this point has no sense without a measurement in an actual simulation.In fact, your approach is wrong. Develop your code as well as possible. After that, measure objectively the cost of the different parts of your code. Optimizing without measuring before is simply non sense.
If a part exhibits a high percentage of the CPU, 50% for instance, don't forget that optimizing that part only cannot divide the cost of the overall code by a factor greater than two. Any way, start the optimization work by the most expensive part (the bottle neck).
Don't forget also that the main improvements are generally coming from better algorithms.
我同意你的建议,你所学到的这些技巧在这个时代是愚蠢的。编译器现在会为你做这件事;这种微观优化不太可能产生重大影响,并且可能不可移植。写清楚&可以理解的代码。仔细选择您的算法。能够产生影响的一件事是以正确的顺序使用多维数组的索引...将 MXN 数组重新转换为 NXM 可以有所帮助,具体取决于程序的数据访问模式。此后,如果您的程序太慢,请测量 CPU 消耗的位置并仅改进这些部分。经验表明,猜测常常是错误的,并且会导致无缘无故地编写出更多不透明的代码。如果你编写一个代码段,其中你的程序花费 1% 的时间而速度提高了一倍,那不会有任何区别。
以下是之前关于 FORALL 和 WHERE 的答案: 我如何确保我的 Fortran FORALL 构造正在并行化? 和 Fortran 95 是否构造这样的WHERE、FORALL 和 SPREAD 通常会产生更快的并行代码?
I second the advice that these tricks that you have been taught are silly in this era. Compilers do this for you now; such micro-optimizations are unlikely to make a significant difference and may not be portable. Write clear & understandable code. Carefully select your algorithm. One thing that can make a difference is using indices of multi-dimensions arrays in the correct order ... recasting an M X N array to N X M can help depending on the pattern of data access by your program. After this, if your program is too slow, measure where the CPU is consumed and improve only those parts. Experience shows that guessing is frequently wrong and leads to writing more opaque code for nor reason. If you make a code section in which your program spends 1% of its time twice as fast, it won't make any difference.
Here are previous answers on FORALL and WHERE: How can I ensure that my Fortran FORALL construct is being parallelized? and Do Fortran 95 constructs such as WHERE, FORALL and SPREAD generally result in faster parallel code?
你已经有了关于该做什么的先验想法,其中一些可能确实有帮助,
但最大的回报是事后分析。
(补充:换句话说,将
a*b*c
放入不同的顺序可能会节省几个周期(我对此表示怀疑),而同时您不会'我不知道你不会因为无缘无故地花费 1000 个周期的东西而措手不及。)无论你如何仔细地编码,都会有你没有预见到的加速机会。这是我找到它们的方法。 (有些人考虑这种方法 有争议)。
执行此操作时,最好从关闭优化标志开始,这样代码就不会全部混乱。
稍后您可以打开它们并让编译器完成它的工作。
让它在具有足够工作负载的调试器下运行,以便它运行合理的时间长度。
当它运行时,手动中断它,并仔细查看它在做什么以及为什么。
多次执行此操作,例如 10 次,这样您就不会就其花费时间的内容得出错误的结论。
以下是您可能会发现的一些示例:
如果您执行整个操作两到三次,您将删除首次编写软件时出现的愚蠢内容。
之后,您可以打开优化、并行性或其他任何功能,并确信不会将时间花在愚蠢的事情上。
You've got a-priori ideas about what to do, and some of them might actually help,
but the biggest payoff is in a-posteriori anaylsis.
(Added: In other words, getting
a*b*c
into a different order might save a couple cycles (which I doubt), while at the same time you don't know you're not getting blind-sided by something spending 1000 cycles for no good reason.)No matter how carefully you code it, there will be opportunities for speedup that you didn't foresee. Here's how I find them. (Some people consider this method controversial).
It's best to start with optimization flags OFF when you do this, so the code isn't all scrambled.
Later you can turn them on and let the compiler do its thing.
Get it running under a debugger with enough of a workload so it runs for a reasonable length of time.
While it's running, manually interrupt it, and take a good hard look at what it's doing and why.
Do this several times, like 10, so you don't draw erroneous conclusions about what it's spending time at.
Here's examples of things you might find:
If you do this entire operation two or three times, you will have removed the stupid stuff that finds its way into any software when it's first written.
After that, you can turn on the optimization, parallelism, or whatever, and be confident no time is being spent on silly stuff.