如何确保我的 Fortran FORALL 构造正在并行化?
我得到了一个表示金属板表面温度点的二维矩阵。矩阵(板)的边缘保持恒定在 20 摄氏度,并且在一个预定义点处有 100 摄氏度的恒定热源。所有其他网格点最初设置为 50 摄氏度。
我的目标是获取所有内部网格点,并通过对周围四个网格点(i+1、i-1、j+1、 j-1) 直到达到收敛(迭代之间的变化小于 0.02 摄氏度)。
据我所知,迭代网格点的顺序无关紧要。
对我来说,这听起来是调用 Fortran FORALL
构造并探索并行化乐趣的好时机。
如何确保代码确实正在并行化?
例如,我可以在我的单核 PowerBook G4 上编译它,并且我预计不会因并行化而提高速度。但如果我在双核 AMD Opteron 上进行编译,我会假设 FORALL 构造可以被利用。
或者,有没有办法衡量程序的有效并行化?
更新
针对MSB的问题,这是gfortran版本4.4.0。 gfortran支持自动多线程吗?
值得注意的是,我想,FORALL 构造已经被自动矢量化所淘汰。
也许这对于一个单独的问题来说是最好的,但是自动矢量化是如何工作的?编译器是否能够检测到循环中仅使用纯函数或子例程?
I've been given a 2D matrix representing temperature points on the surface of a metal plate. The edges of the matrix (plate) are held constant at 20 degrees C and there is a constant heat source of 100 degrees C at one pre-defined point. All other grid points are initially set to 50 degrees C.
My goal is to take all interior grid points and compute its steady-state temperature by iteratively averaging over the surrounding four grid points (i+1, i-1, j+1, j-1) until I reach convergence (a change of less than 0.02 degrees C between iterations).
As far as I know, the order in which I iterate over the grid points is irrelevant.
To me, this sounds like a fine time to invoke the Fortran FORALL
construct and explore the joys of parallelization.
How can I ensure that the code is indeed being parallelized?
For example, I can compile this on my single-core PowerBook G4 and I would expect no improvement in speed due to parallelization. But if I compile on a Dual Core AMD Opteron, I would assume that the FORALL construct can be exploited.
Alternatively, is there a way to measure the effective parallelization of a program?
Update
In response to M.S.B's question, this is with gfortran version 4.4.0. Does gfortran support automatic multi-threading?
That's remarkable that the FORALL construct has been rendered obsolete by, I suppose, what is then auto-vectorization.
Perhaps this is best for a separate question, but how does auto-vectorization work? Is the compiler able to detect that only pure functions or subroutines are being used in a loop?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
FORALL 是一个赋值结构,而不是一个循环结构。 FORALL 的语义规定,FORALL 中每个赋值的右侧 (RHS) 上的表达式在分配给左侧 (LHS) 之前都会被完全计算。无论 RHS 上的操作多么复杂,包括 RHS 和 LHS 重叠的情况,都必须执行此操作。
大多数编译器都押注于优化 FORALL,既因为它难以优化,又因为它不常用。最简单的实现是简单地为 RHS 分配一个临时值,计算表达式并将其存储在临时值中,然后将结果复制到 LHS 中。分配和释放此临时数据可能会使您的代码运行速度相当慢。编译器很难自动确定何时可以在没有临时值的情况下评估 RHS;大多数编译器不会尝试这样做。嵌套 DO 循环变得更容易分析和优化。
使用某些编译器,您可以通过将 FORALL 与 OpenMP“workshare”指令括起来并使用启用 OpenMP 所需的任何标志进行编译来并行计算 RHS,如下所示:
gfortran -fopenmp blah.f90 -o blah
请注意并行评估 RHS 不需要兼容的 OpenMP 实现(至少包括旧版本的 gfortran);实现评估 RHS 是可以接受的,就像它包含在 OpenMP“单个”指令中一样。另请注意,“工作共享”可能不会消除 RHS 临时分配的资源。例如,Mac OS X 上旧版本的 IBM Fortran 编译器就是这种情况。
FORALL is an assignment construct, not a looping construct. The semantics of FORALL state that the expression on the right hand side (RHS) of each assignment within the FORALL is evaluated completely before it is assigned to the left hand side (LHS). This has to be done no matter how complex the operations on the RHS, including cases where the RHS and the LHS overlap.
Most compilers punt on optimizing FORALL, both because it is difficult to optimize and because it is not commonly used. The easiest implementation is to simply allocate a temporary for the RHS, evaluate the expression and store it in the temporary, then copy the result into the LHS. Allocation and deallocation of this temporary is likely to make your code run quite slowly. It is very difficult for a compiler to automatically determine when the RHS can be evaluated without a temporary; most compilers don't make any attempt to do so. Nested DO loops turn out to be much easier to analyze and optimize.
With some compilers, you may be able to parallelize evaluation of the RHS by enclosing the FORALL with the OpenMP "workshare" directive and compiling with whatever flags are necessary to enable OpenMP, like so:
gfortran -fopenmp blah.f90 -o blah
Note that a compliant OpenMP implementation (including at least older versions of gfortran) is not required to evaluate the RHS in parallel; it is acceptable for an implementation to evaluate the RHS as though it is enclosed in an OpenMP "single" directive. Note also that the "workshare" likely will not eliminate the temporary allocated by the RHS. This was the case with an old version of the IBM Fortran compiler on Mac OS X, for instance.
如果您使用英特尔 Fortran 编译器,则可以使用命令行开关来打开/增加编译器的并行化/矢量化详细级别。这样,在编译/链接期间,您将看到类似以下内容:
我承认自上次使用它以来已经有几年了,因此编译器消息实际上可能看起来非常不同,但这是基本思想。
If you use Intel Fortran Compiler, you can use a command line switch to turn on/increase the compliler's verbosity level for parallelization/vectorization. This way during compilation/linking you will be shown something like:
I admit that it has been a few of years since the last time I used it, so the compiler message might actually look very different, but that's the basic idea.
最好的方法是测量计算的时钟时间。尝试使用和不使用并行代码。如果时钟时间减少,那么您的并行代码正在工作。 Fortran 内部 system_clock 在代码块之前和之后调用,将为您提供时钟时间。内在的 cpu_time 将为您提供 cpu 时间,当代码以多线程运行时,由于开销,该时间可能会增加。
据了解,FORALL 并不像引入该语言时想象的那么有用——它更多的是一个初始化构造。编译器同样擅长优化常规循环。
Fortran 编译器在没有明确指定的情况下实现真正的并行处理的能力各不相同,例如使用 OpenMP 或 MPI。你使用什么编译器?
为了获得自动多线程,我使用了 ifort。我手动使用了 OpenMP。通过这两种方法,您可以在使用或不使用并行化的情况下编译程序并测量差异。
The best way is to measure the clock time of the calculation. Try it with and without parallel code. If the clock time decreases, then your parallel code is working. The Fortran intrinsic system_clock, called before and after the code block, will give you the clock time. The intrinsic cpu_time will give you the cpu time, which might go up when code in run multi-threaded due to overhead.
The lore is the FORALL is not as useful as was thought when introduced into the language -- that it is more of a initialization construct. Compilers are equally adept at optimizing regular loops.
Fortran compilers vary in their abilities to implement true parallel processing without it being explicitly specified, e.g., with OpenMP or MPI. What compiler are you using?
To get automatic multi-threading, I've used ifort. Manually, I've used OpenMP. With both of these, you can compile your program with and without the parallelization and measure the difference.