当前位置：文江博客话题详情

如何确保我的 Fortran FORALL 构造正在并行化？

发布于 2024-09-18 12:25:50 字数 610 浏览 10 评论 0原文

我得到了一个表示金属板表面温度点的二维矩阵。矩阵（板）的边缘保持恒定在 20 摄氏度，并且在一个预定义点处有 100 摄氏度的恒定热源。所有其他网格点最初设置为 50 摄氏度。

我的目标是获取所有内部网格点，并通过对周围四个网格点（i+1、i-1、j+1、 j-1) 直到达到收敛（迭代之间的变化小于 0.02 摄氏度）。

据我所知，迭代网格点的顺序无关紧要。

对我来说，这听起来是调用 Fortran FORALL 构造并探索并行化乐趣的好时机。

如何确保代码确实正在并行化？

例如，我可以在我的单核 PowerBook G4 上编译它，并且我预计不会因并行化而提高速度。但如果我在双核 AMD Opteron 上进行编译，我会假设 FORALL 构造可以被利用。

或者，有没有办法衡量程序的有效并行化？

更新

针对MSB的问题，这是gfortran版本4.4.0。 gfortran支持自动多线程吗？

值得注意的是，我想，FORALL 构造已经被自动矢量化所淘汰。

也许这对于一个单独的问题来说是最好的，但是自动矢量化是如何工作的？编译器是否能够检测到循环中仅使用纯函数或子例程？

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

記柔刀 2024-09-25 12:25:51

FORALL 是一个赋值结构，而不是一个循环结构。 FORALL 的语义规定，FORALL 中每个赋值的右侧 (RHS) 上的表达式在分配给左侧 (LHS) 之前都会被完全计算。无论 RHS 上的操作多么复杂，包括 RHS 和 LHS 重叠的情况，都必须执行此操作。

大多数编译器都押注于优化 FORALL，既因为它难以优化，又因为它不常用。最简单的实现是简单地为 RHS 分配一个临时值，计算表达式并将其存储在临时值中，然后将结果复制到 LHS 中。分配和释放此临时数据可能会使您的代码运行速度相当慢。编译器很难自动确定何时可以在没有临时值的情况下评估 RHS；大多数编译器不会尝试这样做。嵌套 DO 循环变得更容易分析和优化。

使用某些编译器，您可以通过将 FORALL 与 OpenMP“workshare”指令括起来并使用启用 OpenMP 所需的任何标志进行编译来并行计算 RHS，如下所示：

!$omp parallel workshare
FORALL (i=,j=,...)
    <assignment>
END FORALL
!$omp end parallel

gfortran -fopenmp blah.f90 -o blah

请注意并行评估 RHS 不需要兼容的 OpenMP 实现（至少包括旧版本的 gfortran）；实现评估 RHS 是可以接受的，就像它包含在 OpenMP“单个”指令中一样。另请注意，“工作共享”可能不会消除 RHS 临时分配的资源。例如，Mac OS X 上旧版本的 IBM Fortran 编译器就是这种情况。

FORALL is an assignment construct, not a looping construct. The semantics of FORALL state that the expression on the right hand side (RHS) of each assignment within the FORALL is evaluated completely before it is assigned to the left hand side (LHS). This has to be done no matter how complex the operations on the RHS, including cases where the RHS and the LHS overlap.

Most compilers punt on optimizing FORALL, both because it is difficult to optimize and because it is not commonly used. The easiest implementation is to simply allocate a temporary for the RHS, evaluate the expression and store it in the temporary, then copy the result into the LHS. Allocation and deallocation of this temporary is likely to make your code run quite slowly. It is very difficult for a compiler to automatically determine when the RHS can be evaluated without a temporary; most compilers don't make any attempt to do so. Nested DO loops turn out to be much easier to analyze and optimize.

With some compilers, you may be able to parallelize evaluation of the RHS by enclosing the FORALL with the OpenMP "workshare" directive and compiling with whatever flags are necessary to enable OpenMP, like so:

!$omp parallel workshare
FORALL (i=,j=,...)
    <assignment>
END FORALL
!$omp end parallel

gfortran -fopenmp blah.f90 -o blah

Note that a compliant OpenMP implementation (including at least older versions of gfortran) is not required to evaluate the RHS in parallel; it is acceptable for an implementation to evaluate the RHS as though it is enclosed in an OpenMP "single" directive. Note also that the "workshare" likely will not eliminate the temporary allocated by the RHS. This was the case with an old version of the IBM Fortran compiler on Mac OS X, for instance.

回复收藏 0 原文

天荒地未老 2024-09-25 12:25:51

如果您使用英特尔 Fortran 编译器，则可以使用命令行开关来打开/增加编译器的并行化/矢量化详细级别。这样，在编译/链接期间，您将看到类似以下内容：

FORALL loop at line X in file Y has been vectorized

我承认自上次使用它以来已经有几年了，因此编译器消息实际上可能看起来非常不同，但这是基本思想。

If you use Intel Fortran Compiler, you can use a command line switch to turn on/increase the compliler's verbosity level for parallelization/vectorization. This way during compilation/linking you will be shown something like:

FORALL loop at line X in file Y has been vectorized

I admit that it has been a few of years since the last time I used it, so the compiler message might actually look very different, but that's the basic idea.

回复收藏 0 原文