可分配数组性能
有一个程序的 mpi 版本,它使用 COMMON 块来存储在代码中随处使用的数组。不幸的是,没有办法在 COMMON 块中声明数组,其大小只能在运行时知道。因此,作为一种解决方法,我决定将这些数组移动到内部接受 ALLOCATABLE 数组的模块中。也就是说,COMMON 块中的所有数组都消失了,而是使用了 ALLOCATE。所以,这是我在程序中唯一改变的事情。不幸的是,该程序的性能很糟糕(与 COMMON 块实现相比)。至于mpi设置,每个计算节点上有一个mpi进程,每个mpi进程有一个线程。 我发现这里提出了类似问题,但是不认为(不明白:))它如何应用于我的情况(每个进程都有一个线程)。我很感激任何帮助。
这是一个简单的例子,说明了我正在谈论的内容(下面是伪代码):
“源文件”:
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
正如您所见,ZEROSET() 没有并行或 MPI 内容。 FILE_1.INC、FILE_2、...、FILE_N.INC 是在 COMMON 块中定义 ARRAY_1、ARRAY_2 ... ARRAY_N 的文件。类似的东西
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
其中 NX、NY、NZ 是在 PARAMETER 指令的帮助下描述的明确定义的参数。 当我使用模块时,我只是销毁了所有 COMMON 块,因此 FILE_I.INC 看起来像
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
然后只是将上面的“INCLUDE 'FILE_I.INC'”语句更改为“USE FILE_I”。实际上,当执行并行程序时,一个特定的进程不需要整个(NX,NY,NZ)域,因此我计算参数然后分配ARRAY_I(仅一次!)。
子例程 ZEROSET() 对于 COMMON 块执行 0.18 秒,对于模块则执行 0.36 秒(当数组的维度在运行时计算时)。因此,性能恶化了两倍。
我希望现在一切都清楚了。我非常感谢你的帮助。
There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.
Here is a simple example which illustrates what I was talking about (below is a pseudocode):
"SOURCE FILE":
SUBROUTINE ZEROSET()
INCLUDE 'FILE_1.INC'
INCLUDE 'FILE_2.INC'
INCLUDE 'FILE_3.INC'
....
INCLUDE 'FILE_N.INC'
ARRAY_1 = 0.0
ARRAY_2 = 0.0
ARRAY_3 = 0.0
ARRAY_4 = 0.0
...
ARRAY_N = 0.0
END SUBROUTINE
As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that
REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)
Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like
REAL, ALLOCATABLE:: ARRAY_I(:,:,:)
And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).
Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.
I hope that everything is clear now. I appreciate you help very much.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
在模块中使用可分配数组通常会损害性能,因为编译器在编译时不知道大小。使用此代码,您将在许多编译器中获得更好的性能:
然后此代码:
编译器将知道数组是 NxN 并且 do 循环超过 N 并且能够利用这一事实(大多数代码在数组上以这种方式工作)。此外,在“do stuff here”中的任何子例程调用之后,编译器将必须假设数组“A”可能已更改大小或移动了内存中的位置并重新检查。这会扼杀优化。
这应该能让你恢复大部分的表现。
公共块也位于内存中的特定位置,这也允许优化。
Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:
Then this code:
The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.
This should get you most of your performance back.
Common blocks are located in a specific place in memory also, and that allows optimizations also.
实际上我想,你的问题是,与堆栈内存和堆内存相结合,确实是基于编译器优化的。根据您使用的编译器,它可能会执行一些更有效的内存消隐,并且对于固定的内存块,它甚至不需要检查它在子例程中的范围和位置。因此,在固定大小的数组中几乎不会涉及任何开销。
这个例程是否经常被调用,或者为什么你关心这 0.18 秒?
如果确实相关,最好的选择是完全摆脱 0 设置,而是例如分离第一个迭代循环并将其用于初始化,这样您就不必引入额外的内存访问,只需使用 0 进行初始化。但是它会重复一些代码......
Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...
当谈到使用数组的 fortran 性能时,我可以想到这些原因:
I could think of just these reasons when it comes to fortran performance using arrays: