可分配数组性能

发布于 2024-12-03 17:43:51 字数 1342 浏览 1 评论 0原文

有一个程序的 mpi 版本,它使用 COMMON 块来存储在代码中随处使用的数组。不幸的是,没有办法在 COMMON 块中声明数组,其大小只能在运行时知道。因此,作为一种解决方法,我决定将这些数组移动到内部接受 ALLOCATABLE 数组的模块中。也就是说,COMMON 块中的所有数组都消失了,而是使用了 ALLOCATE。所以,这是我在程序中唯一改变的事情。不幸的是,该程序的性能很糟糕(与 COMMON 块实现相比)。至于mpi设置,每个计算节点上有一个mpi进程,每个mpi进程有一个线程。 我发现这里提出了类似问题,但是不认为(不明白:))它如何应用于我的情况(每个进程都有一个线程)。我很感激任何帮助。

这是一个简单的例子,说明了我正在谈论的内容(下面是伪代码):

“源文件”:

SUBROUTINE ZEROSET()
   INCLUDE 'FILE_1.INC'
   INCLUDE 'FILE_2.INC'
   INCLUDE 'FILE_3.INC'
   ....
   INCLUDE 'FILE_N.INC'

   ARRAY_1 = 0.0
   ARRAY_2 = 0.0
   ARRAY_3 = 0.0
   ARRAY_4 = 0.0
   ...
   ARRAY_N = 0.0
END SUBROUTINE

正如您所见,ZEROSET() 没有并行或 MPI 内容。 FILE_1.INC、FILE_2、...、FILE_N.INC 是在 COMMON 块中定义 ARRAY_1、ARRAY_2 ... ARRAY_N 的文件。类似的东西

REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)

其中 NX、NY、NZ 是在 PARAMETER 指令的帮助下描述的明确定义的参数。 当我使用模块时,我只是销毁了所有 COMMON 块,因此 FILE_I.INC 看起来像

REAL, ALLOCATABLE:: ARRAY_I(:,:,:)

然后只是将上面的“INCLUDE 'FILE_I.INC'”语句更改为“USE FILE_I”。实际上,当执行并行程序时,一个特定的进程不需要整个(NX,NY,NZ)域,因此我计算参数然后分配ARRAY_I(仅一次!)。

子例程 ZEROSET() 对于 COMMON 块执行 0.18 秒,对于模块则执行 0.36 秒(当数组的维度在运行时计算时)。因此,性能恶化了两倍。

我希望现在一切都清楚了。我非常感谢你的帮助。

There is an mpi-version of a program which uses COMMON blocks to store arrays that are used everywhere through the code. Unfortunately, there is no way to declare arrays in COMMON block size of which would be known only run-time. So, as a workaround I decided to move that arrays in modules which accept ALLOCATABLE arrays inside. That is, all arrays in COMMON blocks were vanished, instead ALLOCATE was used. So, this was the only thing I changed in my program. Unfortunately, performance of the program was awful (when compared to COMMON blocks realization). As to mpi-settings, there is a single mpi-process on each computational node and each mpi-process has a single thread.
I found similar question asked here but don't think (don't understand :) ) how it could be applied to my case (where each process has a single thread). I appreciate any help.

Here is a simple example which illustrates what I was talking about (below is a pseudocode):

"SOURCE FILE":

SUBROUTINE ZEROSET()
   INCLUDE 'FILE_1.INC'
   INCLUDE 'FILE_2.INC'
   INCLUDE 'FILE_3.INC'
   ....
   INCLUDE 'FILE_N.INC'

   ARRAY_1 = 0.0
   ARRAY_2 = 0.0
   ARRAY_3 = 0.0
   ARRAY_4 = 0.0
   ...
   ARRAY_N = 0.0
END SUBROUTINE

As you may see, ZEROSET() has no parallel or MPI stuff. FILE_1.INC, FILE_2, ... , FILE_N.INC are files where ARRAY_1, ARRAY_2 ... ARRAY_N are defined in COMMON blocks. Something like that

REAL ARRAY_1
COMMON /ARRAY_1/ ARRAY_1(NX, NY, NZ)

Where NX, NY, NZ are well defined parameters described with help of PARAMETER directive.
When I use modules, I just destroyed all COMMON blocks, so FILE_I.INC looks like

REAL, ALLOCATABLE:: ARRAY_I(:,:,:)

And then just changed "INCLUDE 'FILE_I.INC'" statement above to "USE FILE_I". Actually, when parallel program is executed, one particular process does not need a whole (NX, NY, NZ) domain, so I calculate parameters and then allocate ARRAY_I (only ONCE!).

Subroutine ZEROSET() is executed 0.18 seconds with COMMON blocks and 0.36 with modules (when array's dimensions are calculated runtime). So, the performance worsened by two times.

I hope that everything is clear now. I appreciate you help very much.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

鼻尖触碰 2024-12-10 17:43:51

在模块中使用可分配数组通常会损害性能,因为编译器在编译时不知道大小。使用此代码,您将在许多编译器中获得更好的性能:

   subroutine X
   use Y  ! Has allocatable array A(N,N) in it
   call Z(A,N)
   end subroutine

   subroutine Z(A,N)
   Integer N
   real A(N,N)
   do stuff here
   end

然后此代码:

   subroutine X
   use Y  ! Has allocatable array A(N,N) in it
   do stuff here
   end subroutine

编译器将知道数组是 NxN 并且 do 循环超过 N 并且能够利用这一事实(大多数代码在数组上以这种方式工作)。此外,在“do stuff here”中的任何子例程调用之后,编译器将必须假设数组“A”可能已更改大小或移动了内存中的位置并重新检查。这会扼杀优化。

这应该能让你恢复大部分的表现。

公共块也位于内存中的特定位置,这也允许优化。

Using allocatable arrays in modules can often hurt performance because the compiler has no idea about sizes at compile time. You will get much better performance with many compilers with this code:

   subroutine X
   use Y  ! Has allocatable array A(N,N) in it
   call Z(A,N)
   end subroutine

   subroutine Z(A,N)
   Integer N
   real A(N,N)
   do stuff here
   end

Then this code:

   subroutine X
   use Y  ! Has allocatable array A(N,N) in it
   do stuff here
   end subroutine

The compiler will know that the array is NxN and the do loops are over N and be able to take advantage of that fact (most codes work that way on arrays). Also, after any subroutine calls in "do stuff here", the compiler will have to assume that array "A" might have changed sizes or moved locations in memory and recheck. That kills optimization.

This should get you most of your performance back.

Common blocks are located in a specific place in memory also, and that allows optimizations also.

枕花眠 2024-12-10 17:43:51

实际上我想,你的问题是,与堆栈内存和堆内存相结合,确实是基于编译器优化的。根据您使用的编译器,它可能会执行一些更有效的内存消隐,并且对于固定的内存块,它甚至不需要检查它在子例程中的范围和位置。因此,在固定大小的数组中几乎不会涉及任何开销。
这个例程是否经常被调用,或者为什么你关心这 0.18 秒?
如果确实相关,最好的选择是完全摆脱 0 设置,而是例如分离第一个迭代循环并将其用于初始化,这样您就不必引入额外的内存访问,只需使用 0 进行初始化。但是它会重复一些代码......

Actually I guess, your problem here is, in combination with stack vs. heap memory, indeed compiler optimization based. Depending on the compiler you're using, it might do some more efficient memory blanking, and for a fixed chunk of memory it does not even need to check the extent and location of it within the subroutine. Thus, in the fixed sized arrays there won't be nearly no overhead involved.
Is this routine called very often, or why do you care about these 0.18 s?
If it is indeed relevant, the best option would be to get rid of the 0 setting at all, and instead for example separate the first iteration loop and use it for the initialization, this way you do not have to introduce additional memory accesses, just for initialization with 0. However it would duplicate some code...

疯狂的代价 2024-12-10 17:43:51

当谈到使用数组的 fortran 性能时,我可以想到这些原因:

  1. 堆栈上的数组 VS 堆上的数组,但我怀疑这可能会对性能产生巨大的影响。
  2. 将数组传递给子例程,因为执行此操作的最佳方法取决于数组,请参阅 高效使用数组

I could think of just these reasons when it comes to fortran performance using arrays:

  1. arrays on the stack VS heap, but I doubt this could have a huge performance impact.
  2. passing arrays to a subroutine, because the best way to do that depends on the array, see this page on using arrays efficiently
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文