MPI_Barrier 无法正常工作

发布于 2024-12-17 11:23:23 字数 4199 浏览 2 评论 0原文

我编写了下面的 C 应用程序来帮助我理解 MPI，以及为什么 MPI_Barrier() 在我庞大的 C++ 应用程序中不起作用。然而，我能够用一个小得多的 C 应用程序在我的巨大应用程序中重现我的问题。本质上，我在 for 循环内调用 MPI_Barrier()，并且 MPI_Barrier() 对所有节点都可见，但在循环迭代 2 次后，程序陷入死锁。有什么想法吗？

#include <mpi.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);
    int i=0, numprocs, rank, namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &namelen);
    printf("%s: Rank %d of %d\n", processor_name, rank, numprocs);
    for(i=1; i <= 100; i++) {
            if (rank==0) printf("Before barrier (%d:%s)\n",i,processor_name);
            MPI_Barrier(MPI_COMM_WORLD);
            if (rank==0) printf("After barrier (%d:%s)\n",i,processor_name);
    }

    MPI_Finalize();
    return 0;
}

输出：

alienone: Rank 1 of 4
alienfive: Rank 3 of 4
alienfour: Rank 2 of 4
alientwo: Rank 0 of 4
Before barrier (1:alientwo)
After barrier (1:alientwo)
Before barrier (2:alientwo)
After barrier (2:alientwo)
Before barrier (3:alientwo)

我正在使用 GCC 4.4，从 Ubuntu 10.10 存储库中打开 MPI 1.3。

另外，在我庞大的 C++ 应用程序中，MPI 广播不起作用。只有一半的节点收到广播，其他节点则陷入等待状态。

预先感谢您的任何帮助或见解！

更新：升级到Open MPI 1.4.4，从源代码编译到/usr/local/。

更新：将 GDB 附加到正在运行的进程会显示一个有趣的结果。在我看来，MPI 系统在屏障处死亡，但 MPI 仍然认为程序正在运行：

附加 GDB 会产生一个有趣的结果。看起来所有节点都在 MPI 屏障处死亡，但 MPI 仍然认为它们正在运行：

0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at   ../sysdeps/unix/sysv/linux/poll.c:83
83  ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
    in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0  0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1  0x00007fc236a45141 in poll_dispatch () from /usr/local/lib/libopen-pal.so.0
#2  0x00007fc236a43f89 in opal_event_base_loop () from /usr/local/lib/libopen-pal.so.0
#3  0x00007fc236a38119 in opal_progress () from /usr/local/lib/libopen-pal.so.0
#4  0x00007fc236eff525 in ompi_request_default_wait_all () from /usr/local/lib/libmpi.so.0
#5  0x00007fc23141ad76 in ompi_coll_tuned_sendrecv_actual () from /usr/local/lib/openmpi/mca_coll_tuned.so
#6  0x00007fc2314247ce in ompi_coll_tuned_barrier_intra_recursivedoubling () from /usr/local/lib/openmpi/mca_coll_tuned.so
#7  0x00007fc236f15f12 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0
#8  0x0000000000400b32 in main (argc=1, argv=0x7fff5883da58) at barrier_test.c:14
(gdb)

更新：我也有这样的代码：

#include <mpi.h>
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] )  {
int n = 400, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf("MPI Rank %i of %i.\n", myid, numprocs);
while (1) {
    h   = 1.0 / (double) n;
    sum = 0.0;
    for (i = myid + 1; i <= n; i += numprocs) {
        x = h * ((double)i - 0.5);
        sum += (4.0 / (1.0 + x*x));
    }
    mypi = h * sum;
    MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0)
        printf("pi is approximately %.16f, Error is %.16f\n",  pi, fabs(pi - PI25DT));
}
MPI_Finalize();
return 0;
}

尽管有无限循环，但循环中的 printf() 只有一个输出：

mpirun -n 24 --machinefile /etc/machines a.out 
MPI Rank 0 of 24.
MPI Rank 3 of 24.
MPI Rank 1 of 24.
MPI Rank 4 of 24.
MPI Rank 17 of 24.
MPI Rank 15 of 24.
MPI Rank 5 of 24.
MPI Rank 7 of 24.
MPI Rank 16 of 24.
MPI Rank 2 of 24.
MPI Rank 11 of 24.
MPI Rank 9 of 24.
MPI Rank 8 of 24.
MPI Rank 20 of 24.
MPI Rank 23 of 24.
MPI Rank 19 of 24.
MPI Rank 12 of 24.
MPI Rank 13 of 24.
MPI Rank 21 of 24.
MPI Rank 6 of 24.
MPI Rank 10 of 24.
MPI Rank 18 of 24.
MPI Rank 22 of 24.
MPI Rank 14 of 24.
pi is approximately 3.1415931744231269, Error is 0.0000005208333338

有什么想法吗？

原文

I wrote the C application below to help me understand MPI, and why MPI_Barrier() isn't functioning in my huge C++ application. However, I was able to reproduce my problem in my huge application with a much smaller C application. Essentially, I call MPI_Barrier() inside a for loop, and MPI_Barrier() is visible to all nodes, yet after 2 iterations of the loop, the program becomes deadlocked. Any thoughts?

#include <mpi.h>
#include <stdio.h>
int main(int argc, char* argv[]) {
    MPI_Init(&argc, &argv);
    int i=0, numprocs, rank, namelen;
    char processor_name[MPI_MAX_PROCESSOR_NAME];
    MPI_Comm_size(MPI_COMM_WORLD, &numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Get_processor_name(processor_name, &namelen);
    printf("%s: Rank %d of %d\n", processor_name, rank, numprocs);
    for(i=1; i <= 100; i++) {
            if (rank==0) printf("Before barrier (%d:%s)\n",i,processor_name);
            MPI_Barrier(MPI_COMM_WORLD);
            if (rank==0) printf("After barrier (%d:%s)\n",i,processor_name);
    }

    MPI_Finalize();
    return 0;
}

The output:

alienone: Rank 1 of 4
alienfive: Rank 3 of 4
alienfour: Rank 2 of 4
alientwo: Rank 0 of 4
Before barrier (1:alientwo)
After barrier (1:alientwo)
Before barrier (2:alientwo)
After barrier (2:alientwo)
Before barrier (3:alientwo)

I am using GCC 4.4, Open MPI 1.3 from the Ubuntu 10.10 repositories.

Also, in my huge C++ application, MPI Broadcasts don't work. Only half the nodes receive the broadcast, the others are stuck waiting for it.

Thank you in advance for any help or insights!

Update: Upgraded to Open MPI 1.4.4, compiled from source into /usr/local/.

Update: Attaching GDB to the running process shows an interesting result. It looks to me that the MPI system died at the barrier, but MPI still thinks the program is running:

Attaching GDB yields an interesting result. It seems all nodes have died at the MPI barrier, but MPI still thinks they are running:

0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at   ../sysdeps/unix/sysv/linux/poll.c:83
83  ../sysdeps/unix/sysv/linux/poll.c: No such file or directory.
    in ../sysdeps/unix/sysv/linux/poll.c
(gdb) bt
#0  0x00007fc235cbd1c8 in __poll (fds=0x15ee360, nfds=8, timeout=<value optimized out>) at ../sysdeps/unix/sysv/linux/poll.c:83
#1  0x00007fc236a45141 in poll_dispatch () from /usr/local/lib/libopen-pal.so.0
#2  0x00007fc236a43f89 in opal_event_base_loop () from /usr/local/lib/libopen-pal.so.0
#3  0x00007fc236a38119 in opal_progress () from /usr/local/lib/libopen-pal.so.0
#4  0x00007fc236eff525 in ompi_request_default_wait_all () from /usr/local/lib/libmpi.so.0
#5  0x00007fc23141ad76 in ompi_coll_tuned_sendrecv_actual () from /usr/local/lib/openmpi/mca_coll_tuned.so
#6  0x00007fc2314247ce in ompi_coll_tuned_barrier_intra_recursivedoubling () from /usr/local/lib/openmpi/mca_coll_tuned.so
#7  0x00007fc236f15f12 in PMPI_Barrier () from /usr/local/lib/libmpi.so.0
#8  0x0000000000400b32 in main (argc=1, argv=0x7fff5883da58) at barrier_test.c:14
(gdb)

Update:
I also have this code:

#include <mpi.h>
#include <stdio.h>
#include <math.h>
int main( int argc, char *argv[] )  {
int n = 400, myid, numprocs, i;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
printf("MPI Rank %i of %i.\n", myid, numprocs);
while (1) {
    h   = 1.0 / (double) n;
    sum = 0.0;
    for (i = myid + 1; i <= n; i += numprocs) {
        x = h * ((double)i - 0.5);
        sum += (4.0 / (1.0 + x*x));
    }
    mypi = h * sum;
    MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
    if (myid == 0)
        printf("pi is approximately %.16f, Error is %.16f\n",  pi, fabs(pi - PI25DT));
}
MPI_Finalize();
return 0;
}

And despite the infinite loop, there is only one output from the printf() in the loop:

mpirun -n 24 --machinefile /etc/machines a.out 
MPI Rank 0 of 24.
MPI Rank 3 of 24.
MPI Rank 1 of 24.
MPI Rank 4 of 24.
MPI Rank 17 of 24.
MPI Rank 15 of 24.
MPI Rank 5 of 24.
MPI Rank 7 of 24.
MPI Rank 16 of 24.
MPI Rank 2 of 24.
MPI Rank 11 of 24.
MPI Rank 9 of 24.
MPI Rank 8 of 24.
MPI Rank 20 of 24.
MPI Rank 23 of 24.
MPI Rank 19 of 24.
MPI Rank 12 of 24.
MPI Rank 13 of 24.
MPI Rank 21 of 24.
MPI Rank 6 of 24.
MPI Rank 10 of 24.
MPI Rank 18 of 24.
MPI Rank 22 of 24.
MPI Rank 14 of 24.
pi is approximately 3.1415931744231269, Error is 0.0000005208333338

Any thoughts?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

烂人 2024-12-24 11:23:23

当进程在最后一个屏障之后的不同时间遇到屏障时，OpenMPI 中的 MPI_Barrier() 有时会挂起，但据我所知，这不是您的情况。无论如何，尝试使用 MPI_Reduce() 代替或在真正调用 MPI_Barrier() 之前。这并不直接等同于屏障，但任何涉及通信器中所有进程的几乎没有有效负载的同步调用都应该像屏障一样工作。我还没有在 LAM/MPI 或 MPICH2 甚至 WMPI 中看到 MPI_Barrier() 的这种行为，但这是 OpenMPI 的一个真正问题。

回复收藏 0 原文

不忘初心 2024-12-24 11:23:23

你有什么互连？它是像 InfiniBand 或 Myrinet 这样的专业产品，还是只是在以太网上使用普通 TCP？如果使用 TCP 传输运行，您是否有多个已配置的网络接口？

此外，Open MPI 是模块化的——有许多模块提供实现各种集体操作的算法。您可以尝试使用 MCA 参数来摆弄它们，例如，您可以通过传递 mpirun 之类的内容来开始调试应用程序的行为，增加 btl 组件的详细程度，例如 --mca btl_base_verbose 30。查找类似以下内容的内容：

[node1:19454] btl: tcp: attempting to connect() to address 192.168.2.2 on port 260
[node2:29800] btl: tcp: attempting to connect() to address 192.168.2.1 on port 260
[node1:19454] btl: tcp: attempting to connect() to address 192.168.109.1 on port 260
[node1][[54886,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.109.1 failed: Connection timed out (110)

在这种情况下，某些（或所有）节点具有多个已配置的已启动网络接口，但并非所有节点都可以通过所有接口访问。例如，如果节点运行带有默认启用的 Xen 支持（RHEL？）的最新 Linux 发行版，或者在其上安装了其他虚拟化软件来启动虚拟网络接口，则可能会发生这种情况。

默认情况下，Open MPI 是惰性的，即按需打开连接。如果选择了正确的接口，则第一次发送/接收通信可能会成功，但后续操作可能会选择备用路径之一，以便最大化带宽。如果通过第二个接口无法访问另一个节点，则可能会发生超时，并且通信将失败，因为 Open MPI 会认为另一个节点已关闭或有问题。

解决方案是使用 TCP btl 模块的 MCA 参数隔离非连接网络或网络接口：

强制 Open MPI 仅使用特定 IP 网络进行通信：--mca btl_tcp_if_include 192.168 .2.0/24
强制 Open MPI 仅使用一些已知可提供完整网络连接的网络接口：--mca btl_tcp_if_include eth0,eth1
强制 Open MPI 不使用已知为私有/虚拟或属于不连接节点的其他网络的网络接口（如果您选择这样做，您必须排除环回lo)：--mca btl_tcp_if_exclude lo,virt0

请参阅打开 MPI 运行时 TCP 调整常见问题解答了解更多详细信息。

What interconnect do you have? Is it a specialisied one like InfiniBand or Myrinet or are you just using plain TCP over Ethernet? Do you have more than one configured network interfaces if running with the TCP transport?

Besides, Open MPI is modular -- there are many modules that provide algorithms implementing the various collective operations. You can try to fiddle with them using MCA parameters, e.g. you can start debugging your application's behaviour with increasing the verbosity of the btl component by passing mpirun something like --mca btl_base_verbose 30. Look for something similar to:

[node1:19454] btl: tcp: attempting to connect() to address 192.168.2.2 on port 260
[node2:29800] btl: tcp: attempting to connect() to address 192.168.2.1 on port 260
[node1:19454] btl: tcp: attempting to connect() to address 192.168.109.1 on port 260
[node1][[54886,1],0][btl_tcp_endpoint.c:638:mca_btl_tcp_endpoint_complete_connect] connect() to 192.168.109.1 failed: Connection timed out (110)

In that case some (or all) nodes have more than one configured network interface that is up but not all nodes are reachable through all the interfaces. This might happen, e.g. if nodes run recent Linux distro with per-default enabled Xen support (RHEL?) or have other virtualisation software installed on them that brings up virtual network interfaces.

By default Open MPI is lazy, that is connectinos are opened on demand. The first send/receive communication may succeed if the right interface is picked up, but subsequent operations are likely to pick up one of the alternate paths in order to maximise the bandwidth. If the other node is unreachable through the second interface a time out is likely to occur and the communication will fail as Open MPI will consider the other node down or problematic.

The solution is to isolate the non-connecting networks or network interfaces using MCA parameters of the TCP btl module:

force Open MPI to use only a specific IP network for communication: --mca btl_tcp_if_include 192.168.2.0/24
force Open MPI to use only some of the network interfaces that are known to provide full network connectivity: --mca btl_tcp_if_include eth0,eth1
force Open MPI to not use network interfaces that are known to be private/virtual or to belong to other networks that do not connect the nodes (if you choose to do so, you must exclude the loopback lo): --mca btl_tcp_if_exclude lo,virt0

Refer to the Open MPI run-time TCP tuning FAQ for more details.

回复收藏 0 原文

~没有更多了~