Open MPI/MPICH - 如果节点终止会发生什么?
我想知道如果 OpenMPI/MPICH2 集群的节点终止会发生什么?是否有某种机制可以容忍这种情况并继续执行?
感谢您的回答 海因里希
I would like to know what happens if a node of a OpenMPI/MPICH2 cluster terminates? Is there some mechanism that is tolerant for this case and continues the execution?
Thanks for your answers
Heinrich
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
事实并非如此,MPI 不提供开箱即用的容错能力。你可以编写程序来处理进程的失败,但我们大多数人都不会这样做,当硬件死机时,我们的程序就会崩溃。随着拥有数十万个处理器的超级计算机的出现以及平均故障间隔时间为秒级的情况,这种情况正在发生变化。
Not really, MPI doesn't provide out-of-the-box fault tolerance. You could write your programs to deal with the failure of a process, but most of us don't, we live with our programs crashing when the hardware dies. This situation is changing with the emergence of supercomputers with hundreds of thousands of processors, and mean-time between failures of the order of seconds.
请注意,自 MPI 1.x 天起就存在的一个功能是您可以设置错误处理程序:例如
http://www.mpi-forum.org/docs/mpi-11-html/node148.html
正如 Mark 所说,我们大多数人只是使用 MPI_ERRORS_ARE_FATAL (这是默认),因为我们的算法非常注重状态,并且不容易恢复(除非通过检查点,我们大多数人无论如何都会这样做)。
但事实并非如此。您可以让 MPI 函数返回错误消息并尝试尽可能地恢复。
有一些容错 MPI 软件包 - http://icl.cs.utk.edu /ftmpi/(有点旧,仅实现 MPI 1.2 功能)。最近,http://osl.iu.edu/research/ft/cifts/ 是一种作为单独项目放入 OpenMPI 中的方法,并且还有一个操作系统级检查点/重新启动包 BLCR,您可能会感兴趣。
MPI-3 论坛正在讨论 MPI 中的标准容错 API,因此此类项目的步伐正在加快。
Note that a feature that has existed since MPI 1.x days is that you can set an error handler: eg,
http://www.mpi-forum.org/docs/mpi-11-html/node148.html
As Mark notes, most of us just use MPI_ERRORS_ARE_FATAL (which is the default) because our algorithms are very state-heavy and can't easily be recovered (except through checkpointing, which most of us do anyway).
But that need not be the case; you can have the MPI functions return the error messages and try to recover as best you can.
There are a few fault-tolerant MPI packages out there -- http://icl.cs.utk.edu/ftmpi/ (which is kind of old and only implements MPI 1.2 functionality). More recently, http://osl.iu.edu/research/ft/cifts/ is one approach being put into OpenMPI as a separate project, and there is also an OS-level checkpoint/restart package, BLCR, which may be of interest.
The MPI-3 forum is discussing a standard fault-tolerance API in MPI, so the pace of such projects is accellerating.