MPICH/OpenMPI 中的容错
我有两个问题
- Q1。除了检查点/回滚之外,是否有更有效的方法来处理 MPI 中的错误情况?我看到如果一个节点“死亡”,程序会突然停止。节点死亡后有什么方法可以继续执行吗? (如果以准确性为代价,则没有问题)
Q2。我在“http://stackoverflow.com/questions/144309/what-is-the-best-mpi-implementation”中读到,OpenMPI 具有更好的容错能力,最近 MPICH-2 也提出了类似的功能..有人知道它们是什么以及如何使用它们吗?这是一种“模式”吗?对于问题1中所述的情况,他们可以提供帮助吗?
请回复。谢谢。
I have two questions-
Q1. Is there a more efficient way to handle the error situation in MPI, other than check-point/rollback? I see that if a node "dies", the program halts abruptly.. Is there any way to go ahead with the execution after a node dies ?? (no issues if it is at the cost of accuracy)
Q2. I read in "http://stackoverflow.com/questions/144309/what-is-the-best-mpi-implementation", that OpenMPI has better fault tolerance and recently MPICH-2 has also come up with similar features.. does anybody know what they are and how to use them? is it a "mode"? can they help in the situation stated in Q1 ?
kindly reply. Thank you.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
MPI - 所有实现 - 都能够在出现错误后继续一段时间。默认是死 - 也就是说,默认错误处理程序是 MPI_ERRORS_ARE_FATAL - 但可以设置(例如,请参阅讨论 此处)。但目前的标准并没有超出这个范围。也就是说,发生此类错误后很难恢复并继续。如果您的程序足够简单(某种主从类型的设置),则可以继续这种方式。
MPI 论坛 目前正在研究 MPI-3,错误处理和容错将在新标准的重要组成部分(有一个工作组 专注于该主题)。然而,在这项工作完成之前,从 MPI 获得更强的容错能力的唯一方法是使用早期的非标准扩展。 FT-MPI 是一个开发了非常强大的 MPI 的项目,但不幸的是它是基于 MPI1 的。 2;该标准的早期版本。 此处声称他们现在正在使用 OpenMPI,但我不知道不知道那会怎么样。有 MPICH-V,基于 MPI2,但它比我认为的更基于检查点重新启动'正在寻找。
更新添加:容错功能并未纳入 MPI-3,但工作组仍在继续其工作,预计不久就会产生一些结果。
MPI - all implementations - have had the ability to continue after an error for a while. The default is to die - that is, the default error handler is MPI_ERRORS_ARE_FATAL - but that can be set (eg, see the discussion here). But the standard doesn't currently much beyond that; that is, it's hard to recover and continue after such an error. If your program is sufficiently simple - some sort of master-worker type of setup - it may be possible to continue this way.
The MPI forum is currently working on what will become MPI-3, and error handling and fault tolerance will be an important component of the new standard (there's a working group dedicated to the topic). Until that work is complete, however, the only way to get stronger fault tolerance out of MPI is to use earlier, nonstandard, extensions. FT-MPI was a project that developed a very robust MPI, but unfortuantely it's based on MPI1.2; a very early version of the standard. The claim here is that they're now working with OpenMPI, but I don't know what's become of that. There's MPICH-V, based on MPI2, but that's more checkpoint-restart based than what I think you're looking for.
Updated to add: The fault tolerance didn't make it into MPI-3, but the working group continues its work and the expectation is that something will result from that before too long.