MPICH/OpenMPI 中的容错

发布于 2024-10-25 20:41:34 字数 376 浏览 0 评论 0原文

我有两个问题

- Q1。除了检查点/回滚之外,是否有更有效的方法来处理 MPI 中的错误情况?我看到如果一个节点“死亡”,程序会突然停止。节点死亡后有什么方法可以继续执行吗? (如果以准确性为代价,则没有问题)

Q2。我在“http://stackoverflow.com/questions/144309/what-is-the-best-mpi-implementation”中读到,OpenMPI 具有更好的容错能力,最近 MPICH-2 也提出了类似的功能..有人知道它们是什么以及如何使用它们吗?这是一种“模式”吗?对于问题1中所述的情况,他们可以提供帮助吗?

请回复。谢谢。

I have two questions-

Q1. Is there a more efficient way to handle the error situation in MPI, other than check-point/rollback? I see that if a node "dies", the program halts abruptly.. Is there any way to go ahead with the execution after a node dies ?? (no issues if it is at the cost of accuracy)

Q2. I read in "http://stackoverflow.com/questions/144309/what-is-the-best-mpi-implementation", that OpenMPI has better fault tolerance and recently MPICH-2 has also come up with similar features.. does anybody know what they are and how to use them? is it a "mode"? can they help in the situation stated in Q1 ?

kindly reply. Thank you.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

紧拥背影 2024-11-01 20:41:36

MPI - 所有实现 - 都能够在出现错误后继续一段时间。默认是死 - 也就是说,默认错误处理程序是 MPI_ERRORS_ARE_FATAL - 但可以设置(例如,请参阅讨论 此处)。但目前的标准并没有超出这个范围。也就是说,发生此类错误后很难恢复并继续。如果您的程序足够简单(某种主从类型的设置),则可以继续这种方式。

MPI 论坛 目前正在研究 MPI-3,错误处理和容错将在新标准的重要组成部分(有一个工作组 专注于该主题)。然而,在这项工作完成之前,从 MPI 获得更强的容错能力的唯一方法是使用早期的非标准扩展。 FT-MPI 是一个开发了非常强大的 MPI 的项目,但不幸的是它是基于 MPI1 的。 2;该标准的早期版本。 此处声称他们现在正在使用 OpenMPI,但我不知道不知道那会怎么样。有 MPICH-V,基于 MPI2,但它比我认为的更基于检查点重新启动'正在寻找。

更新添加:容错功能并未纳入 MPI-3,但工作组仍在继续其工作,预计不久就会产生一些结果。

MPI - all implementations - have had the ability to continue after an error for a while. The default is to die - that is, the default error handler is MPI_ERRORS_ARE_FATAL - but that can be set (eg, see the discussion here). But the standard doesn't currently much beyond that; that is, it's hard to recover and continue after such an error. If your program is sufficiently simple - some sort of master-worker type of setup - it may be possible to continue this way.

The MPI forum is currently working on what will become MPI-3, and error handling and fault tolerance will be an important component of the new standard (there's a working group dedicated to the topic). Until that work is complete, however, the only way to get stronger fault tolerance out of MPI is to use earlier, nonstandard, extensions. FT-MPI was a project that developed a very robust MPI, but unfortuantely it's based on MPI1.2; a very early version of the standard. The claim here is that they're now working with OpenMPI, but I don't know what's become of that. There's MPICH-V, based on MPI2, but that's more checkpoint-restart based than what I think you're looking for.

Updated to add: The fault tolerance didn't make it into MPI-3, but the working group continues its work and the expectation is that something will result from that before too long.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文