MPI 阻塞调用 (MPI_Send/Recv) 有时间限制吗?
我正在提交关于我的大学集群的 MPI 职位。对于较大的程序,我注意到在我的最后一次通信例程中,我的程序崩溃了,几乎没有任何有用的错误消息。
mpirun noticed that process rank 0 with PID 5466 on node red0005 exited on signal 9 (Killed).
唯一有帮助的是 0 级引起了这个问题。由于最终的通信例程的工作方式如下(其中 <-->
表示 MPI_Send/Recv
),
rank 0 rank 1 rank 2 rank 3 ... rank n
| <--> <--> <--> <-->
|
|
|
|
|
|
|
V
----------------------MPI_Barrier()------------------
我的猜测是排名 0 命中 MPI_Barrier() 等待很长一段时间(570-1200 秒)然后导致异常。或者,计算机可能会耗尽内存。当我的本地计算机内存不足时,我会收到非常详细的内存不足警告,但我不知道远程计算机上发生了什么。任何想法这可能意味着什么?
I am submitting MPI jobs on my university cluster. With larger programs I have noticed that during one of my final communication routines, my program crashes with almost no helpful error message.
mpirun noticed that process rank 0 with PID 5466 on node red0005 exited on signal 9 (Killed).
The only thing helpful in all of that is that rank 0 caused the problem. Since this final communication routine works as follows (where <-->
means MPI_Send/Recv
)
rank 0 rank 1 rank 2 rank 3 ... rank n
| <--> <--> <--> <-->
|
|
|
|
|
|
|
V
----------------------MPI_Barrier()------------------
My guess is that rank 0 hits MPI_Barrier()
waits for a very long period (570-1200 s) then causes an exception. Alternatively, the computers might run out of memory. When my local machine runs out of memory, I get a very detailed out of memory warning, but I have no idea what is going on on the remote machine. Any ideas what this might mean?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
这绝对不是超时。 MPI 例程没有此类例外。如果您的集群具有不同的 MPI 库(或使用不同编译器编译的相同 MPI 库)或启动机制,请尝试一下。这可能是库的问题(或者程序中的错误)。
Its most definitely not a timeout. MPI routines do not have such exceptions. If your cluster has a different MPI library (or the same MPI library compiled with a different compiler) or startup mechanism, give that a try. Its probably an issue with the library (or a bug in your program).