在一台机器上的 OS X 上使用 mpirun
我在 OS X 上的单机模式下使用 mpirun
时遇到问题。使用 mpirun -np 5 my_program
运行我的程序时,我收到以下错误输出:
[...-MacBook-Pro.local:85936] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-8/openmpi/orte/mca/pls/base/pls_base_orted_cmds.c at line 275
[...-MacBook-Pro.local:85936] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-8/openmpi/orte/mca/pls/rsh/pls_rsh_module.c at line 1158
[...-MacBook-Pro.local:85936] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-8/openmpi/orte/mca/errmgr/hnp/errmgr_hnp.c at line 90
mpirun noticed that job rank 1 with PID 85940 on node ...-MacBook-Pro.local exited on signal 6 (Abort trap).
2 additional processes aborted (not shown)
显然,默认情况下< code>mpirun 使用 rsh
连接到机器。我尝试使用 ssh
代替,但没有帮助:
mpirun --mca pls_rsh_agent ssh -np 5 my_program
然后,我尝试使用共享内存 (sm
) BTL,这也没有帮助:
mpirun --mca btl self,sm -np 5 my_program
最后,我尝试使用计算机文件来指定我只想使用 localhost
,但这也没有帮助:
mpirun -np 5 -machinefile machinefile.local my_program
这里,machinefile.local
仅包含 localhost
code> 在(单个)第一行。
在上述所有情况下,我都会收到上述超时错误。
另外,我还验证了我的 Mac OS X 防火墙没有运行,并且我可以通过 ssh 登录到我的计算机。
I have trouble using mpirun
in single-machine mode on OS X. When running my program using mpirun -np 5 my_program
I get the following error output:
[...-MacBook-Pro.local:85936] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-8/openmpi/orte/mca/pls/base/pls_base_orted_cmds.c at line 275
[...-MacBook-Pro.local:85936] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-8/openmpi/orte/mca/pls/rsh/pls_rsh_module.c at line 1158
[...-MacBook-Pro.local:85936] [0,0,0] ORTE_ERROR_LOG: Timeout in file /SourceCache/openmpi/openmpi-8/openmpi/orte/mca/errmgr/hnp/errmgr_hnp.c at line 90
mpirun noticed that job rank 1 with PID 85940 on node ...-MacBook-Pro.local exited on signal 6 (Abort trap).
2 additional processes aborted (not shown)
Apparently, by default mpirun
uses rsh
for connecting to machines. I tried using ssh
instead, but it didn't help:
mpirun --mca pls_rsh_agent ssh -np 5 my_program
Then, I tried using the shared-memory (sm
) BTL, which didn't help either:
mpirun --mca btl self,sm -np 5 my_program
Finally, I tried using a machine file to specify that I only want to use localhost
, which didn't help either:
mpirun -np 5 -machinefile machinefile.local my_program
Here, machinefile.local
only contains localhost
on the (single) first line.
In all of the above cases, I get the above timeout error.
Also, I verified that my Mac OS X firewall wasn't running and that I could ssh into my machine.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
所以看起来您正在使用 fink 的 OpenMPI 版本,是吗? /usr/bin 和 /usr/lib 中是否还有原始的 1.2.x MPI?寻找奇怪的启动问题的第一个地方是 MPI 库的版本冲突。
首先尝试一些简单的东西,比如 /usr/bin/mpirun -np 5 hostname ,然后无论你的 fink mpirun 在哪里,都做同样的事情: /path/to/fink/mpirun -np 5 hostname ,只是为了确保两个 MPI 启动器在非 MPI 程序上工作。然后对
my_program
执行ldd
;它链接到哪些库?对这些库使用适当的mpirun
,并查看是否有效。So it looks like you're using a version of OpenMPI from fink, is that right? Do you still have the original 1.2.x MPI in /usr/bin and /usr/lib? The first place to look for weird launching issues is conflicting versions of the MPI libraries.
First try something simple like
/usr/bin/mpirun -np 5 hostname
, and then wherever your fink mpirun is do the same thing:/path/to/fink/mpirun -np 5 hostname
, just to make sure the two MPI launchers work on a non-MPI program. Then do anldd
onmy_program
; which libraries is it linking to? Use the appropriatempirun
for those libraries, and see if that works.检查您的防火墙并确保它允许 mpirun 建立入站和出站连接。
Check your firewall and make sure it allows mpirun to establish inbound and outbound connections.