远程节点上的 mpi_comm_spawn
如何使用 MPI_Comm_spawn 在远程节点上启动工作进程?
使用 OpenMPI 1.4.3,我尝试了以下代码:
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", "node2");
MPI_Comm intercom;
MPI_Comm_spawn("worker",
MPI_ARGV_NULL,
nprocs,
info,
0,
MPI_COMM_SELF,
&intercom,
MPI_ERRCODES_IGNORE);
但失败并显示以下错误消息:
-------------------------------------------------------------------------- There are no allocated resources for the application worker that match the requested mapping: Verify that you have mapped the allocated resources properly using the --host or --hostfile specification. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. --------------------------------------------------------------------------
如果我将“node2”替换为本地计算机的名称,则它可以正常工作。如果我 ssh 到 node2 并在那里运行相同的东西(在信息字典中使用“node2”),那么它也可以正常工作。
我不想使用 mpirun 启动父进程,所以我只是在寻找一种在远程节点上动态生成进程的方法。这可能吗?
How does one use MPI_Comm_spawn to start worker processes on remote nodes?
Using OpenMPI 1.4.3, I've tried this code:
MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", "node2");
MPI_Comm intercom;
MPI_Comm_spawn("worker",
MPI_ARGV_NULL,
nprocs,
info,
0,
MPI_COMM_SELF,
&intercom,
MPI_ERRCODES_IGNORE);
But that fails with this error message:
-------------------------------------------------------------------------- There are no allocated resources for the application worker that match the requested mapping: Verify that you have mapped the allocated resources properly using the --host or --hostfile specification. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. --------------------------------------------------------------------------
If I replace the "node2" with the name of my local machine, then it works fine. If I ssh into node2 and run the same thing there (with "node2" in the info dictionary) then it also works fine.
I don't want to start the parent process with mpirun, so I'm just looking for a way to dynamically spawn processes on remote nodes. Is this possible?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我不确定你为什么不想用 mpirun 启动它?无论如何,一旦您点击 MPI_Init(),您就会隐式启动整个 MPI 机器,这样您只需传递选项而不是依赖默认值。
这里的问题很简单,当 MPI 库启动时(在 MPI_Init() 处),它看不到任何其他可用的主机,因为您没有为 mpirun 提供 --host 或 --hostfile 选项。它不会只是在你所说的其他地方启动进程(事实上,spawn不需要Info主机,所以一般来说它甚至不知道去哪里),所以它失败了。
所以你需要做
mpirun --host myhost,host2 -np 1 ./parentjob
或者,更一般地,提供一个主机文件,最好有多个可用插槽
,并以这种方式启动作业,
mpirun --hostfile mpihosts.txt -np 1 ./parentjob
这是一个功能,而不是一个漏洞;现在 MPI 的工作是找出工作人员的去向,如果您没有在信息中明确指定主机,它会尝试将其放在最未充分利用的地方。这也意味着您不必重新编译来更改您将生成的主机。I'm not sure why you don't want to start it with mpirun? You're implicitly starting up the whole MPI machinery anyway as soon as you hit MPI_Init(), this way you just get to pass it options rather than relying on the default.
The issue here is simply that when the MPI library starts up (at MPI_Init()) it doesn't see any other hosts available, because you haven't given it any with the --host or --hostfile options to mpirun. It won't just launch processes elsewhere on your say-so (indeed, spawn doesn't require Info host, so in general it wouldn't even know where to go otherwise), so it fails.
So you'll need to do
mpirun --host myhost,host2 -np 1 ./parentjob
or, more generally, provide a hostfile, preferably with a number of slots available
and launch the jobs this way,
mpirun --hostfile mpihosts.txt -np 1 ./parentjob
This is a feature, not a bug; now it's MPIs job to figure out where the workers go, and if you don't specify a host explicitly in the info, it'll try to put it in the most underutilized place. It also means you don't have to recompile to change the hosts you'll spawn to.