远程节点上的 mpi_comm_spawn

发布于 2024-10-03 21:43:17 字数 1499 浏览 11 评论 0原文

如何使用 MPI_Comm_spawn 在远程节点上启动工作进程?

使用 OpenMPI 1.4.3,我尝试了以下代码:

MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", "node2");
MPI_Comm intercom;
MPI_Comm_spawn("worker",
        MPI_ARGV_NULL,
        nprocs,
        info,
        0,
        MPI_COMM_SELF,
        &intercom,
        MPI_ERRCODES_IGNORE);

但失败并显示以下错误消息:

--------------------------------------------------------------------------
There are no allocated resources for the application 
  worker
that match the requested mapping:


Verify that you have mapped the allocated resources properly using the 
--host or --hostfile specification.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------

如果我将“node2”替换为本地计算机的名称,则它可以正常工作。如果我 ssh 到 node2 并在那里运行相同的东西(在信息字典中使用“node2”),那么它也可以正常工作。

我不想使用 mpirun 启动父进程,所以我只是在寻找一种在远程节点上动态生成进程的方法。这可能吗?

How does one use MPI_Comm_spawn to start worker processes on remote nodes?

Using OpenMPI 1.4.3, I've tried this code:

MPI_Info info;
MPI_Info_create(&info);
MPI_Info_set(info, "host", "node2");
MPI_Comm intercom;
MPI_Comm_spawn("worker",
        MPI_ARGV_NULL,
        nprocs,
        info,
        0,
        MPI_COMM_SELF,
        &intercom,
        MPI_ERRCODES_IGNORE);

But that fails with this error message:

--------------------------------------------------------------------------
There are no allocated resources for the application 
  worker
that match the requested mapping:


Verify that you have mapped the allocated resources properly using the 
--host or --hostfile specification.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------

If I replace the "node2" with the name of my local machine, then it works fine. If I ssh into node2 and run the same thing there (with "node2" in the info dictionary) then it also works fine.

I don't want to start the parent process with mpirun, so I'm just looking for a way to dynamically spawn processes on remote nodes. Is this possible?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

So尛奶瓶 2024-10-10 21:43:17

我不想启动父进程
使用 mpirun 进行处理,所以我只是
寻找一种动态生成的方法
远程节点上的进程。这是
可能吗?

我不确定你为什么不想用 mpirun 启动它?无论如何,一旦您点击 MPI_Init(),您就会隐式启动整个 MPI 机器,这样您只需传递选项而不是依赖默认值。

这里的问题很简单,当 MPI 库启动时(在 MPI_Init() 处),它看不到任何其他可用的主机,因为您没有为 mpirun 提供 --host 或 --hostfile 选项。它不会只是在你所说的其他地方启动进程(事实上,spawn不需要Info主机,所以一般来说它甚至不知道去哪里),所以它失败了。

所以你需要做
mpirun --host myhost,host2 -np 1 ./parentjob
或者,更一般地,提供一个主机文件,最好有多个可用插槽

myhost slots=1
host2 slots=8
host3 slots=8

,并以这种方式启动作业,mpirun --hostfile mpihosts.txt -np 1 ./parentjob这是一个功能,而不是一个漏洞;现在 MPI 的工作是找出工作人员的去向,如果您没有在信息中明确指定主机,它会尝试将其放在最未充分利用的地方。这也意味着您不必重新编译来更改您将生成的主机。

I don't want to start the parent
process with mpirun, so I'm just
looking for a way to dynamically spawn
processes on remote nodes. Is this
possible?

I'm not sure why you don't want to start it with mpirun? You're implicitly starting up the whole MPI machinery anyway as soon as you hit MPI_Init(), this way you just get to pass it options rather than relying on the default.

The issue here is simply that when the MPI library starts up (at MPI_Init()) it doesn't see any other hosts available, because you haven't given it any with the --host or --hostfile options to mpirun. It won't just launch processes elsewhere on your say-so (indeed, spawn doesn't require Info host, so in general it wouldn't even know where to go otherwise), so it fails.

So you'll need to do
mpirun --host myhost,host2 -np 1 ./parentjob
or, more generally, provide a hostfile, preferably with a number of slots available

myhost slots=1
host2 slots=8
host3 slots=8

and launch the jobs this way, mpirun --hostfile mpihosts.txt -np 1 ./parentjob This is a feature, not a bug; now it's MPIs job to figure out where the workers go, and if you don't specify a host explicitly in the info, it'll try to put it in the most underutilized place. It also means you don't have to recompile to change the hosts you'll spawn to.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文