mpiexec 使用错误数量的 cpu
我正在尝试设置 MPI 集群。但我遇到的问题是,添加到 mpd.conf 文件中的 CPU 数量未正确使用。 我有三台 Ubuntu 服务器。 48 核 opteron 8 核 calc1 具有 8 核的 calc2。
我的 mpd.hosts 看起来像:选项:46
计算1:6
calc2:6
启动后(mpdboot -n 3 -f mpd.hosts)系统正在运行。 mpdtrace->这三者都已列出。
但是运行像“mpiexec -n 58 raxmlHPC-MPI ...”这样的程序会导致 calc1 和 calc2 执行许多作业,而 opteron 同时执行很少的作业。 我做错了什么?
问候
比约恩
I am trying to set up a MPI Cluster. But I have the problem that the number of CPUs added to the mpd.conf file is not correctly used.
I have three Ubuntu servers.
opteron with 48 Cores
calc1 with 8 Cores
calc2 with 8 Cores.
My mpd.hosts looks like:opteron:46
calc1:6
calc2:6
After booting (mpdboot -n 3 -f mpd.hosts) the System is running.
mpdtrace -> all three of them are listed.
But running a Programm like "mpiexec -n 58 raxmlHPC-MPI ..." causes that calc1 and calc2 get to many jobs and opteron gets to few at the same time.
What am I doing wrong?
Regards
Bjoern
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我找到了一个解决方法。
我对 mpiexec 命令使用了附加参数“-machinefile /path/to/mpd.hosts”。现在,所有节点都运行正常。
我遇到的一个问题是收到以下错误消息:
... MPIU_SHMW_Seg_create_attach_templ(671): 打开失败 - 没有这样的文件或目录 ...
要修复它,我必须设置环境变量
MPICH_NO_LOCAL=1
I found a workaround.
I used the additional parameter "-machinefile /path/to/mpd.hosts" for the mpiexec command. And now, all nodes are running correctly.
One problem I got was that I got following error message:
... MPIU_SHMW_Seg_create_attach_templ(671): open failed - No such file or directory ...
To fix it, I had to set the environment variable
MPICH_NO_LOCAL=1
正如您所发现的,您必须将机器文件传递给
mpdboot
和mpiexec
才能使用每个主机的进程计数。 “打开失败”问题是您正在使用的流程管理器 MPD 中的一个已知错误。请注意,MPICH_NO_LOCAL=1
解决方法可行,但可能会导致节点内通信的性能大幅下降。您显然正在使用 MPICH2(或 MPICH2 衍生物),但不清楚您使用的是哪个版本。如果可以的话,我强烈建议升级到 MPICH2 1.2.1p1 或(更好)1.3.1。这两个版本都包含一个名为 Hydra 的新进程管理器,速度要快得多并且更加坚固。在 1.3.1 中,Hydra 是默认的进程管理器。它不需要
mpdboot
阶段,并且支持$HYDRA_HOST_FILE
环境变量,因此您不必在每个mpiexec< 上指定计算机文件/代码>。
As you figured out, you must pass the machinefile to both
mpdboot
andmpiexec
in order to use per-host process counts. The "open failed" issue is a known bug in MPD, the process manager you are using. Note that theMPICH_NO_LOCAL=1
workaround will work, but will probably result in a big performance penalty for intranode communication.You are clearly using MPICH2 (or an MPICH2 derivative), but it's not clear what version you are using. If you can, I would strongly recommend upgrading to either MPICH2 1.2.1p1 or (better yet) 1.3.1. Both of these releases include a newer process manager called hydra that is much faster and more robust. In 1.3.1, hydra is the default process manager. It doesn't require an
mpdboot
phase, and it supports a$HYDRA_HOST_FILE
environment variable so that you don't have to specify the machine file on everympiexec
.