OpenMPI 无法识别多个节点?
我正在尝试在集群上并行运行 Julia 脚本。 集群使用 Moab 和 Torque 作为调度程序和资源管理器。 由于 SSH 似乎受到限制,我使用 MPI 进行多处理。
我抛出以下作业,请求 3 个节点:
#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l pmem=10gb
#PBS -l nodes=3:ppn=1
#PBS -j oe
#PBS -A open
#PBS -o (some path)
#PBS -e (some path)
cd (some path)
echo ""
echo "JOB Started on $(hostname -s) at $(date)"
echo ""
module purge
module use (some path)/modules
module load julia
module load openmpi
mpirun -np 3 -display-allocation julia --project=. "(some path)/test.jl"
echo ""
echo "JOB ended at $(date)"
但是如果我查看输出脚本,它似乎只识别一个节点,comp-bc-0384
:
JOB Started on comp-bc-0384 at Sat Mar 19 22:05:12 EDT 2022
====================== ALLOCATED NODES ======================
comp-bc-0384: slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
[[12308,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: comp-bc-0384
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] 2 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
10.214858 seconds (116.21 k allocations: 6.110 MiB)
JOB ended at Sat Mar 19 22:05:36 EDT 2022
我期待的是 ALLOCATED NODES
部分显示我分配到的其他节点。 过去有一个类似的问题(openMPI/mpich2 不能在多个节点上运行< /a>) 表明它与主机文件有关。 因此,我还尝试了 mpirun -hostfile $PBS_NODEFILE -np 3 -display-allocation julia --project=。 “(某个路径)/test.jl”。然后它返回以下内容:
JOB Started on comp-bc-0384 at Sat Mar 19 22:16:15 EDT 2022
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
JOB ended at Sat Mar 19 22:16:16 EDT 2022
这可能是什么原因?
I am trying to run a Julia script in paralell on a cluster.
The cluster uses Moab and Torque for the scheduler and resource manager.
Since SSH seems to be restricted, I use MPI for multiprocessing.
I throw the following job, requesting for 3 nodes:
#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l pmem=10gb
#PBS -l nodes=3:ppn=1
#PBS -j oe
#PBS -A open
#PBS -o (some path)
#PBS -e (some path)
cd (some path)
echo ""
echo "JOB Started on $(hostname -s) at $(date)"
echo ""
module purge
module use (some path)/modules
module load julia
module load openmpi
mpirun -np 3 -display-allocation julia --project=. "(some path)/test.jl"
echo ""
echo "JOB ended at $(date)"
But it if I look at the output script, it seems that it recognizes only one node, comp-bc-0384
:
JOB Started on comp-bc-0384 at Sat Mar 19 22:05:12 EDT 2022
====================== ALLOCATED NODES ======================
comp-bc-0384: slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
[[12308,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: comp-bc-0384
Another transport will be used instead, although this may result in
lower performance.
NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] 2 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
10.214858 seconds (116.21 k allocations: 6.110 MiB)
JOB ended at Sat Mar 19 22:05:36 EDT 2022
I was expecting the ALLOCATED NODES
section to display the other node(s) I was assigned to.
A similar question in the past (openMPI/mpich2 doesn't run on multiple nodes) suggests that it has something to do with host file.
Therefore I also tried with mpirun -hostfile $PBS_NODEFILE -np 3 -display-allocation julia --project=. "(some path)/test.jl"
. It then returns the following:
JOB Started on comp-bc-0384 at Sat Mar 19 22:16:15 EDT 2022
Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
--------------------------------------------------------------------------
JOB ended at Sat Mar 19 22:16:16 EDT 2022
What could be the cause here?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论