OpenMPI 无法识别多个节点?

发布于 2025-01-15 16:08:41 字数 3432 浏览 3 评论 0原文

我正在尝试在集群上并行运行 Julia 脚本。 集群使用 Moab 和 Torque 作为调度程序和资源管理器。 由于 SSH 似乎受到限制,我使用 MPI 进行多处理。

我抛出以下作业,请求 3 个节点:

#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l pmem=10gb   
#PBS -l nodes=3:ppn=1
#PBS -j oe
#PBS -A open
#PBS -o (some path)
#PBS -e (some path)

cd (some path)
echo ""
echo "JOB Started on $(hostname -s) at $(date)"

echo ""
module purge
module use (some path)/modules
module load julia
module load openmpi
mpirun -np 3 -display-allocation julia --project=.  "(some path)/test.jl"

echo ""
echo "JOB ended at $(date)"

但是如果我查看输出脚本,它似乎只识别一个节点,comp-bc-0384

JOB Started on comp-bc-0384 at Sat Mar 19 22:05:12 EDT 2022


======================   ALLOCATED NODES   ======================
    comp-bc-0384: slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
[[12308,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: comp-bc-0384

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] 2 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 10.214858 seconds (116.21 k allocations: 6.110 MiB)

JOB ended at Sat Mar 19 22:05:36 EDT 2022

我期待的是 ALLOCATED NODES 部分显示我分配到的其他节点。 过去有一个类似的问题(openMPI/mpich2 不能在多个节点上运行< /a>) 表明它与主机文件有关。 因此,我还尝试了 mpirun -hostfile $PBS_NODEFILE -np 3 -display-allocation julia --project=。 “(某个路径)/test.jl”。然后它返回以下内容:

JOB Started on comp-bc-0384 at Sat Mar 19 22:16:15 EDT 2022

Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

JOB ended at Sat Mar 19 22:16:16 EDT 2022

这可能是什么原因?

I am trying to run a Julia script in paralell on a cluster.
The cluster uses Moab and Torque for the scheduler and resource manager.
Since SSH seems to be restricted, I use MPI for multiprocessing.

I throw the following job, requesting for 3 nodes:

#!/bin/bash
#PBS -l walltime=1:00:00
#PBS -l pmem=10gb   
#PBS -l nodes=3:ppn=1
#PBS -j oe
#PBS -A open
#PBS -o (some path)
#PBS -e (some path)

cd (some path)
echo ""
echo "JOB Started on $(hostname -s) at $(date)"

echo ""
module purge
module use (some path)/modules
module load julia
module load openmpi
mpirun -np 3 -display-allocation julia --project=.  "(some path)/test.jl"

echo ""
echo "JOB ended at $(date)"

But it if I look at the output script, it seems that it recognizes only one node, comp-bc-0384:

JOB Started on comp-bc-0384 at Sat Mar 19 22:05:12 EDT 2022


======================   ALLOCATED NODES   ======================
    comp-bc-0384: slots=24 max_slots=0 slots_inuse=0 state=UP
=================================================================
--------------------------------------------------------------------------
[[12308,1],2]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: comp-bc-0384

Another transport will be used instead, although this may result in
lower performance.

NOTE: You can disable this warning by setting the MCA parameter
btl_base_warn_component_unused to 0.
--------------------------------------------------------------------------
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] 2 more processes have sent help message help-mpi-btl-base.txt / btl:no-nics
[comp-bc-0384.acib.production.int.aci.ics.psu.edu:10656] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
 10.214858 seconds (116.21 k allocations: 6.110 MiB)

JOB ended at Sat Mar 19 22:05:36 EDT 2022

I was expecting the ALLOCATED NODES section to display the other node(s) I was assigned to.
A similar question in the past (openMPI/mpich2 doesn't run on multiple nodes) suggests that it has something to do with host file.
Therefore I also tried with mpirun -hostfile $PBS_NODEFILE -np 3 -display-allocation julia --project=. "(some path)/test.jl". It then returns the following:

JOB Started on comp-bc-0384 at Sat Mar 19 22:16:15 EDT 2022

Host key verification failed.
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

JOB ended at Sat Mar 19 22:16:16 EDT 2022

What could be the cause here?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文