即使我们在池中有很多节点，SGE MPI作业即使在特定主机集上运行

发布于 2025-01-22 08:40:42 字数 801 浏览 3 评论 0原文

我们在SGE GPU队列中看到了奇怪的问题，因为我们在GPU队列中有很多节点可用，但是每当我们启动MPI并行作业时，只有在我们的情况下，他们总是会进入一组节点，而当它们始终转到4个GPU节点时，它们就会获得。饱和的工作仍处于“ QW”状态，而不是进展。

这是我们的PPN4配置和作业提交CMD：

qconf -sp ppn4
pe_name                ppn4
slots                  999999
used_slots             0
bound_slots            0
user_lists             NONE                  
xuser_lists            NONE                  
start_proc_args        NONE
stop_proc_args         NONE
per_pe_task_prolog     NONE
per_pe_task_epilog     NONE
allocation_rule        4
control_slaves         TRUE
job_is_first_task      FALSE
urgency_slots          min
accounting_summary     TRUE
daemon_forks_slaves    FALSE
master_forks_slaves    FALSE

mpirun -pe ppn4 16 -l gpu=4 -l <queue name> <job submissionscript>

谢谢 CS

原文

We are seeing strange issue in our SGE gpu queue as we have plenty of nodes available in gpu queue but whenever we launch MPI parallel jobs they always going to one set of nodes only in our case it always going to 4 gpu nodes and when they get saturated jobs are remaining in "qw" state and not progressing..the remaining nodes in Queue are healthy and have exact identical settings.

This is our ppn4 config and job submission cmd:

qconf -sp ppn4
pe_name                ppn4
slots                  999999
used_slots             0
bound_slots            0
user_lists             NONE                  
xuser_lists            NONE                  
start_proc_args        NONE
stop_proc_args         NONE
per_pe_task_prolog     NONE
per_pe_task_epilog     NONE
allocation_rule        4
control_slaves         TRUE
job_is_first_task      FALSE
urgency_slots          min
accounting_summary     TRUE
daemon_forks_slaves    FALSE
master_forks_slaves    FALSE

mpirun -pe ppn4 16 -l gpu=4 -l <queue name> <job submissionscript>

Thank you
CS

分享到QQ

分享到微博