即使我们在池中有很多节点,SGE MPI作业即使在特定主机集上运行
我们在SGE GPU队列中看到了奇怪的问题,因为我们在GPU队列中有很多节点可用,但是每当我们启动MPI并行作业时,只有在我们的情况下,他们总是会进入一组节点,而当它们始终转到4个GPU节点时,它们就会获得。饱和的工作仍处于“ QW”状态,而不是进展。
这是我们的PPN4配置和作业提交CMD:
qconf -sp ppn4
pe_name ppn4
slots 999999
used_slots 0
bound_slots 0
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
per_pe_task_prolog NONE
per_pe_task_epilog NONE
allocation_rule 4
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
daemon_forks_slaves FALSE
master_forks_slaves FALSE
mpirun -pe ppn4 16 -l gpu=4 -l <queue name> <job submissionscript>
谢谢 CS
We are seeing strange issue in our SGE gpu queue as we have plenty of nodes available in gpu queue but whenever we launch MPI parallel jobs they always going to one set of nodes only in our case it always going to 4 gpu nodes and when they get saturated jobs are remaining in "qw" state and not progressing..the remaining nodes in Queue are healthy and have exact identical settings.
This is our ppn4 config and job submission cmd:
qconf -sp ppn4
pe_name ppn4
slots 999999
used_slots 0
bound_slots 0
user_lists NONE
xuser_lists NONE
start_proc_args NONE
stop_proc_args NONE
per_pe_task_prolog NONE
per_pe_task_epilog NONE
allocation_rule 4
control_slaves TRUE
job_is_first_task FALSE
urgency_slots min
accounting_summary TRUE
daemon_forks_slaves FALSE
master_forks_slaves FALSE
mpirun -pe ppn4 16 -l gpu=4 -l <queue name> <job submissionscript>
Thank you
CS
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我想您已经解决了问题,但以防万一。
在您的命令中
Mpirun -pe ppn4 16 ....
16是一个总插槽号,根据所选的PE将在整个群集上使用。因此,PE分配规则采用4个插槽和4个节点x 4个插槽=您要订购的16个插槽。您必须增加该插槽号才能加载更多节点。
最好的,
v
I suppose you are already solved the issue, but just in case.
in your command
mpirun -pe ppn4 16 ....
16 is a total slots number that will be used across the cluster according to the selected PE. So, PE allocation rule takes 4 slots and 4 nodes x 4 slots = 16 slots you are ordering. You have to increase that slot number in order to load more nodes.
Best,
V