Slurm中有什么原因不超过一定数量的节点？

发布于 2025-02-07 20:42:18 字数 821 浏览 3 评论 0原文

我想从大约2,000个阵列中在GCP-SLURM中运行约400个工作。

我的bash文件中的slurm设置和slurm.config设置如下。

run.sh

#SBATCH -o ./out/vs.%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -W

slurm.config

MaxArraySize=50000
MaxJobCount=50000

#COMPUTE NODE
NodeName=DEFAULT CPUs=16 RealMemory=63216 State=UNKNOWN
NodeName=node-0-[0-599] State=CLOUD

当前，除了该任务以外，还使用100个节点用于工作。

如果您继续执行此任务，则总共只执行大约130-150个节点任务，而其余的则未执行。

是否需要设置其他参数？

- 附加错误日志

[2022-06-20T01:18:41.294] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2022-06-20T01:18:41.294] error: slurm_set_addr: Unable to resolve “node-333"
[2022-06-20T01:18:41.294] error: fwd_tree_thread: can't find address for host node-333, check slurm.conf

原文

I want to run about 400 jobs in GCP-slurm from about 2,000 arrays.

The slurm settings and slurm.config settings in my bash file are as follows.

run.sh

#SBATCH -o ./out/vs.%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -W

slurm.config

MaxArraySize=50000
MaxJobCount=50000

#COMPUTE NODE
NodeName=DEFAULT CPUs=16 RealMemory=63216 State=UNKNOWN
NodeName=node-0-[0-599] State=CLOUD

Currently, 100 nodes are being used for work other than that task.

If you proceed with this task, only about 130-150 node tasks in total are executed and the rest are not executed.

Are there any additional parameters that need to be set?

-- additional error log

[2022-06-20T01:18:41.294] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2022-06-20T01:18:41.294] error: slurm_set_addr: Unable to resolve “node-333"
[2022-06-20T01:18:41.294] error: fwd_tree_thread: can't find address for host node-333, check slurm.conf

分享到QQ

分享到微博