Slurm中有什么原因不超过一定数量的节点?
我想从大约2,000个阵列中在GCP-SLURM中运行约400个工作。
我的bash文件中的slurm设置和slurm.config设置如下。
run.sh
#SBATCH -o ./out/vs.%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -W
slurm.config
MaxArraySize=50000
MaxJobCount=50000
#COMPUTE NODE
NodeName=DEFAULT CPUs=16 RealMemory=63216 State=UNKNOWN
NodeName=node-0-[0-599] State=CLOUD
当前,除了该任务以外,还使用100个节点用于工作。
如果您继续执行此任务,则总共只执行大约130-150个节点任务,而其余的则未执行。
是否需要设置其他参数?
- 附加错误日志
[2022-06-20T01:18:41.294] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2022-06-20T01:18:41.294] error: slurm_set_addr: Unable to resolve “node-333"
[2022-06-20T01:18:41.294] error: fwd_tree_thread: can't find address for host node-333, check slurm.conf
I want to run about 400 jobs in GCP-slurm from about 2,000 arrays.
The slurm settings and slurm.config settings in my bash file are as follows.
run.sh
#SBATCH -o ./out/vs.%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH -W
slurm.config
MaxArraySize=50000
MaxJobCount=50000
#COMPUTE NODE
NodeName=DEFAULT CPUs=16 RealMemory=63216 State=UNKNOWN
NodeName=node-0-[0-599] State=CLOUD
Currently, 100 nodes are being used for work other than that task.
If you proceed with this task, only about 130-150 node tasks in total are executed and the rest are not executed.
Are there any additional parameters that need to be set?
-- additional error log
[2022-06-20T01:18:41.294] error: get_addr_info: getaddrinfo() failed: Name or service not known
[2022-06-20T01:18:41.294] error: slurm_set_addr: Unable to resolve “node-333"
[2022-06-20T01:18:41.294] error: fwd_tree_thread: can't find address for host node-333, check slurm.conf
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我找到了一个解决其他错误的解决方法。
https://groups.google.com/g/slurm-users/ c / y-qzkdbyfik
我转介到文章中,您可以编辑slurcmtld.service / slurmd.service.service / slurmdbd.service。
但是,仍然保持最大执行节点的数量。
I found a workaround for the additional error.
https://groups.google.com/g/slurm-users/c/y-QZKDbYfIk
I referred to the article, and you can edit slurcmtld.service / slurmd.service / slurmdbd.service.
However, the maximum number of execution nodes is still maintained.