Slurm 新的工作节点
我想搭建一个自动创建和删除节点的集群环境。这些作业将使用 Slurm 分发到各个节点。 两个问题:
- Slurm 工作节点是否有一个代理或类似的东西,以便节点自动向头节点注册?
- 是否可以在运行时更改 Slurm 配置文件? (因为可以添加或删除新的工作节点)。
I want to build a cluster environment where nodes are automatically created and deleted. The jobs are to be distributed to the various nodes using Slurm.
Two questions:
- Is there an agent or similar for the Slurm workers so that the nodes automatically register with the head node?
- Is it possible to change the Slurm config file during runtime? (since new worker nodes could be added or deleted).
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
您需要重新启动 Slurm 守护程序才能使 slurm.conf 文件的更改生效,这可能会给正在运行的作业带来问题。如果 Slurm 控制守护进程发现 slurm.conf 由于校验和不匹配而不同,则可能会出现错误(作业失败或更糟)(请参阅有关添加节点的官方文档:https://slurm.schedmd.com/faq.html#add_nodes)。
You would need to restart the Slurm daemon for changes to the slurm.conf file to take effect, which could be problematic for jobs that are running. You may have errors (job failures or worse) if the Slurm control daemon finds that the slurm.conf is different due to checksum mismatch (see the official docs on adding nodes: https://slurm.schedmd.com/faq.html#add_nodes).