Slurm控制器无法连接到工人,状态被设置为未知
我正在尝试设置一个由Slurm管理的小集群。控制器也是计算节点。 /etc/slurm/slurm.conf
IN Config In 是:
NodeName=controller,node[01-02] RealMemory=250000 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP
运行sinfo
时,我会得到:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 2 unk* node[01-02]
compute* up infinite 1 idle controller
但是,在每个节点上运行slurmd -c
时我得到:
NodeName=node01 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=257655
UpTime=0-00:30:44
另一个节点上相同。我允许端口6817
和6818
(默认的slurm端口)在所有计算机上(对于TCP-我假设是协议)。我还检查了/etc/slurm/slurm.conf
和/etc/slurm/slurmdbd.conf
是相同的, 。
无论如何是否可以调试与给定机器的连接?
事先感谢您的任何帮助。
I am trying to setup a small cluster, managed with SLURM. The controller is also a compute node. The config in /etc/slurm/slurm.conf
is:
NodeName=controller,node[01-02] RealMemory=250000 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP
When running sinfo
I get:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
compute* up infinite 2 unk* node[01-02]
compute* up infinite 1 idle controller
However, when running slurmd -C
on each node I get:
NodeName=node01 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=257655
UpTime=0-00:30:44
The same on the other node. I have allowed the ports 6817
and 6818
(the default slurm ports) on all machines (for TCP - which I assume is the protocol). I have also checked that the /etc/slurm/slurm.conf
and /etc/slurm/slurmdbd.conf
are the same, along with the munge keys (this works).
Is there anyway to debug the connection to a given machine?
Thanks in advance for any help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我能够浏览日志文件,发现连接已被阻止。群集使用Fedora,因此我使用此链接使用此链接将每台计算机添加到防火墙受信任的列表中 - -7“> CentOS 7中的白名单源IP地址
这些更新的防火墙设置似乎并未立即应用,因此我不得不重新启动所有机器,现在Slurm正常运行。
I was able to go through the log files and found out the connections were being blocked. The cluster is using Fedora and so I added each machine to the firewall trusted list using this link - whitelist source ip addresses in centos 7
These updated firewall settings did not seem to be applied straight away so I had to restart all machines and now SLURM is functioning correctly.