Slurm控制器无法连接到工人，状态被设置为未知

发布于 2025-02-07 21:13:46 字数 1054 浏览 4 评论 0原文

我正在尝试设置一个由Slurm管理的小集群。控制器也是计算节点。 /etc/slurm/slurm.conf IN Config In 是：

NodeName=controller,node[01-02] RealMemory=250000 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP

运行sinfo时，我会得到：

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      2   unk* node[01-02]
compute*     up   infinite      1   idle controller

但是，在每个节点上运行slurmd -c时我得到：

NodeName=node01 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=257655
UpTime=0-00:30:44

另一个节点上相同。我允许端口6817和6818（默认的slurm端口）在所有计算机上（对于TCP-我假设是协议）。我还检查了/etc/slurm/slurm.conf和/etc/slurm/slurmdbd.conf是相同的，。

无论如何是否可以调试与给定机器的连接？

事先感谢您的任何帮助。

原文

I am trying to setup a small cluster, managed with SLURM. The controller is also a compute node. The config in /etc/slurm/slurm.conf is:

NodeName=controller,node[01-02] RealMemory=250000 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP

When running sinfo I get:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      2   unk* node[01-02]
compute*     up   infinite      1   idle controller

However, when running slurmd -C on each node I get:

NodeName=node01 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=257655
UpTime=0-00:30:44

The same on the other node. I have allowed the ports 6817 and 6818 (the default slurm ports) on all machines (for TCP - which I assume is the protocol). I have also checked that the /etc/slurm/slurm.conf and /etc/slurm/slurmdbd.conf are the same, along with the munge keys (this works).

Is there anyway to debug the connection to a given machine?

Thanks in advance for any help.

分享到QQ

分享到微博