Slurm控制器无法连接到工人,状态被设置为未知

发布于 2025-02-07 21:13:46 字数 1054 浏览 4 评论 0原文

我正在尝试设置一个由Slurm管理的小集群。控制器也是计算节点。 /etc/slurm/slurm.conf IN Config In 是:

NodeName=controller,node[01-02] RealMemory=250000 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP

运行sinfo时,我会得到:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      2   unk* node[01-02]
compute*     up   infinite      1   idle controller

但是,在每个节点上运行slurmd -c时我得到:

NodeName=node01 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=257655
UpTime=0-00:30:44

另一个节点上相同。我允许端口68176818(默认的slurm端口)在所有计算机上(对于TCP-我假设是协议)。我还检查了/etc/slurm/slurm.conf/etc/slurm/slurmdbd.conf是相同的, 。

无论如何是否可以调试与给定机器的连接?

事先感谢您的任何帮助。

I am trying to setup a small cluster, managed with SLURM. The controller is also a compute node. The config in /etc/slurm/slurm.conf is:

NodeName=controller,node[01-02] RealMemory=250000 Sockets=1 CoresPerSocket=32 ThreadsPerCore=2 State=UNKNOWN
PartitionName=compute Nodes=ALL Default=YES MaxTime=INFINITE State=UP

When running sinfo I get:

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
compute*     up   infinite      2   unk* node[01-02]
compute*     up   infinite      1   idle controller

However, when running slurmd -C on each node I get:

NodeName=node01 CPUs=64 Boards=1 SocketsPerBoard=1 CoresPerSocket=32 ThreadsPerCore=2 RealMemory=257655
UpTime=0-00:30:44

The same on the other node. I have allowed the ports 6817 and 6818 (the default slurm ports) on all machines (for TCP - which I assume is the protocol). I have also checked that the /etc/slurm/slurm.conf and /etc/slurm/slurmdbd.conf are the same, along with the munge keys (this works).

Is there anyway to debug the connection to a given machine?

Thanks in advance for any help.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

人海汹涌 2025-02-14 21:13:46

我能够浏览日志文件,发现连接已被阻止。群集使用Fedora,因此我使用此链接使用此链接将每台计算机添加到防火墙受信任的列表中 - -7“> CentOS 7中的白名单源IP地址

这些更新的防火墙设置似乎并未立即应用,因此我不得不重新启动所有机器,现在Slurm正常运行。

I was able to go through the log files and found out the connections were being blocked. The cluster is using Fedora and so I added each machine to the firewall trusted list using this link - whitelist source ip addresses in centos 7

These updated firewall settings did not seem to be applied straight away so I had to restart all machines and now SLURM is functioning correctly.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文