slurm群集:配置节点,并非所有内核都具有等量的数字线程

发布于 2025-02-03 14:58:10 字数 1837 浏览 9 评论 0原文

我有一个新的节点,我正在尝试将其添加到slurm群集中。新机器上的内核并非都具有相同数量的螺纹:6个内核每个有2个螺纹,4个核有1个线程,总计16个CPU。这是由lscpu -e显示的:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ
  0    0      0    0 0:0:0:0          yes 6300.0000 800.0000
  1    0      0    0 0:0:0:0          yes 6300.0000 800.0000
  2    0      0    1 1:1:1:0          yes 6300.0000 800.0000
  3    0      0    1 1:1:1:0          yes 6300.0000 800.0000
  4    0      0    2 2:2:2:0          yes 6300.0000 800.0000
  5    0      0    2 2:2:2:0          yes 6300.0000 800.0000
  6    0      0    3 3:3:3:0          yes 6300.0000 800.0000
  7    0      0    3 3:3:3:0          yes 6300.0000 800.0000
  8    0      0    4 4:4:4:0          yes 6300.0000 800.0000
  9    0      0    4 4:4:4:0          yes 6300.0000 800.0000
 10    0      0    5 5:5:5:0          yes 6300.0000 800.0000
 11    0      0    5 5:5:5:0          yes 6300.0000 800.0000
 12    0      0    6 6:6:6:0          yes 3600.0000 800.0000
 13    0      0    7 7:7:6:0          yes 3600.0000 800.0000
 14    0      0    8 8:8:6:0          yes 3600.0000 800.0000
 15    0      0    9 9:9:6:0          yes 3600.0000 800.0000

附加到 slurm.conf 时,我通常只从lscpu中复制信息。对于我的新计算机,信息是:

CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              1
Core(s) per socket:              10
Socket(s):                       1

我附加到 slurm.conf 以下:nodeName = myNode cpus = 16 socketsperboard = 1 coresperSocket = 10 threadspercore = 1。但是,这引起了以下错误:

error: NodeNames=MYNODE CPUs=16 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

在内部,似乎Slurm希望节点具有具有所有相同线程的内核。如何为我的新节点正确配置 slurm.conf

I've got a new node that I'm trying to add to my Slurm cluster. The cores on the new machine do not all have the same number of threads: 6 cores have 2 threads each and 4 cores have 1 thread each, a total of 16 CPUs. This is shown by lscpu -e:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE    MAXMHZ   MINMHZ
  0    0      0    0 0:0:0:0          yes 6300.0000 800.0000
  1    0      0    0 0:0:0:0          yes 6300.0000 800.0000
  2    0      0    1 1:1:1:0          yes 6300.0000 800.0000
  3    0      0    1 1:1:1:0          yes 6300.0000 800.0000
  4    0      0    2 2:2:2:0          yes 6300.0000 800.0000
  5    0      0    2 2:2:2:0          yes 6300.0000 800.0000
  6    0      0    3 3:3:3:0          yes 6300.0000 800.0000
  7    0      0    3 3:3:3:0          yes 6300.0000 800.0000
  8    0      0    4 4:4:4:0          yes 6300.0000 800.0000
  9    0      0    4 4:4:4:0          yes 6300.0000 800.0000
 10    0      0    5 5:5:5:0          yes 6300.0000 800.0000
 11    0      0    5 5:5:5:0          yes 6300.0000 800.0000
 12    0      0    6 6:6:6:0          yes 3600.0000 800.0000
 13    0      0    7 7:7:6:0          yes 3600.0000 800.0000
 14    0      0    8 8:8:6:0          yes 3600.0000 800.0000
 15    0      0    9 9:9:6:0          yes 3600.0000 800.0000

When appending to slurm.conf I'll usually just copy over info from lscpu. For my new machine, the info is:

CPU(s):                          16
On-line CPU(s) list:             0-15
Thread(s) per core:              1
Core(s) per socket:              10
Socket(s):                       1

I appended to slurm.conf the following: NodeName=MYNODE CPUs=16 SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=1. However, this raised the following error:

error: NodeNames=MYNODE CPUs=16 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.

Internally, it seems like slurm expects nodes to have cores with all the same number of threads. How can I correctly configure slurm.conf for my new node?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

月下客 2025-02-10 14:58:10

尝试删除socketsperboard = 1 coreSperSocket = 10 threadspercore = 1,只指定nodeName = myNode cpus = 16。如果您指定cpus套接字corespersocket等。Slurm将尝试理解cpu值值。如果您不指定它们,Slurm将接受您给它的CPU值。

Try removing SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=1 and just specifying NodeName=MYNODE CPUs=16. If you specify both CPUS and Sockets, CoresPerSocket, etc. Slurm will try to make sense of the CPU value. If you do not specify them, Slurm will accept the CPU value you give it.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文