slurm群集:配置节点,并非所有内核都具有等量的数字线程
我有一个新的节点,我正在尝试将其添加到slurm群集中。新机器上的内核并非都具有相同数量的螺纹:6个内核每个有2个螺纹,4个核有1个线程,总计16个CPU。这是由lscpu -e
显示的:
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 6300.0000 800.0000
1 0 0 0 0:0:0:0 yes 6300.0000 800.0000
2 0 0 1 1:1:1:0 yes 6300.0000 800.0000
3 0 0 1 1:1:1:0 yes 6300.0000 800.0000
4 0 0 2 2:2:2:0 yes 6300.0000 800.0000
5 0 0 2 2:2:2:0 yes 6300.0000 800.0000
6 0 0 3 3:3:3:0 yes 6300.0000 800.0000
7 0 0 3 3:3:3:0 yes 6300.0000 800.0000
8 0 0 4 4:4:4:0 yes 6300.0000 800.0000
9 0 0 4 4:4:4:0 yes 6300.0000 800.0000
10 0 0 5 5:5:5:0 yes 6300.0000 800.0000
11 0 0 5 5:5:5:0 yes 6300.0000 800.0000
12 0 0 6 6:6:6:0 yes 3600.0000 800.0000
13 0 0 7 7:7:6:0 yes 3600.0000 800.0000
14 0 0 8 8:8:6:0 yes 3600.0000 800.0000
15 0 0 9 9:9:6:0 yes 3600.0000 800.0000
附加到 slurm.conf 时,我通常只从lscpu
中复制信息。对于我的新计算机,信息是:
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 1
我附加到 slurm.conf 以下:nodeName = myNode cpus = 16 socketsperboard = 1 coresperSocket = 10 threadspercore = 1
。但是,这引起了以下错误:
error: NodeNames=MYNODE CPUs=16 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
在内部,似乎Slurm希望节点具有具有所有相同线程的内核。如何为我的新节点正确配置 slurm.conf ?
I've got a new node that I'm trying to add to my Slurm cluster. The cores on the new machine do not all have the same number of threads: 6 cores have 2 threads each and 4 cores have 1 thread each, a total of 16 CPUs. This is shown by lscpu -e
:
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 6300.0000 800.0000
1 0 0 0 0:0:0:0 yes 6300.0000 800.0000
2 0 0 1 1:1:1:0 yes 6300.0000 800.0000
3 0 0 1 1:1:1:0 yes 6300.0000 800.0000
4 0 0 2 2:2:2:0 yes 6300.0000 800.0000
5 0 0 2 2:2:2:0 yes 6300.0000 800.0000
6 0 0 3 3:3:3:0 yes 6300.0000 800.0000
7 0 0 3 3:3:3:0 yes 6300.0000 800.0000
8 0 0 4 4:4:4:0 yes 6300.0000 800.0000
9 0 0 4 4:4:4:0 yes 6300.0000 800.0000
10 0 0 5 5:5:5:0 yes 6300.0000 800.0000
11 0 0 5 5:5:5:0 yes 6300.0000 800.0000
12 0 0 6 6:6:6:0 yes 3600.0000 800.0000
13 0 0 7 7:7:6:0 yes 3600.0000 800.0000
14 0 0 8 8:8:6:0 yes 3600.0000 800.0000
15 0 0 9 9:9:6:0 yes 3600.0000 800.0000
When appending to slurm.conf I'll usually just copy over info from lscpu
. For my new machine, the info is:
CPU(s): 16
On-line CPU(s) list: 0-15
Thread(s) per core: 1
Core(s) per socket: 10
Socket(s): 1
I appended to slurm.conf the following: NodeName=MYNODE CPUs=16 SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=1
. However, this raised the following error:
error: NodeNames=MYNODE CPUs=16 match no Sockets, Sockets*CoresPerSocket or Sockets*CoresPerSocket*ThreadsPerCore. Resetting CPUs.
Internally, it seems like slurm expects nodes to have cores with all the same number of threads. How can I correctly configure slurm.conf for my new node?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
尝试删除
socketsperboard = 1 coreSperSocket = 10 threadspercore = 1
,只指定nodeName = myNode cpus = 16
。如果您指定cpus
和套接字
,corespersocket
等。Slurm将尝试理解cpu
值值。如果您不指定它们,Slurm将接受您给它的CPU值。Try removing
SocketsPerBoard=1 CoresPerSocket=10 ThreadsPerCore=1
and just specifyingNodeName=MYNODE CPUs=16
. If you specify bothCPUS
andSockets
,CoresPerSocket
, etc. Slurm will try to make sense of theCPU
value. If you do not specify them, Slurm will accept the CPU value you give it.