绑定比Slurm OpenMPI中的CPU误差更多的过程
我正在尝试运行一项作业,该作业使用slurm上节点之间传递的明确消息(即不仅是运行并行作业),但是正在遇到一个反复出现的错误,即“提出与该请求绑定的请求将导致比在一个上的cpus绑定更多的过程资源”。简而言之,我的代码要求在128个节点上发送一系列参数,计算这些参数的可能性,然后将这些可能性值的总和收集回根节点。使用以下sbatch文件执行代码时,我会遇到错误:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -N 8 ./linesearch
我认为使用-n 8
会明确将8个流程分配给16 -ntasks-per-per-pernode < /代码>。我认为使用此方法(这是对计算机处理空间的效率低下的),将在响应不同的溢出线程后减少此错误,但并未解决该问题。
完整的错误消息(如果有用的话)如下:
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: NONE:IF-SUPPORTED
Node: XXXXXX
#processes: 4
#cpus: 3
You can override this protection by adding the "overload-allowed"
option to your binding directive.
我要执行的过程可能是内存密集的,因此我不必一定要使用超负荷覆盖,以使作业终止分配后终止的风险。
I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:
#!/bin/bash
#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00
# Load the default OpenMPI module.
module load openmpi
mpiexec -N 8 ./linesearch
I thought that using -N 8
would explicitly assign 8 processes-per-node to 16 --ntasks-per-node
. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.
The full error message, if useful, is as follows:
A request was made to bind to that would result in binding more
processes than cpus on a resource:
Bind to: NONE:IF-SUPPORTED
Node: XXXXXX
#processes: 4
#cpus: 3
You can override this protection by adding the "overload-allowed"
option to your binding directive.
The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
请注意,我正在加载模块OpenMPI v2.0.1 [退休]。但是,更改sbatch文件以仅使用
-np 128
任务解决此问题sbatch文件:
替代解决方案是使用
- bind> - bind-to-to core -m-map-by core by core
在mpiexec
语句中,将每个过程绑定到核心Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only
-np 128
tasks resolved this issuesbatch file:
An alternative solution is to use
--bind-to core --map-by core
in thempiexec
statement to bind each process to a core