绑定比Slurm OpenMPI中的CPU误差更多的过程

发布于 2025-01-24 13:33:53 字数 1046 浏览 5 评论 0原文

我正在尝试运行一项作业,该作业使用slurm上节点之间传递的明确消息(即不仅是运行并行作业),但是正在遇到一个反复出现的错误,即“提出与该请求绑定的请求将导致比在一个上的cpus绑定更多的过程资源”。简而言之,我的代码要求在128个节点上发送一系列参数,计算这些参数的可能性,然后将这些可能性值的总和收集回根节点。使用以下sbatch文件执行代码时,我会遇到错误:

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -N 8 ./linesearch

我认为使用-n 8会明确将8个流程分配给16 -ntasks-per-per-pernode < /代码>。我认为使用此方法(这是对计算机处理空间的效率低下的),将在响应不同的溢出线程后减少此错误,但并未解决该问题。

完整的错误消息(如果有用的话)如下:

A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     NONE:IF-SUPPORTED
   Node:        XXXXXX
   #processes:  4
   #cpus:       3

You can override this protection by adding the "overload-allowed"
option to your binding directive.

我要执行的过程可能是内存密集的,因此我不必一定要使用超负荷覆盖,以使作业终止分配后终止的风险。

I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -N 8 ./linesearch

I thought that using -N 8 would explicitly assign 8 processes-per-node to 16 --ntasks-per-node. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.

The full error message, if useful, is as follows:

A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     NONE:IF-SUPPORTED
   Node:        XXXXXX
   #processes:  4
   #cpus:       3

You can override this protection by adding the "overload-allowed"
option to your binding directive.

The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

给不了的爱 2025-01-31 13:33:53

请注意,我正在加载模块OpenMPI v2.0.1 [退休]。但是,更改sbatch文件以仅使用-np 128任务解决此问题

sbatch文件:

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -np 128 ./execs/linesearch $1 $2

替代解决方案是使用- bind> - bind-to-to core -m-map-by core by corempiexec语句中,将每个过程绑定到核心

Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only -np 128 tasks resolved this issue

sbatch file:

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -np 128 ./execs/linesearch $1 $2

An alternative solution is to use --bind-to core --map-by core in the mpiexec statement to bind each process to a core

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文