绑定比Slurm OpenMPI中的CPU误差更多的过程

发布于 2025-01-24 13:33:53 字数 1046 浏览 5 评论 0原文

我正在尝试运行一项作业，该作业使用slurm上节点之间传递的明确消息（即不仅是运行并行作业），但是正在遇到一个反复出现的错误，即“提出与该请求绑定的请求将导致比在一个上的cpus绑定更多的过程资源”。简而言之，我的代码要求在128个节点上发送一系列参数，计算这些参数的可能性，然后将这些可能性值的总和收集回根节点。使用以下sbatch文件执行代码时，我会遇到错误：

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -N 8 ./linesearch

我认为使用-n 8会明确将8个流程分配给16 -ntasks-per-per-pernode < /代码>。我认为使用此方法（这是对计算机处理空间的效率低下的），将在响应不同的溢出线程后减少此错误，但并未解决该问题。

完整的错误消息（如果有用的话）如下：

A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     NONE:IF-SUPPORTED
   Node:        XXXXXX
   #processes:  4
   #cpus:       3

You can override this protection by adding the "overload-allowed"
option to your binding directive.

我要执行的过程可能是内存密集的，因此我不必一定要使用超负荷覆盖，以使作业终止分配后终止的风险。

原文

I am trying to run a job that uses explicit message passing between nodes on SLURM (i.e. not just running parallel jobs) but am getting a recurring error that "a request was made to bind to that would result in binding more processes than cpus on a resource". Briefly, my code requires sending an array of parameters across 128 nodes, calculating a likelihood of those parameters, and gathering the sum of those likelihood values back to the root node. I got the error when executing the code using the following sbatch file:

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks-per-node=16
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -N 8 ./linesearch

I thought that using -N 8 would explicitly assign 8 processes-per-node to 16 --ntasks-per-node. I thought that using this method, which is an inefficient use of computer processing space, would reduce this error following a response to a different overflow thread, but it didn't resolve the issue.

The full error message, if useful, is as follows:

A request was made to bind to that would result in binding more
processes than cpus on a resource:

   Bind to:     NONE:IF-SUPPORTED
   Node:        XXXXXX
   #processes:  4
   #cpus:       3

You can override this protection by adding the "overload-allowed"
option to your binding directive.

The processes that I'm executing can be memory intensive, so I don't want to necessarily use the overload override in the risk of jobs terminating after exhausting allocation.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

给不了的爱 2025-01-31 13:33:53

请注意，我正在加载模块OpenMPI v2.0.1 [退休]。但是，更改sbatch文件以仅使用-np 128任务解决此问题

sbatch文件：

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -np 128 ./execs/linesearch $1 $2

替代解决方案是使用- bind> - bind-to-to core -m-map-by core by core在mpiexec语句中，将每个过程绑定到核心

Note that I was loading module openmpi v2.0.1 [retired]. However, changing the sbatch file to bind to socket with only -np 128 tasks resolved this issue

sbatch file:

#!/bin/bash

#SBATCH --job-name=linesearch
#SBATCH --output=ls_%j.txt
#SBATCH --nodes=16
#SBATCH --ntasks=128
#SBATCH --partition=broadwl
#SBATCH --mem-per-cpu=2000
#SBATCH --time=18:00:00

# Load the default OpenMPI module.
module load openmpi

mpiexec -np 128 ./execs/linesearch $1 $2

An alternative solution is to use --bind-to core --map-by core in the mpiexec statement to bind each process to a core

回复收藏 0 原文

~没有更多了~