仅当两者都获得分配的资源时,才能运行2个Slurm作业

发布于 2025-02-10 17:08:48 字数 230 浏览 2 评论 0原文

提交一项工作以持有4 GPU。第二个提交以获取接下来的4个GPU(在不同的节点上)。我如何确保两个作业同时运行,以最终同步(Pytorch DPP)。

拥有一个额外的脚本来检查可用资源可以解决问题,但是其他作业可能会优先,因为它们已经进入了队列,而不是等待...

我使用的特定分区不允许直接提供2个节点的请求。

我也知道- 依赖项标志,但是这只能用作对第一个作业的完成检查。

One job is submitted to get hold of 4 GPUs. The second is submitted to get hold of the next 4 GPUs (on a different node). How can I ensure that both of the jobs run at the same time such that they eventually synchronise (Pytorch DPP).

Having an extra script to check the available resources does the trick, however other jobs might have priority because they have been in the queue, rather than waiting...

The particular partition I am using does not allow for a request of 2 nodes directly.

I am also aware of the --dependency flag, however this can only be used as a completion check of the first job.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

吻安 2025-02-17 17:08:48

一个简单的答案是使用Slurm更加明确。

idx=0; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=1 &
idx=1; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=2 &
idx=1; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=3 &

wait

srun示例

工作将根据需要分配特定的通用资源来满足请求。如果工作被暂停,这些资源将无法被其他工作使用。

如上所述,可以使用srun命令的-gres选项将作业步骤从分配给作业的人分配给工作。默认情况下,将分配给作业的所有通用资源。如果需要,工作步骤可以明确指定与作业不同的通用资源计数。此设计选择是基于每个作业执行许多工作步骤的场景。如果默认情况下授予所有通用资源的访问权限,则某些作业步骤将需要明确指定零通用资源计数,我们认为这更令人困惑。可以分配特定的通用资源,这些资源将无法用于其他工作步骤。一个简单的示例如下所示。

flags解释了

  1. - gres - 每个节点所需的通用资源
  2. -gpus - gpus - gpus abios
  3. -gpus-per节点 - 每个节点所需的GPU。等效于GPU的-gres选项。
  4. - gpus-per插座 - 每个插座需要GPU。要求作业指定任务套接字。
  5. - 按任务gpus-gpus - 每个任务所需的GPU。要求作业指定任务计数。
  6. - cpus-per-gpu - 分配了每gpu的CPU计数。
  7. - gpu-bind - 定义任务如何与GPU绑定。
  8. - gpu-freq - 指定GPU频率和/或GPU存储频率。
  9. - mem-per-gpu - 分配的每个GPU内存。
#!/bin/bash
#
# gres_test.bash
# Submit as follows:
# sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
#
srun --gres=gpu:2 -n2 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
wait

另一个示例:

srun --gres=gpu:1 bash -c 'CUDA_VISIBLE_DEVICES=$SLURM_PROCID env' | grep CUDA_VISIBLE

CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=0

您可以通过BASH脚本进一步自动化此功能:

#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive


for i in `seq 0 3`; do
    cd ${i}
    export CUDA_VISIBLE_DEVICES=$i
    python gpu_code.py &
    cd ..
done
wait

复杂但更好的答案...

多进程服务MPS是与CUDA编程接口兼容的实现变体。国会议员执行体系结构旨在让合作的多进程CUDA应用程序(通常用于MPI作业),在最新的NVIDIA GPU上使用Hyper-Q功能。 Hyper-Q允许在同一GPU上同时处理CUDA内核;当GPU计算能力不受单个申请流程的不足时,这可以提高性能。

默认情况下,CUDA MPS在用户可用的不同CUDA模块中包括。

对于多GPU MPI批处理作业,可以使用CUDA MPS的使用使用-C MPS选项。但是,该节点必须通过- 独家选项专门保留。

通过默认GPU分区执行(具有40个物理内核和4个GPU的节点)仅使用一个节点:

mps_multi_gpu_mpi.slurm

#!/bin/bash

SBATCH --job-name=gpu_cuda_mps_multi_mpi # name of job
SBATCH --ntasks=40  # total number of MPI tasks
SBATCH --ntasks-per-node=40  # number of MPI tasks per node (all physical cores)
SBATCH --gres=gpu:4  # number of GPUs per node (all GPUs)
SBATCH --cpus-per-task=1  # number of cores per task 

# /!\ Caution: In Slurm vocabulary, "multithread" refers to hyperthreading.

SBATCH --hint=nomultithread  # hyperthreading deactivated
SBATCH --time=00:10:00  # maximum execution time requested (HH:MM:SS)
SBATCH --output=gpu_cuda_mps_multi_mpi%j.out # name of output file
SBATCH --error=gpu_cuda_mps_multi_mpi%j.out  # name of error file (here, common with the output)

SBATCH --exclusive   # exclusively reserves the node 
SBATCH -C mps    # the MPS is activated  
         
# cleans out modules loaded in interactive and inherited by default
module purge
         
# loads modules
module load ...
         
# echo of launched commands
set -x
         
# execution of the code with binding via bind_gpu.sh: 4 GPUs for 40 MPI tasks.
srun ./executable_multi_gpu_mpi

通过sbatch命令提交脚本:

sbatch mps_multi_gpu_mpi.slurm
同样,您可以在GPU_P2分区的整个节点(带有24个物理内核和8 GPU的节点)上执行作业:

SBATCH --partition=gpu_p2 ​           # GPU partition requested 
SBATCH --ntasks=24 ​                  # total number of MPI tasks 
SBATCH --ntasks-per-node=24 ​         # number of MPI tasks per node (all physical cores)
SBATCH --gres=gpu:8                  # number of GPUs per node (all GPUs)
SBATCH --cpus-per-task=1 ​            # number of cores per task 

要小心,即使您仅使用节点的一部分,也必须在独家模式下保留。特别是,这意味着整个节点被发票。
我建议您通过加载相同的模块在相同的环境中编译和执行代码。在此示例中,我假设executable_mps_multi_gpu_mpi可执行文件在提交目录中找到,即输入sbatch命令的目录。

计算输出文件,gpu_cuda_mps_multi_mpi< numero_job> out,也可以在提交目录中找到。它是在作业执行开始时创建的:在运行工作时进行编辑或修改它可能会破坏执行。

SLURM默认行为必须进行模块清除:当您启动sbatch时,将加载到环境中的任何模块将传递给已提交的工作,从而使工作取决于什么您以前做过。

protip:要避免自动任务分布中的错误,我建议使用srun执行您的代码而不是mpirun。这样可以保证符合您在提交文件中要求的资源规范的分布。

Misc。
工作默认情况下,每个分区和按QoS(服务质量)在Slurm中定义了资源。您可以修改限制或指定其他分区和 /或QoS,如文档中所示,详细列出了分区和QoS。

那是详尽的,我希望有帮助!

The simple answer is to be more explicit with slurm.

idx=0; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=1 &
idx=1; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=2 &
idx=1; export CUDA_VISIBLE_DEVICES=$idx; python -u run_pos.py --fold=3 &

wait

srun examples

Jobs will be allocated specific generic resources as needed to satisfy the request. If the job is suspended, those resources do not become available for use by other jobs.

Job steps can be allocated generic resources from those allocated to the job using the --gres option with the srun command as described above. By default, a job step will be allocated all of the generic resources allocated to the job. If desired, the job step may explicitly specify a different generic resource count than the job. This design choice was based upon a scenario where each job executes many job steps. If job steps were granted access to all generic resources by default, some job steps would need to explicitly specify zero generic resource counts, which we considered more confusing. The job step can be allocated specific generic resources and those resources will not be available to other job steps. A simple example is shown below.

Flags explained

  1. --gres - Generic resources required per node
  2. --gpus - GPUs required per job
  3. --gpus-per-node - GPUs required per node. Equivalent to the --gres option for GPUs.
  4. --gpus-per-socket - GPUs required per socket. Requires the job to specify a task socket.
  5. --gpus-per-task - GPUs required per task. Requires the job to specify a task count.
  6. --cpus-per-gpu - Count of CPUs allocated per GPU.
  7. --gpu-bind - Define how tasks are bound to GPUs.
  8. --gpu-freq - Specify GPU frequency and/or GPU memory frequency.
  9. --mem-per-gpu - Memory allocated per GPU.
#!/bin/bash
#
# gres_test.bash
# Submit as follows:
# sbatch --gres=gpu:4 -n4 -N1-1 gres_test.bash
#
srun --gres=gpu:2 -n2 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
srun --gres=gpu:1 -n1 --exclusive show_device.sh &
wait

Another example:

srun --gres=gpu:1 bash -c 'CUDA_VISIBLE_DEVICES=$SLURM_PROCID env' | grep CUDA_VISIBLE

CUDA_VISIBLE_DEVICES=1
CUDA_VISIBLE_DEVICES=0

You can further automate this with a bash script:

#!/bin/bash
#SBATCH --qos=maxjobs
#SBATCH -N 1
#SBATCH --exclusive


for i in `seq 0 3`; do
    cd ${i}
    export CUDA_VISIBLE_DEVICES=$i
    python gpu_code.py &
    cd ..
done
wait

The complex but better answer...

The Multi-Process Service MPS is an implementation variant compatible with the CUDA programming interface. The MPS execution architecture is designed to let co-operative multi-process CUDA applications, generally for MPI jobs, use Hyper-Q functionalities on the very latest NVIDIA GPUs. Hyper-Q allows CUDA kernels to be processed simultaneously on the same GPU; this can improve performance when the GPU calculation capacity is underused by a single application process.

CUDA MPS is included by default in the different CUDA modules available to the users.

For a multi-GPU MPI batch job, the usage of CUDA MPS can be activated with the -C mps option. However, the node must be exclusively reserved via the --exclusive option.

For an execution via the default gpu partition (nodes with 40 physical cores and 4 GPUs) using only one node:

mps_multi_gpu_mpi.slurm

#!/bin/bash

SBATCH --job-name=gpu_cuda_mps_multi_mpi # name of job
SBATCH --ntasks=40  # total number of MPI tasks
SBATCH --ntasks-per-node=40  # number of MPI tasks per node (all physical cores)
SBATCH --gres=gpu:4  # number of GPUs per node (all GPUs)
SBATCH --cpus-per-task=1  # number of cores per task 

# /!\ Caution: In Slurm vocabulary, "multithread" refers to hyperthreading.

SBATCH --hint=nomultithread  # hyperthreading deactivated
SBATCH --time=00:10:00  # maximum execution time requested (HH:MM:SS)
SBATCH --output=gpu_cuda_mps_multi_mpi%j.out # name of output file
SBATCH --error=gpu_cuda_mps_multi_mpi%j.out  # name of error file (here, common with the output)

SBATCH --exclusive   # exclusively reserves the node 
SBATCH -C mps    # the MPS is activated  
         
# cleans out modules loaded in interactive and inherited by default
module purge
         
# loads modules
module load ...
         
# echo of launched commands
set -x
         
# execution of the code with binding via bind_gpu.sh: 4 GPUs for 40 MPI tasks.
srun ./executable_multi_gpu_mpi

Submit script via the sbatch command:

sbatch mps_multi_gpu_mpi.slurm
Similarly, you can execute your job on an entire node of the gpu_p2 partition (nodes with 24 physical cores and 8 GPUs) by specifying:

SBATCH --partition=gpu_p2 ​           # GPU partition requested 
SBATCH --ntasks=24 ​                  # total number of MPI tasks 
SBATCH --ntasks-per-node=24 ​         # number of MPI tasks per node (all physical cores)
SBATCH --gres=gpu:8                  # number of GPUs per node (all GPUs)
SBATCH --cpus-per-task=1 ​            # number of cores per task 

Be careful, even if you use only part of the node, it has to be reserved in exclusive mode. In particular, this means that the entire node is invoiced.
I recommend that you compile and execute your codes in the same environment by loading the same modules. In this example, I assume that the executable_mps_multi_gpu_mpi executable file is found in the submission directory, i.e. the directory in which the sbatch command is entered.

The calculation output file, gpu_cuda_mps_multi_mpi<numero_job>.out, is also found in the submission directory. It is created at the start of the job execution: Editing or modifying it while the job is running can disrupt the execution.

The module purge is made necessary by the Slurm default behaviour: Any modules which are loaded in your environment at the moment when you launch sbatch will be passed to the submitted job making the execution of your job dependent on what you have done previously.

PROTIP: To avoid errors in the automatic task distribution, I recommend using srun to execute your code instead of mpirun. This guarantees a distribution which conforms to the specifications of the resources you requested in the submission file.

Misc.
Jobs have resources defined in Slurm by default, per partition and per QoS (Quality of Service). You can modify the limits or specify another partition and / or QoS as shown in the documentation detailing the partitions and QoS.

That was exhaustive, I HOPE THAT HELPS!

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文