想知道我是否可以问一下有关并行运行slurm作业的问题。
我 ),
我已经设计了以下bash脚本
#!/bin/bash
#SBATCH --job-name fmriGLM #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH -t 16:00:00 # Time for running job
#SBATCH -o /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/output_fmri_glm.o%j #%j : job id 가 [>
#SBATCH -e /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/error_fmri_glm.e%j
pwd; hostname; date
#SBATCH --ntasks=30
#SBATCH --mem-per-cpu=3000MB
#SBATCH --cpus-per-task=1
for num in {0..29}
do
srun --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
done
wait
,我运行了以下内容: sbatch test_bash
,
但是,当我查看test_bash 时输出,很明显,正在执行BASH脚本中的 SRUNS
...有人可以告诉我我在哪里做错了什么以及如何修复它?
**更新:当我查看错误文件时,我将获得以下内容: srun:job 43969步骤创建暂时禁用,重试
。我搜索了Internet,它说这可能是由于不指定内存而引起的,因此没有足够的内存来完成第二个作业。但是我认为当我做时,我已经将内存定义了 - mem_per_per_cpu = 300mb < /代码>?
**更新:我已经尝试更改此处所述的代码:为什么我的SLURM工作步骤不在并行启动?,但是..仍然没有起作用
**潜在的相关信息:我们的节点约为96厘米,与所说的教程相比,这似乎很奇怪一个节点有4张或
谢谢你!
I was wondering if I could ask something about running slurm jobs in parallel.(Please note that I am new to slurm and linux and have only started using it 2 days ago...)
As per the insturctions on the picture below (source : https://hpc.nmsu.edu/discovery/slurm/tasks/serial-execution/),

I have designed the following bash script
#!/bin/bash
#SBATCH --job-name fmriGLM #job name을 다르게 하기 위해서
#SBATCH --nodes=1
#SBATCH -t 16:00:00 # Time for running job
#SBATCH -o /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/output_fmri_glm.o%j #%j : job id 가 [>
#SBATCH -e /scratch/connectome/dyhan316/fmri_preprocessing/FINAL_loop_over_all/error_fmri_glm.e%j
pwd; hostname; date
#SBATCH --ntasks=30
#SBATCH --mem-per-cpu=3000MB
#SBATCH --cpus-per-task=1
for num in {0..29}
do
srun --ntasks=1 python FINAL_ARGPARSE_RUN.py --n_division 30 --start_num ${num} &
done
wait
The, I ran sbatch as follows: sbatch test_bash
However, when I view the outputs, it is apparent that only one of the sruns
in the bash script are being executed... Could anyone tell me where I went wrong and how I can fix it?
**update : when I look at the error file I get the following : srun: Job 43969 step creation temporarily disabled, retrying
. I searched the internet and it says that this could be caused by not specifying the memory and hence not having enough memory for the second job.. but I thought that I already specifeid the memory when I did --mem_per_cpu=300MB
?
**update : I have tried changing the code as said as in here : Why are my slurm job steps not launching in parallel?, but.. still it didn't work
**potentially pertinent information: our node has about 96cores, which seems odd when compared to tutorials that say one node has like 4cores or something
Thank you!!
发布评论
评论(2)
尝试将
- 独家
添加到srun
命令行:这将指示
srun
使用子分配并按照您的预期工作。请注意,
- 独家
选项在此上下文中具有不同的含义,与sbatch
一起使用。还要注意,不同版本的slurm具有独特的规范方法,但是使用
- 独家
应该在大多数版本中使用。Try adding
--exclusive
to thesrun
command line:This will instruct
srun
to use a sub-allocation and work as you intended.Note that the
--exclusive
option has a different meaning in this context than if used withsbatch
.Note also that different versions of Slurm have a distinct canonical way of doing this, but using
--exclusive
should work across most versions.即使您解决了问题,但事实证明您已经指定了
-MEM_PER_CPU = 300MB
在您的sbatch
脚本中,我想添加就我而言,我的slurm设置不允许-Mem_per_cpu
insbatch
,只有- mem
。因此,srun
命令仍将分配所有内存并阻止后续步骤。对我的关键是指定-MEM_PER_CPU
(或- mem
)srun
命令。Even though you have solved your problem which turned out to be something else, and that you have already specified
--mem_per_cpu=300MB
in yoursbatch
script, I would like to add that in my case, my Slurm setup doesn't allow--mem_per_cpu
insbatch
, only--mem
. So thesrun
command will still allocate all the memory and block the subsequent steps. The key for me, is to specify--mem_per_cpu
(or--mem
) in thesrun
command.