Pytorch从Slurm Job中发现超过1 GPU
我正在使用 SLURM 从超级计算机为我的 ML 作业分配一些 GPU 节点。
对于单个 GPU(1 个节点),一切正常,但当为超过 1 个作业设置 SLURM 脚本时,python 脚本似乎仍然只检测 1 个 GPU。首先,我确信节点被保留,因为下面的输出:
现在,我不确定我做错了什么,但这是我的 SLURM 文件
#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH -N 4
#SBATCH -C TitanX
#SBATCH --gres=gpu:1
#SBATCH -o myfile.out
# Load GPU drivers
module load cuda11.1/toolkit
module load cuDNN/cuda11.1
# This loads the anaconda virtual environment with our packages
source /home/user/.bashrc
conda activate env_37
CUDA_VISIBLE_DEVICES=0,1,2,3
# Run the actual experiment
python train.py --name gm --workers 4 --shuffle --keep_step 1000 --decay_step 1000
:用于调用的 Python 脚本GPU:
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
model.cuda()
model.train()
# adding for multiple GPUs
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
model = nn.DataParallel(model, gpu_ids = [0,1,2,3])
# is cuda being used?
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
device_count()
的输出为我提供 1。我仍然不确定问题是否出在我的 python 脚本或 SLURM 脚本上。超级计算机上可用的最高 CUDA 工具包版本是 11.1(据我所知),我在 SLURM 脚本中加载了这个模块,但是运行 python 脚本的 conda 环境使用更高的 CUDA 版本。但由于代码已经在 1 个 GPU 上运行,我怀疑这就是问题所在。我希望有人能帮忙!
I'm using SLURM to assign some GPU nodes from a supercomputer for an ML job I have.
Everything works fine for a single GPU (1 node), but when set the SLURM script for more than 1 job, the python script still seems to only detect 1 GPU. First of all, I'm certain that the nodes are being reserved because of the output right below:
Now, I'm not sure what I am doing wrong but this is my SLURM file:
#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH -N 4
#SBATCH -C TitanX
#SBATCH --gres=gpu:1
#SBATCH -o myfile.out
# Load GPU drivers
module load cuda11.1/toolkit
module load cuDNN/cuda11.1
# This loads the anaconda virtual environment with our packages
source /home/user/.bashrc
conda activate env_37
CUDA_VISIBLE_DEVICES=0,1,2,3
# Run the actual experiment
python train.py --name gm --workers 4 --shuffle --keep_step 1000 --decay_step 1000
And my Python script for calling the GPUs:
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
model.cuda()
model.train()
# adding for multiple GPUs
if torch.cuda.device_count() > 1:
print("Let's use", torch.cuda.device_count(), "GPUs!")
model = nn.DataParallel(model, gpu_ids = [0,1,2,3])
# is cuda being used?
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))
The output of device_count()
gives me 1. I'm still not sure if the problem lies with my python script or with the SLURM script. The highest available CUDA toolkit version on the supercomputer is 11.1 (as far as I know) and I loaded this module in the SLURM script, however the conda environment running the python script uses a higher CUDA version. But since the code already worked with 1 GPU, I doubt that this is the problem. I hope someone can help!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
You are allocating 1 GPU to your sbatch job with
#SBATCH --gres=gpu:1
If you have 4 GPUs to use, you should change this option to
#SBATCH --gres=gpu :4
See the documentation on
gres
for more info: https: //slurm.schedmd.com/gres.htmlYou are allocating 1 GPU to your sbatch job with
#SBATCH --gres=gpu:1
If you have 4 GPUs to use, you should change this option to
#SBATCH --gres=gpu:4
See the documentation on
gres
for more info: https://slurm.schedmd.com/gres.html