Pytorch从Slurm Job中发现超过1 GPU

发布于 2025-01-18 14:03:19 字数 1511 浏览 1 评论 0原文

我正在使用 SLURM 从超级计算机为我的 ML 作业分配一些 GPU 节点。

对于单个 GPU(1 个节点),一切正常,但当为超过 1 个作业设置 SLURM 脚本时,python 脚本似乎仍然只检测 1 个 GPU。首先,我确信节点被保留,因为下面的输出:

在此处输入图像描述

现在,我不确定我做错了什么,但这是我的 SLURM 文件

#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH -N 4
#SBATCH -C TitanX
#SBATCH --gres=gpu:1
#SBATCH -o myfile.out

# Load GPU drivers
module load cuda11.1/toolkit
module load cuDNN/cuda11.1

# This loads the anaconda virtual environment with our packages
source /home/user/.bashrc
conda activate env_37

CUDA_VISIBLE_DEVICES=0,1,2,3

# Run the actual experiment
python train.py --name gm --workers 4 --shuffle --keep_step 1000 --decay_step 1000

:用于调用的 Python 脚本GPU:

os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
model.cuda()
model.train()
    
# adding for multiple GPUs    
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model, gpu_ids = [0,1,2,3])
    
# is cuda being used?
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

device_count() 的输出为我提供 1。我仍然不确定问题是否出在我的 python 脚本或 SLURM 脚本上。超级计算机上可用的最高 CUDA 工具包版本是 11.1(据我所知),我在 SLURM 脚本中加载了这个模块,但是运行 python 脚本的 conda 环境使用更高的 CUDA 版本。但由于代码已经在 1 个 GPU 上运行,我怀疑这就是问题所在。我希望有人能帮忙!

I'm using SLURM to assign some GPU nodes from a supercomputer for an ML job I have.

Everything works fine for a single GPU (1 node), but when set the SLURM script for more than 1 job, the python script still seems to only detect 1 GPU. First of all, I'm certain that the nodes are being reserved because of the output right below:

enter image description here

Now, I'm not sure what I am doing wrong but this is my SLURM file:

#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH -N 4
#SBATCH -C TitanX
#SBATCH --gres=gpu:1
#SBATCH -o myfile.out

# Load GPU drivers
module load cuda11.1/toolkit
module load cuDNN/cuda11.1

# This loads the anaconda virtual environment with our packages
source /home/user/.bashrc
conda activate env_37

CUDA_VISIBLE_DEVICES=0,1,2,3

# Run the actual experiment
python train.py --name gm --workers 4 --shuffle --keep_step 1000 --decay_step 1000

And my Python script for calling the GPUs:

os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
model.cuda()
model.train()
    
# adding for multiple GPUs    
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model, gpu_ids = [0,1,2,3])
    
# is cuda being used?
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

The output of device_count() gives me 1. I'm still not sure if the problem lies with my python script or with the SLURM script. The highest available CUDA toolkit version on the supercomputer is 11.1 (as far as I know) and I loaded this module in the SLURM script, however the conda environment running the python script uses a higher CUDA version. But since the code already worked with 1 GPU, I doubt that this is the problem. I hope someone can help!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

塔塔猫 2025-01-25 14:03:20

You are allocating 1 GPU to your sbatch job with #SBATCH --gres=gpu:1

If you have 4 GPUs to use, you should change this option to #SBATCH --gres=gpu :4

See the documentation on gres for more info: https: //slurm.schedmd.com/gres.html

You are allocating 1 GPU to your sbatch job with #SBATCH --gres=gpu:1

If you have 4 GPUs to use, you should change this option to #SBATCH --gres=gpu:4

See the documentation on gres for more info: https://slurm.schedmd.com/gres.html

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文