Pytorch从Slurm Job中发现超过1 GPU

发布于 2025-01-18 14:03:19 字数 1511 浏览 1 评论 0原文

我正在使用 SLURM 从超级计算机为我的 ML 作业分配一些 GPU 节点。

对于单个 GPU（1 个节点），一切正常，但当为超过 1 个作业设置 SLURM 脚本时，python 脚本似乎仍然只检测 1 个 GPU。首先，我确信节点被保留，因为下面的输出：

现在，我不确定我做错了什么，但这是我的 SLURM 文件

#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH -N 4
#SBATCH -C TitanX
#SBATCH --gres=gpu:1
#SBATCH -o myfile.out

# Load GPU drivers
module load cuda11.1/toolkit
module load cuDNN/cuda11.1

# This loads the anaconda virtual environment with our packages
source /home/user/.bashrc
conda activate env_37

CUDA_VISIBLE_DEVICES=0,1,2,3

# Run the actual experiment
python train.py --name gm --workers 4 --shuffle --keep_step 1000 --decay_step 1000

：用于调用的 Python 脚本GPU：

os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
model.cuda()
model.train()
    
# adding for multiple GPUs    
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model, gpu_ids = [0,1,2,3])
    
# is cuda being used?
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

device_count() 的输出为我提供 1。我仍然不确定问题是否出在我的 python 脚本或 SLURM 脚本上。超级计算机上可用的最高 CUDA 工具包版本是 11.1（据我所知），我在 SLURM 脚本中加载了这个模块，但是运行 python 脚本的 conda 环境使用更高的 CUDA 版本。但由于代码已经在 1 个 GPU 上运行，我怀疑这就是问题所在。我希望有人能帮忙！

原文

I'm using SLURM to assign some GPU nodes from a supercomputer for an ML job I have.

Everything works fine for a single GPU (1 node), but when set the SLURM script for more than 1 job, the python script still seems to only detect 1 GPU. First of all, I'm certain that the nodes are being reserved because of the output right below:

Now, I'm not sure what I am doing wrong but this is my SLURM file:

#!/bin/bash
#SBATCH --time=00:03:00
#SBATCH -N 4
#SBATCH -C TitanX
#SBATCH --gres=gpu:1
#SBATCH -o myfile.out

# Load GPU drivers
module load cuda11.1/toolkit
module load cuDNN/cuda11.1

# This loads the anaconda virtual environment with our packages
source /home/user/.bashrc
conda activate env_37

CUDA_VISIBLE_DEVICES=0,1,2,3

# Run the actual experiment
python train.py --name gm --workers 4 --shuffle --keep_step 1000 --decay_step 1000

And my Python script for calling the GPUs:

os.environ['CUDA_VISIBLE_DEVICES'] = "0,1,2,3"
model.cuda()
model.train()
    
# adding for multiple GPUs    
if torch.cuda.device_count() > 1:
    print("Let's use", torch.cuda.device_count(), "GPUs!")
    model = nn.DataParallel(model, gpu_ids = [0,1,2,3])
    
# is cuda being used?
print(torch.cuda.is_available())
print(torch.cuda.current_device())
print(torch.cuda.device(0))
print(torch.cuda.device_count())
print(torch.cuda.get_device_name(0))

The output of device_count() gives me 1. I'm still not sure if the problem lies with my python script or with the SLURM script. The highest available CUDA toolkit version on the supercomputer is 11.1 (as far as I know) and I loaded this module in the SLURM script, however the conda environment running the python script uses a higher CUDA version. But since the code already worked with 1 GPU, I doubt that this is the problem. I hope someone can help!

分享到QQ

分享到微博