Pytorch 中的 Cuda 内存不足：奇怪的行为以及如何解决它

发布于 2025-01-15 04:03:48 字数 1571 浏览 4 评论 0原文

我正在尝试使用 Pytorch 训练模型（convLSTM）进行视频预测。我的“视频”是尺寸为 350x350（帧）的图像列表，我给出 N 帧作为输入，需要预测输出中的 N 帧。

为了训练我的网络，我目前使用的是具有 12 个内核和 12GB RAM 的 NVIDIA V100。我的输入和输出是大小为 [batch_size、n_frames、channels、height、width] 的 5 阶张量。

问题是，即使设置最低限度，即： num_workers=1，batch_size = 1，size-training-set= 1（只有 1 个 16 帧的视频样本！ - 请注意，低于 16 帧也可以）我收到此错误：

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 32.00 GiB total capacity; 28.14 GiB already allocated; 96.54 MiB free; 28.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

我真的不知道如何消除此错误。我的训练大小约为 10K 多个样本，每个样本 20 帧，我什至无法使用 1 个样本运行我的代码。

我发现这个错误很常见，但找不到解决方法（因为一切都已经是最低限度了..）。我尝试使用 nvidia-smi 和 torch.cuda.memory_summary() 检查内存使用情况，但无法获得任何见解。此外，torch.cuda.empty_cache()也没有帮助。

然而我看到 pytorch 为自己分配了很多空间，这正常吗？在代码中，我没有做任何奇怪的事情，我只是检查 cuda 是否可用作 device ，然后

model.to(device)
if multi_gpu: 
    model = nn.DataParallel(model)

在我的训练循环中设置此时生成的错误：

prediction = model(inputs, 
            input_frames  = train_data.n_frames_input, 
            future_frames = train_data.n_frames_output, 
            output_frames = train_data.n_frames_output,      
            teacher_forcing = True, 
            scheduled_sampling_ratio = scheduled_sampling_ratio)

有什么见解吗？

编辑：将图片大小调整为较低尺寸（原始质量的一半）似乎有帮助。

原文

I am trying to train a model (convLSTM) using Pytorch for video prediction. My "videos" are a list of images of dimensions 350x350 (frames), and I give N frames as input and need to predict N frames in output.

For train my network I am currently using a NVIDIA V100 with 12 cores and 12GB of RAM. My inputs and ouputs are 5-th order tensors of size [batch_size, n_frames, channels, height, width].

The problem is that even setting the bare minimum, i.e.:
num_workers=1,batch_size = 1, size-training-set= 1 (just 1 video sample of 16 frames! - note that with lower than 16 frames works)
I get this error:

RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 32.00 GiB total capacity; 28.14 GiB already allocated; 96.54 MiB free; 28.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.

I really don't know how to get rid of this error. My training size is about more than 10K samples of 20 frames each and I cannot even run my code with 1 sample.

I saw that this error is common, but cannot find a way to solve it (since everything is already at the minimum..). I tried to check my memory usage with nvidia-smi and torch.cuda.memory_summary() but could not get any insight. Also, torch.cuda.empty_cache() does not help.

I see however that pytorch allocate a lot of space for itself, is it normal? In the code I don't do nothing weird, I just check if cuda is available as device and then set

model.to(device)
if multi_gpu: 
    model = nn.DataParallel(model)

the error generates at this point in my training loop:

prediction = model(inputs, 
            input_frames  = train_data.n_frames_input, 
            future_frames = train_data.n_frames_output, 
            output_frames = train_data.n_frames_output,      
            teacher_forcing = True, 
            scheduled_sampling_ratio = scheduled_sampling_ratio)

Any insights?

EDIT: resizing the pictures to lower dimensions (half of the original quality) seems to help.

分享到QQ

分享到微博