Pytorch 中的 Cuda 内存不足:奇怪的行为以及如何解决它
我正在尝试使用 Pytorch 训练模型(convLSTM)进行视频预测。我的“视频”是尺寸为 350x350(帧)的图像列表,我给出 N 帧作为输入,需要预测输出中的 N 帧。
为了训练我的网络,我目前使用的是具有 12 个内核和 12GB RAM 的 NVIDIA V100。我的输入和输出是大小为 [batch_size、n_frames、channels、height、width] 的 5 阶张量。
问题是,即使设置最低限度,即: num_workers=1,batch_size = 1,size-training-set= 1(只有 1 个 16 帧的视频样本! - 请注意,低于 16 帧也可以) 我收到此错误:
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 32.00 GiB total capacity; 28.14 GiB already allocated; 96.54 MiB free; 28.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
我真的不知道如何消除此错误。我的训练大小约为 10K 多个样本,每个样本 20 帧,我什至无法使用 1 个样本运行我的代码。
我发现这个错误很常见,但找不到解决方法(因为一切都已经是最低限度了..)。我尝试使用 nvidia-smi
和 torch.cuda.memory_summary()
检查内存使用情况,但无法获得任何见解。此外,torch.cuda.empty_cache()
也没有帮助。
然而我看到 pytorch 为自己分配了很多空间,这正常吗?在代码中,我没有做任何奇怪的事情,我只是检查 cuda 是否可用作 device
,然后
model.to(device)
if multi_gpu:
model = nn.DataParallel(model)
在我的训练循环中设置此时生成的错误:
prediction = model(inputs,
input_frames = train_data.n_frames_input,
future_frames = train_data.n_frames_output,
output_frames = train_data.n_frames_output,
teacher_forcing = True,
scheduled_sampling_ratio = scheduled_sampling_ratio)
有什么见解吗?
编辑:将图片大小调整为较低尺寸(原始质量的一半)似乎有帮助。
I am trying to train a model (convLSTM) using Pytorch for video prediction. My "videos" are a list of images of dimensions 350x350 (frames), and I give N frames as input and need to predict N frames in output.
For train my network I am currently using a NVIDIA V100 with 12 cores and 12GB of RAM. My inputs and ouputs are 5-th order tensors of size [batch_size, n_frames, channels, height, width].
The problem is that even setting the bare minimum, i.e.:
num_workers=1,batch_size = 1, size-training-set= 1 (just 1 video sample of 16 frames! - note that with lower than 16 frames works)
I get this error:
RuntimeError: CUDA out of memory. Tried to allocate 24.00 MiB (GPU 0; 32.00 GiB total capacity; 28.14 GiB already allocated; 96.54 MiB free; 28.55 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.
I really don't know how to get rid of this error. My training size is about more than 10K samples of 20 frames each and I cannot even run my code with 1 sample.
I saw that this error is common, but cannot find a way to solve it (since everything is already at the minimum..). I tried to check my memory usage with nvidia-smi
and torch.cuda.memory_summary()
but could not get any insight. Also, torch.cuda.empty_cache()
does not help.
I see however that pytorch allocate a lot of space for itself, is it normal? In the code I don't do nothing weird, I just check if cuda is available as device
and then set
model.to(device)
if multi_gpu:
model = nn.DataParallel(model)
the error generates at this point in my training loop:
prediction = model(inputs,
input_frames = train_data.n_frames_input,
future_frames = train_data.n_frames_output,
output_frames = train_data.n_frames_output,
teacher_forcing = True,
scheduled_sampling_ratio = scheduled_sampling_ratio)
Any insights?
EDIT: resizing the pictures to lower dimensions (half of the original quality) seems to help.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论