微调 GPT2 时 CUDA 内存不足
运行时错误:CUDA 内存不足。尝试分配 144.00 MiB(GPU 0;11.17 GiB 总容量;10.49 GiB 已分配;13.81 MiB 空闲;PyTorch 总共保留 10.56 GiB)分配的内存尝试设置 max_split_size_mb 以避免碎片。请参阅内存管理和 PYTORCH_CUDA_ALLOC_CONF 的文档
这是我收到的错误,我尝试过调整批量大小但无济于事。我正在 google colab 上进行培训。
这是与错误有关的代码段:
training_args = TrainingArguments(
output_dir="/content/",
num_train_epochs=EPOCHS,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
# gradient_accumulation_steps=BATCH_UPDATE,
evaluation_strategy="epoch",
save_strategy='epoch',
fp16=True,
fp16_opt_level=APEX_OPT_LEVEL,
warmup_steps=WARMUP_STEPS,
learning_rate=LR,
adam_epsilon=EPS,
weight_decay=0.01,
save_total_limit=1,
load_best_model_at_end=True,
)
有解决方案吗?
RuntimeError: CUDA out of memory. Tried to allocate 144.00 MiB (GPU 0; 11.17 GiB total capacity; 10.49 GiB already allocated; 13.81 MiB free; 10.56 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
This is the error I am getting, I have tried playing around with batch size but to no avail. I am training on google colab.
This is the piece of code concerned with the error:
training_args = TrainingArguments(
output_dir="/content/",
num_train_epochs=EPOCHS,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
# gradient_accumulation_steps=BATCH_UPDATE,
evaluation_strategy="epoch",
save_strategy='epoch',
fp16=True,
fp16_opt_level=APEX_OPT_LEVEL,
warmup_steps=WARMUP_STEPS,
learning_rate=LR,
adam_epsilon=EPS,
weight_decay=0.01,
save_total_limit=1,
load_best_model_at_end=True,
)
Any solution?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

您使用哪种型号?只是 Huggingface 的标准
gpt-2
吗?我之前在自己的 GPU 上微调过该模型,该 GPU 只有 6GB,并且能够毫无问题地使用batch_size
8。我会尝试以下每一项:
batch_size
- 您已经尝试过了,您是否将其一直更改为batch_size
为 1?即使那样也会出现问题吗?!nvidia-smi -L
查看分配给您的 GPU。如果您看到您的型号小于 24GB,请将“笔记本设置”设置为“无”,然后再次设置为 GPU 以获取新的型号。或管理会话 ->终止会话然后重新分配。多尝试几次,直到获得好的 GPU。因为您的代码可能不适用于 16GB 或更少的内存,但可能仅适用于 24GB。一般来说,清除资源可能是一个好主意,以防已经加载的大型内容首先导致此问题。Which model are you using? Just the standard
gpt-2
from huggingface? I fine-tuned that model before on my own GPU which has only 6GB and was able to usebatch_size
of 8 without a problem.I would try each of the following:
batch_size
- you already tried it, did you change it all the way down to abatch_size
of 1? Does the problem occur even then?!nvidia-smi -L
to see which GPU was allocated to you. If you should see that you got a model with less than 24GB, turn Notebook-Settings to None, then to GPU again to get a new one. Or Manage Sessions -> Terminate Sessions then Reallocate. Try a few times until you get a good GPU. Since your code might not work with 16GB or less but might just work with 24GB. Generally clearing your ressources might be a good idea in case there is something large already loaded causing this problem in the first place.