DeepSpeed 优化器问题

发布于 2025-01-11 08:40:04 字数 2145 浏览 0 评论 0原文

我正在尝试使用 deepspeed 优化器来让我能够将更多内容放入内存并加快我的训练速度。我一直在优化器方面遇到这个问题,但我不确定是什么原因造成的。错误如下:

Traceback (most recent call last):
  File "02232022_kat_repr_train.py", line 89, in <module>
    first_exp.run_experiment()
  File "experiments.py", line 28, in run_experiment
    self.trainer.train(self.hyperparameters["epochs"], self.hyperparameters["eval_period"])
  File "trainer.py", line 90, in train
    self.optimizer.step()
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1633, in step
    self.check_overflow()
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1913, in check_overflow
    self._check_overflow(partition_gradients)
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1818, in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1837, in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1830, in has_overflow_partitioned_grads_serial
    for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0

我的训练代码非常标准:

outputs = self.model(**batch)
loss = self.loss_function(outputs)
loss = self.loss_aggregation(loss)
epoch_loss += loss
loss.backward()
self.optimizer.backward(loss)
self.model.zero_grad()

我像这样初始化deepspeed: self.model, self.optimizer, _, self.lr_scheduler = ds.initialize(model=self.model, config_params=self.deepspeed_config , optimizationr=self.optimizer, lr_scheduler=self.lr_scheduler)

我正在使用此处的示例配置:https://github.com/microsoft/DeepSpeedExamples/blob/ master/Megatron-LM/scripts/ds_zero2_config.json

有谁知道为什么我会遇到这个问题?

谢谢你!

I am trying to use the deepspeed optimizer to allow me to fit more into memory and speed up my training. I keep having this trouble with the optimizer and I am not sure what is causing it. The error is below:

Traceback (most recent call last):
  File "02232022_kat_repr_train.py", line 89, in <module>
    first_exp.run_experiment()
  File "experiments.py", line 28, in run_experiment
    self.trainer.train(self.hyperparameters["epochs"], self.hyperparameters["eval_period"])
  File "trainer.py", line 90, in train
    self.optimizer.step()
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1633, in step
    self.check_overflow()
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1913, in check_overflow
    self._check_overflow(partition_gradients)
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1818, in _check_overflow
    self.overflow = self.has_overflow(partition_gradients)
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1837, in has_overflow
    overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
  File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1830, in has_overflow_partitioned_grads_serial
    for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0

My training code is pretty standard:

outputs = self.model(**batch)
loss = self.loss_function(outputs)
loss = self.loss_aggregation(loss)
epoch_loss += loss
loss.backward()
self.optimizer.backward(loss)
self.model.zero_grad()

I initialize deepspeed like this: self.model, self.optimizer, _, self.lr_scheduler = ds.initialize(model=self.model, config_params=self.deepspeed_config, optimizer=self.optimizer, lr_scheduler=self.lr_scheduler)

I am using the example configurations from here: https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM/scripts/ds_zero2_config.json

Does anyone know why I am getting this issue?

Thank you!

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文