DeepSpeed 优化器问题
我正在尝试使用 deepspeed 优化器来让我能够将更多内容放入内存并加快我的训练速度。我一直在优化器方面遇到这个问题,但我不确定是什么原因造成的。错误如下:
Traceback (most recent call last):
File "02232022_kat_repr_train.py", line 89, in <module>
first_exp.run_experiment()
File "experiments.py", line 28, in run_experiment
self.trainer.train(self.hyperparameters["epochs"], self.hyperparameters["eval_period"])
File "trainer.py", line 90, in train
self.optimizer.step()
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1633, in step
self.check_overflow()
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1913, in check_overflow
self._check_overflow(partition_gradients)
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1818, in _check_overflow
self.overflow = self.has_overflow(partition_gradients)
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1837, in has_overflow
overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1830, in has_overflow_partitioned_grads_serial
for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0
我的训练代码非常标准:
outputs = self.model(**batch)
loss = self.loss_function(outputs)
loss = self.loss_aggregation(loss)
epoch_loss += loss
loss.backward()
self.optimizer.backward(loss)
self.model.zero_grad()
我像这样初始化deepspeed: self.model, self.optimizer, _, self.lr_scheduler = ds.initialize(model=self.model, config_params=self.deepspeed_config , optimizationr=self.optimizer, lr_scheduler=self.lr_scheduler)
我正在使用此处的示例配置:https://github.com/microsoft/DeepSpeedExamples/blob/ master/Megatron-LM/scripts/ds_zero2_config.json
有谁知道为什么我会遇到这个问题?
谢谢你!
I am trying to use the deepspeed optimizer to allow me to fit more into memory and speed up my training. I keep having this trouble with the optimizer and I am not sure what is causing it. The error is below:
Traceback (most recent call last):
File "02232022_kat_repr_train.py", line 89, in <module>
first_exp.run_experiment()
File "experiments.py", line 28, in run_experiment
self.trainer.train(self.hyperparameters["epochs"], self.hyperparameters["eval_period"])
File "trainer.py", line 90, in train
self.optimizer.step()
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1633, in step
self.check_overflow()
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1913, in check_overflow
self._check_overflow(partition_gradients)
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1818, in _check_overflow
self.overflow = self.has_overflow(partition_gradients)
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1837, in has_overflow
overflow = self.local_overflow if self.cpu_offload else self.has_overflow_partitioned_grads_serial(
File "/anaconda3/envs/mmlm/lib/python3.8/site-packages/deepspeed/runtime/zero/stage_1_and_2.py", line 1830, in has_overflow_partitioned_grads_serial
for j, grad in enumerate(self.averaged_gradients[i]):
KeyError: 0
My training code is pretty standard:
outputs = self.model(**batch)
loss = self.loss_function(outputs)
loss = self.loss_aggregation(loss)
epoch_loss += loss
loss.backward()
self.optimizer.backward(loss)
self.model.zero_grad()
I initialize deepspeed like this: self.model, self.optimizer, _, self.lr_scheduler = ds.initialize(model=self.model, config_params=self.deepspeed_config, optimizer=self.optimizer, lr_scheduler=self.lr_scheduler)
I am using the example configurations from here: https://github.com/microsoft/DeepSpeedExamples/blob/master/Megatron-LM/scripts/ds_zero2_config.json
Does anyone know why I am getting this issue?
Thank you!
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论