StableBaselines3-为什么呼叫“ Model.Learn（50,000）”两次不与Called“ Model.Learn（100,000）”给出相同的结果。一次？

发布于 2025-02-12 20:56:14 字数 654 浏览 2 评论 0原文

我正在研究稳定的baselines中的增强学习问题3。

我试图理解为什么此代码：

model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)

没有给出与此代码完全相同的结果：

model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)

我说它们没有给出相同的结果，性能不同。鉴于我在for-loop中设置了确定性= true，但我没有更改种子，因此不同的性能必须意味着网络不同，这意味着训练过程是不同的。

我的印象是，如果我在现有模型上运行型号。LEARN（），它只会在以前停止的培训中进行培训，但我想这是不正确的。

有人可以帮助我理解为什么这两种情况会带来不同的结果吗？

原文

I am working on a Reinforcement Learning problem in StableBaselines3.

I am trying to understand why this code:

model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)

Does not give the exact same result as this code:

model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)

I say they don't give the same results because in both cases, I tested out the model on a test-set through a for-loop, and the performance was different. Given that I set deterministic=True in the for-loop and I didn't change the seed, the different performance must mean the networks are different, which means the training process was different.

I was under the impression that if I run model.learn() on an existing model, it would just pick up the training where it was previously left off, but I guess that's incorrect.

Can someone help me understand why those two situations deliver different results?

分享到QQ

分享到微博