StableBaselines3-为什么呼叫“ Model.Learn(50,000)”两次不与Called“ Model.Learn(100,000)”给出相同的结果。一次?
我正在研究稳定的baselines中的增强学习问题3。
我试图理解为什么此代码:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)
没有给出与此代码完全相同的结果:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)
我说它们没有给出相同的结果,性能不同。鉴于我在for-loop中设置了确定性= true,但我没有更改种子,因此不同的性能必须意味着网络不同,这意味着训练过程是不同的。
我的印象是,如果我在现有模型上运行型号。LEARN(),它只会在以前停止的培训中进行培训,但我想这是不正确的。
有人可以帮助我理解为什么这两种情况会带来不同的结果吗?
I am working on a Reinforcement Learning problem in StableBaselines3.
I am trying to understand why this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(100000)
Does not give the exact same result as this code:
model = MaskablePPO(MaskableActorCriticPolicy, env, verbose=1, learning_rate=0.0003, gamma=0.975, seed=10, batch_size=256, clip_range=0.2)
model.learn(50000)
model.learn(50000)
I say they don't give the same results because in both cases, I tested out the model on a test-set through a for-loop, and the performance was different. Given that I set deterministic=True in the for-loop and I didn't change the seed, the different performance must mean the networks are different, which means the training process was different.
I was under the impression that if I run model.learn() on an existing model, it would just pick up the training where it was previously left off, but I guess that's incorrect.
Can someone help me understand why those two situations deliver different results?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论