让 Actor 和 Critic 使用明显不同的模型有什么好处吗?
在 Actor-Critic 方法中,Actor 和 Critic 被分配两个互补但不同的目标。我试图了解这些目标(更新策略和更新价值函数)之间的差异是否足够大,足以保证 Actor 和 Critic 具有不同的模型,或者它们是否具有足够相似的复杂性,以至于应该重用相同的模型为了简单起见。我意识到这可能是非常具体的情况,但不是以什么方式。例如,随着模型复杂性的增加,平衡是否会发生变化?
如果有任何经验法则,或者您是否知道解决该问题的特定出版物,请告诉我。
In Actor-Critic methods the Actor and Critic are assigned two complimentary, but different goals. I'm trying to understand whether the differences between these goals (updating a policy and updating a value function) are large enough to warrant different models for the Actor and Critic, or if they are of similar enough complexity that the same model should be reused for simplicity. I realize that this could be very situational, but not in what way. For example, does the balance shift as the model complexity grows?
Please let me know if there are any rules of thumb for this, or if you know of a specific publication that addresses the issue.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
实证结果表明完全相反 - 重要的是让相同网络同时执行这两项操作(直到最后一层/头)。其主要原因是学习价值网络(批评)为塑造政策(参与者)的表示提供了信号,否则几乎不可能获得这种信号。
事实上,如果你考虑一下这些,就会发现这些目标极其相似,因为对于最优确定性政策
,T 是过渡动态。
The empirical results suggest the exact opposite - that it is important to have the same network doing both (up to some final layer/head). The main reason for this is that learning value network (critis) provides signal for shaping represntation of the policy (actor) that otherwise would be nearly impossible to get.
In fact if you think about these, these are extremely similar goals, since for optimal deterministic policy
where T is the transition dynamics.