如何学习马尔可夫决策过程中的奖励函数
在 Q-learning 期间更新 R(s) 函数的正确方法是什么?例如,假设代理访问状态 s1 五次,并收到奖励 [0,0,1,1,0]。我应该计算平均奖励,例如 R(s1) = sum([0,0,1,1,0])/5?或者我应该使用移动平均线,为该状态收到的最新奖励值赋予更大的权重?我读过的大多数 Q 学习描述都将 R(s) 视为某种常数,并且似乎从未涵盖随着经验的积累,您如何随着时间的推移学习这个值。
编辑:我可能会将 Q-Learning 中的 R(s) 与 中的 R(s,s') 混淆马尔可夫决策过程。问题仍然相似。学习 MDP 时,更新 R(s,s') 的最佳方法是什么?
What's the appropriate way to update your R(s) function during Q-learning? For example, say an agent visits state s1 five times, and receives rewards [0,0,1,1,0]. Should I calculate the mean reward, e.g. R(s1) = sum([0,0,1,1,0])/5? Or should I use a moving average that gives greater weight to the more recent reward values received for that state? Most of the descriptions of Q-learning I've read treat R(s) as some sort of constant, and never seem to cover how you might learn this value over time as experience is accumulated.
EDIT: I may be confusing the R(s) in Q-Learning with R(s,s') in a Markov Decision Process. The question remains similar. When learning an MDP, what's the best way to update R(s,s')?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
Q-Learning 在贪婪策略下保持每个状态的动作平均值。它根据每对步骤的奖励来计算这些值。贪婪策略下的状态值等于最佳动作的值。 Q-Learning 的规范描述在 强化学习:简介< /a>.
没有“最佳”更新方式,但 SARSA 是一个很好的默认方式。 SARSA 与 Q-Learning 类似,只不过它学习的是它遵循的策略,而不是贪婪策略。
Q-Learning keeps a running average of action values for each state under the greedy policy. It computes these values based on rewards from each pair of steps. State value under the greedy policy is equal to the value of the best action. The canonical description of Q-Learning is given in Reinforcement Learning: An Introduction.
There is no "best" way to update, but SARSA is a good default. SARSA is similar to Q-Learning, except that it learns the policy it follows, rather than the greedy policy.
在标准的无模型强化学习(如 Q 学习)中,您不需要学习奖励函数。您学到的是价值函数或 q 值函数。奖励是通过与环境交互获得的,并且您可以估计状态-动作对随时间累积奖励的预期值(折扣)。
如果您使用基于模型的方法,情况会有所不同,您会尝试学习环境模型,即:转换和奖励函数。但 Q-learning 的情况并非如此。
In standard model-free RL (like Q-learning), you do not learn the reward function. What you learn is the value function or q-value function. Rewards are obtained by interacting with the environment and you estimate the expected value of accumulated rewards over time (discounted) for state-actions pairs.
If you are using model-based approaches, this is different and you try to learn a model of the environment, that is: transition and rewards function. But this is not the case of Q-learning.