Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 10 years ago.
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
接受
或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
发布评论
评论(3)
处理大多数 MDP 问题都有一个模式,但我认为您可能在问题描述中省略了一些信息,很可能它与您试图达到的状态或情节结束的方式有关(什么)如果您跑出网格边缘就会发生)。我已尽力回答您的问题,但我还附加了关于我用来处理此类问题的过程的入门知识。
首先,效用是一个相当抽象的衡量标准,衡量你想要处于给定状态的程度。即使您使用简单的启发式方法(欧几里得距离或曼哈顿距离)来衡量效用,也绝对有可能拥有两个具有相同效用的状态。在这种情况下,我假设效用价值和奖励是可以互换的。
从长远来看,这类问题的目标往往是,如何最大化您的预期(长期)奖励?学习率、gamma 控制您对当前任务的重视程度状态与您想要的最终结果 - 实际上,您可以将伽玛视为一个频谱,从“在这个时间步中做对我最有利的事情”到另一个极端“探索我所有的选择,然后回到最好的一个'。 Sutton 和 Barto 在强化学习一书中有一些非常好的解释这是如何工作的。
在开始之前,请回顾一下问题并确保您可以自信地回答以下问题。
那么问题的答案呢?
我们如何检查这对于这个问题是否有意义?
编辑。响应到目标状态的转移概率请求。下面的符号假设
There is a pattern to dealing with most MDP problems, but I think you've probably omitted some information from the problem description, most likely it has to do with the state you're trying to reach, or the way an episode ends (what happens if you run off the edge of the grid). I've done my best to answer your questions, but I've appended a primer on the process I use to deal with these types of problems.
Firstly utility is a fairly abstract measure of how much you want to be in a given state. It's definitely possible to have two states with equal utility, even when you measure utility with simple heuristics (Euclidean or Manhattan distance). In this case, I assume that the utility value and reward are interchangeable.
In the long term, the objective in these types of problems tends to be, how do you maximise your expected (long term) reward? The learning rate, gamma, controls how much emphasis you place on the current state versus where you would like to end up - effectively you can think of gamma as a spectrum going from, 'do the thing the benefits me most in this timestep' to at the other extreme 'explore all my options, and go back to the best one'. Sutton and Barto in there book on reinforcement learning have some really nice explanations of how this works.
Before you get started, go back through the question and make sure that you can confidently answer the following questions.
So the answers to the questions?
How can we check that this makes sense for this problem?
Edit. answering the request for the transition probabilities to the target state. The notation below assumes
ad.1) 可能机器人并不总是要移动——即那 30% 的人是“啊,现在我休息一下”或“根本没有移动的力量”。
ad.1) probably it is not that robot has always to move -- i.e. those 30% are "ah, now I rest a bit" or "there was no power to move at all".
我将这个问题表述为有限视野马尔可夫决策过程,并通过策略迭代解决它。在每次迭代的右侧,有一个颜色编码的网格表示每个状态的推荐操作以及原始奖励网格/矩阵。
审查第四阶段的最终政策/策略。它与您的直觉相符吗?
I've formulated this problem as a Finite-Horizon Markov Decision Process and solved it via Policy Iteration. To the right of each iteration, there is a color-coded grid representation of the recommended actions for each state as well as the original reward grid/matrix.
Review the final policy/strategy at Stage 4. Does it agree with your intuition?