马尔可夫决策过程问题

发布于 2024-08-19 08:11:09 字数 1469 浏览 5 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

予囚 2024-08-26 08:11:09

处理大多数 MDP 问题都有一个模式,但我认为您可能在问题描述中省略了一些信息,很可能它与您试图达到的状态或情节结束的方式有关(什么)如果您跑出网格边缘就会发生)。我已尽力回答您的问题,但我还附加了关于我用来处理此类问题的过程的入门知识。

首先,效用是一个相当抽象的衡量标准,衡量你想要处于给定状态的程度。即使您使用简单的启发式方法(欧几里得距离或曼哈顿距离)来衡量效用,也绝对有可能拥有两个具有相同效用的状态。在这种情况下,我假设效用价值和奖励是可以互换的。

从长远来看,这类问题的目标往往是,如何最大化您的预期(长期)奖励?学习率、gamma 控制您对当前任务的重视程度状态与您想要的最终结果 - 实际上,您可以将伽玛视为一个频谱,从“在这个时间步中做对我最有利的事情”到另一个极端“探索我所有的选择,然后回到最好的一个'。 Sutton 和 Barto 在强化学习一书中有一些非常好的解释这是如何工作的。


在开始之前,请回顾一下问题并确保您可以自信地回答以下问题。

  1. 什么是国家?有多少个州?
  2. 什么是行动?有多少个动作?
  3. 如果从状态 u 开始,并应用操作 a,达到新状态 v 的概率是多少?

那么问题的答案呢?

  1. 状态是一个向量 (x,y)。网格是 5 x 5,因此有 25 个状态。
  2. 有四种可能的动作,{E,N,S,W}
  3. 应用适当动作后成功到达相邻状态的概率为 0.7,不移动(保持在相同状态的概率为 0.3)。假设 (0,0) 是左上角的单元格,(4,4) 是右下角的单元格,下表显示了所有可能转换的一小部分。
Start State Action           Final State    Probability
---------------------------------------------------
(0,0)           E               (0,0)          0.3
(0,0)           E               (1,0)          0.7
(0,0)           E               (2,0)          0
...
(0,0)           E               (0,1)          0
... 
(0,0)           E               (4,4)          0
(0,0)           N               (0,0)          0.3
...
(4,4)           W               (3,4)          0.7
(4,4)           W               (4,4)          0.3

我们如何检查这对于这个问题是否有意义?

  1. 检查表中是否有适当数量的条目。在 5 x 5 网格上有 25 个状态和 4 个操作,因此该表应有 100 个条目。
  2. 检查以确保对于开始状态/操作对,只有两个条目的发生概率非零。

编辑。响应目标状态的转移概率请求。下面的符号假设

  • v 是最终状态
  • u 是源状态
  • a 是操作,如果没有提及,则暗示所应用的操作不相关。
P( v=(3,3) | u =(2,3), a=E ) = 0.7
P( v=(3,3) | u =(4,3), a=W ) = 0.7
P( v=(3,3) | u =(3,2), a=N ) = 0.7
P( v=(3,3) | u =(3,4), a=S ) = 0.7
P( v=(3,3) | u =(3,3) ) = 0.3

There is a pattern to dealing with most MDP problems, but I think you've probably omitted some information from the problem description, most likely it has to do with the state you're trying to reach, or the way an episode ends (what happens if you run off the edge of the grid). I've done my best to answer your questions, but I've appended a primer on the process I use to deal with these types of problems.

Firstly utility is a fairly abstract measure of how much you want to be in a given state. It's definitely possible to have two states with equal utility, even when you measure utility with simple heuristics (Euclidean or Manhattan distance). In this case, I assume that the utility value and reward are interchangeable.

In the long term, the objective in these types of problems tends to be, how do you maximise your expected (long term) reward? The learning rate, gamma, controls how much emphasis you place on the current state versus where you would like to end up - effectively you can think of gamma as a spectrum going from, 'do the thing the benefits me most in this timestep' to at the other extreme 'explore all my options, and go back to the best one'. Sutton and Barto in there book on reinforcement learning have some really nice explanations of how this works.


Before you get started, go back through the question and make sure that you can confidently answer the following questions.

  1. What is a state? How many states are there?
  2. What is an action? How many actions are there?
  3. If you start in state u, and you apply an action a, what is the probability of reaching a new state v?

So the answers to the questions?

  1. A state is a vector (x,y). The grid is 5 by 5, so there are 25 states.
  2. There are four possible actions, {E,N,S,W}
  3. The probability of successfully reaching an adjacent state after applying a suitable action is 0.7, the probability of not moving (staying in the same state is 0.3). Assuming (0,0) is the top left cell and (4,4) is the bottom right cell, the following table shows a small subset of all of the possible transitions.
Start State Action           Final State    Probability
---------------------------------------------------
(0,0)           E               (0,0)          0.3
(0,0)           E               (1,0)          0.7
(0,0)           E               (2,0)          0
...
(0,0)           E               (0,1)          0
... 
(0,0)           E               (4,4)          0
(0,0)           N               (0,0)          0.3
...
(4,4)           W               (3,4)          0.7
(4,4)           W               (4,4)          0.3

How can we check that this makes sense for this problem?

  1. Check that the table has an appropriate number of entries. On a 5 by 5 grid there are 25 states and 4 actions, so the table should have 100 entries.
  2. Check to make sure that for an start state / action pair, only two entries have non-zero probability of occuring.

Edit. answering the request for the transition probabilities to the target state. The notation below assumes

  • v is the final state
  • u is the source state
  • a is the action, where it isn't mentioned, it's implied that the action applied isn't relevant.
P( v=(3,3) | u =(2,3), a=E ) = 0.7
P( v=(3,3) | u =(4,3), a=W ) = 0.7
P( v=(3,3) | u =(3,2), a=N ) = 0.7
P( v=(3,3) | u =(3,4), a=S ) = 0.7
P( v=(3,3) | u =(3,3) ) = 0.3
往日情怀 2024-08-26 08:11:09

ad.1) 可能机器人并不总是要移动——即那 30% 的人是“啊,现在我休息一下”或“根本没有移动的力量”。

ad.1) probably it is not that robot has always to move -- i.e. those 30% are "ah, now I rest a bit" or "there was no power to move at all".

埋情葬爱 2024-08-26 08:11:09

我将这个问题表述为有限视野马尔可夫决策过程,并通过策略迭代解决它。在每次迭代的右侧,有一个颜色编码的网格表示每个状态的推荐操作以及原始奖励网格/矩阵。

审查第四阶段的最终政策/策略。它与您的直觉相符吗?

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

在此处输入图像描述

I've formulated this problem as a Finite-Horizon Markov Decision Process and solved it via Policy Iteration. To the right of each iteration, there is a color-coded grid representation of the recommended actions for each state as well as the original reward grid/matrix.

Review the final policy/strategy at Stage 4. Does it agree with your intuition?

enter image description here

enter image description here

enter image description here

enter image description here

enter image description here

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文