减少强化学习中的马尔可夫状态数量
我开始尝试强化学习(使用萨顿的书)。 我无法完全理解一方面必须减少马尔可夫状态空间,另一方面又不对什么重要和什么不做出假设之间的悖论。
背景
例如。以西洋跳棋为例,萨顿说,人们不应该为游戏中的某些动作分配奖励,例如击败对手的棋子。他声称这可能会优化人工智能以获取棋子而不是赢得比赛。因此,奖励应该只给予你想要达到的结果(例如赢得比赛)。
问题 1
假设(德州扑克)扑克 AI 仅具有玩家手中的牌和桌上的牌的马尔可夫状态。这大约有 52*51*50*49*48*47*46/1*2*3*4*5*6*7 个状态。现在假设我们希望人工智能考虑玩家的资金池+他们的赌注。如果我们假设 8 个玩家每人拥有 1-200.000 美元,这将使马尔可夫状态空间接近“无限数量的组合”。
问题 2
一种减少状态的策略可能是将玩家的现金分为贫穷、中等或富有。这严重减少了我们的状态空间,但是,我怎么知道 a) 3 组就足够了? b) 每个组的区别限制是什么?
干杯,
I've started toying with reinforcement learning (using the Sutton book). I fail to fully understand is the paradox between having to reduce the markov state space while on the other hand not making assumptions about what's important and what's not.
background
Eg. the checkers example, Sutton says that one should not assign rewards to certain actions in the game, such as defeating an opponents piece. He claims this may optimize the AI for taking pieces not win the game. Thus, rewards should only be given to the result you want to achieve (eg win the game).
Question 1
Assume a (Texas hold'em) Poker AI with a markov state only of the players hand and the cards on the table. This has around 52*51*50*49*48*47*46/1*2*3*4*5*6*7 states. Now assume we want the AI to take players money pool + their bets into consideration. This will make the Markov state space approach "infinite number of combinations" if we assume 8 players each having between $1-200.000.
Question 2
One state-reducing-strategy could be to divide players cash into either poor, medium or rich. This seriously reduces our state space, however, how do I know that a) 3 groups is sufficient? b) what are the discriminating limits for each group?
cheers,
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(3)
一般的方法是当状态空间变得太大时使用函数逼近来减小状态空间。这里的关键是你要概括相似状态之间的奖励。当然,这需要您通过利用领域知识提出有意义的功能来使用。不幸的是,没有任何算法可以同时解决特征选择问题和控制问题,并保证最优性(在多项式时间内),我们也不期望发明任何算法。
为了回答您的问题,1)即使是在 8 名玩家限制德州扑克中的初学者水平表现也远远超出了当前最先进的研究水平。查看当前的“世界上最好的电脑扑克玩家”研究http://poker.cs.ualberta.ca/< /a>.也就是说,您可以尝试将空间划分为任意特征,例如: (player[1].cash > 1000) 0:1 、 (player[1].cash > 2500) 0:1 等
2)很难知道你的表示有多好,通常人们只是运行它直到它开始收敛并看看它的表现如何......
The general approach is to use function approximation to reduce the state space when it gets too large. The key here is that you are generalizing rewards between states that are similar. Of course this requires that you come up with meaningful features to use by exploiting domain knowledge. There are, unfortunately, no algorithms that both solve the feature selection problem and the control problem together with any guarantees about optimality (in polynomial time), nor do we expect any to be invented.
To answer your questions, 1) even beginner level performance in 8 player limit texas holdem' is far beyond the current state of the art research. Check out the current "worlds best computer poker player" research at http://poker.cs.ualberta.ca/. That said, you could try and divide up the space into arbitrary features like: (player[1].cash > 1000) 0:1 , (player[1].cash > 2500) 0:1, etc.
2) It's hard to know how good your representation is, usually people just run it until it starts to converge and see how well it performs...
一种减少强化学习中状态空间的建议方法是使用状态-动作层次结构。您可以将其分解为更小的变量,例如 x1、x2、x3,而不是使用单个状态变量 X。然后测量它们的转变频率并确定它们之间的依赖性(例如,当 x2=abc 时,x1 通常会发生变化)。然后,您可以制定一项政策,解释如何最好地转变变化较快的变量,以改变变化较慢的变量,从而最大化回报。
这种方法仍然相对较新,我不知道它有任何公开实现。然而,有几篇论文提出了可能的实施方案。 MAXQ 算法 采用人类定义的层次结构,而 HEXQ算法描述了一种学习层次结构和策略的方法。
A proposed approach to reducing state space in RL is through use of a state-action hierarchy. Instead of having a single state variable X, you would break that up into smaller variables, say, x1, x2, x3. Then you measure their transition frequencies and determine dependencies between them (e.g. x1 usually changes when x2=abc). You can then form a policy explaining how best to transition the faster-changing variable in order to change the slower-changing variable in order to maximize the reward.
This approach is still relatively new, and I'm not aware of any public implementations of it. However, there are several papers proposing possible implementations. The MAXQ algorithm assumes a human-defined hierarchy, whereas the HEXQ algorithm describes a method of learning the hierarchy as well as the policies.
补充一下 @fairidox 的观点(我同意):如果玩家拥有的现金池在 1-200000 之间,那么根据他们所说的函数近似,如果玩家处于拥有 1000 美元或 1001 美元的状态。现在,正如您在问题 2 中提到的,选择这些任意边界存在问题。
同样的逻辑适用于让机器人处于物理空间中,其中它在一个空间中的状态可能与右侧 1mm 的状态具有相同的值。如果我们没有近似,那么我们可以说我们还应该为 0.5mm、0.25mm 等设置一个新状态。Sutton 和 Barto(以及 David Silver 讲座)为此讨论了 Tile Coding,我认为这也可能适合您扑克的例子也是如此。
To add to @fairidox point (which I agree with): if the pot of cash players have is between 1-200000 then, given function approximation as they say, it is logical that a player would not alter their behaviour if they are in a state of having $1000 or $1001. Now, as you mention in question 2, there is a problem with choosing these arbitrary boundaries.
The same logic applies to having a robot in a physical space where its state in one space is likely to have the same value as a state 1mm to the right. If we did not approximate then we could argue that we should also have a new state for 0.5mm, 0.25mm etc. Sutton and Barto (and the David Silver lectures) discuss Tile Coding for this, which I think may also be appropriate for your poker example too.