改善 Q-Learning
我目前正在使用 Q-Learning 尝试教机器人如何在充满墙壁/障碍物的房间中移动。它必须从房间中的任何位置开始并到达目标状态(例如,可能是到达有门的图块)。 目前,当它想要移动到另一个图块时,它会去那个图块,但我在想,将来我可能会添加一个随机机会去另一个图块,而不是那个。它只能向上、向下、向左、向右移动。达到目标状态会产生 +100,其余操作将产生 0。
我正在使用找到的算法 这里,如下图所示。
现在,关于这一点,我有一些问题:
- 当使用 Q-Learning 时,有点像 神经网络,我必须做 学习阶段之间的区别 和使用阶段?我的意思是,看来 他们第一次展示的 图片是一张学习图片,在 第二张图片使用一张。
- 我在某处读到,这需要 无限数量的步骤才能到达 最佳Q值表。是那个 真的?我想说这不是真的,但我 这里一定缺少一些东西。
我也听说过 TD(时间 差异),这似乎是 由下式表示 表达式:
Q(a, s) = Q(a, s) * alpha * [R(a, s) + gamma * Max { Q(a', s' } - Q(a, s)]
对于 alpha = 1,似乎 图中第一个显示的。什么 伽玛产生的差异, 这里?
- 我遇到了一些并发症,如果 我尝试一个非常大的房间(300x200 例如像素)。正如它 本质上是随机运行的,如果 房间很大然后需要一个 很多时间随机去 从第一个状态到目标状态。什么 我可以使用哪些方法来加快速度?我 我想也许已经满桌了 与真与假有关 无论我在那一集中有什么 是否已经处于该状态。 如果是,我就放弃它,如果不是,我就放弃它 去那里。如果我已经在 所有这些州,然后我会去 随机的一个。这样一来,就只是 就像我现在在做什么,知道 我要重复说的次数较少 我目前正在做的事情。
- 我想尝试其他的事情 我的 Q 值查找表,所以我 正在考虑使用神经网络 具有反向传播的网络 这。我可能会尝试有一个 每个动作的神经网络(向上, 下、左、右),看起来是这样 什么会产生最好的结果。有吗 任何其他方法(除了 SVM 之外, 似乎太难实施 我自己)我可以使用并且 实施会给我带来好处 Q 值函数近似?
- 你认为遗传算法 将会在这方面取得良好的成果 情况,使用 Q 值矩阵 作为基础呢?我怎么能 测试我的健身功能?它给我的印象是遗传算法通常用于更加随机/复杂的事情。如果我们仔细观察,我们会发现 Q 值遵循一个明显的趋势 - 靠近目标的 Q 值越高,离目标越远 Q 值越低。试图通过 GA 得出这个结论可能会花费太长时间?
I am currently using Q-Learning to try to teach a bot how to move in a room filled with walls/obstacles. It must start in any place in the room and get to the goal state(this might be, to the tile that has a door, for example).
Currently when it wants to move to another tile, it will go to that tile, but I was thinking that in the future I might add a random chance of going to another tile, instead of that. It can only move up, down, left and right. Reaching the goal state yields +100 and the rest of the actions will yield 0.
I am using the algorithm found here, which can be seen in the image bellow.
Now, regarding this, I have some questions:
- When using Q-Learning, a bit like
Neural Networks, I must make
distinction between a learning phase
and a using phase? I mean, it seems
that what they shown on the first
picture is a learning one and in the
second picture a using one. - I read somewhere that it'd take an
infinite number of steps to reach to
the optimum Q values table. Is that
true? I'd say that isn't true, but I
must be missing something here. I've heard also about TD(Temporal
Differences), which seems to be
represented by the following
expression:Q(a, s) = Q(a, s) * alpha * [R(a, s) + gamma * Max { Q(a', s' } - Q(a, s)]
which for alpha = 1, just seems the
one shown first in the picture. What
difference does that gamma make,
here?- I have run in some complications if
I try a very big room(300x200
pixels, for example). As it
essentially runs randomly, if the
room is very big then it will take a
lot of time to go randomly from the
first state to the goal state. What
methods can I use to speed it up? I
thought maybe having a table filled
with trues and falses, regarding
whatever I have in that episode
already been in that state or not.
If yes, I'd discard it, if no, I'd
go there. If I had already been in
all those states, then I'd go to a
random one. This way, it'd be just
like what am I doing now, knowing
that I'd repeat states a less often
that I currently do. - I'd like to try something else than
my lookup table for Q-Values, so I
was thinking in using Neural
Networks with back-propagation for
this. I will probably try having a
Neural Network for each action (up,
down, left, right), as it seems it's
what yields best results. Are there
any other methods (besides SVM, that
seem way too hard to implement
myself) that I could use and
implement that'd give me good
Q-Values function approximation? - Do you think Genetic Algorithms
would yield good results in this
situation, using the Q-Values matrix
as the basis for it? How could I
test my fitness function? It gives me the impression that GA are generally used for things way more random/complex. If we watch carefully we will notice that the Q-Values follow a clear trend - having the higher Q values near the goal and lower ones the farther away you are from them. Going to try to reach that conclusion by GA probably would take way too long?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
我不是该主题的专家,但我会尝试直接回答您的许多问题
[顺便说一句,我应该为每个问题获得多个 + 代表!...只是开玩笑,如果我在“SO 代表”中,我会远离发帖,这将获得总计 20 次观看,其中一半访客对手头的概念有一个粗略的了解]
1) Q-学习两阶段的事情?
是的,Q-Learning 意味着两个阶段,学习阶段和行动阶段。与许多自动学习算法一样,在行动阶段可以“继续学习”。
2)最优 G 矩阵的步骤数无限?
不确定哪里需要无限数量的学习周期来学习最佳 Q 矩阵。可以肯定的是(除非 alpha 和 gamma 因子不正确),算法会收敛,即使速度可能非常慢。这促使我跳过并评论你关于 300x200 游戏空间的想法,并且......是的!,对于这样的空间,给定奖励模型,似乎需要无穷大才能获得“最佳”Q 表。现在,从数学上讲,该算法可能永远不会达到最佳状态,但对于实际解决方案,研究渐近线就足够了。
3)TD模型中gamma的作用
这表明在一条通往更高奖励的道路上(这里是你的模型,字面上)推迟奖励的重要性。这通常可以防止算法陷入解空间的局部最大值,但代价是学习速度变得更慢......
4)帮助学习大迷宫的建议
冒着违背 Q-Learning 本质的风险,您可以在距离目标越来越远的地方启动机器人。这将有助于它首先改进目标周围状态区域的 Q 矩阵,然后利用这个部分学习的 Q 矩阵作为初始状态,在距目标不断增加的半径内随机获取。
另一种风险更大的方法(实际上可能进一步掩盖了 Q-Learning 的真实本质)是改变 R 矩阵以提供越来越高的奖励,并随机放置在距离目标越来越近的位置。这种方法的缺点是,它可能会在解决方案空间中引入许多局部最大值的机会,如果学习率和其他因素没有正确调整,算法可能会陷入困境。
这两种方法,特别是后者,都可以解释为您(设计者)在解决方案中的“布线”。其他人会说这只是在混合中引入一些 DP 的方式...
5) 神经网络 (NN) 6) 遗传算法 (GA)
对于将 NN 或 GA 添加到其中没有任何意见。
我可能因为上面的一些数学上不太准确的陈述而自取其辱。 ;-)
I'm not an expert on the topic, but I'll take a crack at responding directly at your many questions
[BTW, I should get multi +reps for each question!... Just kidding, if I was in "for the SO reps", I'd stay clear from posting which will get a grand total of 20 views with half of these visitors having an rough idea of the concepts at hand]
1) Q-Learning a two-phase thing?
Yes, Q-Learning implies two phases, a learning phase and an action phase. As with many automated learning algorithms it is possible to "keep on learning" while in the action phase.
2) Infinite number of steps for an optimal G matrix?
Not sure where the statement requiring an infinite number of learning cycles to learn an optimal Q matrix. To be sure (and unless the alpha and gamma factors are incorrect), the algorithm converges, if only at a possibly very slow rate. This prompts me to skip and comment on your idea of a 300x200 game space, and well... YES!, for such a space, an given the reward model, it will take what seems to infinity to get an "optimal" Q table. Now, it may be possible that mathematically the algorithm never reaches the optimal nivarna, but for practical solutions, working on the asymptote is just good enough.
3) Role of gamma in TD model
This indicates the importance of deferring rewards, on a path (here with your model, literally), towards higher rewards. This generally prevents the algorithm of getting stuck in local maximas of the solution space, at the cost of making learning even slower...
4) Suggestions to help with learning a big maze
At the risk of betraying the nature of Q-Learning, you can start the robot at increasingly further distances from the goal. This will help it improve the Q Matrix in the area of the states which surround the goal first, then leveraging this partially learned Q matrix as the initial state taken, randomly, within an increasing radius from the goal.
Another, riskier, approach (and indeed one that may further belie the true nature of Q-Learning), would be to change the R Matrix to provide increasingly high rewards, at random placed located at a decreasing distance from the goal. The downside to this approach is that it may introduce opportunities of many local maximas in the solution space, where the algorithm may get stuck if the learning rate and other factors are not tweaked properly.
Both of these approaches in particular the latter can be interpreted as a your (the designer) "wiring" in a solution. Other will say that this is merely as way of introducing a dash of DP into the mix...
5) Neural Net (NN) 6) Genetic Algorithm (GA)
No opinion about adding NN or GA into the mix.
I probably made enough of a fool of myself with some of the less-than-mathematically-accurate statement above. ;-)
-您应该尝试更改 alpha 和 gamma 值。它们是重要的参数。
-尝试更多剧集。
-改变探索的价值观。过多的探索探索是不好的。没有足够的探索是不够的。
-You should try to change alpha and gamma values. They are important parameters.
-Try more episodes.
-changes values of exploration. Too much exploration of exploration is not good. And not enough exploration is not goog.