QLearning 和永不停歇的剧集

发布于 2024-08-13 17:09:35 字数 840 浏览 5 评论 0原文

假设我们有一个 (x,y) 平面,机器人可以在其中移动。现在我们将世界的中间定义为目标状态,这意味着一旦机器人达到该状态,我们将给予 100 的奖励。

现在,假设有 4 个状态(我将其称为 A、B、C、D)可以导致目标状态。

当我们第一次处于 A 并进入目标状态时,我们将更新 QValues 表,如下所示:

Q(state = A, action = going to goal state) = 100 + 0

可能会发生以下两种情况之一。我可以在这里结束这一集,然后开始另一集,机器人必须再次找到目标状态,或者即使在找到目标状态后我也可以继续探索世界。如果我尝试这样做,我会发现一个问题。如果我处于目标状态并返回到状态 A,则 Qvalue 将如下所示:

Q(state = goalState, action = going to A) = 0 + gamma * 100

现在,如果我尝试从 A 再次进入目标状态:

Q(state = A, action = going to goal state) = 100 + gamma * (gamma * 100)

这意味着如果我继续这样做,因为 0 <= gamma <= 0,两个 qValue 都会永远上升。

这是 QLearning 的预期行为吗?我做错了什么吗?如果这是预期的行为,这不会导致问题吗?我知道从概率上讲,所有 4 个状态(A、B、C 和 D)都会以相同的速度增长,但即便如此,让它们永远增长还是让我有点烦恼。

即使在找到目标后也允许代理继续探索的想法与他距离目标状态越近,就越有可能处于当前可以更新的状态。

Let's imagine we have an (x,y) plane where a robot can move. Now we define the middle of our world as the goal state, which means that we are going to give a reward of 100 to our robot once it reaches that state.

Now, let's say that there are 4 states(which I will call A,B,C,D) that can lead to the goal state.

The first time we are in A and go to the goal state, we will update our QValues table as following:

Q(state = A, action = going to goal state) = 100 + 0

One of 2 things can happen. I can end the episode here, and start a different one where the robot has to find again the goal state, or I can continue exploring the world even after I found the goal state. If I try to do this, I see a problem though. If I am in the goal state and go back to state A, it's Qvalue will be the following:

Q(state = goalState, action = going to A) = 0 + gamma * 100

Now, if I try to go again to the goal state from A:

Q(state = A, action = going to goal state) = 100 + gamma * (gamma * 100)

Which means that if I keep doing this, as 0 <= gamma <= 0, both qValues are going to rise forever.

Is this the expected behavior of QLearning? Am I doing something wrong? If this is the expected behavior, can't this lead to problems? I know that probabilistically, all the 4 states(A,B,C and D), will grow at the same rate, but even so it kinda bugs me having them growing forever.

The ideia of allowing the agent to continue exploring even after finding the goal has to do with that the nearer he is from the goal state, the more likely it is to being in states that can be updated at the moment.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

衣神在巴黎 2024-08-20 17:09:35

这是符合预期的,因为 Q 估计不是预期的奖励,而是预期的回报,这是我想要的奖励金额(可能通过 gamma 打折)如果我从那里开始并遵循我的政策直到剧集结束或永远,期望从该状态/行动中获得收获。

如果你给我一些按钮,其中一个按钮在按下时总是产生 1 美元,那么按下该按钮的真正预期奖励是 1 美元。但假设我有无数次按下按钮的机会,按下按钮的真正预期回报是无限美元。

This is as expected, since the Q estimate isn't the expected reward, it's the expected return, which is the (possibly discounted via gamma) amount of reward I'd expect to reap from that state/action if I started there and followed my policy until the end of the episode or forever.

If you give me some buttons, and one of those buttons always produces $1 when pressed, then the true expected reward for pressing that button is $1. But the true expected return for pressing the button is infinity dollars, assuming I get infinite number of chances to push a button.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文