artificial-intelligence reinforcement-learning

QLearning 和永不停歇的剧集

发布于 2024-08-13 17:09:35 字数 840 浏览 15 评论 0原文

假设我们有一个 (x,y) 平面，机器人可以在其中移动。现在我们将世界的中间定义为目标状态，这意味着一旦机器人达到该状态，我们将给予 100 的奖励。

现在，假设有 4 个状态（我将其称为 A、B、C、D）可以导致目标状态。

当我们第一次处于 A 并进入目标状态时，我们将更新 QValues 表，如下所示：

Q(state = A, action = going to goal state) = 100 + 0

可能会发生以下两种情况之一。我可以在这里结束这一集，然后开始另一集，机器人必须再次找到目标状态，或者即使在找到目标状态后我也可以继续探索世界。如果我尝试这样做，我会发现一个问题。如果我处于目标状态并返回到状态 A，则 Qvalue 将如下所示：

Q(state = goalState, action = going to A) = 0 + gamma * 100

现在，如果我尝试从 A 再次进入目标状态：

Q(state = A, action = going to goal state) = 100 + gamma * (gamma * 100)

这意味着如果我继续这样做，因为 0 <= gamma <= 0，两个 qValue 都会永远上升。

这是 QLearning 的预期行为吗？我做错了什么吗？如果这是预期的行为，这不会导致问题吗？我知道从概率上讲，所有 4 个状态（A、B、C 和 D）都会以相同的速度增长，但即便如此，让它们永远增长还是让我有点烦恼。

即使在找到目标后也允许代理继续探索的想法与他距离目标状态越近，就越有可能处于当前可以更新的状态。

原文

Let's imagine we have an (x,y) plane where a robot can move. Now we define the middle of our world as the goal state, which means that we are going to give a reward of 100 to our robot once it reaches that state.

Now, let's say that there are 4 states(which I will call A,B,C,D) that can lead to the goal state.

The first time we are in A and go to the goal state, we will update our QValues table as following:

Q(state = A, action = going to goal state) = 100 + 0

One of 2 things can happen. I can end the episode here, and start a different one where the robot has to find again the goal state, or I can continue exploring the world even after I found the goal state. If I try to do this, I see a problem though. If I am in the goal state and go back to state A, it's Qvalue will be the following:

Q(state = goalState, action = going to A) = 0 + gamma * 100

Now, if I try to go again to the goal state from A:

Q(state = A, action = going to goal state) = 100 + gamma * (gamma * 100)

Which means that if I keep doing this, as 0 <= gamma <= 0, both qValues are going to rise forever.

Is this the expected behavior of QLearning? Am I doing something wrong? If this is the expected behavior, can't this lead to problems? I know that probabilistically, all the 4 states(A,B,C and D), will grow at the same rate, but even so it kinda bugs me having them growing forever.

The ideia of allowing the agent to continue exploring even after finding the goal has to do with that the nearer he is from the goal state, the more likely it is to being in states that can be updated at the moment.

分享到QQ

分享到微博