Q-learning 和 SARSA 有什么区别?

发布于 2024-11-27 00:47:38 字数 1864 浏览 9 评论 0原文

虽然我知道 SARSA 是政策性的,而 Q-learning 是离策略,当查看他们的公式时,(对我来说)很难看出这两种算法之间的任何区别。

根据强化学习:简介(萨顿和巴托)一书。在SARSA算法中,给定一个策略,对应的动作值函数Q(在状态s和动作a,时间步t),即Q(st, at),可以更新如下

Q(st, at) = Q(st, at) + α *(rt + γ*Q(st+1, at+1) - Q(st)子>, at))

另一方面,Q-learning 的更新步骤算法如下

Q(st, at) = Q(st, at) + α *(rt + γ*maxa Q(st+1, a) - Q(st >, at))

也可以写成

Q(st, at) = (1 - α) * Q(st, at< /sub>) + α * (rt + γ*maxa Q(st+1, a))

其中 γ (gamma) 是折扣因子和 rt 是从环境收到的奖励时间步长t。

这两种算法之间的区别是否在于 SARSA 只查找下一个策略值,而 Q-learning 则查找下一个最大策略值?

TLDR(以及我自己的答案)

感谢自从我第一次提出这个问题以来所有回答这个问题的人。我制作了一个使用 Q-Learning 的 github 存储库,并凭经验了解了其中的区别。这一切都取决于您如何选择下一个最佳行动,从算法的角度来看,这可以是平均值最大值 > 或最佳 操作,具体取决于您选择的实施方式。

另一个主要区别是这种选择何时发生(例如,在线与离线)以及如何/为何影响学习。如果您在 2019 年阅读本文,并且是一个更注重实践的人,那么玩 RL 玩具问题可能是理解差异的最佳方式。

最后一个重要注意事项是 Suton 和对于下一个状态最佳/最大行动和奖励,Barto 以及维基百科经常有混合、令人困惑错误的公式表示:

r(t+1)

实际上是

r(t)

Although I know that SARSA is on-policy while Q-learning is off-policy, when looking at their formulas it's hard (to me) to see any difference between these two algorithms.

According to the book Reinforcement Learning: An Introduction (by Sutton and Barto). In the SARSA algorithm, given a policy, the corresponding action-value function Q (in the state s and action a, at timestep t), i.e. Q(st, at), can be updated as follows

Q(st, at) = Q(st, at) + α*(rt + γ*Q(st+1, at+1) - Q(st, at))

On the other hand, the update step for the Q-learning algorithm is the following

Q(st, at) = Q(st, at) + α*(rt + γ*maxa Q(st+1, a) - Q(st, at))

which can also be written as

Q(st, at) = (1 - α) * Q(st, at) + α * (rt + γ*maxa Q(st+1, a))

where γ (gamma) is the discount factor and rt is the reward received from the environment at timestep t.

Is the difference between these two algorithms the fact that SARSA only looks up the next policy value while Q-learning looks up the next maximum policy value?

TLDR (and my own answer)

Thanks to all those answering this question since I first asked it. I've made a github repo playing with Q-Learning and empirically understood what the difference is. It all amounts to how you select your next best action, which from an algorithmic standpoint can be a mean, max or best action depending on how you chose to implement it.

The other main difference is when this selection is happening (e.g., online vs offline) and how/why that affects learning. If you are reading this in 2019 and are more of a hands-on person, playing with a RL toy problem is probably the best way to understand the differences.

One last important note is that both Suton & Barto as well as Wikipedia often have mixed, confusing or wrong formulaic representations with regards to the next state best/max action and reward:

r(t+1)

is in fact

r(t)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(8

悲凉≈ 2024-12-04 00:47:38

当我学习这部分时,我也发现它很混乱,所以我把 R.Sutton 和 AGBarto 的两个伪代码放在一起,希望能让区别更清楚。

输入图像描述这里

蓝色框突出显示了两种算法实际不同的部分。数字突出显示了稍后将解释的更详细的差异。

TL;NR

|             | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' |   π   |      π     |
| Updating Q  |   π   |      μ     |

其中 π 是 ε 贪婪策略(例如 ε > 0,有探索),μ 是贪婪策略(例如 ε == 0,无探索)。

  1. 鉴于 Q-learning 使用不同的策略来选择下一个动作 A' 和更新 Q。换句话说,它试图在遵循另一个策略 μ 的同时评估 π,因此它是一种离策略算法。

  2. 相比之下,SARSA 始终使用 π,因此它是一种在策略算法。

更详细的解释

  1. 两者之间最重要的区别是每次操作后Q如何更新。 SARSA 严格遵循 ε-贪婪策略使用 Q',因为 A' 是从中得出的。相比之下,Q 学习在下一步的所有可能操作中使用最大 Q'。这使得它看起来像是遵循 ε=0 的贪婪策略,即这部分没有探索。

  2. 然而,当实际采取行动时,Q-learning仍然使用从ε-贪婪策略中采取的行动。这就是为什么“Choose A ...”位于重复循环内。

  3. 遵循Q-learning中的循环逻辑,A'仍然来自ε-贪婪策略。

When I was learning this part, I found it very confusing too, so I put together the two pseudo-codes from R.Sutton and A.G.Barto hoping to make the difference clearer.

enter image description here

Blue boxes highlight the part where the two algorithms actually differ. Numbers highlight the more detailed difference to be explained later.

TL;NR:

|             | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' |   π   |      π     |
| Updating Q  |   π   |      μ     |

where π is a ε-greedy policy (e.g. ε > 0 with exploration), and μ is a greedy policy (e.g. ε == 0, NO exploration).

  1. Given that Q-learning is using different policies for choosing next action A' and updating Q. In other words, it is trying to evaluate π while following another policy μ, so it's an off-policy algorithm.

  2. In contrast, SARSA uses π all the time, hence it is an on-policy algorithm.

More detailed explanation:

  1. The most important difference between the two is how Q is updated after each action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn from it. In contrast, Q-learning uses the maximum Q' over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, i.e. NO exploration in this part.

  2. However, when actually taking an action, Q-learning still uses the action taken from a ε-greedy policy. This is why "Choose A ..." is inside the repeat loop.

  3. Following the loop logic in Q-learning, A' is still from the ε-greedy policy.

等风来 2024-12-04 00:47:38

是的,这是唯一的区别。在政策 SARSA 学习与其遵循的政策相关的行动价值,而离政策 Q-Learning 则相对于贪婪政策学习行动价值。在某些常见条件下,它们都收敛到实值函数,但收敛速度不同。 Q-Learning 的收敛速度往往较慢,但有能力在改变策略的同时继续学习。此外,Q-Learning 与线性近似结合时不能保证收敛。

实际上,在ε-贪婪策略下,Q-Learning 计算 Q(s,a) 与最大动作值之间的差值,而 SARSA 计算 Q(s,a) 与平均动作的加权和之间的差值值和最大值:

Q-Learning: Q(st+1,at+1) = maxaQ(st+1,a)

SARSA: Q(st+1,at+1 sub>) = ε·meanaQ(st+1,a) + (1-ε)·maxaQ(s t+1,a)

Yes, this is the only difference. On-policy SARSA learns action values relative to the policy it follows, while off-policy Q-Learning does it relative to the greedy policy. Under some common conditions, they both converge to the real value function, but at different rates. Q-Learning tends to converge a little slower, but has the capabilitiy to continue learning while changing policies. Also, Q-Learning is not guaranteed to converge when combined with linear approximation.

In practical terms, under the ε-greedy policy, Q-Learning computes the difference between Q(s,a) and the maximum action value, while SARSA computes the difference between Q(s,a) and the weighted sum of the average action value and the maximum:

Q-Learning: Q(st+1,at+1) = maxaQ(st+1,a)

SARSA: Q(st+1,at+1) = ε·meanaQ(st+1,a) + (1-ε)·maxaQ(st+1,a)

z祗昰~ 2024-12-04 00:47:38

数学上有什么区别?

正如大多数其他答案中已经描述的那样,这两个更新之间的数学差异确实在于,当更新状态-动作对的 Q 值时 (St, At)

  • Sarsa 使用行为策略(即代理在环境中生成经验所使用的策略,通常为epsilon -greedy) 选择一个额外的动作At+1,然后使用 Q(St+1, At+1 )(按gamma折扣)作为更新目标计算中的预期未来回报。
  • Q-学习不使用行为策略来选择附加操作At+1。相反,它将更新规则中的预期未来回报估计为 maxA Q(St+1, A)。这里使用的max运算符可以被视为“遵循”完全贪婪的策略。 代理实际上并没有遵循贪婪策略;它只是在更新规则中说:“假设我从现在开始遵循贪婪策略,那么我的预期未来回报是多少?”。

直观上这意味着什么?

正如其他答案中提到的,上述差异意味着,使用技术术语,Sarsa 是一种在策略学习算法,而 Q-learning 是一种离策略学习算法。

在极限情况下(给定无限量的时间来生成经验和学习),并在一些额外的假设下,这意味着 Sarsa 和 Q-learning 收敛到不同的解决方案/“最佳”策略

  • >Sarsa 将收敛到在我们继续遵循用于生成体验的相同策略的假设下的最佳解决方案。这通常是一个带有某种(相当“愚蠢”)随机性元素的策略,比如 epsilon-greedy,因为否则我们无法保证我们会收敛到任何东西。
  • Q-Learning 将收敛到在生成经验和训练后,我们切换到贪婪策略的假设下的最佳解决方案

何时使用哪种算法?

在我们关心代理在学习/生成经验过程中的表现的情况下,像 Sarsa 这样的算法通常更受欢迎。例如,考虑代理是一个昂贵的机器人,如果它掉下悬崖就会损坏。我们不希望它在学习过程中经常掉落,因为它很昂贵。因此,我们在学习过程中关心它的表现。然而,我们也知道有时我们需要它随机行动(例如 epsilon-greedy)。这意味着机器人沿着悬崖行走是非常危险的,因为它可能会决定随机行动(概率为 epsilon)并摔倒。所以,我们希望它能够快速了解​​到靠近悬崖是危险的; 即使贪婪策略能够在不跌倒的情况下沿着它走,我们知道我们正在遵循具有随机性的 epsilon-贪婪策略,并且我们关心优化我们的性能,因为我们知道我们会有时很愚蠢。在这种情况下,Sarsa 会更好。

如果我们不关心代理在训练过程中的表现,但我们只是希望它学习我们最终会切换到的最佳贪婪策略,那么像Q-learning这样的算法会更适合。例如,考虑一下我们玩一些练习游戏(有时我们不介意因为随机性而失败),然后玩一场重要的锦标赛(我们将停止学习并从 epsilon-greedy 切换到贪婪策略) )。这就是 Q-learning 更好的地方。

What is the difference mathematically?

As is already described in most other answers, the difference between the two updates mathematically is indeed that, when updating the Q-value for a state-action pair (St, At):

  • Sarsa uses the behaviour policy (meaning, the policy used by the agent to generate experience in the environment, which is typically epsilon-greedy) to select an additional action At+1, and then uses Q(St+1, At+1) (discounted by gamma) as expected future returns in the computation of the update target.
  • Q-learning does not use the behaviour policy to select an additional action At+1. Instead, it estimates the expected future returns in the update rule as maxA Q(St+1, A). The max operator used here can be viewed as "following" the completely greedy policy. The agent is not actually following the greedy policy though; it only says, in the update rule, "suppose that I would start following the greedy policy from now on, what would my expected future returns be then?".

What does this mean intuitively?

As mentioned in other answers, the difference described above means, using technical terminology, that Sarsa is an on-policy learning algorithm, and Q-learning is an off-policy learning algorithm.

In the limit (given an infinite amount of time to generate experience and learn), and under some additional assumptions, this means that Sarsa and Q-learning converge to different solutions / "optimal" policies:

  • Sarsa will converge to a solution that is optimal under the assumption that we keep following the same policy that was used to generate the experience. This will often be a policy with some element of (rather "stupid") randomness, like epsilon-greedy, because otherwise we are unable to guarantee that we'll converge to anything at all.
  • Q-Learning will converge to a solution that is optimal under the assumption that, after generating experience and training, we switch over to the greedy policy.

When to use which algorithm?

An algorithm like Sarsa is typically preferable in situations where we care about the agent's performance during the process of learning / generating experience. Consider, for example, that the agent is an expensive robot that will break if it falls down a cliff. We'd rather not have it fall down too often during the learning process, because it is expensive. Therefore, we care about its performance during the learning process. However, we also know that we need it to act randomly sometimes (e.g. epsilon-greedy). This means that it is highly dangerous for the robot to be walking alongside the cliff, because it may decide to act randomly (with probability epsilon) and fall down. So, we'd prefer it to quickly learn that it's dangerous to be close to the cliff; even if a greedy policy would be able to walk right alongside it without falling, we know that we're following an epsilon-greedy policy with randomness, and we care about optimizing our performance given that we know that we'll be stupid sometimes. This is a situation where Sarsa would be preferable.

An algorithm like Q-learning would be preferable in situations where we do not care about the agent's performance during the training process, but we just want it to learn an optimal greedy policy that we'll switch to eventually. Consider, for example, that we play a few practice games (where we don't mind losing due to randomness sometimes), and afterwards play an important tournament (where we'll stop learning and switch over from epsilon-greedy to the greedy policy). This is where Q-learning would be better.

浮光之海 2024-12-04 00:47:38

您的 Q-Learning 公式中存在索引错误。
萨顿和巴托的第 148 页。

Q(st,at) <-- Q(st,at) + alpha * [r(t+1) + gamma * max Q(st+1,a) -
Q(st,at)]

拼写错误在 max 的参数中:

索引是 st+1 和 a,
而在你的问题中,它们是 st+1 和 at+1 (这些对于 SARSA 来说是正确的)。

希望这会有所帮助。

There's an index mistake in your formula for Q-Learning.
Page 148 of Sutton and Barto's.

Q(st,at) <-- Q(st,at) + alpha * [r(t+1) + gamma * max Q(st+1,a) -
Q(st,at) ]

The typo is in the argument of the max:

the indexes are st+1 and a,
while in your question they are st+1 and at+1 (these are correct for SARSA).

Hope this helps a bit.

耶耶耶 2024-12-04 00:47:38

在 Q-Learning 中,

这是您的:
Q-学习:Q(St,At) = Q(St,At) + a [ R(t+1) + 折扣 * 最大 Q(St+1,At) - Q(St, at) ]

应更改为
Q-学习:Q(St,At) = Q(St,At) + a [ R(t+1) + 折扣 * 最大 Q(St+1,a) - Q(St, At) ]

正如您所说,您必须找到更新方程的最大 Q 值。通过改变a,那么你将得到一个新的Q(St,At)。请注意,为您提供最大 Q 值的 a 不是下一个操作。在这个阶段,你只知道下一个状态(St+1),在进入下一轮之前,你想用St+1更新St(St <-- St+1)。

对于每个循环;

  • 使用 Q 值从 St 中选择 At

  • 取 At 并观察 Rt+1 和 St+1

  • 使用方程式更新 Q 值。

  • St <-- St+1

直到 St 结束

In Q-Learning

This is your:
Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,At) - Q(St,At) ]

should be changed to
Q-Learning: Q(St,At) = Q(St,At) + a [ R(t+1) + discount * max Q(St+1,a) - Q(St,At) ]

As you said, you have to find the maximum Q-value for the update eq. by changing the a, Then you will have a new Q(St,At). CAREFULLY, the a that give you the maximum Q-value is not the next action. At this stage, you only know the next state (St+1), and before going to next round, you want to update the St by the St+1 (St <-- St+1).

For each loop;

  • choose At from the St using the Q-value

  • take At and observe Rt+1 and St+1

  • Update Q-value using the eq.

  • St <-- St+1

Until St is terminal

笑红尘 2024-12-04 00:47:38

SARSA 和 Qlearning 之间的唯一区别是 SARSA 根据当前策略采取下一步行动,而 qlearning 采取下一状态效用最大的行动

The only difference between SARSA and Qlearning is that SARSA takes the next action based on the current policy while qlearning takes the action with maximum utility of next state

有木有妳兜一样 2024-12-04 00:47:38

SARSA 和 Q-learnig 代理都遵循 e-greedy 策略与环境交互。

SARSA 代理使用下一个时间步 Q 值以及策略提供的任何操作来更新其 Q 函数(大多数仍然是贪婪的,但也接受随机操作)。正在执行的策略和正在更新的策略是相同的。

Q-学习代理仅使用带来最大下一个状态 Q 值的操作来更新其 Q 函数(相对于策略而言是完全贪婪的)。正在执行的策略和正在更新的策略是不同的。

因此,SARSA 是在策略(on-policy),Q-learning 是离策略(off-policy)。

Both SARSA and Q-learnig agents follow e-greedy policy to interact with environment.

SARSA agent updates its Q-function using the next timestep Q-value with whatever action the policy provides(mostly still greedy, but random action also accepted). The policy being executed and the policy being updated towards are the same.

Q-learning agent updates its Q-function with only the action brings the maximum next state Q-value(total greedy with respect to the policy). The policy being executed and the policy being updated towards are different.

Hence, SARSA is on-policy, Q-learning is off-policy.

财迷小姐 2024-12-04 00:47:38

我没有读过任何书,只是看到它们的含义
q 学习只关注(动作网格)
SARSA学习只需关注(状态到状态)并观察s和s'的动作列表,然后更新(状态到状态网格)

I didn't read any book just I see the implication of them
q learning just focus on the (action grid)
SARSA learning just focus on the (state to state) and observe the action list of s and s' and then update the (state to state grid)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文