当前位置：文江博客话题详情

Q-learning 和 SARSA 有什么区别？

发布于 2024-11-27 00:47:38 字数 1864 浏览 9 评论 0原文

虽然我知道 SARSA 是政策性的，而 Q-learning 是离策略，当查看他们的公式时，（对我来说）很难看出这两种算法之间的任何区别。

根据强化学习：简介（萨顿和巴托）一书。在SARSA算法中，给定一个策略，对应的动作值函数Q（在状态s和动作a，时间步t），即Q(s_t, a_{t)，可以更新如下}

Q(s_t, a_t) = Q(s_t, a_t) + α *(r_t + γ*Q(s_t+1, a_t+1) - Q(s_t)子>, a_t))

另一方面，Q-learning 的更新步骤算法如下

Q(s_t, a_t) = Q(s_t, a_t) + α *(r_t + γ*max_a Q(s_t+1, a) - Q(s_t >, a_t))

也可以写成

Q(s_t, a_t) = (1 - α) * Q(s_t, a_{t< /sub>) + α * (r_t + γ*max_a Q(s_t+1, a))}

其中 γ (gamma) 是折扣因子和 r_t 是从环境收到的奖励时间步长t。

这两种算法之间的区别是否在于 SARSA 只查找下一个策略值，而 Q-learning 则查找下一个最大策略值？

TLDR（以及我自己的答案）

感谢自从我第一次提出这个问题以来所有回答这个问题的人。我制作了一个使用 Q-Learning 的 github 存储库，并凭经验了解了其中的区别。这一切都取决于您如何选择下一个最佳行动，从算法的角度来看，这可以是平均值、最大值 > 或最佳操作，具体取决于您选择的实施方式。

另一个主要区别是这种选择何时发生（例如，在线与离线）以及如何/为何影响学习。如果您在 2019 年阅读本文，并且是一个更注重实践的人，那么玩 RL 玩具问题可能是理解差异的最佳方式。

最后一个重要注意事项是 Suton 和对于下一个状态最佳/最大行动和奖励，Barto 以及维基百科经常有混合、令人困惑或错误的公式表示：

r(t+1)

实际上是

r(t)

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

悲凉≈ 2024-12-04 00:47:38

当我学习这部分时，我也发现它很混乱，所以我把 R.Sutton 和 AGBarto 的两个伪代码放在一起，希望能让区别更清楚。

蓝色框突出显示了两种算法实际不同的部分。数字突出显示了稍后将解释的更详细的差异。

TL;NR：

|             | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' |   π   |      π     |
| Updating Q  |   π   |      μ     |

其中 π 是 ε 贪婪策略（例如 ε > 0，有探索），μ 是贪婪策略（例如 ε == 0，无探索）。

鉴于 Q-learning 使用不同的策略来选择下一个动作 A' 和更新 Q。换句话说，它试图在遵循另一个策略 μ 的同时评估 π，因此它是一种离策略算法。
相比之下，SARSA 始终使用 π，因此它是一种在策略算法。

更详细的解释：

两者之间最重要的区别是每次操作后Q如何更新。 SARSA 严格遵循 ε-贪婪策略使用 Q'，因为 A' 是从中得出的。相比之下，Q 学习在下一步的所有可能操作中使用最大 Q'。这使得它看起来像是遵循 ε=0 的贪婪策略，即这部分没有探索。
然而，当实际采取行动时，Q-learning仍然使用从ε-贪婪策略中采取的行动。这就是为什么“Choose A ...”位于重复循环内。
遵循Q-learning中的循环逻辑，A'仍然来自ε-贪婪策略。

When I was learning this part, I found it very confusing too, so I put together the two pseudo-codes from R.Sutton and A.G.Barto hoping to make the difference clearer.

Blue boxes highlight the part where the two algorithms actually differ. Numbers highlight the more detailed difference to be explained later.

TL;NR:

|             | SARSA | Q-learning |
|:-----------:|:-----:|:----------:|
| Choosing A' |   π   |      π     |
| Updating Q  |   π   |      μ     |

where π is a ε-greedy policy (e.g. ε > 0 with exploration), and μ is a greedy policy (e.g. ε == 0, NO exploration).

Given that Q-learning is using different policies for choosing next action A' and updating Q. In other words, it is trying to evaluate π while following another policy μ, so it's an off-policy algorithm.
In contrast, SARSA uses π all the time, hence it is an on-policy algorithm.

More detailed explanation:

The most important difference between the two is how Q is updated after each action. SARSA uses the Q' following a ε-greedy policy exactly, as A' is drawn from it. In contrast, Q-learning uses the maximum Q' over all possible actions for the next step. This makes it look like following a greedy policy with ε=0, i.e. NO exploration in this part.
However, when actually taking an action, Q-learning still uses the action taken from a ε-greedy policy. This is why "Choose A ..." is inside the repeat loop.
Following the loop logic in Q-learning, A' is still from the ε-greedy policy.

回复收藏 0 原文

等风来 2024-12-04 00:47:38

是的，这是唯一的区别。在政策 SARSA 学习与其遵循的政策相关的行动价值，而离政策 Q-Learning 则相对于贪婪政策学习行动价值。在某些常见条件下，它们都收敛到实值函数，但收敛速度不同。 Q-Learning 的收敛速度往往较慢，但有能力在改变策略的同时继续学习。此外，Q-Learning 与线性近似结合时不能保证收敛。

实际上，在ε-贪婪策略下，Q-Learning 计算 Q(s,a) 与最大动作值之间的差值，而 SARSA 计算 Q(s,a) 与平均动作的加权和之间的差值值和最大值：

Q-Learning: Q(s_t+1,a_t+1) = max_aQ(s_t+1,a)

SARSA: Q(s_t+1,a_t+1 sub>) = ε·mean_aQ(s_t+1,a) + (1-ε)·max_aQ(s _t+1,a)

回复收藏 0 原文

z祗昰~ 2024-12-04 00:47:38

数学上有什么区别？

正如大多数其他答案中已经描述的那样，这两个更新之间的数学差异确实在于，当更新状态-动作对的 Q 值时 (S_{t, A_t)}：

Sarsa 使用行为策略（即代理在环境中生成经验所使用的策略，通常为epsilon -greedy) 选择一个额外的动作A_t+1，然后使用 Q(S_t+1, A_t+1 ）（按gamma折扣）作为更新目标计算中的预期未来回报。
Q-学习不使用行为策略来选择附加操作A_t+1。相反，它将更新规则中的预期未来回报估计为 max_A Q(S_t+1, A)。这里使用的max运算符可以被视为“遵循”完全贪婪的策略。 代理实际上并没有遵循贪婪策略；它只是在更新规则中说：“假设我从现在开始遵循贪婪策略，那么我的预期未来回报是多少？”。

直观上这意味着什么？

正如其他答案中提到的，上述差异意味着，使用技术术语，Sarsa 是一种在策略学习算法，而 Q-learning 是一种离策略学习算法。

在极限情况下（给定无限量的时间来生成经验和学习），并在一些额外的假设下，这意味着 Sarsa 和 Q-learning 收敛到不同的解决方案/“最佳”策略：

>Sarsa 将收敛到在我们继续遵循用于生成体验的相同策略的假设下的最佳解决方案。这通常是一个带有某种（相当“愚蠢”）随机性元素的策略，比如 epsilon-greedy，因为否则我们无法保证我们会收敛到任何东西。
Q-Learning 将收敛到在生成经验和训练后，我们切换到贪婪策略的假设下的最佳解决方案。

何时使用哪种算法？

在我们关心代理在学习/生成经验过程中的表现的情况下，像 Sarsa 这样的算法通常更受欢迎。例如，考虑代理是一个昂贵的机器人，如果它掉下悬崖就会损坏。我们不希望它在学习过程中经常掉落，因为它很昂贵。因此，我们在学习过程中关心它的表现。然而，我们也知道有时我们需要它随机行动（例如 epsilon-greedy）。这意味着机器人沿着悬崖行走是非常危险的，因为它可能会决定随机行动（概率为 epsilon）并摔倒。所以，我们希望它能够快速了解到靠近悬崖是危险的；即使贪婪策略能够在不跌倒的情况下沿着它走，我们知道我们正在遵循具有随机性的 epsilon-贪婪策略，并且我们关心优化我们的性能，因为我们知道我们会有时很愚蠢。在这种情况下，Sarsa 会更好。

如果我们不关心代理在训练过程中的表现，但我们只是希望它学习我们最终会切换到的最佳贪婪策略，那么像Q-learning这样的算法会更适合。例如，考虑一下我们玩一些练习游戏（有时我们不介意因为随机性而失败），然后玩一场重要的锦标赛（我们将停止学习并从 epsilon-greedy 切换到贪婪策略））。这就是 Q-learning 更好的地方。

What is the difference mathematically?

As is already described in most other answers, the difference between the two updates mathematically is indeed that, when updating the Q-value for a state-action pair (S_t, A_t):

Sarsa uses the behaviour policy (meaning, the policy used by the agent to generate experience in the environment, which is typically epsilon-greedy) to select an additional action A_t+1, and then uses Q(S_t+1, A_t+1) (discounted by gamma) as expected future returns in the computation of the update target.
Q-learning does not use the behaviour policy to select an additional action A_t+1. Instead, it estimates the expected future returns in the update rule as max_A Q(S_t+1, A). The max operator used here can be viewed as "following" the completely greedy policy. The agent is not actually following the greedy policy though; it only says, in the update rule, "suppose that I would start following the greedy policy from now on, what would my expected future returns be then?".

What does this mean intuitively?

As mentioned in other answers, the difference described above means, using technical terminology, that Sarsa is an on-policy learning algorithm, and Q-learning is an off-policy learning algorithm.

In the limit (given an infinite amount of time to generate experience and learn), and under some additional assumptions, this means that Sarsa and Q-learning converge to different solutions / "optimal" policies:

Sarsa will converge to a solution that is optimal under the assumption that we keep following the same policy that was used to generate the experience. This will often be a policy with some element of (rather "stupid") randomness, like epsilon-greedy, because otherwise we are unable to guarantee that we'll converge to anything at all.
Q-Learning will converge to a solution that is optimal under the assumption that, after generating experience and training, we switch over to the greedy policy.

When to use which algorithm?

An algorithm like Sarsa is typically preferable in situations where we care about the agent's performance during the process of learning / generating experience. Consider, for example, that the agent is an expensive robot that will break if it falls down a cliff. We'd rather not have it fall down too often during the learning process, because it is expensive. Therefore, we care about its performance during the learning process. However, we also know that we need it to act randomly sometimes (e.g. epsilon-greedy). This means that it is highly dangerous for the robot to be walking alongside the cliff, because it may decide to act randomly (with probability epsilon) and fall down. So, we'd prefer it to quickly learn that it's dangerous to be close to the cliff; even if a greedy policy would be able to walk right alongside it without falling, we know that we're following an epsilon-greedy policy with randomness, and we care about optimizing our performance given that we know that we'll be stupid sometimes. This is a situation where Sarsa would be preferable.

An algorithm like Q-learning would be preferable in situations where we do not care about the agent's performance during the training process, but we just want it to learn an optimal greedy policy that we'll switch to eventually. Consider, for example, that we play a few practice games (where we don't mind losing due to randomness sometimes), and afterwards play an important tournament (where we'll stop learning and switch over from epsilon-greedy to the greedy policy). This is where Q-learning would be better.

回复收藏 0 原文