目标函数(SA)和值函数(RL)有什么区别

发布于 2025-02-04 12:29:11 字数 201 浏览 4 评论 0原文

在模拟退火(SA)中具有目标函数e(S)定义了从一个状态s转移到另一个s'的过渡概率。理想情况下,目标函数最小对应于最佳解决方案。

在强化学习(RL)中,我们有一个值函数v(s),它具有在当前状态s中的好处的值。

功能中也有一个值,可以使当前状态和动作的组合具有值,但我不想将其与SA进行比较。

所以我的问题现在是,e(s)和v(s)有什么区别?

Having an objective function E(s) in Simulated Annealing (SA) defines the transition probability of moving from one state s to another s'. Ideally, the objective function minimum corresponds to the optimal solution.

In Reinforcement learning (RL), we have a value function v(s) that gives a value of how good it is to be in the current state s.

There is also in function which gives a value to a combination of the current state and an action, but I don't want to compare this to SA.

So my question is now, what is the difference between E(s) and v(s)?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

活雷疯 2025-02-11 12:29:11

模拟退火(SA)和增强学习(RL)算法旨在解决不同类别的问题。前者的目的是找到全球最佳距离,而后者则是为了找到最大化奖励的政策(不是直接的奖励和国家)。更确切地说,在RL中,代理就奖励及其当前状态(反馈)做出了行动。代理的策略可以看作是一个地图,定义了给定状态执行诉讼的概率,并定义了考虑所有未来动作的状态的好处。

只要您可以将分数归因于玩家,RL算法就可以应用于在游戏中优化代理的策略。奖励通常是两个时间阶段(即回合)之间的分数差异。例如,对于许多游戏,例如国际象棋,对手可以影响代理的状态,而代理商可以根据反馈循环对其进行反应。在这种情况下,目标是找到最大化获胜机会的操作顺序。将天真的SA用于此类问题没有太多意义:没有必要找到最佳的全球状态。实际上,如果我们尝试在这种情况下应用SA,那么好的对手将迅速阻止SA融合到良好的全球最佳最佳状态。实际上,SA不考虑对手,也不关心操作序列,而仅在SA中很重要。

另外,如果您想找到可衍生的数学函数的最小值(例如,高阶多项式),则RL算法是毫无用处的(且效率低下),因为它们专注于优化最佳策略,而您不需要(尽管最佳策略可以帮助找到全球最佳,SA已经对此有好处),您只需要最佳状态(甚至可能是其相关的目标价值)。

另一个关键区别是afaik e(s) sa中的预定义,而v(s)通常是未知的,必须由RL找到算法。这是一个巨大的区别,因为在实践中v(s)取决于RL算法还需要找到的策略。如果已知v(s),则可以通过琐碎的推导策略(代理需要执行最大化v(s))的操作,如果最佳策略是已知,然后可以从马尔可夫链中计算出v(s)

Simulated Annealing (SA) and Reinforcement Learning (RL) algorithms are meant to solve different classes of problems. The former is meant to find a global optimum while the later is meant to find a policy that maximize a reward (not directly a reward nor a state). More precisely, in RL, agents do actions regarding a reward and their current state (feedback). The policy of an agent can be seen as a map defining the probability of doing an action given a state and the value function defined how good is it to be in a state considering all future actions.

RL algorithms can be applied to optimize the policy of an agent in game as long as you can attribute a score to the players. The reward can typically be the score difference between two time-step (ie. rounds). For many games, like chess for example, an opponent can impact the state of the agent and the agent can just react to it based on a feedback loop. The goal in such case is to find the sequence of operation that maximize the chance to win. Using naively SA for such a problem does not make much sense: there is no need to find the best global state. In fact, if we try to apply SA in this case, a good opponent will quickly prevent SA to converge to a good global optimal. In fact, SA does not consider the opponent and do not care about the sequence of operation, only the result matters in SA.

Alternatively, if you want to find the minimum value of a derivable mathematical function (eg. high-order polynomials), then RL algorithm are quite useless (and inefficient) because they focus on optimizing the optimal policy while you do not need that (though an optimal policy can help to find a global optimal, SA is already good for that), you only want the optimal state (and possibly its associated objective value).

Another key difference is that AFAIK E(s) is predefined in SA, while V(s) is generally unknown and must be found by RL algorithms. This is a huge difference since in practice V(s) is is dependent of the policy which the RL algorithm need to also find. If V(s) is known, then the policy can be trivially deduced (the agent needs to perform the action that maximize V(s)) and if an optimal policy is known, then V(s) can be approximated computed from the Markov chain.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文