火车稳定的基线3与示例?

发布于 2025-02-08 06:27:50 字数 698 浏览 5 评论 0原文

对于我的学习算法的基本消除 我定义了一个自定义环境。 现在有了稳定基线的标准示例 似乎总是由稳定的基线自动启动 (通过稳定的baselines选择自己的随机行动并评估奖励)。 标准学习似乎是这样完成的:

model.Learn(total_timesteps = 10000)

,这将尝试不同的操作并优化动作观察的关系 学习。

我想尝试一种非常基本的方法:对于我的自定义环境,我会 尝试生成示例列表,应根据某些操作采取哪些操作 相关性的情况(因此有预定义的观察行动奖励列表)。

我想通过此列表训练模型。

使用StableBaselines实施此操作的最合适的方法是什么3 (使用pytorch)?

附加信息: 也许可以将问题的感觉与Atari游戏的想法进行比较,而不是一次训练整个游戏序列(从游戏开始到结尾,然后再次重新启动直到训练结束),而是要训练只有某些更具体的代表性情况的代理人。 或国际象棋:让代理商似乎是一个巨大的区别 选择随机选择或随机选择的移动或 让他遵循大师们特别有趣的举动 情况。

也许可以将清单作为环境反应的主要部分 (例如 然后用环境2进行训练,例如1000步,依此类推)。 这可能是一个解决方案。

但是问题是,稳定的基线会选择 动作自己,使其无法学习完整的顺序 按顺序选择“正确”或类似的象棋精心选择的步骤。

因此,实际的问题是:可能是可能的,在训练/学习时,带来稳定的底贝林而不是自我选择的行动,重要的是要训练预定义的动作吗?

For my basic evaulation of learning algorithms
I defined a custom environment.
Now with standard examples for stable baselines the learning
seems always to be initiated by stable baselines automatically
(by stablebaselines choosing random actions itsself and evaluating the rewards).
The standard learning seems to be done like this:

model.learn(total_timesteps=10000)

and this will try out different actions and optimize the action-observation-relations
while learning.

I would like to try out a really basic approach: for my custom environment I would
try to generate lists of examples, which actions should be taken according to some
situations of relevance (so there is a list of predefined observation-action-rewards).

And I would like to train the model with this list.

What would be the most appropriate way to implement this with stablebaselines3
(using pytorch)?

Additional information:
maybe the sense of the question could be compared to the idea in case of an atari game, not to always train a whole game sequence at once (from start to end of game, and then restart again until training ends), but instead to train the agent only with some more specific, representative situations of importance.
Or in chess: it seems to be a huge difference to let an agent
select randomly self choosen or randomly selected moves or
to let him follow moves played by masters in particular interesting
situations.

Maybe one could put the lists as main part of the environment reaction
(so e.g. train the agent with environment 1 for e.g. 1000 steps,
then train with environment 2 for e.g. 1000 steps and so on).
This could be a solution.

But the problem would be, that stable baselines would choose
actions itsself, so that it could not learn a complete sequence
of "correct" or like in chess masterful choosen steps in sequence.

So again the practical question is : is is possible, ot important to bring stablebaselines to train predefined actions instead of self choosen ones while training/learning?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

俯瞰星空 2025-02-15 06:27:50

模仿学习本质上是您想要的。有一个模仿位于基准顶部的库您可以用来实现这一目标。

请参阅此

Imitation Learning is essentially what you are looking for. There is an imitation library that sits on top of baselines that you can use to achieve this.

See this example on how to create a policy that mimics expert behavior to train the network. The behavior in this case comes from a set of action sequences or rollout. In the example the rollout comes from an expertly trained policy, but you can probably create a hand written one. See this on how to create a rollout.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文