是否有可能在A2C稳定的基线3中暴露重播缓冲液以包括人类的判断?

发布于 2025-01-20 23:20:08 字数 486 浏览 4 评论 0原文

我正在使用 stable-baselines3 中的 A2C (Advantage Actor Critic) 框架 (包链接在这里)用于解决奖励为+1或0的强化问题的包。我有一个自动机制来将奖励分配给给定状态下的选择。然而,这种自动机制并不足以奖励我的选择。我评估认为,人类的判断(如果人类坐下来奖励选择)会更好。

现在,我想在训练中将这种人类判断融入到A2C框架中。

这是我对 A2C 工作原理的理解:

假设 1 集中有 N 个时间步。轨迹存储在经验重放缓冲区中:[(S1, A1, R1), (S2, A2, R2) ...],用于在剧集结束时训练演员和评论家神经网络。

我可以访问这个发送到神经网络进行训练的缓冲区吗?或者有什么替代方案可以在 A2C 框架中引入人类参与循环吗?

I am using A2C (Advantage Actor Critic) framework from stable-baselines3 (package link here) package for solving a reinforcement problem where reward is +1 or 0. I have an automatic mechanism to allocate reward to a choice in a given state. However, that automatic mechanism is not that good enough to reward my choices. I have evaluated that human judgement (if a human sits and rewards the choices) is better.

Now, I want to incorporate this human judgement into the A2C framework in training.

This is my understanding of how A2C works:

Let's say there are N timesteps in 1 episode. The trajectory is stored in an experience replay buffer: [(S1, A1, R1), (S2, A2, R2) ...] which is used to train the actor and critic neural networks at the end of the episode.

Can I access this buffer that is sent to neural networks for training? Or is there any alternative to introduce human in the loop in A2C framework?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

守护在此方 2025-01-27 23:20:08

当然!该环境是一个简单的 python 脚本,其中在 env.step 末尾的某个位置计算并返回奖励,然后与状态和操作一起添加到重播缓冲区中。

然后,您可以使用简单的 I/O 命令在每次执行操作时手动插入奖励值。

然而,深度强化学习通常需要数十万次迭代(经验)才能学到有用的东西(除非环境足够简单)。

Of course! The environment is a simple python script in which, somewhere at the end of env.step, the reward is calculated and returned, to be then added along with the state and the action to the replay buffer.

You could then manually insert the reward value each time an action is taken, using simple I/O commands.

However, Deep Reinforcement Learning usually requires hundreds of thousands of iterations (experience) before learning something useful (unless the environment is simple enough).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文