强化学习玩具项目

发布于 2024-09-01 13:47:18 字数 607 浏览 12 评论 0原文

我的学习和玩具项目应用强化学习是：
- 代理尝试“安全”地达到目标状态“快点”......
- 但是有射弹和火箭会在途中向特工发射。
- 代理可以确定火箭的位置 - 带有一些噪音 - 仅当火箭“靠近”时
- 然后特工必须学会避免撞上这些火箭。
- 代理具有 - 可随时间充电 - 燃料，其在代理运动中消耗
- 连续动作：向前加速 - 以角度转弯

我需要一些适合这种情况的 RL 算法的提示和名称..
- 我认为它是 POMDP ，但我可以将其建模为 MDP 并忽略噪声吗？
- 如果是 POMDP，评估概率的推荐方法是什么？
- 在这种情况下使用哪个更好：价值函数或策略迭代？
- 我可以使用神经网络来模拟环境动力学而不是使用显式方程吗？
- 如果是，是否有推荐的特定类型/型号的神经网络？
- 我认为动作必须是离散化的，对吗？

我知道学习这样的主题需要时间和精力，但我渴望..
如果您无法回答全部问题，您可以回答部分问题...
谢谢

原文

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

陌若浮生 2024-09-08 13:47:18

如果这是您第一次进行强化学习实验，我建议您从比这简单得多的事情开始。您可以从简单的项目开始，掌握窍门，然后再转向像这样的更复杂的项目。我在 POMDP 方面遇到了麻烦，而且我已经在 RL 领域工作了一段时间了。现在我将尝试回答我能回答的问题。

我认为它是 POMDP ，但我可以将其建模为 MDP 并忽略噪声吗？

是的。 POMDP 代表部分可观察马尔可夫决策过程。部分可观察部分是指智能体无法完全知道其状态，但可以根据观察来估计它。在你的情况下，你将把火箭的位置作为可能有一些噪音的观察，并且根据特工之前的知识，你可以更新它对导弹所在位置的信念。这增加了很多复杂性。将导弹位置作为绝对位置使用会容易得多，而不必处理不确定性。那么您就不必使用 POMDP。

如果是POMDP，评估概率的推荐方法是什么？

我不明白你的问题。您将使用某种形式的贝叶斯规则。也就是说，您将拥有某种分布，即您的信念状态（处于任何给定状态的概率），这将是您的先验分布，并且根据观察，您将调整它并获得后验分布。如果您需要更多信息，请查看贝叶斯规则。

在这种情况下使用哪个更好：价值函数或策略迭代？

我的大部分经验都是使用价值函数，并发现它们相对容易使用/理解。但我不知道还能告诉你什么。我想这可能是你的选择，我必须花时间做这个项目才能做出更好的选择。

我可以使用神经网络来模拟环境动力学而不是使用显式方程吗？如果是，是否有推荐的特定类型/模型的神经网络？

我对使用神经网络建模环境一无所知，抱歉。

我认为 Actions 必须是离散化的，对吧？

是的。您必须有一个离散的操作列表和一个离散的状态列表。一般来说，算法会为任何给定状态选择最佳操作，而对于最简单的算法（例如 QLearning），您只需跟踪每个给定状态-操作对的值即可。

如果您刚刚学习所有这些内容，我会推荐 Sutton 和 Barto 文本。另外，如果您想查看 RL 算法的简单示例，我有一个非常简单的基类和一个使用它的示例，位于 github （用 Python 编写）。 Abstract_rl 类旨在针对 RL 任务进行扩展，但非常简单。 simple_rl.py 是一个简单任务的示例（它是一个简单的网格，以一个位置为目标，它使用 QLearning 作为算法），使用可以运行的 base_rl 并打印一些显示随时间变化的奖励的图表。两者都不是很复杂，但如果您刚刚开始，可能会帮助您提供一些想法。我希望这有帮助。如果您有更多或更具体的问题，请告诉我。

If this is your first experiment with reinforcement learning I would recommend starting with something much simpler than this. You can start simple to get the hang of things and then move to a more complicated project like this one. I have trouble with POMDPs and I have been working in RL for quite a while now. Now I'll try to answer what questions I can.

I think it is POMDP , but can I model it as MDP and just ignore noise?

Yes. POMDP stands for Partially Observable Markov Decision Process. The partially observable part refers to the fact that the agent can't know it's state perfectly, but can estimate it based on observations. In your case, you would have the location of the rocket as an observation that can have some noise, and based on the agents previous knowledge you can update it's belief of where the missiles are. That adds a lot of complexity. It would be much easier to use the missile locations as absolutes and not have to deal with uncertainty. Then you would not have to use POMDPs.

In case POMDP, What is the recommended way for evaluating probability?

I don't understand your question. You would use some form of Bayes rule. That is, you would have some sort of distribution that is your belief state (probabilities of being in any given state), that would be your prior distribution and based on observation you would adjust this and get a posterior distribution. Look into Bayes rule if you need more info.

Which is better to use in this case: Value functions or Policy Iterations?

Most of my experience has been using value functions and find them relatively easy to use/understand. But I don't know what else to tell you. I think this is probably your choice, I would have to spend time working on the project to make a better choice.

Can I use NN to model environment dynamics instead of using explicit equations? If yes, Is there a specific type/model of NN to be recommended?

I don't know anything about using NN to model environments, sorry.

I think Actions must be discretized, right?

Yes. You would have to have a discrete list of actions, and a discrete list of states. Generally the algorithm will choose the best action for any given state, and for the simplest algorithms (something like QLearning) you just keep track of a value for every given state-action pair.

If you are just learning all of this stuff I would recommend the Sutton and Barto text. Also if you want to see a simple example of a RL algorithm I have a very simple base class and an example using it up at github (written in Python). The abstract_rl class is meant to be extended for RL tasks, but is very simple. simple_rl.py is an example of a simple task (it is a simple grid with one position being the goal and it uses QLearning as the algorithm) using base_rl that can be run and will print some graphs showing reward over time. Neither are very complex, but if you are just getting started may help to give you some ideas. I hope this helped. Let me know if you have any more or more specific questions.

回复收藏 0 原文