如何在 Python 中离散连续观察和动作空间?
我的教授要求我在 OpenAI 的 Pendulum-V1 健身房环境中应用策略迭代方法。
Pendulum-V1 有以下环境:
观测
类型:Box(3)
Num | Observation | Min | Max |
---|---|---|---|
0 | cos(theta) | -1.0 | 1.0 |
1 | sin(theta) | -1.0 | 1.0 |
2 | theta dot | -8.0 | 8.0 |
Actions
类型:Box(1)
Num | Observation | Min | Max |
---|---|---|---|
0 | 联合努力 | -2.0 | 2.0 |
据我了解,策略迭代需要离散动作、离散观察和概率函数,例如 Frozen Lake OpenAI 环境。我知道有一些方法是为连续范围内的框类型数据设计的,但要求是应用“正确的”策略迭代方法并解释为什么它不起作用。
有谁有来源,知道代码仓库,或者可以帮助我如何离散操作和观察状态数据并通过策略方法应用它?我读到的所有内容都告诉我这是解决这个问题的糟糕方法,而且我似乎找不到任何人在 Pendulum-V1 上实际实现了这种方法。
My professor has asked me to apply a Policy Iteration method on the Pendulum-V1 gym environment in OpenAI.
Pendulum-V1 has the following Environment:
Observation
Type: Box(3)
Num | Observation | Min | Max |
---|---|---|---|
0 | cos(theta) | -1.0 | 1.0 |
1 | sin(theta) | -1.0 | 1.0 |
2 | theta dot | -8.0 | 8.0 |
Actions
Type: Box(1)
Num | Observation | Min | Max |
---|---|---|---|
0 | Joint effort | -2.0 | 2.0 |
From my understanding, Policy Iteration requires discrete actions, discrete observations and probability functions, such as the Frozen Lake OpenAI environment. I know that there are methods designed for box type data in a continuous range but the requirement is to apply a "correct" Policy Iteration method and explain why it doesn't work.
Does anyone have a source, know a code repo, or could help me with how I would discretise the action and observation state data and apply it via the Policy Method? Everything I have read has told me this is a bad way to solve this problem and I cannot seem to find anyone who has actually implemented this method on Pendulum-V1.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论