嵌套列表作为状态、元组作为动作的 Q 表表示

发布于 2025-01-19 00:28:05 字数 511 浏览 3 评论 0原文

当我的状态是列表而操作是元组时,如何创建 Q 表?

N = 3 的状态示例

[[1], [2], [3]]
[[1], [2, 3]]
[[1], [3, 2]]
[[2], [3, 1]]
[[1, 2, 3]]

这些状态的操作示例

[[1], [2], [3]] -> (1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)
[[1], [2, 3]] -> (1, 2), (2, 0), (2, 1)
[[1], [3, 2]] -> (1, 3), (3, 0), (3, 1)
[[2], [3, 1]] -> (2, 3), (3, 0), (3, 2)
[[1, 2, 3]] -> (1, 0)

我想知道的

# q_table = {state: {action: q_value}}

但我不认为这是一个好的设计。

How can I create a Q-table, when my states are lists and actions are tuples?

Example of states for N = 3

[[1], [2], [3]]
[[1], [2, 3]]
[[1], [3, 2]]
[[2], [3, 1]]
[[1, 2, 3]]

Example of actions for those states

[[1], [2], [3]] -> (1, 2), (1, 3), (2, 1), (2, 3), (3, 1), (3, 2)
[[1], [2, 3]] -> (1, 2), (2, 0), (2, 1)
[[1], [3, 2]] -> (1, 3), (3, 0), (3, 1)
[[2], [3, 1]] -> (2, 3), (3, 0), (3, 2)
[[1, 2, 3]] -> (1, 0)

I was wondering about

# q_table = {state: {action: q_value}}

But I don't think, thats a good design.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

梦晓ヶ微光ヅ倾城 2025-01-26 00:28:05

1. 你的州真的应该属于列表类型吗?

list 是一个可变类型。 tuple 是等效的不可变类型。你在学习过程中会改变你的状态吗?我对此表示怀疑。

无论如何,如果您使用 list,则不能将其用作字典键(因为它是可变的)

2. 否则,这是一个非常好的表示

在强化学习上下文中,您将希望

  1. 得到Q 的特定值
  2. 查看特定状态下所有可能操作的 Q 值(以找到最大 Q)

您的表示允许您以最小的复杂性完成这两个操作,并且非常清晰。所以这是一个很好的代表。

1. Should your states really be of type list?

list is a mutable type. tuple is the equivalent immutable type. Do you mutate your states during learning? I doubt it.

In any case if you use list, you cannot use it as a dictionary key (because it is mutable)

2. Otherwise this is a pretty good representation

In a reinforcement learning context, you’ll want to

  1. get a specific value for Q
  2. Look at the Q values for all possible actions in a specific state (to find the maximal Q)

Your representation allows you to do both of these with minimal complexity, and is pretty clear. So it is a good representation.

暖风昔人 2025-01-26 00:28:05

使用嵌套词典实际上是自定义表格加固学习的合理设计选择---它称为表格,原因是有原因的:)

您可以使用默认设备将Q-table初始化为某个值,例如,例如,0。

from collections import defaultdict

q = defaultdict(lambda: defaultdict(lambda: default_q_value))

或无默认值:

q = {s: {a: default_q_value for a in actions} for s in states}

IT然后,可以方便地执行更新,从而获得最大的功能,这样

best_next_state_val = max(q[s].values())
q[state][action] += alpha * (reward + gamma * best_next_state_val)

我就要注意的是,如果您使用这样的Q桌子训练代理商,则每当所有值时,每次都会选择相同的操作因为这些动作是相等的(例如QF初始化时)。

最后,如果您不想使用字典,则可以将状态和操作元组映射到索引,将映射存储在字典中,并在将状态/操作传递给环境实现时使用查找。然后,您可以将它们用作2D Numpy数组的索引。

Using a nested dictionary is actually a reasonable design choice for custom tabular reinforcement learning---it's called tabular for a reason :)

You could use defaultdict to initialize the q-table to a certain value, e.g., 0.

from collections import defaultdict

q = defaultdict(lambda: defaultdict(lambda: default_q_value))

or without defaultdict:

q = {s: {a: default_q_value for a in actions} for s in states}

It is then convenient to perform updates by getting the max by something like so

best_next_state_val = max(q[s].values())
q[state][action] += alpha * (reward + gamma * best_next_state_val)

One thing I'd just watch out for is that if you train an agent using a q-table like this, it will pick the same action each time if all the values for the actions are equal (such as when the qf is initialized).

Finally, if you don't want to use dictionaries, you can just map state and action tuples to indices, store the mapping in a dictionary, and use a lookup when you pass the state/action to your environment implementation. You can then just use them as indices of a 2d numpy array.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文