稳定的基线3-设置“手动” q_values
我所做的是,
我正在使用稳定基线3中的DQN算法进行两个玩家板类型游戏。在此游戏中,有40个动作可用,但是一旦制定了,就无法再做一次。
我用对手训练了我的第一个模型,该模型会随机选择其移动。如果该模型做出了无效的动作,我给出的负奖励等于一个人可以获得并停止游戏的最大分数。
该问题
完成后,我对第一轮获得的新模型进行了培训。不幸的是,最终,随着对手似乎无效的举动,训练过程被阻止。这意味着,在我在第一次培训中尝试的所有尝试,第一个模型仍然可以预测无效的动作。这是“愚蠢”对手的代码:
while(self.dumb_turn):
#The opponent chooses a move
chosen_line, _states = model2.predict(self.state, deterministic=True)
#We check if the move is valid or not
while(line_exist(chosen_line, self.state)):
chosen_line, _states = model2.predict(self.state, deterministic=True)
#Once a good move is made, we registered it as a move and add it to the space state
self.state[chosen_line]=1
我想做什么,但不知道
解决方案将如何手动将Q值设置为无效的动作,以便对手避免使用这些动作,并且培训算法不会卡住。我被告知如何访问这些价值观:
import torch as th
from stable_baselines3 import DQN
model = DQN("MlpPolicy", "CartPole-v1")
env = model.get_env()
obs = env.reset()
with th.no_grad():
obs_tensor, _ = model.q_net.obs_to_tensor(obs)
q_values = model.q_net(obs_tensor)
但是我不知道如何将它们设置为 - 世界。
如果有人可以帮助我,我会非常感谢。
What I have done
I'm using the DQN Algorithm in Stable Baselines 3 for a two players board type game. In this game, 40 moves are available, but once one is made, it can't be done again.
I trained my first model with an opponent which would choose randomly its move. If an invalid move is made by the model, I give a negative reward equal to the max score one can obtain and stop the game.
The issue
Once it's was done, I trained a new model against the one I obtained with the first run. Unfortunately, ultimately, the training process gets blocked as the opponent seems to loop an invalid move. Which means that, with all I've tried in the first training, the first model still predicts invalid moves. Here's the code for the "dumb" opponent :
while(self.dumb_turn):
#The opponent chooses a move
chosen_line, _states = model2.predict(self.state, deterministic=True)
#We check if the move is valid or not
while(line_exist(chosen_line, self.state)):
chosen_line, _states = model2.predict(self.state, deterministic=True)
#Once a good move is made, we registered it as a move and add it to the space state
self.state[chosen_line]=1
What I would like to do but don't know how
A solution would be to set manually the Q-values to -inf for the invalid moves so that the opponent avoid those moves, and the training algorithm does not get stuck. I've been told how to access to these values :
import torch as th
from stable_baselines3 import DQN
model = DQN("MlpPolicy", "CartPole-v1")
env = model.get_env()
obs = env.reset()
with th.no_grad():
obs_tensor, _ = model.q_net.obs_to_tensor(obs)
q_values = model.q_net(obs_tensor)
But I don't know how to set them to -infinity.
If somebody could help me, I would be very grateful.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
我最近有一个类似的问题,在训练过程中,我需要直接改变RL模型产生的Q值以影响其行为。
To do this I overwritten some methods of the library:
Personally I don't like too much this approach, and I would suggest you to try first some “more natural” alternatives, like for examples giving in input to your model also some kind of已经选择了哪些动作的历史,以帮助该模型了解应避免预选的动作。
例如,您可以使用额外的二进制掩码丰富RL模型的输入,其中已选择的动作将相应的位设置为1(在这种情况下,您应该修改健身房环境)。
I recently had a similar problem in which I needed to directly alter the q-values produced by the RL model during training in order to influence its actions.
To do this I overwritten some methods of the library:
Personally I don’t like too much this approach, and I would suggest you to try first some “more natural” alternatives, like for examples giving in input to your model also some kind of history of what actions have been already selected, in order to help the model learn that pre-selected actions should be avoided.
For example you could enrich the input for the RL model with an additional binary mask where the moves already chosen have their corresponding bit set to 1. (In this case you should modify the gym environment).