稳定的基线3-设置“手动” q_values

发布于 2025-01-17 20:33:56 字数 1132 浏览 7 评论 0原文

我所做的是，

我正在使用稳定基线3中的DQN算法进行两个玩家板类型游戏。在此游戏中，有40个动作可用，但是一旦制定了，就无法再做一次。

我用对手训练了我的第一个模型，该模型会随机选择其移动。如果该模型做出了无效的动作，我给出的负奖励等于一个人可以获得并停止游戏的最大分数。

该问题

完成后，我对第一轮获得的新模型进行了培训。不幸的是，最终，随着对手似乎无效的举动，训练过程被阻止。这意味着，在我在第一次培训中尝试的所有尝试，第一个模型仍然可以预测无效的动作。这是“愚蠢”对手的代码：

while(self.dumb_turn):
    #The opponent chooses a move
    chosen_line, _states = model2.predict(self.state, deterministic=True)
    #We check if the move is valid or not
    while(line_exist(chosen_line, self.state)):
        chosen_line, _states = model2.predict(self.state, deterministic=True)
    #Once a good move is made, we registered it as a move and add it to the space state
    self.state[chosen_line]=1

我想做什么，但不知道

解决方案将如何手动将Q值设置为无效的动作，以便对手避免使用这些动作，并且培训算法不会卡住。我被告知如何访问这些价值观：

import torch as th
from stable_baselines3 import DQN

model = DQN("MlpPolicy", "CartPole-v1")
env = model.get_env()

obs = env.reset()
with th.no_grad():
     obs_tensor, _ = model.q_net.obs_to_tensor(obs)
     q_values = model.q_net(obs_tensor)

但是我不知道如何将它们设置为 - 世界。

如果有人可以帮助我，我会非常感谢。

原文

What I have done

I'm using the DQN Algorithm in Stable Baselines 3 for a two players board type game. In this game, 40 moves are available, but once one is made, it can't be done again.

I trained my first model with an opponent which would choose randomly its move. If an invalid move is made by the model, I give a negative reward equal to the max score one can obtain and stop the game.

The issue

Once it's was done, I trained a new model against the one I obtained with the first run. Unfortunately, ultimately, the training process gets blocked as the opponent seems to loop an invalid move. Which means that, with all I've tried in the first training, the first model still predicts invalid moves. Here's the code for the "dumb" opponent :

while(self.dumb_turn):
    #The opponent chooses a move
    chosen_line, _states = model2.predict(self.state, deterministic=True)
    #We check if the move is valid or not
    while(line_exist(chosen_line, self.state)):
        chosen_line, _states = model2.predict(self.state, deterministic=True)
    #Once a good move is made, we registered it as a move and add it to the space state
    self.state[chosen_line]=1

What I would like to do but don't know how

A solution would be to set manually the Q-values to -inf for the invalid moves so that the opponent avoid those moves, and the training algorithm does not get stuck. I've been told how to access to these values :

import torch as th
from stable_baselines3 import DQN

model = DQN("MlpPolicy", "CartPole-v1")
env = model.get_env()

obs = env.reset()
with th.no_grad():
     obs_tensor, _ = model.q_net.obs_to_tensor(obs)
     q_values = model.q_net(obs_tensor)

But I don't know how to set them to -infinity.

If somebody could help me, I would be very grateful.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

野生奥特曼 2025-01-24 20:33:57

我最近有一个类似的问题，在训练过程中，我需要直接改变RL模型产生的Q值以影响其行为。

To do this I overwritten some methods of the library:

# Imports
from stable_baselines3.dqn.policies import QNetwork, DQNPolicy

# Override some methods of the class QNetwork used by the DQN model in order to set to a negative value the q-values of
# some actions

# Two possibile methods to override:
# Override _predict ---> alter q-values only during predictions but not during training
# Override forward ---> alter q-values also during training (Attention: here we are working with batches of q-values)

class QNetwork_modified(QNetwork):
    
    def forward(self, obs: th.Tensor) -> th.Tensor:
        """
        Predict the q-values.
        :param obs: Observation
        :return: The estimated Q-Value for each action.
        """
        # Compute the q-values using the QNetwork
        q_values = self.q_net(self.extract_features(obs))
        # For each observation in the training batch:
        for i in range(obs.shape[0]):
            # Here you can alter q_values[i]

        
        return q_values

    
# Override the make_q_net method of the DQN policy used by the DQN model to make it use the new DQN network

class DQNPolicy_modified(DQNPolicy):
    def make_q_net(self) -> DQNPolicy:
        # Make sure we always have separate networks for features extractors etc
        net_args = self._update_features_extractor(self.net_args, features_extractor=None)
        return QNetwork_modified(**net_args).to(self.device)



model = DQN(DQNPolicy_modified, env, verbose=1)

Personally I don't like too much this approach, and I would suggest you to try first some “more natural” alternatives, like for examples giving in input to your model also some kind of已经选择了哪些动作的历史，以帮助该模型了解应避免预选的动作。
例如，您可以使用额外的二进制掩码丰富RL模型的输入，其中已选择的动作将相应的位设置为1（在这种情况下，您应该修改健身房环境）。

I recently had a similar problem in which I needed to directly alter the q-values produced by the RL model during training in order to influence its actions.

To do this I overwritten some methods of the library:

# Imports
from stable_baselines3.dqn.policies import QNetwork, DQNPolicy

# Override some methods of the class QNetwork used by the DQN model in order to set to a negative value the q-values of
# some actions

# Two possibile methods to override:
# Override _predict ---> alter q-values only during predictions but not during training
# Override forward ---> alter q-values also during training (Attention: here we are working with batches of q-values)

class QNetwork_modified(QNetwork):
    
    def forward(self, obs: th.Tensor) -> th.Tensor:
        """
        Predict the q-values.
        :param obs: Observation
        :return: The estimated Q-Value for each action.
        """
        # Compute the q-values using the QNetwork
        q_values = self.q_net(self.extract_features(obs))
        # For each observation in the training batch:
        for i in range(obs.shape[0]):
            # Here you can alter q_values[i]

        
        return q_values

    
# Override the make_q_net method of the DQN policy used by the DQN model to make it use the new DQN network

class DQNPolicy_modified(DQNPolicy):
    def make_q_net(self) -> DQNPolicy:
        # Make sure we always have separate networks for features extractors etc
        net_args = self._update_features_extractor(self.net_args, features_extractor=None)
        return QNetwork_modified(**net_args).to(self.device)



model = DQN(DQNPolicy_modified, env, verbose=1)

Personally I don’t like too much this approach, and I would suggest you to try first some “more natural” alternatives, like for examples giving in input to your model also some kind of history of what actions have been already selected, in order to help the model learn that pre-selected actions should be avoided.
For example you could enrich the input for the RL model with an additional binary mask where the moves already chosen have their corresponding bit set to 1. (In this case you should modify the gym environment).

回复收藏 0 原文

~没有更多了~