RL+优化:如何做得更好?

发布于 2025-01-25 05:22:27 字数 4063 浏览 4 评论 0 原文

我正在学习如何使用强化学习来优化。我选择了最大匹配在bipartite中。

回想一下,图中的匹配是边缘的子集,其中没有两个边缘在同一节点/顶点上。目标是找到最大的此类子集。

我在下面显示我的完整代码,但首先让我解释一下部分。

num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

这使得在两个节点中的每个节点中的每个节点中都有一个随机的两分图。然后,它打印出真正的最大匹配的大小。

在下面的代码中, self.agent_pos 是代表当前匹配的数组。它的长度是原始图中的边数,如果包括 i ,则index i 有一个1,否则为0。 self.matching 是增长匹配中的边缘集。 self.matching_nodes 是增长匹配中的一组节点,用于检查是否可以添加特定的边缘。

import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

class MaxMatchEnv(gym.Env):
    metadata = {'render.modes': ['console']}
    def __init__(self, array_length=10):
        super(MaxMatchEnv, self).__init__()
        # Size of the 1D-grid
        self.array_length = array_length
        self.agent_pos = [0]*array_length
        self.action_space = spaces.Discrete(array_length)
        self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
        self.matching = set()  # set of edges
        self.matching_nodes = set() # set of node ids (ints)
        self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
        self.best_found = 0
        
    def reset(self):
        # Initialize the array to have random values
        self.time = 0
        #print(self.agent_pos)
        self.agent_pos = [0]*self.array_length
        self.matching = set()
        self.matching_nodes = set()
        return np.array(self.agent_pos)
    
        
    def step(self, action):
        self.time += 1 
        reward = 0
        edge = g.es[action]
        if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
            self.matching.add(edge)
            self.matching_nodes.add(edge.source)
            self.matching_nodes.add(edge.target)
            self.agent_pos[action] = 1
            if sum(self.agent_pos) > self.best_found:
                self.best_found = sum(self.agent_pos)
                print("New max", self.best_found)
            reward = 1
        elif self.agent_pos[action] == 1:
            #print("Removing edge", action)
            self.matching_nodes.remove(edge.source)
            self.matching_nodes.remove(edge.target)
            self.matching.remove(edge)
            self.agent_pos[action] = 0
            reward = -1
        done = sum(self.agent_pos) == self.matching_size
        info = {}
        return np.array(self.agent_pos), reward, done, info

    def render(self, mode='console'):
        print(sum(self.agent_pos))

    def close(self):
        pass


if __name__ == '__main__':
 
    num_variables = 1000
    g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
    g_matching = g.maximum_bipartite_matching()
    print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

    env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)

    model = PPO('MlpPolicy', env, verbose=1).learn(10000000)

有很多问题,但主要的问题是它不能很好地优化。该代码仅提供550多个,然后停止改进真正的最佳最佳位置超过900(从开始时由代码打印出来)。

主要问题是:

如何做得更好,从而达到更好的匹配?

辅助问题是,如何打印到目前为止找到的最佳匹配?我尝试使用self.best_found维持最佳分数无法正常工作,因为它似乎是定期重置的。

无助的变化

  • 更改DQN的PPO只会有边际差异。
  • 我尝试更改代码,以使完成在1000个步骤后是正确的。

更改如下:

if self.time == 1000:
    done = True
else:
    done = False

添加 print(max.get_attr(“ best_found”)))代替 print(“ new Max”,self.best_found)>更改为完成完全没有任何优势。

I am learning about how to optimize using reinforcement learning. I have chosen the problem of maximum matching in a bipartite graph as I can easily compute the true optimum.

Recall that a matching in a graph is a subset of the edges where no two edges are incident on the same node/vertex. The goal is to find the largest such subset.

I show my full code below but first let me explain parts of it.

num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

This makes a random bipartite graph with 1000 nodes in each of the two sets of nodes. It then prints out the size of the true maximum matching.

In the code below, self.agent_pos is an array representing the current matching found. Its length is the number of edges in the original graph and there is a 1 at index i if edge i is included and a 0 otherwise. self.matching is the set of edges in the growing matching. self.matching_nodes is the set of nodes in the growing matching that is used to check to see if a particular edge can be added or not.

import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces

from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env

class MaxMatchEnv(gym.Env):
    metadata = {'render.modes': ['console']}
    def __init__(self, array_length=10):
        super(MaxMatchEnv, self).__init__()
        # Size of the 1D-grid
        self.array_length = array_length
        self.agent_pos = [0]*array_length
        self.action_space = spaces.Discrete(array_length)
        self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
        self.matching = set()  # set of edges
        self.matching_nodes = set() # set of node ids (ints)
        self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
        self.best_found = 0
        
    def reset(self):
        # Initialize the array to have random values
        self.time = 0
        #print(self.agent_pos)
        self.agent_pos = [0]*self.array_length
        self.matching = set()
        self.matching_nodes = set()
        return np.array(self.agent_pos)
    
        
    def step(self, action):
        self.time += 1 
        reward = 0
        edge = g.es[action]
        if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
            self.matching.add(edge)
            self.matching_nodes.add(edge.source)
            self.matching_nodes.add(edge.target)
            self.agent_pos[action] = 1
            if sum(self.agent_pos) > self.best_found:
                self.best_found = sum(self.agent_pos)
                print("New max", self.best_found)
            reward = 1
        elif self.agent_pos[action] == 1:
            #print("Removing edge", action)
            self.matching_nodes.remove(edge.source)
            self.matching_nodes.remove(edge.target)
            self.matching.remove(edge)
            self.agent_pos[action] = 0
            reward = -1
        done = sum(self.agent_pos) == self.matching_size
        info = {}
        return np.array(self.agent_pos), reward, done, info

    def render(self, mode='console'):
        print(sum(self.agent_pos))

    def close(self):
        pass


if __name__ == '__main__':
 
    num_variables = 1000
    g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
    g_matching = g.maximum_bipartite_matching()
    print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))

    env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)

    model = PPO('MlpPolicy', env, verbose=1).learn(10000000)

There are a number of problems with this but the main one is that it doesn't optimize well. This code gives just over 550 and then stops improving where the true optimum is over 900 (it is printed out by the code at the start).

The main question is:

How can this be done better so that it gets to a better matching?

A subsidiary question is, how can I print the best matching found so far? My attempt using self.best_found to maintain the best score does not work as it seems to be reset regularly.

Changes that haven't help

  • Changing PPO for DQN makes only a marginal difference.
  • I tried changing the code so that done is True after 1000 steps.

The change is as follows:

if self.time == 1000:
    done = True
else:
    done = False

Having added print(max(env.get_attr("best_found"))) in place of print("New max", self.best_found) this change to done shows no advantage at all.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

逆流 2025-02-01 05:22:27

要为每个环境打印最大值,您可以从稳定的基线使用 get_attr 方法。更多信息

例如,下面的线将为12个环境中的每个环境中的每个环境中的每个环境打印最大值,然后在所有环境中打印最大值。

print(env.get_attr("best_found"))
print(max(env.get_attr("best_found")))

关于为什么它不融合,可能会选择不良的奖励,尽管看着您的奖励选择似乎很明智。我在您的代码中添加了调试打印,以查看某个步骤是否导致 done = true ,但似乎环境永远不会达到该状态。我认为,要了解模型,最好将多个动作序列序列带到 done = true 的状态,这意味着该模型可以体验情节的结束。我没有详细研究您的代码中的问题,但也许这些信息可以帮助您调试您的问题。

如果我们将问题与其他环境进行了比较使用完成= true ,这有助于模型学习更好的策略(在您的情况下,您可以限制每个情节的动作量,而不是永远运行同一集)。这可以帮助该模型避免陷入本地最佳状态,因为您有机会在新剧集中“重试”。

To print the max for each environment you can use get_attr method from stable baselines. More info in their official docs.

For example the lines below will print the max for each of the 12 environments and then the maximum across all environments.

print(env.get_attr("best_found"))
print(max(env.get_attr("best_found")))

Regarding why it does not converge it could be due a bad reward selected, although looking at your reward choice it seems sensible. I added a debug print in your code to see if some step lead to done = True, but it seems that the environment never reaches that state. I think that for the model to learn it would be good to have multiple sequence of actions leading to a state with done = True, which would mean that the model gets to experience the end of an episode. I did not study the problem in your code in detail but maybe this information can help debug your issue.

If we compare the problem with other environments like CartPole, there we have episodes that end with done = True and that helps the model learn a better policy (in your case you could limit the amount of actions per episode instead of running the same episode forever). That could help the model avoid getting stuck in a local optimum as you give it the opportunity to "try again" in a new episode.

半岛未凉 2025-02-01 05:22:27

如果我要正确阅读此内容,那么您就不会执行任何超参数调整。您如何知道您的学习率 /策略更新率 /任何其他变量参数非常适合当前的问题?

看起来稳定的基准内置了此功能:
https> https> https:// stable-baselines。 readthedocs.io/en/master/guide/rl_zoo.html?highlight = tune#hyperparameter-optimization

If I'm reading this right, then you aren't performing any hyperparameter tuning. How do you know your learning rate / policy update rate / any other variable parameters are a good fit for the problem at hand?

It looks like stable baselines has this functionality built-in:
https://stable-baselines.readthedocs.io/en/master/guide/rl_zoo.html?highlight=tune#hyperparameter-optimization

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文