RL+优化:如何做得更好?
我正在学习如何使用强化学习来优化。我选择了最大匹配在bipartite中。
回想一下,图中的匹配是边缘的子集,其中没有两个边缘在同一节点/顶点上。目标是找到最大的此类子集。
我在下面显示我的完整代码,但首先让我解释一下部分。
num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
这使得在两个节点中的每个节点中的每个节点中都有一个随机的两分图。然后,它打印出真正的最大匹配的大小。
在下面的代码中, self.agent_pos
是代表当前匹配的数组。它的长度是原始图中的边数,如果包括 i
,则index i
有一个1,否则为0。 self.matching
是增长匹配中的边缘集。 self.matching_nodes
是增长匹配中的一组节点,用于检查是否可以添加特定的边缘。
import igraph as ig
from tqdm import tqdm
import numpy as np
import gym
from gym import spaces
from stable_baselines3 import PPO
from stable_baselines3.common.env_util import make_vec_env
class MaxMatchEnv(gym.Env):
metadata = {'render.modes': ['console']}
def __init__(self, array_length=10):
super(MaxMatchEnv, self).__init__()
# Size of the 1D-grid
self.array_length = array_length
self.agent_pos = [0]*array_length
self.action_space = spaces.Discrete(array_length)
self.observation_space = spaces.Box(low=0, high=1, shape=(array_length,), dtype=np.uint8)
self.matching = set() # set of edges
self.matching_nodes = set() # set of node ids (ints)
self.matching_size = len([v for v in g_matching.matching if v < num_variables and v != -1])
self.best_found = 0
def reset(self):
# Initialize the array to have random values
self.time = 0
#print(self.agent_pos)
self.agent_pos = [0]*self.array_length
self.matching = set()
self.matching_nodes = set()
return np.array(self.agent_pos)
def step(self, action):
self.time += 1
reward = 0
edge = g.es[action]
if not(edge.source in self.matching_nodes or edge.target in self.matching_nodes):
self.matching.add(edge)
self.matching_nodes.add(edge.source)
self.matching_nodes.add(edge.target)
self.agent_pos[action] = 1
if sum(self.agent_pos) > self.best_found:
self.best_found = sum(self.agent_pos)
print("New max", self.best_found)
reward = 1
elif self.agent_pos[action] == 1:
#print("Removing edge", action)
self.matching_nodes.remove(edge.source)
self.matching_nodes.remove(edge.target)
self.matching.remove(edge)
self.agent_pos[action] = 0
reward = -1
done = sum(self.agent_pos) == self.matching_size
info = {}
return np.array(self.agent_pos), reward, done, info
def render(self, mode='console'):
print(sum(self.agent_pos))
def close(self):
pass
if __name__ == '__main__':
num_variables = 1000
g = ig.Graph.Random_Bipartite(num_variables, num_variables, p=3/num_variables)
g_matching = g.maximum_bipartite_matching()
print("Matching size", len([v for v in g_matching.matching if v < num_variables and v != -1]))
env = make_vec_env(lambda: MaxMatchEnv(array_length=len(g.es)), n_envs=12)
model = PPO('MlpPolicy', env, verbose=1).learn(10000000)
有很多问题,但主要的问题是它不能很好地优化。该代码仅提供550多个,然后停止改进真正的最佳最佳位置超过900(从开始时由代码打印出来)。
主要问题是:
如何做得更好,从而达到更好的匹配?
辅助问题是,如何打印到目前为止找到的最佳匹配?我尝试使用self.best_found维持最佳分数无法正常工作,因为它似乎是定期重置的。
无助的变化
- 更改DQN的PPO只会有边际差异。
- 我尝试更改代码,以使完成在1000个步骤后是正确的。
更改如下:
if self.time == 1000:
done = True
else:
done = False
添加 print(max.get_attr(“ best_found”)))
代替 print(“ new Max”,self.best_found)
>更改为完成
完全没有任何优势。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
要为每个环境打印最大值,您可以从稳定的基线使用
get_attr
方法。更多信息。例如,下面的线将为12个环境中的每个环境中的每个环境中的每个环境打印最大值,然后在所有环境中打印最大值。
关于为什么它不融合,可能会选择不良的奖励,尽管看着您的奖励选择似乎很明智。我在您的代码中添加了调试打印,以查看某个步骤是否导致
done = true
,但似乎环境永远不会达到该状态。我认为,要了解模型,最好将多个动作序列序列带到done = true
的状态,这意味着该模型可以体验情节的结束。我没有详细研究您的代码中的问题,但也许这些信息可以帮助您调试您的问题。如果我们将问题与其他环境进行了比较使用
完成= true
,这有助于模型学习更好的策略(在您的情况下,您可以限制每个情节的动作量,而不是永远运行同一集)。这可以帮助该模型避免陷入本地最佳状态,因为您有机会在新剧集中“重试”。To print the max for each environment you can use
get_attr
method from stable baselines. More info in their official docs.For example the lines below will print the max for each of the 12 environments and then the maximum across all environments.
Regarding why it does not converge it could be due a bad reward selected, although looking at your reward choice it seems sensible. I added a debug print in your code to see if some step lead to
done = True
, but it seems that the environment never reaches that state. I think that for the model to learn it would be good to have multiple sequence of actions leading to a state withdone = True
, which would mean that the model gets to experience the end of an episode. I did not study the problem in your code in detail but maybe this information can help debug your issue.If we compare the problem with other environments like CartPole, there we have episodes that end with
done = True
and that helps the model learn a better policy (in your case you could limit the amount of actions per episode instead of running the same episode forever). That could help the model avoid getting stuck in a local optimum as you give it the opportunity to "try again" in a new episode.如果我要正确阅读此内容,那么您就不会执行任何超参数调整。您如何知道您的学习率 /策略更新率 /任何其他变量参数非常适合当前的问题?
看起来稳定的基准内置了此功能:
https> https> https:// stable-baselines。 readthedocs.io/en/master/guide/rl_zoo.html?highlight = tune#hyperparameter-optimization
If I'm reading this right, then you aren't performing any hyperparameter tuning. How do you know your learning rate / policy update rate / any other variable parameters are a good fit for the problem at hand?
It looks like stable baselines has this functionality built-in:
https://stable-baselines.readthedocs.io/en/master/guide/rl_zoo.html?highlight=tune#hyperparameter-optimization