强化学习示例
环境: 总共有 25 圈。 有两种类型的操作:构建 CS 和构建 CI。
目标: 找到使用专门的机器学习/强化学习给出的总圈数中可以构建的 CI(建筑物)的最大数量。
笔记: 尽管 CS 从技术上讲是建筑物,但我并未将它们计入建筑物总数中。当在我的代码中阅读“建筑物”意味着仅构建 CI 时,需要注意这一点。
公式: BPT(每回合建筑物)= CS/4 + 5 每构建 4 个 CS,您的 CI 就会增加 1。(从 5 开始)
For example:
turn 1: build 5 CI (bpt: 5) (total buildings: 5)
turn 2: build 1 CS (bpt: 5) (total buildings: 5)
turn 3: build 1 CS (bpt: 5) (total buildings: 5)
turn 4: build 1 CS (bpt: 5) (total buildings: 5)
turn 5: build 1 CS (bpt: 6) (total buildings: 5)
turn 6: build 6 CI (bpt: 6) (total buildings: 11) (increased by BPT 6)
我的总体目标是到 25 岁,看看可以构建的 CI 的最大数量是多少。除此之外,我想知道我需要采取这些行动以最大化最佳情况的步骤和顺序。
我下面的代码似乎实现了这一点,但当我尝试使用经过训练的模型时,它失败了。据我所知,在我的所有剧集完成后,我的 q_values 表将能够绘制出最佳的可能路径。
不幸的是,发生的事情是我的最终 q_values 表似乎具有所有相同的值,并且 np.argmax 的使用只是为所有决策选择第 0 个索引。我注意到,在训练过程中,我的模型正确地识别了最佳解决方案,但由于某种原因,我的最终 q_values 表没有反映它。
一个重要的注意事项:在第 25 回合,如果正确完成,最大建筑物数应为 126。前 4 回合应该是 CS,其余的应该是 CI,最大化最高可能性。
import numpy as np
import math
import pdb
class AI:
def __init__(self, turns: int, learning_rate: int, discount_factor: int, actions: list, q_values: list):
'''
turns: max number of turns an agent can take,
learning_rate: the rate in which an agent should learn,
discount_factor: the decayed reward amount
actions: the actions which the agent can take,
q_values: a mapping of probabilities which suggests which action should be taken at any given state
history_cs: state - number of cs built
history_ci: state - number of ci built (buildings)
'''
# default values
self.state = 0
self.cs = 0
self.buildings = 0
self.max_buildings = 0
self.history_cs = []
self.history_ci = []
self.turns = turns
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.actions = actions
self.q_values = q_values
def reset(self):
''' Resets the default values back to their original values '''
self.state = 0
self.cs = 0
self.buildings = 0
self.history_cs = []
self.history_ci = []
def get_reward(self) -> int:
''' The reward will be based on the number of buildings created '''
return self.buildings
def is_game_over(self) -> bool:
''' Determines if all turns have been used '''
return self.state == self.turns
def get_bpt(self, cs: int) -> int:
''' Determines the current buildings per turn '''
return (math.floor(cs/4)) + 5
def get_next_action(self, epsilon: float) -> int:
'''
Returns the most likely successful action with some probability that an inferior action may happen occasionally.
'''
if np.random.random() < epsilon:
return np.argmax(self.q_values[self.state])
else:
return np.random.randint(2)
def get_next_state(self, action_index: int) -> int:
''' Executes next action and returns the next state '''
if self.actions[action_index] == "build ci":
new_buildings = self.get_bpt(self.cs)
self.buildings += new_buildings
self.history_ci.append({self.state: new_buildings})
elif self.actions[action_index] == "build cs":
self.cs += 1
self.history_cs.append({self.state : 1})
self.state += 1
return self.state
def print_best_path(self):
self.reset()
while not ai.is_game_over():
action_index = self.get_next_action(1.)
if action_index == 0:
print(f"build ci")
else:
print(f"build cs")
self.get_next_state(action_index)
print(f"total construction sites: {self.cs}")
print(f"total buildings: {self.buildings}")
TURNS = 25
ai = AI(turns=TURNS,
learning_rate=0.9,
discount_factor=0.9,
actions=["build ci", "build cs"],
q_values=np.zeros((TURNS+1, 1, 2)))
for episode in range(100000):
ai.reset()
action_index = None
while not ai.is_game_over():
action_index = ai.get_next_action(.9)
old_state = ai.state
next_state = ai.get_next_state(action_index)
if ai.buildings < ai.max_buildings:
reward = -10
else:
reward = -1
old_q_value = ai.q_values[old_state, 0, action_index]
temporal_difference = reward + (ai.discount_factor * np.max(ai.q_values[next_state])) - old_q_value
new_q_value = old_q_value + (ai.learning_rate * temporal_difference)
ai.q_values[old_state, 0, action_index] = new_q_value
if ai.buildings > ai.max_buildings:
ai.max_buildings = ai.buildings
print(f"\nepisode: {episode}")
print(ai.history_cs)
print(ai.history_ci)
print(f"total construction sites: {ai.cs}")
print(f"total buildings: {ai.buildings}")
#if ai.buildings == 126:
# print(ai.q_values)
#pdb.set_trace()
#ai.print_best_path()
Environment:
There are 25 total turns.
There are two types of actions: build CS and build CI.
Goal:
Find the max number of CIs (buildings) which can be built in the total number of turns given using specifically machine-learning/reinforced learning.
Note:
Even though CS are technically buildings I am not including their count in the total number of buildings. This is important to note when reading "buildings" in my code implies only CIs built.
Formula:
BPT (buildings per turn) = CS/4 + 5
For every 4 CS built, your CI increases by 1. (you start with 5)
For example:
turn 1: build 5 CI (bpt: 5) (total buildings: 5)
turn 2: build 1 CS (bpt: 5) (total buildings: 5)
turn 3: build 1 CS (bpt: 5) (total buildings: 5)
turn 4: build 1 CS (bpt: 5) (total buildings: 5)
turn 5: build 1 CS (bpt: 6) (total buildings: 5)
turn 6: build 6 CI (bpt: 6) (total buildings: 11) (increased by BPT 6)
My overall goal is to get to turn 25 and see what the max number of CIs which can be built. In addition to that, I want to know the steps and in which order I need to take these actions to maximize the best case scenario.
My code below seems to achieve that but it fails when I attempt to use my trained model. It's my understanding after all my episodes are completed that my q_values table would be able to map out the best possible path.
Unfortantely what is happening is my final q_values table appears to have all the same values and the use of np.argmax is simply selecting the 0th index for all the decisions. What I have noticed is that during the training my model correctly identifies the best solution but for some reason my final q_values table doesnt reflect it.
One important note: at turn 25, the max buildings should be 126 if completed correctly. The first 4 turns should be CS and the rest would be CIs maximizing the highest possibility.
import numpy as np
import math
import pdb
class AI:
def __init__(self, turns: int, learning_rate: int, discount_factor: int, actions: list, q_values: list):
'''
turns: max number of turns an agent can take,
learning_rate: the rate in which an agent should learn,
discount_factor: the decayed reward amount
actions: the actions which the agent can take,
q_values: a mapping of probabilities which suggests which action should be taken at any given state
history_cs: state - number of cs built
history_ci: state - number of ci built (buildings)
'''
# default values
self.state = 0
self.cs = 0
self.buildings = 0
self.max_buildings = 0
self.history_cs = []
self.history_ci = []
self.turns = turns
self.learning_rate = learning_rate
self.discount_factor = discount_factor
self.actions = actions
self.q_values = q_values
def reset(self):
''' Resets the default values back to their original values '''
self.state = 0
self.cs = 0
self.buildings = 0
self.history_cs = []
self.history_ci = []
def get_reward(self) -> int:
''' The reward will be based on the number of buildings created '''
return self.buildings
def is_game_over(self) -> bool:
''' Determines if all turns have been used '''
return self.state == self.turns
def get_bpt(self, cs: int) -> int:
''' Determines the current buildings per turn '''
return (math.floor(cs/4)) + 5
def get_next_action(self, epsilon: float) -> int:
'''
Returns the most likely successful action with some probability that an inferior action may happen occasionally.
'''
if np.random.random() < epsilon:
return np.argmax(self.q_values[self.state])
else:
return np.random.randint(2)
def get_next_state(self, action_index: int) -> int:
''' Executes next action and returns the next state '''
if self.actions[action_index] == "build ci":
new_buildings = self.get_bpt(self.cs)
self.buildings += new_buildings
self.history_ci.append({self.state: new_buildings})
elif self.actions[action_index] == "build cs":
self.cs += 1
self.history_cs.append({self.state : 1})
self.state += 1
return self.state
def print_best_path(self):
self.reset()
while not ai.is_game_over():
action_index = self.get_next_action(1.)
if action_index == 0:
print(f"build ci")
else:
print(f"build cs")
self.get_next_state(action_index)
print(f"total construction sites: {self.cs}")
print(f"total buildings: {self.buildings}")
TURNS = 25
ai = AI(turns=TURNS,
learning_rate=0.9,
discount_factor=0.9,
actions=["build ci", "build cs"],
q_values=np.zeros((TURNS+1, 1, 2)))
for episode in range(100000):
ai.reset()
action_index = None
while not ai.is_game_over():
action_index = ai.get_next_action(.9)
old_state = ai.state
next_state = ai.get_next_state(action_index)
if ai.buildings < ai.max_buildings:
reward = -10
else:
reward = -1
old_q_value = ai.q_values[old_state, 0, action_index]
temporal_difference = reward + (ai.discount_factor * np.max(ai.q_values[next_state])) - old_q_value
new_q_value = old_q_value + (ai.learning_rate * temporal_difference)
ai.q_values[old_state, 0, action_index] = new_q_value
if ai.buildings > ai.max_buildings:
ai.max_buildings = ai.buildings
print(f"\nepisode: {episode}")
print(ai.history_cs)
print(ai.history_ci)
print(f"total construction sites: {ai.cs}")
print(f"total buildings: {ai.buildings}")
#if ai.buildings == 126:
# print(ai.q_values)
#pdb.set_trace()
#ai.print_best_path()
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论