- 概览
- 安装
- 教程
- 算法接口文档
- 简易高效的并行接口
- APIS
- FREQUENTLY ASKED QUESTIONS
- EVOKIT
- 其他
- parl.algorithms.paddle.policy_gradient
- parl.algorithms.paddle.dqn
- parl.algorithms.paddle.ddpg
- parl.algorithms.paddle.ddqn
- parl.algorithms.paddle.oac
- parl.algorithms.paddle.a2c
- parl.algorithms.paddle.qmix
- parl.algorithms.paddle.td3
- parl.algorithms.paddle.sac
- parl.algorithms.paddle.ppo
- parl.algorithms.paddle.maddpg
- parl.core.paddle.model
- parl.core.paddle.algorithm
- parl.remote.remote_decorator
- parl.core.paddle.agent
- parl.remote.client
PPO
- class PPO(model, clip_param=0.1, value_loss_coef=0.5, entropy_coef=0.01, initial_lr=0.00025, eps=1e-05, max_grad_norm=0.5, use_clipped_value_loss=True, norm_adv=True, continuous_action=False)[源代码]¶
基类:
Algorithm
- __init__(model, clip_param=0.1, value_loss_coef=0.5, entropy_coef=0.01, initial_lr=0.00025, eps=1e-05, max_grad_norm=0.5, use_clipped_value_loss=True, norm_adv=True, continuous_action=False)[源代码]¶
PPO algorithm
- 参数:
model (parl.Model) – forward network of actor and critic.
clip_param (float) – epsilon in clipping loss.
value_loss_coef (float) – value function loss coefficient in the optimization objective.
entropy_coef (float) – policy entropy coefficient in the optimization objective.
initial_lr (float) – learning rate.
eps (float) – Adam optimizer epsilon.
max_grad_norm (float) – max gradient norm for gradient clipping.
use_clipped_value_loss (bool) – whether or not to use a clipped loss for the value function.
norm_adv (bool) – whether or not to use advantages normalization.
continuous_action (bool) – whether or not is continuous action environment.
- learn(batch_obs, batch_action, batch_value, batch_return, batch_logprob, batch_adv, lr=None)[源代码]¶
update model with PPO algorithm
- 参数:
batch_obs (torch.Tensor) – shape([batch_size] + obs_shape)
batch_action (torch.Tensor) – shape([batch_size] + action_shape)
batch_value (torch.Tensor) – shape([batch_size])
batch_return (torch.Tensor) – shape([batch_size])
batch_logprob (torch.Tensor) – shape([batch_size])
batch_adv (torch.Tensor) – shape([batch_size])
lr (torch.Tensor) –
- 返回:
value loss action_loss (float): policy loss entropy_loss (float): entropy loss
- 返回类型:
value_loss (float)
- predict(obs)[源代码]¶
use the model to predict action
- 参数:
obs (torch tensor) – observation, shape([batch_size] + obs_shape)
- 返回:
- action, shape([batch_size] + action_shape),
noted that in the discrete case we take the argmax along the last axis as action
- 返回类型:
action (torch tensor)
- sample(obs)[源代码]¶
Define the sampling process. This function returns the action according to action distribution.
- 参数:
obs (torch tensor) – observation, shape([batch_size] + obs_shape)
- 返回:
value, shape([batch_size, 1]) action (torch tensor): action, shape([batch_size] + action_shape) action_log_probs (torch tensor): action log probs, shape([batch_size]) action_entropy (torch tensor): action entropy, shape([batch_size])
- 返回类型:
value (torch tensor)
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论