返回介绍

PPO

发布于 2024-06-23 17:58:49 字数 5211 浏览 0 评论 0 收藏 0

class PPO(model, clip_param=0.1, value_loss_coef=0.5, entropy_coef=0.01, initial_lr=0.00025, eps=1e-05, max_grad_norm=0.5, use_clipped_value_loss=True, norm_adv=True, continuous_action=False)[源代码]

基类:Algorithm

__init__(model, clip_param=0.1, value_loss_coef=0.5, entropy_coef=0.01, initial_lr=0.00025, eps=1e-05, max_grad_norm=0.5, use_clipped_value_loss=True, norm_adv=True, continuous_action=False)[源代码]

PPO algorithm

参数:
  • model (parl.Model) – forward network of actor and critic.

  • clip_param (float) – epsilon in clipping loss.

  • value_loss_coef (float) – value function loss coefficient in the optimization objective.

  • entropy_coef (float) – policy entropy coefficient in the optimization objective.

  • initial_lr (float) – learning rate.

  • eps (float) – Adam optimizer epsilon.

  • max_grad_norm (float) – max gradient norm for gradient clipping.

  • use_clipped_value_loss (bool) – whether or not to use a clipped loss for the value function.

  • norm_adv (bool) – whether or not to use advantages normalization.

  • continuous_action (bool) – whether or not is continuous action environment.

learn(batch_obs, batch_action, batch_value, batch_return, batch_logprob, batch_adv, lr=None)[源代码]

update model with PPO algorithm

参数:
  • batch_obs (torch.Tensor) – shape([batch_size] + obs_shape)

  • batch_action (torch.Tensor) – shape([batch_size] + action_shape)

  • batch_value (torch.Tensor) – shape([batch_size])

  • batch_return (torch.Tensor) – shape([batch_size])

  • batch_logprob (torch.Tensor) – shape([batch_size])

  • batch_adv (torch.Tensor) – shape([batch_size])

  • lr (torch.Tensor) –

返回:

value loss action_loss (float): policy loss entropy_loss (float): entropy loss

返回类型:

value_loss (float)

predict(obs)[源代码]

use the model to predict action

参数:

obs (torch tensor) – observation, shape([batch_size] + obs_shape)

返回:

action, shape([batch_size] + action_shape),

noted that in the discrete case we take the argmax along the last axis as action

返回类型:

action (torch tensor)

sample(obs)[源代码]

Define the sampling process. This function returns the action according to action distribution.

参数:

obs (torch tensor) – observation, shape([batch_size] + obs_shape)

返回:

value, shape([batch_size, 1]) action (torch tensor): action, shape([batch_size] + action_shape) action_log_probs (torch tensor): action log probs, shape([batch_size]) action_entropy (torch tensor): action entropy, shape([batch_size])

返回类型:

value (torch tensor)

value(obs)[源代码]

use the model to predict obs values

参数:

obs (torch tensor) – observation, shape([batch_size] + obs_shape)

返回:

value of obs, shape([batch_size])

返回类型:

value (torch tensor)

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
    我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
    原文