深钢筋学习1步TD不融合

发布于 2025-01-27 16:11:17 字数 757 浏览 6 评论 0原文

获得1步TD(时间差异)预测以与神经网络收敛的预测是否有一些技巧?该网络是使用Relu的简单馈送向前网络。我已经通过以下方式进行了网络来进行Q学习:

  gamma = 0.9
  q0 = model.predict(X0[times+1])
  q1 = model.predict(X1[times+1])
  q2 = model.predict(X2[times+1])
  q_Opt = np.min(np.concatenate((q0,q1,q2),axis=1),axis=1)
  # Use negative rewards because rewards are negative
  target = -np.array(rewards)[times] + gamma * q_Opt

其中X0,X1和X2是MNIST图像功能,分别对其进行了操作0、1和2。此方法收敛。我正在尝试的是行不通的:

  # What I'm trying that doesn't work
  v_hat_next = model.predict(X[time_steps+1])
  target = -np.array(rewards)[times] + gamma * v_hat_next

  history = model.fit(X[times], target, batch_size=128, epochs=10, verbose=1)

这种方法根本不会收敛,实际上给出了每个状态的相同状态值。知道我在做什么错吗?设置目标有一些技巧吗?目标应该是+1+

Is there some trick to getting 1-step TD (temporal difference) prediction to converge with a neural net? The network is a simple feed forward network using ReLU. I've got the network working for Q-learning in the following way:

  gamma = 0.9
  q0 = model.predict(X0[times+1])
  q1 = model.predict(X1[times+1])
  q2 = model.predict(X2[times+1])
  q_Opt = np.min(np.concatenate((q0,q1,q2),axis=1),axis=1)
  # Use negative rewards because rewards are negative
  target = -np.array(rewards)[times] + gamma * q_Opt

Where X0, X1, and X2 are MNIST image features with actions 0, 1, and 2 concatenated onto them respectively. This method converges. What I'm trying that doesn't work:

  # What I'm trying that doesn't work
  v_hat_next = model.predict(X[time_steps+1])
  target = -np.array(rewards)[times] + gamma * v_hat_next

  history = model.fit(X[times], target, batch_size=128, epochs=10, verbose=1)

This method doesn't converge at all and in fact gives identical state values for every state. Any idea what I'm doing wrong? Is there some trick to setting up the target? The target is supposed to be ????????+1+????????̂ (????????+1,????) and I thought that's what I've done here.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。
列表为空,暂无数据
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文