深钢筋学习1步TD不融合

发布于 2025-01-27 16:11:17 字数 757 浏览 6 评论 0原文

获得1步TD（时间差异）预测以与神经网络收敛的预测是否有一些技巧？该网络是使用Relu的简单馈送向前网络。我已经通过以下方式进行了网络来进行Q学习：

  gamma = 0.9
  q0 = model.predict(X0[times+1])
  q1 = model.predict(X1[times+1])
  q2 = model.predict(X2[times+1])
  q_Opt = np.min(np.concatenate((q0,q1,q2),axis=1),axis=1)
  # Use negative rewards because rewards are negative
  target = -np.array(rewards)[times] + gamma * q_Opt

其中X0，X1和X2是MNIST图像功能，分别对其进行了操作0、1和2。此方法收敛。我正在尝试的是行不通的：

  # What I'm trying that doesn't work
  v_hat_next = model.predict(X[time_steps+1])
  target = -np.array(rewards)[times] + gamma * v_hat_next

  history = model.fit(X[times], target, batch_size=128, epochs=10, verbose=1)

这种方法根本不会收敛，实际上给出了每个状态的相同状态值。知道我在做什么错吗？设置目标有一些技巧吗？目标应该是+1+

原文

Is there some trick to getting 1-step TD (temporal difference) prediction to converge with a neural net? The network is a simple feed forward network using ReLU. I've got the network working for Q-learning in the following way:

  gamma = 0.9
  q0 = model.predict(X0[times+1])
  q1 = model.predict(X1[times+1])
  q2 = model.predict(X2[times+1])
  q_Opt = np.min(np.concatenate((q0,q1,q2),axis=1),axis=1)
  # Use negative rewards because rewards are negative
  target = -np.array(rewards)[times] + gamma * q_Opt

Where X0, X1, and X2 are MNIST image features with actions 0, 1, and 2 concatenated onto them respectively. This method converges. What I'm trying that doesn't work:

  # What I'm trying that doesn't work
  v_hat_next = model.predict(X[time_steps+1])
  target = -np.array(rewards)[times] + gamma * v_hat_next

  history = model.fit(X[times], target, batch_size=128, epochs=10, verbose=1)

This method doesn't converge at all and in fact gives identical state values for every state. Any idea what I'm doing wrong? Is there some trick to setting up the target? The target is supposed to be ????????+1+????????̂ (????????+1,????) and I thought that's what I've done here.

分享到QQ

分享到微博