深钢筋学习1步TD不融合
获得1步TD(时间差异)预测以与神经网络收敛的预测是否有一些技巧?该网络是使用Relu的简单馈送向前网络。我已经通过以下方式进行了网络来进行Q学习:
gamma = 0.9
q0 = model.predict(X0[times+1])
q1 = model.predict(X1[times+1])
q2 = model.predict(X2[times+1])
q_Opt = np.min(np.concatenate((q0,q1,q2),axis=1),axis=1)
# Use negative rewards because rewards are negative
target = -np.array(rewards)[times] + gamma * q_Opt
其中X0,X1和X2是MNIST图像功能,分别对其进行了操作0、1和2。此方法收敛。我正在尝试的是行不通的:
# What I'm trying that doesn't work
v_hat_next = model.predict(X[time_steps+1])
target = -np.array(rewards)[times] + gamma * v_hat_next
history = model.fit(X[times], target, batch_size=128, epochs=10, verbose=1)
这种方法根本不会收敛,实际上给出了每个状态的相同状态值。知道我在做什么错吗?设置目标有一些技巧吗?目标应该是+1+
Is there some trick to getting 1-step TD (temporal difference) prediction to converge with a neural net? The network is a simple feed forward network using ReLU. I've got the network working for Q-learning in the following way:
gamma = 0.9
q0 = model.predict(X0[times+1])
q1 = model.predict(X1[times+1])
q2 = model.predict(X2[times+1])
q_Opt = np.min(np.concatenate((q0,q1,q2),axis=1),axis=1)
# Use negative rewards because rewards are negative
target = -np.array(rewards)[times] + gamma * q_Opt
Where X0, X1, and X2 are MNIST image features with actions 0, 1, and 2 concatenated onto them respectively. This method converges. What I'm trying that doesn't work:
# What I'm trying that doesn't work
v_hat_next = model.predict(X[time_steps+1])
target = -np.array(rewards)[times] + gamma * v_hat_next
history = model.fit(X[times], target, batch_size=128, epochs=10, verbose=1)
This method doesn't converge at all and in fact gives identical state values for every state. Any idea what I'm doing wrong? Is there some trick to setting up the target? The target is supposed to be ????????+1+????????̂ (????????+1,????) and I thought that's what I've done here.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论