在加强学习中需要帮助以奖励功能

发布于 2025-02-10 09:23:56 字数 1729 浏览 2 评论 0原文

我已经创建了一个用于人造自定义金融资产进行交易的RL（完成代码）。这是我的数据框（环境）由“近距离”价格和“音量”制成：

loses = []
volumes = []
for i in range(0,16):
    for inc in range(0,30):
        closes.append(1 + 0.00005 * inc)
        volumes.append(2 + 0.00008 * inc)
    for dec in range(0,30):
        closes.append(1.00145 - 0.00005 * dec)
        volumes.append(2.00240 - 0.00008 * dec)

raw_df = pd.DataFrame(zip(closes, volumes), columns=['close','volume'])

我正在使用差异化（df -df -shift（1））使我的数据静止不动。有三个操作：销售，购买并持有。

这是每个步骤之后的返回的观察结果：“关闭”，“卷”，trade_length，total_episode_profit，current_profit，current_action（交易，观看）

的开放和关闭贸易成本等于1，并且要观看0.5罚款市场和无所事事，持有奖励等于接近[-1] - 关闭[-2]，卖出奖励等于总利润或交易头寸损失。

这是我的NN结构：

model = Sequential()
        model.add(Dense(10, activation='tanh', input_shape=(env.df_ep.shape[1] + 4,)))
        model.add(Dropout(0.2))
        model.add(Dense(8))
        model.add(Dropout(0.2))
        model.add(Dense(env.ACTION_SPACE_SIZE, activation='linear'))
        model.compile(loss='mse', optimizer=adam_v2.Adam(learning_rate=0.001), metrics=['accuracy'])

问题是在许多情节（大约6000个）的RL停止学习之后，它首先就开设了一笔交易，直到最后！但这确实是一个简单的财务资产和一个简单的环境，它不是真正的财务资产，我认为它应该很容易学习。我想问题是我的奖励功能。

这是一些情节的照片：

原文

I've created a RL to trade on an artificial custom financial asset (Complete Code). This is my dataframe (environment) made of 'Close' price and 'Volume':

loses = []
volumes = []
for i in range(0,16):
    for inc in range(0,30):
        closes.append(1 + 0.00005 * inc)
        volumes.append(2 + 0.00008 * inc)
    for dec in range(0,30):
        closes.append(1.00145 - 0.00005 * dec)
        volumes.append(2.00240 - 0.00008 * dec)

raw_df = pd.DataFrame(zip(closes, volumes), columns=['close','volume'])

I'm making my data stationary using differentiation (df - df.shift(1)) There are three actions: Sell, Buy and Hold.

And this is the returned observation after each step: 'close', 'volume', trade_length, total_episode_profit, current_profit, current_action (trading, watching)

There is an open and close trade cost equal to 1, and there is a 0.5 penalty to watch market and do nothing, holding reward is equal to close[-1] - close[-2] and sell reward is equal to total profit or loss of trading position.

And here is my NN structure:

model = Sequential()
        model.add(Dense(10, activation='tanh', input_shape=(env.df_ep.shape[1] + 4,)))
        model.add(Dropout(0.2))
        model.add(Dense(8))
        model.add(Dropout(0.2))
        model.add(Dense(env.ACTION_SPACE_SIZE, activation='linear'))
        model.compile(loss='mse', optimizer=adam_v2.Adam(learning_rate=0.001), metrics=['accuracy'])

The problem is after lots of episodes (about 6000) RL stops to learn and its just open a trade at the first and hold it till the end! But this is really a simple financial asset and a simple environment, it's not a real financial asset and I think it should learn it easily. I guess that the problem is with my reward function.

Here are some photos of episodes: