具有音频功能的 Transformer(Multi-Head-Attention)，Val 精度始终相同

发布于 2025-01-14 11:49:36 字数 3460 浏览 1 评论 0原文

我在创建变压器模型时遇到一些问题。无论我如何更改参数，我总是能获得 11.86% 的验证准确率，如果我仅使用 1 个输入训练模型，这一准确率甚至不会改变。仅当我更改验证数据的大小时，准确性才会发生变化。我尝试遵循本指南。我有 500 个音频，我提取了它们的 20 个 MFCC 特征。假设我有 (500, 20, 1) 大小的数据。标签是属于这些音频的情感。由于我已经拥有数据，因此我没有使用任何嵌入或标记化。现在的代码如下：

class TransformerBlock(layers.Layer):
    def __init__(self, key_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=key_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(key_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

key_dim = 2  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(20,1))
transformer_block = TransformerBlock(key_dim, num_heads, ff_dim)
x = transformer_block(inputs)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(7, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)
optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001, beta_1=0.9, beta_2=0.999,epsilon=1e-7,amsgrad=False,name='Adam')
model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])

history = model.fit(
    x_train, y_train, batch_size=16, epochs=15, validation_data=(x_val, y_val)
)

这开始训练，但结果是这样的：

30/30 [==============================] - 1s 17ms/step - loss: 1.9461 - accuracy: 0.1213 - val_loss: 1.9471 - val_accuracy: 0.1186
Epoch 2/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9457 - accuracy: 0.1489 - val_loss: 1.9479 - val_accuracy: 0.1186
Epoch 3/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9456 - accuracy: 0.1489 - val_loss: 1.9489 - val_accuracy: 0.1186
Epoch 4/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9452 - accuracy: 0.1489 - val_loss: 1.9501 - val_accuracy: 0.1186
Epoch 5/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9451 - accuracy: 0.1489 - val_loss: 1.9510 - val_accuracy: 0.1186
Epoch 6/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9450 - accuracy: 0.1277 - val_loss: 1.9523 - val_accuracy: 0.1186

即使我更改任何参数，甚至更改训练数据的大小，Val 准确度仍为 11.86%。我认为我在创建模型时犯了一个错误，但我找不到问题所在。一开始我尝试了归一化和一个热编码，但当我看到即使我使用 1 个数据进行训练时，它也给出了 11.86% 的准确率，我认为这个问题完全独立于数据。模型的创建应该有问题，但我看不到它。

这是模型摘要：

编辑1：我尝试增加密集层的单位，但结果没有改变。

编辑 2：我尝试使用相同的数据训练一个简单的 CNN 模型。我也得到了 11.86% 的准确率。所以我不确定是什么问题。也许数据有问题？

原文

I have some problems with the creation of a transformer model. Whatever I change in the parameters, I always get 11.86% validation accuracy which is not even changed If I train the model with only 1 input. The accuracy changes only if I change the size of the validation data. I tried to follow this guide. I have 500 audio and I extracted their 20 MFCC features. So let's say I have a (500, 20, 1) size of data. The labels are the emotions that belongs to these audios. Since I already have the data, I did not use any embedding or tokenizing. Here is the code right now:

class TransformerBlock(layers.Layer):
    def __init__(self, key_dim, num_heads, ff_dim, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = layers.MultiHeadAttention(num_heads=num_heads, key_dim=key_dim)
        self.ffn = keras.Sequential(
            [layers.Dense(ff_dim, activation="relu"), layers.Dense(key_dim),]
        )
        self.layernorm1 = layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = layers.Dropout(rate)
        self.dropout2 = layers.Dropout(rate)

    def call(self, inputs, training):
        attn_output = self.att(inputs, inputs)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(inputs + attn_output)
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        return self.layernorm2(out1 + ffn_output)

key_dim = 2  # Embedding size for each token
num_heads = 2  # Number of attention heads
ff_dim = 32  # Hidden layer size in feed forward network inside transformer

inputs = layers.Input(shape=(20,1))
transformer_block = TransformerBlock(key_dim, num_heads, ff_dim)
x = transformer_block(inputs)
x = layers.GlobalAveragePooling1D()(x)
x = layers.Dropout(0.1)(x)
x = layers.Dense(20, activation="relu")(x)
x = layers.Dropout(0.1)(x)
outputs = layers.Dense(7, activation="softmax")(x)

model = keras.Model(inputs=inputs, outputs=outputs)
optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001, beta_1=0.9, beta_2=0.999,epsilon=1e-7,amsgrad=False,name='Adam')
model.compile(optimizer=optimizer, loss="categorical_crossentropy", metrics=["accuracy"])

history = model.fit(
    x_train, y_train, batch_size=16, epochs=15, validation_data=(x_val, y_val)
)

This starts training but the results are like this:

30/30 [==============================] - 1s 17ms/step - loss: 1.9461 - accuracy: 0.1213 - val_loss: 1.9471 - val_accuracy: 0.1186
Epoch 2/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9457 - accuracy: 0.1489 - val_loss: 1.9479 - val_accuracy: 0.1186
Epoch 3/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9456 - accuracy: 0.1489 - val_loss: 1.9489 - val_accuracy: 0.1186
Epoch 4/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9452 - accuracy: 0.1489 - val_loss: 1.9501 - val_accuracy: 0.1186
Epoch 5/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9451 - accuracy: 0.1489 - val_loss: 1.9510 - val_accuracy: 0.1186
Epoch 6/15
30/30 [==============================] - 0s 11ms/step - loss: 1.9450 - accuracy: 0.1277 - val_loss: 1.9523 - val_accuracy: 0.1186

Val accuracy is 11.86% even I change any of the parameters and even I change the size of the train data. I think I made a mistake while creating the model but I could not find the problem. At first I tried normalization and one hot encoding but after I saw that even I use 1 data for the train, it gives 11.86% accuracy, I thought that this problem is completely independent from the data. There should be a problem in the creation of the model but I can not see it.

This is the model summary:

Edit 1: I tried increasing the unit of dense layers but the result did not change.

Edit 2: I tried to train a simple CNN model with the same data. I got exactly 11.86% accuracy with it too. So I am not sure what is the problem. Maybe the data has a problem?

分享到QQ

分享到微博