如何提高 LSTM 模型 (Keras) 的准确性？

发布于 2025-01-10 08:57:57 字数 9433 浏览 4 评论 0原文

我目前正在尝试创建一个 LSTM 网络，该网络从许多 MIDI 文件（一种代表音符的数字格式）中获取数据，并预测音乐序列中的下一个音符是什么。我使用以下函数将 MIDI 数据标记为更简单的整数时间序列格式：

def tokenize_stream(list):

    tokenized_array = []
    current_token = 0

    for x in list:
        string_version = ' '.join(x)
        if string_version in tokenizer_map:
            tokenized_array.append(tokenizer_map.get(string_version))
        else:
            tokenizer_map.update({string_version: current_token})
            tokenized_array.append(current_token)
            current_token += 1

    return tokenized_array

def data_to_time_series(data, window_size):

    numpy_array = np.array(data)

    X = []
    Y = []

    for i in range(len(numpy_array) - window_size):
        row = [[a] for a in numpy_array[i: i + window_size]]
        X.append(row)
        label = numpy_array[i + window_size]
        Y.append(label)
    return np.array(X), np.array(Y)

这些函数将音符名称转换为如下所示的标记：

Raw Data:[1,3,2,2,2,1,4]

然后将它们转换为如下所示的时间序列格式：

Data:[1,3,2,2,2,1] -> Label:[4]

下面我们可以看到以下的实际数据： 2 个值得注意的 MIDI 文件：

Input Data (X):
[
[[1][3][1]...[6][6][6]]

[[3][1][0]...[6][6][1]]

[[1][0][1]...[6][1][6]]

...

[[1][2][2]...[8][1][2]]

[[2][2][1]...[1][2][8]]

[[2][1][0]...[2][8][3]]
]

Labels (Y):
[1 6 1 1 6 6 1 6 1 1 3 1 3 0 3 4 3 4 1 6 6 1 6 1 2 1 3 1 3 0 3 4 3 4 5 1 6
1 3 1 3 0 6 3 3 2 3 3 3 4 5 1 3 0 6 3 3 2 3 1 3 3 0 0 5 3 2 5 5 3 1 5 5 5
9 7 0 5 3 2 5 5 0 3 6 6 6 1 6 1 1 3 1 3 0 3 4 3 4 1 6 1 6 3 2 6 6 2 6 3 2
6 8 3 3 3 3 2 8 3 3 8 3 2 8 3 1 1 0 3 2 6 1 2 2 1 0 8 4 8 3 8 4 1 1 8 1 2
8 3 6 3 1 1 1 1 1 5 1 5 4 3 3 5 1 2 5 6 6 2 5 3 1 0 0 3 1 1 1 1 1 5 1 5 4
3 3 2 5 3 5 0 3 6 3 4 6 1 1 5 1 3 1 1 1 1 1 5 1 4 5 4 3 3 5 1 2 5 6 6 2 5
3 1 0 0 3 1 1 1 1 1 5 1 4 5 4 3 3 2 5 3 5 0 3 6 3 4 6 1 1 5 1 6 3 2 6 6 2
6 3 2 6 8 3 3 3 3 2 8 3 3 8 3 2 8 3 1 1 0 3 2 6 1 2 2 1 0 8 4 8 3 8 4 1 1
8 1 2 8 3 6]

通过此函数收集数据并以正确的格式放置：

def get_data(path, look_back, train_size_v, number_of_midi_files):


    files = []
    count = 0
    countMax = number_of_midi_files

    for i in os.listdir(path):
        if count == countMax:
             break
        if i.endswith(".mid"):
            files.append(i)
            count += 1

    random.shuffle(files)


    # Add the information from each note in the MIDI files to an array
    notes_array = np.array([read_midi(path + i) for i in files])

    # converting 2D array into 1D array
    notes = [element for note_ in notes_array for element in note_]

    # Tokenize the list of notes
    tokens = tokenize_stream(notes)

    unique_notes = list(set(tokens))
    print("Unique Notes: " + str(len(unique_notes)))

    # Transform the data into time series format
    X, Y = data_to_time_series(tokens, look_back)

    n_vocab = len(set(tokens))

    X_train, X_remainder, Y_train, Y_remainder = train_test_split(X, Y, train_size = train_size_v)
    X_val, X_test, Y_val, Y_test = train_test_split(X_remainder,Y_remainder, test_size=0.5)

    Y_train = to_categorical(Y_train, n_vocab)
    Y_val = to_categorical(Y_val, n_vocab)
    Y_test = to_categorical(Y_test, n_vocab)


    return n_vocab, X, Y, X_train, Y_train, X_val, Y_val, X_test, Y_test

然后使用数据的方式是 LSTM 通过查看某个音符的标签来了解序列中的下一个音符应该是什么。一系列整数，如图所示多于。

然后，这些数据被用来尝试训练一个非常简单的 LSTM 模型，但我发现我对模型的准确性没有任何运气。这是我正在使用的模型：

def build_model(model_input, model_labels, n_vocab, learning_rate):
   model = Sequential()
   model.add(LSTM(10, activation = 'relu', input_shape=(model_input.shape[1], model_input.shape[2])))
   model.add(Dense(n_vocab))
   model.add(Activation('softmax'))
   
   model.compile(loss='categorical_crossentropy', optimizer= RMSprop(learning_rate=learning_rate), metrics=['accuracy'])
   
   model.summary()
   return model

我使用单个 LSTM 层，然后使用 softmax 激活层来输出每个可能音符的概率。在本例中，只有 12 个可能的音符，因此 n_vocab 为 12。

然后我按如下方式训练模型：

def train_model(model_input, model_labels, val_input, val_labels, epochs_v, look_back, n_vocab, learning_rate): 
    filepath = "music_model_2/"
    
    earlyStopping = EarlyStopping(monitor='val_loss', patience=10, verbose=0, mode='min')
    mcp_save = ModelCheckpoint(filepath, save_best_only=True, monitor='val_loss', mode='min')
    reduce_lr_loss = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=7, verbose=1, epsilon=1e-4, mode='min')
    
    model = build_model(model_input, model_labels, n_vocab, learning_rate)
    
    history = model.fit(model_input, model_labels, validation_data = (val_input, val_labels), batch_size=128, epochs=epochs_v, callbacks=[earlyStopping, mcp_save, reduce_lr_loss]).history
    
    return history

最后，在我的主函数中，我按如下方式构建、训练和评估模型：

def main():

    path = 'C_Major_Midi/'
    look_back = 16 # size of lookback for the timeseries data 
    epochs = 40 # Number of eopchs the model runs for
    training_data_split = 0.8 #The percentage split of the training and test data
    number_of_midi_files = 50 # The number of midi files used to create the time series data
    learning_rate = 0.001 # The learning rate of the model
    batch_size = 128 # Batch size used for the model

    n_vocab, X, Y, X_train, Y_train, X_val, Y_val, X_test, Y_test = get_data(path, look_back, training_data_split, number_of_midi_files)

    history = train_model(X_train, Y_train, X_val, Y_val, epochs, look_back, n_vocab, learning_rate)


    model = load_model("music_model_2")


    test_loss, test_acc = model.evaluate(X_test, Y_test)

    print('Test Loss: {}'.format(test_loss))

    print('Test Accuracy: {}'.format(test_acc))

具有超参数的模型的输出如下所示：

    Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_13 (LSTM)               (None, 10)                480       
_________________________________________________________________
dense_13 (Dense)             (None, 12)                132       
_________________________________________________________________
activation_13 (Activation)   (None, 12)                0         
=================================================================
Total params: 612
Trainable params: 612
Non-trainable params: 0

运行模型并根据测试数据对其进行评估的结果如下。

模型精度

模型损失

这是一组经过调整的参数和模型，表明我所做的调整对提高准确性几乎没有作用。

model = Sequential()
model.add(LSTM(128, activation = 'relu', input_shape=(model_input.shape[1], model_input.shape[2])))
model.add(Dense(n_vocab))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer= RMSprop(learning_rate=learning_rate), metrics=['accuracy'])


look_back = 5
epochs = 100
training_data_split = 0.8
number_of_midi_files = 40
learning_rate = 0.001
batch_size = 50

这是我编写的另一个模型，它更复杂并且运行更多时期。它的准确率仍然稳定在略高于 0.3 的水平。

#   First LSTM Layer  
model.add(LSTM(128,input_shape=(model_input.shape[1], model_input.shape[2]),return_sequences=True))
model.add(Dropout(0.3))

#   Second LSTM Layer    
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.5))

#   First Hidden Layer        
model.add(Dense(256))
model.add(Dropout(0.3))

#   Second Hidden Layer    
model.add(Dense(256))
model.add(Dropout(0.5))

#   Flatten data shape    
model.add(Flatten())

#   Final Output Layer
model.add(Dense(n_vocab))          
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=adam_v2.Adam(learning_rate=learning_rate, decay=1e-6), metrics=['accuracy'])


path = 'C_Major_Midi/'
look_back = 10
epochs = 200
training_data_split = 0.8
number_of_midi_files = 1000
learning_rate = 0.001
batch_size = 128

最终模型显示训练数据和测试数据之间具有良好的一致性，但仍稳定在 0.3。

model = Sequential()
model.add(LSTM(50, activation = 'relu', input_shape=(model_input.shape[1], model_input.shape[2])))
model.add(Dropout(0.3))
model.add(Dense(n_vocab))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer= RMSprop(learning_rate=learning_rate), metrics=['accuracy'])


look_back = 7
epochs = 50
training_data_split = 0.8
number_of_midi_files = 100
learning_rate = 0.001
batch_size = 128

如图所示，我尝试使用更大量的数据、不同的优化器、更多/更少的 LSTM 层以及其中的单位。无论我如何调整，我永远无法获得超过 0.3 的准确度分数。这是我的第一个大型机器学习项目，所以我很可能会犯一些愚蠢的错误，但我希望有经验的人让我知道为什么我的准确性这么快就趋于稳定。

非常感谢xx

原文

I am currently trying to create an LSTM network that takes the data from many MIDI files (a digital format representing musical notes) and predicts what the next note will be in a musical sequence. I have tokenized the MIDI data into a more simple integer time series format using the following functions:

def tokenize_stream(list):

    tokenized_array = []
    current_token = 0

    for x in list:
        string_version = ' '.join(x)
        if string_version in tokenizer_map:
            tokenized_array.append(tokenizer_map.get(string_version))
        else:
            tokenizer_map.update({string_version: current_token})
            tokenized_array.append(current_token)
            current_token += 1

    return tokenized_array

def data_to_time_series(data, window_size):

    numpy_array = np.array(data)

    X = []
    Y = []

    for i in range(len(numpy_array) - window_size):
        row = [[a] for a in numpy_array[i: i + window_size]]
        X.append(row)
        label = numpy_array[i + window_size]
        Y.append(label)
    return np.array(X), np.array(Y)

These functions turn the note names into tokens like this:

Raw Data:[1,3,2,2,2,1,4]

And then turn them into time-series format like this:

Data:[1,3,2,2,2,1] -> Label:[4]

Below we can see the actual data for 2 MIDI files worth of notes:

Input Data (X):
[
[[1][3][1]...[6][6][6]]

[[3][1][0]...[6][6][1]]

[[1][0][1]...[6][1][6]]

...

[[1][2][2]...[8][1][2]]

[[2][2][1]...[1][2][8]]

[[2][1][0]...[2][8][3]]
]

Labels (Y):
[1 6 1 1 6 6 1 6 1 1 3 1 3 0 3 4 3 4 1 6 6 1 6 1 2 1 3 1 3 0 3 4 3 4 5 1 6
1 3 1 3 0 6 3 3 2 3 3 3 4 5 1 3 0 6 3 3 2 3 1 3 3 0 0 5 3 2 5 5 3 1 5 5 5
9 7 0 5 3 2 5 5 0 3 6 6 6 1 6 1 1 3 1 3 0 3 4 3 4 1 6 1 6 3 2 6 6 2 6 3 2
6 8 3 3 3 3 2 8 3 3 8 3 2 8 3 1 1 0 3 2 6 1 2 2 1 0 8 4 8 3 8 4 1 1 8 1 2
8 3 6 3 1 1 1 1 1 5 1 5 4 3 3 5 1 2 5 6 6 2 5 3 1 0 0 3 1 1 1 1 1 5 1 5 4
3 3 2 5 3 5 0 3 6 3 4 6 1 1 5 1 3 1 1 1 1 1 5 1 4 5 4 3 3 5 1 2 5 6 6 2 5
3 1 0 0 3 1 1 1 1 1 5 1 4 5 4 3 3 2 5 3 5 0 3 6 3 4 6 1 1 5 1 6 3 2 6 6 2
6 3 2 6 8 3 3 3 3 2 8 3 3 8 3 2 8 3 1 1 0 3 2 6 1 2 2 1 0 8 4 8 3 8 4 1 1
8 1 2 8 3 6]

This data is collected and put in the correct format with this function:

def get_data(path, look_back, train_size_v, number_of_midi_files):


    files = []
    count = 0
    countMax = number_of_midi_files

    for i in os.listdir(path):
        if count == countMax:
             break
        if i.endswith(".mid"):
            files.append(i)
            count += 1

    random.shuffle(files)


    # Add the information from each note in the MIDI files to an array
    notes_array = np.array([read_midi(path + i) for i in files])

    # converting 2D array into 1D array
    notes = [element for note_ in notes_array for element in note_]

    # Tokenize the list of notes
    tokens = tokenize_stream(notes)

    unique_notes = list(set(tokens))
    print("Unique Notes: " + str(len(unique_notes)))

    # Transform the data into time series format
    X, Y = data_to_time_series(tokens, look_back)

    n_vocab = len(set(tokens))

    X_train, X_remainder, Y_train, Y_remainder = train_test_split(X, Y, train_size = train_size_v)
    X_val, X_test, Y_val, Y_test = train_test_split(X_remainder,Y_remainder, test_size=0.5)

    Y_train = to_categorical(Y_train, n_vocab)
    Y_val = to_categorical(Y_val, n_vocab)
    Y_test = to_categorical(Y_test, n_vocab)


    return n_vocab, X, Y, X_train, Y_train, X_val, Y_val, X_test, Y_test

The way the data is then used is that the LSTM learns what the next note in the sequence should be by seeing what the label is for a series of integers, as seen above.

This data is then being used to try and train a very simple LSTM model but I am not finding that I am having any luck with the accuracy of the model. Here is the model I am using:

def build_model(model_input, model_labels, n_vocab, learning_rate):
   model = Sequential()
   model.add(LSTM(10, activation = 'relu', input_shape=(model_input.shape[1], model_input.shape[2])))
   model.add(Dense(n_vocab))
   model.add(Activation('softmax'))
   
   model.compile(loss='categorical_crossentropy', optimizer= RMSprop(learning_rate=learning_rate), metrics=['accuracy'])
   
   model.summary()
   return model

I am using a single LSTM layer and then a softmax activation layer to output the probabilities of each possible note. In this instance, there are only 12 possible notes so n_vocab is 12.

I then train the model as follows:

def train_model(model_input, model_labels, val_input, val_labels, epochs_v, look_back, n_vocab, learning_rate): 
    filepath = "music_model_2/"
    
    earlyStopping = EarlyStopping(monitor='val_loss', patience=10, verbose=0, mode='min')
    mcp_save = ModelCheckpoint(filepath, save_best_only=True, monitor='val_loss', mode='min')
    reduce_lr_loss = ReduceLROnPlateau(monitor='val_loss', factor=0.1, patience=7, verbose=1, epsilon=1e-4, mode='min')
    
    model = build_model(model_input, model_labels, n_vocab, learning_rate)
    
    history = model.fit(model_input, model_labels, validation_data = (val_input, val_labels), batch_size=128, epochs=epochs_v, callbacks=[earlyStopping, mcp_save, reduce_lr_loss]).history
    
    return history

Finally, in my main function, I am building, training and evaluating the model as follows:

def main():

    path = 'C_Major_Midi/'
    look_back = 16 # size of lookback for the timeseries data 
    epochs = 40 # Number of eopchs the model runs for
    training_data_split = 0.8 #The percentage split of the training and test data
    number_of_midi_files = 50 # The number of midi files used to create the time series data
    learning_rate = 0.001 # The learning rate of the model
    batch_size = 128 # Batch size used for the model

    n_vocab, X, Y, X_train, Y_train, X_val, Y_val, X_test, Y_test = get_data(path, look_back, training_data_split, number_of_midi_files)

    history = train_model(X_train, Y_train, X_val, Y_val, epochs, look_back, n_vocab, learning_rate)


    model = load_model("music_model_2")


    test_loss, test_acc = model.evaluate(X_test, Y_test)

    print('Test Loss: {}'.format(test_loss))

    print('Test Accuracy: {}'.format(test_acc))

The output of the model with the hyperparameters shown here is as follows:

    Model: "sequential_13"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lstm_13 (LSTM)               (None, 10)                480       
_________________________________________________________________
dense_13 (Dense)             (None, 12)                132       
_________________________________________________________________
activation_13 (Activation)   (None, 12)                0         
=================================================================
Total params: 612
Trainable params: 612
Non-trainable params: 0

The results from running the model and evaluating it on test data are as follows.

Model Accuracy

Model Loss

Here is a tweaked set of parameters and model to show that the tweaks I'm making do little to improve the accuracy.

model = Sequential()
model.add(LSTM(128, activation = 'relu', input_shape=(model_input.shape[1], model_input.shape[2])))
model.add(Dense(n_vocab))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer= RMSprop(learning_rate=learning_rate), metrics=['accuracy'])


look_back = 5
epochs = 100
training_data_split = 0.8
number_of_midi_files = 40
learning_rate = 0.001
batch_size = 50

Here is a further model I wrote that is more complex and run for more epochs. It still plateaus at just over 0.3 accuracy.

#   First LSTM Layer  
model.add(LSTM(128,input_shape=(model_input.shape[1], model_input.shape[2]),return_sequences=True))
model.add(Dropout(0.3))

#   Second LSTM Layer    
model.add(LSTM(256, return_sequences=True))
model.add(Dropout(0.5))

#   First Hidden Layer        
model.add(Dense(256))
model.add(Dropout(0.3))

#   Second Hidden Layer    
model.add(Dense(256))
model.add(Dropout(0.5))

#   Flatten data shape    
model.add(Flatten())

#   Final Output Layer
model.add(Dense(n_vocab))          
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer=adam_v2.Adam(learning_rate=learning_rate, decay=1e-6), metrics=['accuracy'])


path = 'C_Major_Midi/'
look_back = 10
epochs = 200
training_data_split = 0.8
number_of_midi_files = 1000
learning_rate = 0.001
batch_size = 128

And a final Model that shows good alignment between train and test data but still plateaus at 0.3.

model = Sequential()
model.add(LSTM(50, activation = 'relu', input_shape=(model_input.shape[1], model_input.shape[2])))
model.add(Dropout(0.3))
model.add(Dense(n_vocab))
model.add(Activation('softmax'))

model.compile(loss='categorical_crossentropy', optimizer= RMSprop(learning_rate=learning_rate), metrics=['accuracy'])


look_back = 7
epochs = 50
training_data_split = 0.8
number_of_midi_files = 100
learning_rate = 0.001
batch_size = 128

As shown, I have tried to use larger quantities of data, different optimizers, more/less LSTM layers and units within them. Whatever I tweak I can never achieve an accuracy score of much over 0.3. This is my first big machine learning project so it is very likely I’m making some stupid errors but I would like someone with experience to let me know why my accuracy is plateauing so quickly.

Thanks so so much xx

分享到QQ

分享到微博