word2vec＆＃x2b; LSTM良好的培训和验证，但测试不佳

发布于 2025-01-23 05:13:07 字数 5857 浏览 0 评论 0原文

目前，我正在培训我的Word2Vec + LSTM进行Twitter情感分析。我使用预先训练的googlenewsvectornegornegative300单词嵌入。我使用预先训练的GoogLenewsVectornegative300的原因是因为当我使用自己的数据集训练自己的Word2Vec时，性能要差得多。问题是为什么我的训练过程分别验证了ACC和损失分别为0.88和0.34。然后，我的混乱矩阵似乎也是错误的。在此之前，我在安装模型

文本预处理之前已经完成了几个过程：

较低的外壳
删除主题标签，提及，URL，数字，更改为数字，非ASCII字符，转发“ RT”“ RT”
扩展收缩
替换neftions in antonyms用Antonyms emandyms删除旁路
删除puncutations
删除puncutation
lemmatization

我将数据集分为90:10进行火车：测试如下：

def split_data(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y,
                                                        train_size=0.9, 
                                                        test_size=0.1, 
                                                        stratify=y,
                                                        random_state=0)
    return X_train, X_test, y_train, y_test

培训的拆分数据具有2060个样本，具有708个积极的情感类别，837个负面情绪类别，515个情感中性班级

，然后我实施了文本增强，所有培训数据的EDA（易于数据增加）如下：

class TextAugmentation:
    def __init__(self):
        self.augmenter = EDA()

    def replace_synonym(self, text):
        augmented_text_portion = int(len(text)*0.1) 
        synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
        return synonym_replaced

    def random_insert(self, text):
        augmented_text_portion = int(len(text)*0.1) 
        random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
        return random_inserted

    def random_swap(self, text):
        augmented_text_portion = int(len(text)*0.1)
        random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
        return random_swaped

    def random_delete(self, text):
        random_deleted = self.augmenter.random_deletion(text, p=0.5)
        return random_deleted

text_augmentation = TextAugmentation()

培训导致培训的数据增加具有10300个样本，带有3540个积极情绪类别，4185个负面情感类别和2575个情感中性班级

然后我将序列表示为：

# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)

X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)

# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)

pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1

# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]

y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)

， 8869独特的单词和100最大序列长度

最后，我使用经过训练的GoogLenewsVectorneGornegative300 Word嵌入中，但仅使用重量和LSTM，我再次将训练数据拆分为10％以进行验证：

# Build single LSTM model
def build_lstm_model(embedding_matrix, max_sequence_length):
    # Input layer
    input_layer = Input(shape=(max_sequence_length,), dtype='int32')
    
    # Word embedding layer
    embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
                                output_dim=embedding_matrix.shape[1],
                                weights=[embedding_matrix],
                                input_length=max_sequence_length,
                                trainable=True)(input_layer)
    
    # LSTM model layer
    lstm_layer = LSTM(units=128,
                      dropout=0.5,
                      return_sequences=True)(embedding_layer)
    batch_normalization = BatchNormalization()(lstm_layer)
    
    lstm_layer = LSTM(units=128,
                      dropout=0.5,
                      return_sequences=False)(batch_normalization)
    batch_normalization = BatchNormalization()(lstm_layer)

    # Dense model layer
    dense_layer = Dense(units=128, activation='relu')(batch_normalization)
    dropout_layer = Dropout(rate=0.5)(dense_layer)
    batch_normalization = BatchNormalization()(dropout_layer)
    
    output_layer = Dense(units=3, activation='softmax')(batch_normalization)

    lstm_model = Model(inputs=input_layer, outputs=output_layer)

    return lstm_model

# Building single LSTM model
sinovac_lstm_model = build_lstm_model(SINOVAC_EMBEDDING_MATRIX, SINOVAC_MAX_SEQUENCE)
sinovac_lstm_model.summary()
sinovac_lstm_model.compile(loss='categorical_crossentropy',
                               optimizer=Adam(learning_rate=0.001),
                               metrics=['accuracy'])
sinovac_lstm_history = sinovac_lstm_model.fit(x=X_sinovac_train,
                                                  y=y_sinovac_train,
                                                  batch_size=64,
                                                  epochs=20,
                                                  validation_split=0.1,
                                                  verbose=1)

培训结果：

评估结果：

我真的需要一些建议或见解才能在测试中具有良好的准确性

原文

currently I'am training my Word2Vec + LSTM for Twitter sentiment analysis. I use the pre-trained GoogleNewsVectorNegative300 word embedding. The reason I used the pre-trained GoogleNewsVectorNegative300 because the performance much worse when I trained my own Word2Vec using own dataset. The problem is why my training process had validation acc and loss stuck at 0.88 and 0.34 respectively. Then, my confussion matrix also seems wrong. Here several processes that I have done before fitting the model

Text Pre processing:

Lower casing
Remove hashtag, mentions, URLs, numbers, change words to numbers, non-ASCII characters, retweets "RT"
Expand contractions
Replace negations with antonyms
Remove puncutations
Remove stopwords
Lemmatization

I split my dataset into 90:10 for train:test as follows:

def split_data(X, y):
    X_train, X_test, y_train, y_test = train_test_split(X, 
                                                        y,
                                                        train_size=0.9, 
                                                        test_size=0.1, 
                                                        stratify=y,
                                                        random_state=0)
    return X_train, X_test, y_train, y_test

The split data resulting in training has 2060 samples with 708 positive sentiment class, 837 negative sentiment class, and 515 sentiment neutral class

Then, I implemented the text augmentation that is EDA (Easy Data Augmentation) on all the training data as follows:

class TextAugmentation:
    def __init__(self):
        self.augmenter = EDA()

    def replace_synonym(self, text):
        augmented_text_portion = int(len(text)*0.1) 
        synonym_replaced = self.augmenter.synonym_replacement(text, n=augmented_text_portion)
        return synonym_replaced

    def random_insert(self, text):
        augmented_text_portion = int(len(text)*0.1) 
        random_inserted = self.augmenter.random_insertion(text, n=augmented_text_portion)
        return random_inserted

    def random_swap(self, text):
        augmented_text_portion = int(len(text)*0.1)
        random_swaped = self.augmenter.random_swap(text, n=augmented_text_portion)
        return random_swaped

    def random_delete(self, text):
        random_deleted = self.augmenter.random_deletion(text, p=0.5)
        return random_deleted

text_augmentation = TextAugmentation()

The data augmentation resulting in training has 10300 samples with 3540 positive sentiment class, 4185 negative sentiment class, and 2575 sentiment neutral class

Then, I tokenized the sequence as follows:

# Tokenize the sequence
pfizer_tokenizer = Tokenizer(oov_token='OOV')
pfizer_tokenizer.fit_on_texts(df_pfizer_train['text'].values)

X_pfizer_train_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_train['text'].values)
X_pfizer_test_tokenized = pfizer_tokenizer.texts_to_sequences(df_pfizer_test['text'].values)

# Pad the sequence
X_pfizer_train_padded = pad_sequences(X_pfizer_train_tokenized, maxlen=100)
X_pfizer_test_padded = pad_sequences(X_pfizer_test_tokenized, maxlen=100)

pfizer_max_length = 100
pfizer_num_words = len(pfizer_tokenizer.word_index) + 1

# Encode label
y_pfizer_train_encoded = df_pfizer_train['sentiment'].factorize()[0]
y_pfizer_test_encoded = df_pfizer_test['sentiment'].factorize()[0]

y_pfizer_train_category = to_categorical(y_pfizer_train_encoded)
y_pfizer_test_category = to_categorical(y_pfizer_test_encoded)

Resulting in 8869 unique words and 100 maximum sequence length

Finally, I fit the into my model using pre trained GoogleNewsVectorNegative300 word embedding but only use the weight and LSTM, and I split my training data again with 10% for validation as follows:

# Build single LSTM model
def build_lstm_model(embedding_matrix, max_sequence_length):
    # Input layer
    input_layer = Input(shape=(max_sequence_length,), dtype='int32')
    
    # Word embedding layer
    embedding_layer = Embedding(input_dim=embedding_matrix.shape[0],
                                output_dim=embedding_matrix.shape[1],
                                weights=[embedding_matrix],
                                input_length=max_sequence_length,
                                trainable=True)(input_layer)
    
    # LSTM model layer
    lstm_layer = LSTM(units=128,
                      dropout=0.5,
                      return_sequences=True)(embedding_layer)
    batch_normalization = BatchNormalization()(lstm_layer)
    
    lstm_layer = LSTM(units=128,
                      dropout=0.5,
                      return_sequences=False)(batch_normalization)
    batch_normalization = BatchNormalization()(lstm_layer)

    # Dense model layer
    dense_layer = Dense(units=128, activation='relu')(batch_normalization)
    dropout_layer = Dropout(rate=0.5)(dense_layer)
    batch_normalization = BatchNormalization()(dropout_layer)
    
    output_layer = Dense(units=3, activation='softmax')(batch_normalization)

    lstm_model = Model(inputs=input_layer, outputs=output_layer)

    return lstm_model

# Building single LSTM model
sinovac_lstm_model = build_lstm_model(SINOVAC_EMBEDDING_MATRIX, SINOVAC_MAX_SEQUENCE)
sinovac_lstm_model.summary()
sinovac_lstm_model.compile(loss='categorical_crossentropy',
                               optimizer=Adam(learning_rate=0.001),
                               metrics=['accuracy'])
sinovac_lstm_history = sinovac_lstm_model.fit(x=X_sinovac_train,
                                                  y=y_sinovac_train,
                                                  batch_size=64,
                                                  epochs=20,
                                                  validation_split=0.1,
                                                  verbose=1)

The training result:

The evaluation result:

I really need some suggestions or insights to have a good accuracy on my test

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

孤独患者 2025-01-30 05:13:07

没有审查所有内容，可能会限制您的结果的一些高级事物：

googlenews向量是在2012年及更早开始的Media-Outlet新闻报道中培训的。 2020年以上的推文使用一种非常不同的语言。我不一定会期望那些易于的媒介来自不同的时代＆amp;编写的域，非常擅长建模所需的单词。训练有素的Word2Vec模型（使用大量现代推文数据，具有良好的预处理/令牌化和参数化选择）有很好的工作机会，因此您可能需要重新访问该选择。
googlenews培训文本预处理虽然据我所知，但据我所知，从未完全记载过，但似乎并没有使所有套管变平，也没有删除停止词，也没有涉及lemmatization。它并没有将明显的否定变为反义词，但它确实将某些单词的统计组合用于多式令牌。因此，您的某些步骤可能导致您的代币与该集合的向量的一致性较小 - 甚至丢弃了信息，例如单词的拐点变化，这些信息可能会被带来。确保您要采取的每一步都值得您麻烦 - 并注意，在推文上使用相同的预处理2VEC训练构建的足够的现代Word2Vec Moel，然后以后的步骤将完美匹配。
>
Word2Vec模型和任何更深层次的神经网络通常都需要大量数据来训练，并避免过度拟合。即使是从googlenews中忽略了9亿个参数，您也尝试训练〜130k参数 - 至少520kb的状态 - 从初始一组仅2060个推文大小的文本（也许是100kb的数据）。从某种意义上说，学习通用事物的模型往往是数据的压缩，并且模型比培训数据大得多，带来了严重过度拟合的风险。（您用同义词替换单词的机械过程可能并不是真正为模型提供任何同义词之间的单词矢量相似性尚未提供的信息。）因此：考虑缩小模型并获得更多的培训数据 - 即使从只要语言使用相似，其他领域除了您的主要分类兴趣。

Without reviewing everything, a few high-order things that may be limiting your results:

The GoogleNews vectors were trained on media-outlet news stories from 2012 and earlier. Tweets in 2020+ use a very different style of language. I wouldn't necessarily expect those pretrained vectors, from a different era & domain-of-writing, to be very good at modeling the words you'll need. A well-trained word2vec model (using plenty of modern tweet data, with good preprocessing/tokenization & parameterization choices) has a good chance of working better, so you may want to revisit that choice.
The GoogleNews training texts preprocessing, while as far as I can tell never fully-documented, did not appear to flatten all casing, nor remove stopwords, nor involve lemmatization. It didn't mutate obvious negations into antonyms, but it did perform a statistical combinations of some single-words into multigram tokens instead. So some of your steps are potentially causing your tokens to have less concordance with that set's vectors – even throwing away info, like inflectional variations of words, that could be beneficially retained. Be sure every step you're taking is worth the trouble – and note that a suffiicient modern word2vec moel, on Tweets, built using the same preprocessing for word2vec training then later steps, would match vocabularies perfectly.
Both the word2vec model, and any deeper neural network, often need lots of data to train well, and avoid overfitting. Even disregarding the 900 million parameters from GoogleNews, you're trying to train ~130k parameters – at least 520KB of state – from an initial set of merely 2060 tweet-sized texts (maybe 100KB of data). Models that learn generalizable things tend to be compressions of the data, in some sense, and a model that's much larger than the training data brings risk of severe overfitting. (Your mechanistic process for replacing words with synonyms may not be really giving the model any info that the word-vector similarity between synonyms didn't already provide.) So: consider shrinking your model, and getting much more training data - potentially even from other domains than your main classification interest, as long as the use-of-language is similar.

回复收藏 0 原文

~没有更多了~