来自预训练的嵌入的KERAS语义相似性模型

发布于 2025-01-20 02:12:38 字数 3526 浏览 7 评论 0 原文

我想实现一个KERAS模型，以预测单词嵌入的两个句子之间的相似性，如下所示（我在末尾包含了完整的脚本）：

加载单词嵌入式模型，例如，word2vec和fastText。
通过计算句子中所有单词的平均单词向量来生成样本（ x1 和 x2 ）。如果使用了两个或多个模型，请计算所有螺丝的算术平均值（计算元嵌入。
condenate x1 和 x2 将其馈送到网络之前。
编译（和评估）KERAS模型。

整个脚本如下：

import numpy as np
from gensim.models import Word2Vec
from keras.layers import Dense
from keras.models import Sequential
from sklearn.model_selection import train_test_split


def encoder_vector(v: str, model: Word2Vec) -> np.array:
    wv_dim = model.vector_size
    if v in model.wv:
        return model.wv[v]
    else:
        return np.zeros(wv_dim)


def encoder_words_avg(words: list[str], model: Word2Vec) -> np.array:
    dim = model.vector_size
    words = [word for word in words if word in model.wv]
    if len(words) >= 1:
        return np.mean(model.wv[words], axis=0)
    else:
        return np.zeros(dim)


def load_samples(mappings, w2v_model, fast_model):
    dim = w2v_model.vector_size
    num = len(mappings)

    X1 = np.zeros((num, dim))
    X2 = np.zeros((num, dim))
    y = np.zeros((num, 1))

    for i in range(num):
        mapping = mappings[i].split("|")
        sentence_1, sentence_2 = mapping[1:]

        e = np.zeros((2, dim))

        # Compute meta-embedding by averaging all embeddings.
        e[0, :] = encoder_words_avg(words=sentence_1.split(), model=w2v_model)
        e[1, :] = encoder_words_avg(words=sentence_1.split(), model=fast_model)
        X1[i] = e.mean(axis=0)

        e[0, :] = encoder_words_avg(words=sentence_2.split(), model=w2v_model)
        e[1, :] = encoder_words_avg(words=sentence_2.split(), model=fast_model)
        X2[i] = e.mean(axis=0)

        y[i] = 0.0 if mapping[0].startswith("-") else 1.0

    return X1, X2, y


def baseline_model(X_train, X_test, y_train, y_test):
    model = Sequential()
    model.add(
        Dense(
            200,
            input_shape=(X_train.shape[1],),
            activation="relu",
            kernel_initializer="he_uniform",
        )
    )
    model.add(Dense(1, activation="sigmoid"))
    model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])
    model.fit(X_train, y_train, batch_size=8, epochs=14)

    # Evaluate the trained model, using the train and test data
    _, train_acc = model.evaluate(X_train, y_train, verbose=0)
    _, test_acc = model.evaluate(X_test, y_test, verbose=0)

    print("Train: %.3f, Test: %.3f\n" % (train_acc, test_acc))

    return model


def main():
    w2v_model = Word2Vec.load("")
    fast_model = Word2Vec.load("")

    mappings = [
        "1|boiled chicken egg|hen egg whole boiled",
        "2|tomato|tomato substance",
        "3|sweet potatoes|potato chip",
        "-1|watering plants|cornsalad plant",
        "-2|butter|butane",
        "-3|olive plant|black olives",
    ]

    X1, X2, y = load_samples(mappings, w2v_model=w2v_model, fast_model=fast_model)

    # Concatenate both arrays into one before feeding to the network.
    X = np.concatenate([X1, X2], axis=1)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    model = baseline_model(X_train, X_test, y_train, y_test)

    model.summary()

上面的脚本似乎有效，但是即使仅使用Word2Vec，预测结果也很差（这使我认为Keras模型可能存在问题...）。关于如何改善结果的任何想法？我做错了吗？

谢谢。

原文

I want to implement a Keras model to predict the similarity between two sentences from words embeddings as follows (I included my full script at the end):

Load words embeddings models, e.g., Word2Vec and fastText.
Generate samples (X1 and X2) by computing the average word vectors for all words in a sentence. If two or more models are used, calculate the arithmetic mean of all embeddings (Frustratingly Easy Meta-Embedding -- Computing Meta-Embeddings by Averaging Source Word Embeddings).
Concatenate X1 and X2 into one array before feeding them to the network.
Compile (and evaluate) the Keras model.

The entire script is as follows:

import numpy as np
from gensim.models import Word2Vec
from keras.layers import Dense
from keras.models import Sequential
from sklearn.model_selection import train_test_split


def encoder_vector(v: str, model: Word2Vec) -> np.array:
    wv_dim = model.vector_size
    if v in model.wv:
        return model.wv[v]
    else:
        return np.zeros(wv_dim)


def encoder_words_avg(words: list[str], model: Word2Vec) -> np.array:
    dim = model.vector_size
    words = [word for word in words if word in model.wv]
    if len(words) >= 1:
        return np.mean(model.wv[words], axis=0)
    else:
        return np.zeros(dim)


def load_samples(mappings, w2v_model, fast_model):
    dim = w2v_model.vector_size
    num = len(mappings)

    X1 = np.zeros((num, dim))
    X2 = np.zeros((num, dim))
    y = np.zeros((num, 1))

    for i in range(num):
        mapping = mappings[i].split("|")
        sentence_1, sentence_2 = mapping[1:]

        e = np.zeros((2, dim))

        # Compute meta-embedding by averaging all embeddings.
        e[0, :] = encoder_words_avg(words=sentence_1.split(), model=w2v_model)
        e[1, :] = encoder_words_avg(words=sentence_1.split(), model=fast_model)
        X1[i] = e.mean(axis=0)

        e[0, :] = encoder_words_avg(words=sentence_2.split(), model=w2v_model)
        e[1, :] = encoder_words_avg(words=sentence_2.split(), model=fast_model)
        X2[i] = e.mean(axis=0)

        y[i] = 0.0 if mapping[0].startswith("-") else 1.0

    return X1, X2, y


def baseline_model(X_train, X_test, y_train, y_test):
    model = Sequential()
    model.add(
        Dense(
            200,
            input_shape=(X_train.shape[1],),
            activation="relu",
            kernel_initializer="he_uniform",
        )
    )
    model.add(Dense(1, activation="sigmoid"))
    model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])
    model.fit(X_train, y_train, batch_size=8, epochs=14)

    # Evaluate the trained model, using the train and test data
    _, train_acc = model.evaluate(X_train, y_train, verbose=0)
    _, test_acc = model.evaluate(X_test, y_test, verbose=0)

    print("Train: %.3f, Test: %.3f\n" % (train_acc, test_acc))

    return model


def main():
    w2v_model = Word2Vec.load("")
    fast_model = Word2Vec.load("")

    mappings = [
        "1|boiled chicken egg|hen egg whole boiled",
        "2|tomato|tomato substance",
        "3|sweet potatoes|potato chip",
        "-1|watering plants|cornsalad plant",
        "-2|butter|butane",
        "-3|olive plant|black olives",
    ]

    X1, X2, y = load_samples(mappings, w2v_model=w2v_model, fast_model=fast_model)

    # Concatenate both arrays into one before feeding to the network.
    X = np.concatenate([X1, X2], axis=1)

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )

    model = baseline_model(X_train, X_test, y_train, y_test)

    model.summary()

The above script seems to work, but the prediction result is very poor even when using only Word2Vec (which makes me think there could be an issue with the Keras model...). Any ideas on how to improve the outcome? Am I doing something wrong?

Thank you.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

夜夜流光相皎洁 2025-01-27 02:12:38

目前尚不清楚您打算预测什么。

您是否希望 Keras NN 报告与两个文本摘要向量之间的精确余弦相似度计算所报告的相同值？如果是这样，为什么不直接……计算一下呢？我不一定期望神经架构能够更好地近似。

或者，如果您的小型 6 对数据集是目标：

您现有的“黄金标准”答案对我来说似乎并不明显正确。从表面上看，“橄榄植物”和“橄榄树” “黑橄榄”看起来几乎与“番茄”和“番茄”一样“相似”。 “番茄物质”。同样，“给植物浇水”和“给植物浇水” “玉米沙拉植物”与“红薯”和“玉米沙拉植物”相似。 “薯片”。
仅仅 6 个示例（训练/测试分割后可能是 5 个？）都不足以有效地训练更大的神经分类器，到分类器可能很容易训练的程度（实际上是“过度拟合”））到 5 个训练示例，它不一定学到任何可推广到一个保留示例（该示例使用的向量与训练文本相距甚远）的东西。（由于训练数据如此缺乏，并且使用可能与训练数据任意不同的输入进行测试，因此预期性能“非常差”。神经网络需要大量不同的训练示例！）

最后，创建组合的策略-平均嵌入，正如您链接的论文所研究的那样，是另一种在我看来可疑的非典型做法。即使它可以提供一些好处，也没有理由在使用更典型和简单的基线方法进行比较之前，将这种非典型的、有些不直观的额外练习混合到你的实验中，以确保额外的“元” /平均值得复杂化。

这篇论文本身并没有真正显示出相对于串联的任何优势，除了在六分之一测试中的微小优势之外，串联比平均具有更强大的理论基础（保留每个模型的完全独立空间）。此外，GLoVe 和 GLoVe 的平均值为：在 6 项评估中，CBOW 在 3 项评估中的表现与单独使用 GLoVe 的结果相同或更差，而在其他 3 项评估中，CBOW 的表现则稍好一些。对我来说，这意味着优异的性能可能主要是由额外步骤引入的随机抖动，而平均充其量只是一种廉价的选择，可以视为一种微小的提升，而不是一种普遍更好的方法。

该论文还没有解决许多自然相关的问题：

平均是否比仅随机选择每个模型维度的一半进行串联更好？那就更便宜了！
某些任务中的一些轻微提升可能不是由于平均，而是由于他们应用的其他转换 - 应用于每个源模型的 l2 归一化，或者跨越 GLoVE 模型的每个维度的整体？（目前尚不清楚这种模型后处理是否仅在双模型平均之前应用，或者在其单独评估中也应用于 GLoVe。）

还有其他工作表明词向量空间的训练后转换可能会提高下游任务的性能 - 例如参见'除顶部之外的所有部分' - 那么到底要执行哪些步骤区分哪些优点很重要。

It's unclear what you're intending to predict.

Do you want your Keras NN to report the same value as the precise cosine-similarity calculation, between the two text summary vectors, would report? If so, why not just... do the calculation? It's not something I'd necessarily expect a neural-architecture to approxmate better.

Alternatively, if your tiny 6-pair dataset is the target:

Your existing 'gold standard' answers don't seem obviously correct to me. Superficially, 'olive plant' & 'black olives' seem nearly as 'similar' as 'tomato' & 'tomato substance'. Similarly, 'watering plants' & 'cornsalad plant' about-as-similar as 'sweet potatoes' & 'potato chip'.
A mere 6 examples (maybe 5 after train/test split?) is both inadequate to usefully train a larger neural classifier, and to the extent the classifer might be easily trained (indeed 'overfit') to the 5 training examples, it won't necessarily have learned anything generalizable to the one hold-out example (which is using vectors quite far from the training texts). (With such a paucity of training data, & testing using inputs that might be arbitrarily different than the training data, "very poor" performance would be expected. Neural nets require lots of varied training examples!)

Finally, the strategy of creating combined-embeddings-by-averaging, as investigated by your linked paper, is another atypical practice that seems fishy to me. Even if it could offer some benefits, there's no reason to mix that atypical, somewhat non-intuitive extra practice into your experiment before even having things work with a more typical and simple baseline approach, for comparison, to be sure the extra 'meta'/averaging is worth the complication.

The paper itself doesn't really show any advantage over concatenation, which has a stronger theoretical basis (preserving each model's full independent spaces) than averaging, except by a tiny amount in 1-of-6 tests. Further, average of GLoVe & CBOW performs the same or worse than GLoVe alone on 3 on their 6 evaluations – and pretty minimally better on the 3 other evaluations. That implies to me the outperformance might be mainly random jitter introduced by the extra steps, and the averaging is – at best – a cheap option to consider as a tiny boost, not a generally-better approach.

The paper also leaves many natural related questions unaddressed:

Is averaging better than, say, just picking a random half of each models' dimensions for concatenation? That'd be even cheaper!
Might some of the slight lift in some tasks be due not to the averaging, but the other transformations they've applied – the l2-normalization applied to each source model, or across the whole of each dimension for the GLoVE model? (It's unclear if this model-postprocessing was only applied before dual-model averaging, or also to GLoVe in its solo evaluation.)

There's other work suggesting post-training transformations of word-vector spaces may improve performance on downstream tasks – see for example 'All But The Top' – so which steps, exactly, get which advantages is important to distinguish.

回复收藏 0 原文

~没有更多了~