来自预训练的嵌入的KERAS语义相似性模型
我想实现一个KERAS模型,以预测单词嵌入的两个句子之间的相似性,如下所示(我在末尾包含了完整的脚本):
- 加载单词嵌入式模型,例如,word2vec和fastText。
- 通过计算句子中所有单词的平均单词向量来生成样本(
x1
和x2
)。如果使用了两个或多个模型,请计算所有螺丝的算术平均值( 计算元嵌入。 - condenate
x1
和x2
将其馈送到网络之前。 - 编译(和评估)KERAS模型。
整个脚本如下:
import numpy as np
from gensim.models import Word2Vec
from keras.layers import Dense
from keras.models import Sequential
from sklearn.model_selection import train_test_split
def encoder_vector(v: str, model: Word2Vec) -> np.array:
wv_dim = model.vector_size
if v in model.wv:
return model.wv[v]
else:
return np.zeros(wv_dim)
def encoder_words_avg(words: list[str], model: Word2Vec) -> np.array:
dim = model.vector_size
words = [word for word in words if word in model.wv]
if len(words) >= 1:
return np.mean(model.wv[words], axis=0)
else:
return np.zeros(dim)
def load_samples(mappings, w2v_model, fast_model):
dim = w2v_model.vector_size
num = len(mappings)
X1 = np.zeros((num, dim))
X2 = np.zeros((num, dim))
y = np.zeros((num, 1))
for i in range(num):
mapping = mappings[i].split("|")
sentence_1, sentence_2 = mapping[1:]
e = np.zeros((2, dim))
# Compute meta-embedding by averaging all embeddings.
e[0, :] = encoder_words_avg(words=sentence_1.split(), model=w2v_model)
e[1, :] = encoder_words_avg(words=sentence_1.split(), model=fast_model)
X1[i] = e.mean(axis=0)
e[0, :] = encoder_words_avg(words=sentence_2.split(), model=w2v_model)
e[1, :] = encoder_words_avg(words=sentence_2.split(), model=fast_model)
X2[i] = e.mean(axis=0)
y[i] = 0.0 if mapping[0].startswith("-") else 1.0
return X1, X2, y
def baseline_model(X_train, X_test, y_train, y_test):
model = Sequential()
model.add(
Dense(
200,
input_shape=(X_train.shape[1],),
activation="relu",
kernel_initializer="he_uniform",
)
)
model.add(Dense(1, activation="sigmoid"))
model.compile(optimizer="sgd", loss="binary_crossentropy", metrics=["accuracy"])
model.fit(X_train, y_train, batch_size=8, epochs=14)
# Evaluate the trained model, using the train and test data
_, train_acc = model.evaluate(X_train, y_train, verbose=0)
_, test_acc = model.evaluate(X_test, y_test, verbose=0)
print("Train: %.3f, Test: %.3f\n" % (train_acc, test_acc))
return model
def main():
w2v_model = Word2Vec.load("")
fast_model = Word2Vec.load("")
mappings = [
"1|boiled chicken egg|hen egg whole boiled",
"2|tomato|tomato substance",
"3|sweet potatoes|potato chip",
"-1|watering plants|cornsalad plant",
"-2|butter|butane",
"-3|olive plant|black olives",
]
X1, X2, y = load_samples(mappings, w2v_model=w2v_model, fast_model=fast_model)
# Concatenate both arrays into one before feeding to the network.
X = np.concatenate([X1, X2], axis=1)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
model = baseline_model(X_train, X_test, y_train, y_test)
model.summary()
上面的脚本似乎有效,但是即使仅使用Word2Vec,预测结果也很差(这使我认为Keras模型可能存在问题...)。关于如何改善结果的任何想法?我做错了吗?
谢谢。
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
data:image/s3,"s3://crabby-images/d5906/d59060df4059a6cc364216c4d63ceec29ef7fe66" alt="扫码二维码加入Web技术交流群"
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
目前尚不清楚您打算预测什么。
您是否希望 Keras NN 报告与两个文本摘要向量之间的精确余弦相似度计算所报告的相同值?如果是这样,为什么不直接……计算一下呢?我不一定期望神经架构能够更好地近似。
或者,如果您的小型 6 对数据集是目标:
您现有的“黄金标准”答案对我来说似乎并不明显正确。从表面上看,“橄榄植物”和“橄榄树” “黑橄榄”看起来几乎与“番茄”和“番茄”一样“相似”。 “番茄物质”。同样,“给植物浇水”和“给植物浇水” “玉米沙拉植物”与“红薯”和“玉米沙拉植物”相似。 “薯片”。
仅仅 6 个示例(训练/测试分割后可能是 5 个?)都不足以有效地训练更大的神经分类器, 到分类器可能很容易训练的程度(实际上是“过度拟合”) )到 5 个训练示例,它不一定学到任何可推广到一个保留示例(该示例使用的向量与训练文本相距甚远)的东西。 (由于训练数据如此缺乏,并且使用可能与训练数据任意不同的输入进行测试,因此预期性能“非常差”。神经网络需要大量不同的训练示例!)
最后,创建组合的策略-平均嵌入,正如您链接的论文所研究的那样,是另一种在我看来可疑的非典型做法。即使它可以提供一些好处,也没有理由在使用更典型和简单的基线方法进行比较之前,将这种非典型的、有些不直观的额外练习混合到你的实验中,以确保额外的“元” /平均值得复杂化。
这篇论文本身并没有真正显示出相对于串联的任何优势,除了在六分之一测试中的微小优势之外,串联比平均具有更强大的理论基础(保留每个模型的完全独立空间)。此外,GLoVe 和 GLoVe 的平均值为:在 6 项评估中,CBOW 在 3 项评估中的表现与单独使用 GLoVe 的结果相同或更差,而在其他 3 项评估中,CBOW 的表现则稍好一些。对我来说,这意味着优异的性能可能主要是由额外步骤引入的随机抖动,而平均充其量只是一种廉价的选择,可以视为一种微小的提升,而不是一种普遍更好的方法。
该论文还没有解决许多自然相关的问题:
还有其他工作表明词向量空间的训练后转换可能会提高下游任务的性能 - 例如参见'除顶部之外的所有部分' - 那么到底要执行哪些步骤区分哪些优点很重要。
It's unclear what you're intending to predict.
Do you want your Keras NN to report the same value as the precise cosine-similarity calculation, between the two text summary vectors, would report? If so, why not just... do the calculation? It's not something I'd necessarily expect a neural-architecture to approxmate better.
Alternatively, if your tiny 6-pair dataset is the target:
Your existing 'gold standard' answers don't seem obviously correct to me. Superficially, 'olive plant' & 'black olives' seem nearly as 'similar' as 'tomato' & 'tomato substance'. Similarly, 'watering plants' & 'cornsalad plant' about-as-similar as 'sweet potatoes' & 'potato chip'.
A mere 6 examples (maybe 5 after train/test split?) is both inadequate to usefully train a larger neural classifier, and to the extent the classifer might be easily trained (indeed 'overfit') to the 5 training examples, it won't necessarily have learned anything generalizable to the one hold-out example (which is using vectors quite far from the training texts). (With such a paucity of training data, & testing using inputs that might be arbitrarily different than the training data, "very poor" performance would be expected. Neural nets require lots of varied training examples!)
Finally, the strategy of creating combined-embeddings-by-averaging, as investigated by your linked paper, is another atypical practice that seems fishy to me. Even if it could offer some benefits, there's no reason to mix that atypical, somewhat non-intuitive extra practice into your experiment before even having things work with a more typical and simple baseline approach, for comparison, to be sure the extra 'meta'/averaging is worth the complication.
The paper itself doesn't really show any advantage over concatenation, which has a stronger theoretical basis (preserving each model's full independent spaces) than averaging, except by a tiny amount in 1-of-6 tests. Further, average of GLoVe & CBOW performs the same or worse than GLoVe alone on 3 on their 6 evaluations – and pretty minimally better on the 3 other evaluations. That implies to me the outperformance might be mainly random jitter introduced by the extra steps, and the averaging is – at best – a cheap option to consider as a tiny boost, not a generally-better approach.
The paper also leaves many natural related questions unaddressed:
There's other work suggesting post-training transformations of word-vector spaces may improve performance on downstream tasks – see for example 'All But The Top' – so which steps, exactly, get which advantages is important to distinguish.