为什么 Word2Vec 函数返回很多 0.99 的值

发布于 2025-01-10 12:53:06 字数 1482 浏览 0 评论 0原文

我正在尝试在评论数据集上应用 word2vec 模型。首先，我将预处理应用于数据集：

df=df.text.apply(gensim.utils.simple_preprocess)

这是我得到的数据集：

0       [understand, location, low, score, look, mcdon...
3       [listen, it, morning, tired, maybe, hangry, ma...
6       [super, cool, bathroom, door, open, foot, nugg...
19      [cant, find, better, mcdonalds, know, getting,...
27      [night, went, mcdonalds, best, mcdonalds, expe...
                              ...
1677    [mcdonalds, app, order, arrived, line, drive, ...
1693    [correct, order, filled, promptly, expecting, ...
1694    [wow, fantastic, eatery, high, quality, ive, e...
1704    [let, tell, eat, lot, mcchickens, best, ive, m...
1716    [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object

现在我创建 Word2Vec 模型并训练它：

model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))

我不明白的是函数 most_similar() 返回给我很多 0.99 的相似度。

[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]

我做错了什么？

原文

I'm trying to apply a word2vec model on a review dataset. First of all I apply the preprocessing to the dataset:

df=df.text.apply(gensim.utils.simple_preprocess)

and this is the dataset that I get:

0       [understand, location, low, score, look, mcdon...
3       [listen, it, morning, tired, maybe, hangry, ma...
6       [super, cool, bathroom, door, open, foot, nugg...
19      [cant, find, better, mcdonalds, know, getting,...
27      [night, went, mcdonalds, best, mcdonalds, expe...
                              ...
1677    [mcdonalds, app, order, arrived, line, drive, ...
1693    [correct, order, filled, promptly, expecting, ...
1694    [wow, fantastic, eatery, high, quality, ive, e...
1704    [let, tell, eat, lot, mcchickens, best, ive, m...
1716    [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object

Now I create the Word2Vec model and train it:

model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))

What I dont understand is that the function most_similar() returns to me a lot of 0.99 of similarity.

[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]

What am I doing wrong?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

她说她爱他 2025-01-17 12:53:06

你说得对，这不正常。

您的 df 不太可能是 Word2Vec 期望的正确格式。它需要一个可重复迭代的 Python 序列，其中每个项目都是一个字符串标记的列表。

尝试显示 next(iter(df))，以查看 df 中的第一项（如果像 Word2Vec 那样迭代）。它看起来像是一个很好的训练数据吗？

分别关于您的代码：

min_count=1 对于 Word2Vec 来说总是一个坏主意 - 稀有单词无法获得良好的向量，但总的来说，它的作用很像随机噪声使附近的单词更难训练。一般来说，不应降低默认的 min_count=5 除非您确定这对您的结果有帮助，因为您可以比较该值与较低值的效果。如果你的词汇量似乎因为单词只出现了区区 5 次而消失了，那么对于这种需要大量数据的算法来说，你的数据可能太少了。
除非每个文本都有数万个标记，否则只有 283 个文本不太可能是足够的训练数据。（即使可以从这个远小于理想的语料库中提取一些结果，您可能需要缩小 vector_size 和/或增加 epochs 到充分利用最少的数据
如果您在 Word2Vec() 构造中为句子提供语料库，则不需要调用.train()。它将自动使用该语料库作为构造函数的一部分（您只需调用独立的内部 .build_vocab() &如果您在构建时没有提供语料库，则需要执行 .train() 步骤。）

我强烈建议您至少启用对 INFO 的日志记录> 相关课程的水平（或者所有 Gensim 或只是 Word2Vec），然后您将看到有用的日志记录/进度信息，如果您仔细阅读，将往往会发现诸如此处冗余的第二次训练之类的问题。不过，这并不是您主要问题的原因。）

回复收藏 0 原文

怪异←思 2025-01-17 12:53:06

根据官方文档：

Find the top-N most similar words. ... 
This method computes cosine similarity between a simple mean of the projection weight 
vectors of the given words and the vectors for each word in the model. The method 
corresponds to the word-analogy and distance scripts in the original word2vec 
implementation. ...

自从你把这个 df 作为你的句子基础放在参数中，gensim 只是计算不同句子（数据帧行）中单词的类比和距离。我不确定您的数据框是否包含“服务”，如果是，结果单词只是句子中与“服务”值最接近的单词。

According to the official doc:

Find the top-N most similar words. ... 
This method computes cosine similarity between a simple mean of the projection weight 
vectors of the given words and the vectors for each word in the model. The method 
corresponds to the word-analogy and distance scripts in the original word2vec 
implementation. ...

Since you put this df as your sentence base in the param, gensim just calculates the analogy and distance of words in different sentences (dataframe rows). I'm not sure your dataframe contains "service", if yes, the result words are just words which have the closest values to "service" in their sentences.

回复收藏 0 原文

~没有更多了~