为什么 Word2Vec 函数返回很多 0.99 的值

发布于 2025-01-10 12:53:06 字数 1482 浏览 0 评论 0原文

我正在尝试在评论数据集上应用 word2vec 模型。首先,我将预处理应用于数据集:

df=df.text.apply(gensim.utils.simple_preprocess)

这是我得到的数据集:

0       [understand, location, low, score, look, mcdon...
3       [listen, it, morning, tired, maybe, hangry, ma...
6       [super, cool, bathroom, door, open, foot, nugg...
19      [cant, find, better, mcdonalds, know, getting,...
27      [night, went, mcdonalds, best, mcdonalds, expe...
                              ...
1677    [mcdonalds, app, order, arrived, line, drive, ...
1693    [correct, order, filled, promptly, expecting, ...
1694    [wow, fantastic, eatery, high, quality, ive, e...
1704    [let, tell, eat, lot, mcchickens, best, ive, m...
1716    [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object

现在我创建 Word2Vec 模型并训练它:

model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))

我不明白的是函数 most_similar() 返回给我很多 0.99 的相似度。

[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]

我做错了什么?

I'm trying to apply a word2vec model on a review dataset. First of all I apply the preprocessing to the dataset:

df=df.text.apply(gensim.utils.simple_preprocess)

and this is the dataset that I get:

0       [understand, location, low, score, look, mcdon...
3       [listen, it, morning, tired, maybe, hangry, ma...
6       [super, cool, bathroom, door, open, foot, nugg...
19      [cant, find, better, mcdonalds, know, getting,...
27      [night, went, mcdonalds, best, mcdonalds, expe...
                              ...
1677    [mcdonalds, app, order, arrived, line, drive, ...
1693    [correct, order, filled, promptly, expecting, ...
1694    [wow, fantastic, eatery, high, quality, ive, e...
1704    [let, tell, eat, lot, mcchickens, best, ive, m...
1716    [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object

Now I create the Word2Vec model and train it:

model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))

What I dont understand is that the function most_similar() returns to me a lot of 0.99 of similarity.

[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]

What am I doing wrong?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

她说她爱他 2025-01-17 12:53:06

你说得对,这不正常。

您的 df 不太可能是 Word2Vec 期望的正确格式。它需要一个可重复迭代的 Python 序列,其中每个项目都是一个字符串标记列表

尝试显示 next(iter(df)),以查看 df 中的第一项(如果像 Word2Vec 那样迭代)。它看起来像是一个很好的训练数据吗?

分别关于您的代码:

  • min_count=1 对于 Word2Vec 来说总是一个坏主意 - 稀有单词无法获得良好的向量,但总的来说,它的作用很像随机噪声使附近的单词更难训练。一般来说,不应降低默认的 min_count=5 除非您确定这对您的结果有帮助,因为您可以比较该值与较低值的效果。如果你的词汇量似乎因为单词只出现了区区 5 次而消失了,那么对于这种需要大量数据的算法来说,你的数据可能太少了。
  • 除非每个文本都有数万个标记,否则只有 283 个文本不太可能是足够的训练数据。 (即使可以从这个远小于理想的语料库中提取一些结果,您可能需要缩小 vector_size 和/或增加 epochs 到充分利用最少的数据
  • 如果您在 Word2Vec() 构造中为句子提供语料库,则不需要调用.train()。它将自动使用该语料库作为构造函数的一部分(您只需调用独立的内部 .build_vocab() &如果您在构建时没有提供语料库,则需要执行 .train() 步骤。)

我强烈建议您至少启用对 INFO 的日志记录> 相关课程的水平(或者所有 Gensim 或只是 Word2Vec),然后您将看到有用的日志记录/进度信息,如果您仔细阅读,将往往会发现诸如此处冗余的第二次训练之类的问题。不过,这并不是您主要问题的原因。)

You're right that's not normal.

It is unlikely that your df is the proper format Word2Vec expects. It needs a re-iterable Python sequence, where each item is a list of string tokens.

Try displaying next(iter(df)), to see the 1st item in df, if iterated over as Word2Vec does. Does it look like a good piece of training data?

Separately regarding your code:

  • min_count=1 is always a bad idea with Word2Vec - rare words can't get good vectors but do, in aggregate, serve a lot like random noise making nearby words harder to train. Generally, the default min_count=5 shouldn't be lowered unless you're sure that will help your results, because you can compare that value's effects versus lower values. And if it seems like too much of your vocabulary disappears because words don't appear even a measly 5 times, you likely have too little data for this data-hungry algorithm.
  • Only 283 texts are unlikely to be enough training data unless each text has tens of thousands of tokens. (And even if it were possible to squeeze some results from this far-smaller-than-ideal corpus, you might need to shrink the vector_size and/or increase the epochs to get the most out of minimal data.
  • If you supply a corpus to sentences in the Word2Vec() construction, you don't need to call .train(). It will have already automatically used that corpus fully as part of the constructor. (You only need to call the indepdendent, internal .build_vocab() & .train() steps if you didn't supply a corpus at construction-time.)

I highly recommend you enable logging to at least the INFO level for the relevant classes (either all Gensim or just Word2Vec). Then you'll see useful logging/progress info which, if you read over, will tend to reveal problems like the redundant second training here. (That redundant training isn't the cause of your main problem, though.)

怪异←思 2025-01-17 12:53:06

根据官方文档

Find the top-N most similar words. ... 
This method computes cosine similarity between a simple mean of the projection weight 
vectors of the given words and the vectors for each word in the model. The method 
corresponds to the word-analogy and distance scripts in the original word2vec 
implementation. ...

自从你把这个 df 作为你的句子基础放在参数中,gensim 只是计算不同句子(数据帧行)中单词的类比和距离。我不确定您的数据框是否包含“服务”,如果是,结果单词只是句子中与“服务”值最接近的单词。

According to the official doc:

Find the top-N most similar words. ... 
This method computes cosine similarity between a simple mean of the projection weight 
vectors of the given words and the vectors for each word in the model. The method 
corresponds to the word-analogy and distance scripts in the original word2vec 
implementation. ...

Since you put this df as your sentence base in the param, gensim just calculates the analogy and distance of words in different sentences (dataframe rows). I'm not sure your dataframe contains "service", if yes, the result words are just words which have the closest values to "service" in their sentences.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文