为什么 Word2Vec 函数返回很多 0.99 的值
我正在尝试在评论数据集上应用 word2vec 模型。首先,我将预处理应用于数据集:
df=df.text.apply(gensim.utils.simple_preprocess)
这是我得到的数据集:
0 [understand, location, low, score, look, mcdon...
3 [listen, it, morning, tired, maybe, hangry, ma...
6 [super, cool, bathroom, door, open, foot, nugg...
19 [cant, find, better, mcdonalds, know, getting,...
27 [night, went, mcdonalds, best, mcdonalds, expe...
...
1677 [mcdonalds, app, order, arrived, line, drive, ...
1693 [correct, order, filled, promptly, expecting, ...
1694 [wow, fantastic, eatery, high, quality, ive, e...
1704 [let, tell, eat, lot, mcchickens, best, ive, m...
1716 [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object
现在我创建 Word2Vec 模型并训练它:
model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))
我不明白的是函数 most_similar() 返回给我很多 0.99 的相似度。
[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]
我做错了什么?
I'm trying to apply a word2vec model on a review dataset. First of all I apply the preprocessing to the dataset:
df=df.text.apply(gensim.utils.simple_preprocess)
and this is the dataset that I get:
0 [understand, location, low, score, look, mcdon...
3 [listen, it, morning, tired, maybe, hangry, ma...
6 [super, cool, bathroom, door, open, foot, nugg...
19 [cant, find, better, mcdonalds, know, getting,...
27 [night, went, mcdonalds, best, mcdonalds, expe...
...
1677 [mcdonalds, app, order, arrived, line, drive, ...
1693 [correct, order, filled, promptly, expecting, ...
1694 [wow, fantastic, eatery, high, quality, ive, e...
1704 [let, tell, eat, lot, mcchickens, best, ive, m...
1716 [entertaining, staff, ive, come, mcdees, servi...
Name: text, Length: 283, dtype: object
Now I create the Word2Vec model and train it:
model = gensim.models.Word2Vec(sentences=df, vector_size=200, window=10, min_count=1, workers=6)
model.train(df,total_examples=model.corpus_count,epochs=model.epochs)
print(model.wv.most_similar("service",topn=10))
What I dont understand is that the function most_similar() returns to me a lot of 0.99 of similarity.
[('like', 0.9999310970306396), ('mcdonalds', 0.9999251961708069), ('food', 0.9999234080314636), ('order', 0.999918520450592), ('fries', 0.9999175667762756), ('got', 0.999911367893219), ('window', 0.9999082088470459), ('way', 0.9999075531959534), ('it', 0.9999069571495056), ('meal', 0.9999067783355713)]
What am I doing wrong?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(2)
你说得对,这不正常。
您的
df
不太可能是Word2Vec
期望的正确格式。它需要一个可重复迭代的 Python 序列,其中每个项目都是一个字符串标记的列表。尝试显示
next(iter(df))
,以查看df
中的第一项(如果像Word2Vec
那样迭代)。它看起来像是一个很好的训练数据吗?分别关于您的代码:
min_count=1
对于Word2Vec
来说总是一个坏主意 - 稀有单词无法获得良好的向量,但总的来说,它的作用很像随机噪声使附近的单词更难训练。一般来说,不应降低默认的min_count=5
除非您确定这对您的结果有帮助,因为您可以比较该值与较低值的效果。如果你的词汇量似乎因为单词只出现了区区 5 次而消失了,那么对于这种需要大量数据的算法来说,你的数据可能太少了。vector_size
和/或增加epochs
到充分利用最少的数据Word2Vec()
构造中为句子
提供语料库,则不需要调用.train()
。它将自动使用该语料库作为构造函数的一部分(您只需调用独立的内部.build_vocab()
&如果您在构建时没有提供语料库,则需要执行.train()
步骤。)我强烈建议您至少启用对
INFO
的日志记录> 相关课程的水平(或者所有 Gensim 或只是Word2Vec
),然后您将看到有用的日志记录/进度信息,如果您仔细阅读,将往往会发现诸如此处冗余的第二次训练之类的问题。不过,这并不是您主要问题的原因。)You're right that's not normal.
It is unlikely that your
df
is the proper formatWord2Vec
expects. It needs a re-iterable Python sequence, where each item is a list of string tokens.Try displaying
next(iter(df))
, to see the 1st item indf
, if iterated over asWord2Vec
does. Does it look like a good piece of training data?Separately regarding your code:
min_count=1
is always a bad idea withWord2Vec
- rare words can't get good vectors but do, in aggregate, serve a lot like random noise making nearby words harder to train. Generally, the defaultmin_count=5
shouldn't be lowered unless you're sure that will help your results, because you can compare that value's effects versus lower values. And if it seems like too much of your vocabulary disappears because words don't appear even a measly 5 times, you likely have too little data for this data-hungry algorithm.vector_size
and/or increase theepochs
to get the most out of minimal data.sentences
in theWord2Vec()
construction, you don't need to call.train()
. It will have already automatically used that corpus fully as part of the constructor. (You only need to call the indepdendent, internal.build_vocab()
&.train()
steps if you didn't supply a corpus at construction-time.)I highly recommend you enable logging to at least the
INFO
level for the relevant classes (either all Gensim or justWord2Vec
). Then you'll see useful logging/progress info which, if you read over, will tend to reveal problems like the redundant second training here. (That redundant training isn't the cause of your main problem, though.)根据官方文档:
自从你把这个 df 作为你的句子基础放在参数中,gensim 只是计算不同句子(数据帧行)中单词的类比和距离。我不确定您的数据框是否包含“服务”,如果是,结果单词只是句子中与“服务”值最接近的单词。
According to the official doc:
Since you put this df as your sentence base in the param, gensim just calculates the analogy and distance of words in different sentences (dataframe rows). I'm not sure your dataframe contains "service", if yes, the result words are just words which have the closest values to "service" in their sentences.