使用Python中的Gensim FastText创建一个单词列表的成对相似性矩阵

发布于 2025-02-04 18:35:54 字数 857 浏览 3 评论 0原文

我有一个单词列表,我需要使用FastText Word嵌入来创建成对的相似性矩阵。这就是我目前正在做的:

from gensim.models import fasttext as ft
from sklearn.metrics import pairwise_distances

path='cc.en.300.bin'
model=ft.load_facebook_vectors(path, encoding='utf-8')

wordlist = [x for x in df_['word']]  # list of words from dataframe

wordlist_vec = [model[x] for x in df_['word']]  #get word vector
wd_arr = np.array(wordlist_vec).reshape(-1, 1)  # reshape to compute pairwise distance

distances = pairwise_distances(wd_arr, wd_arr, metric=model.similarity)  # pairwise distance matrix

这将使用Gensim的余弦相似性函数产生成对距离矩阵。不幸的是,我猜我遇到了一个内存错误

Unable to allocate 1013. GiB for an array with shape (368700, 368700) and data type float64

,因为它试图将单词的所有向量存储在内存中(我们正在谈论〜1100个单词,顶部)。

我不确定哪种方式继续这里。从单词列表开始,是否有本机Gensim函数来创建相似性矩阵?另外,有什么聪明的方法可以得到它?

I have a list of words, and I need to create a pairwise similarity matrix using the Fasttext word embedding. This is what I am currently doing:

from gensim.models import fasttext as ft
from sklearn.metrics import pairwise_distances

path='cc.en.300.bin'
model=ft.load_facebook_vectors(path, encoding='utf-8')

wordlist = [x for x in df_['word']]  # list of words from dataframe

wordlist_vec = [model[x] for x in df_['word']]  #get word vector
wd_arr = np.array(wordlist_vec).reshape(-1, 1)  # reshape to compute pairwise distance

distances = pairwise_distances(wd_arr, wd_arr, metric=model.similarity)  # pairwise distance matrix

this would yield a pairwise distance matrix using Gensim's cosine similarity function. Unfortunately, I get a memory error

Unable to allocate 1013. GiB for an array with shape (368700, 368700) and data type float64

I guess because it's trying to stock in memory all the vectors of the words (we are talking about ~1100 words, tops).

I am not sure which way to proceed here. Is there a native gensim function to create a similarity matrix starting from a list of words? Alternatively, what could be a clever way to get it?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(1

记忆で 2025-02-11 18:35:54

该错误清楚地表明,pairwise_distances()已获得368,700个项目,其距离应使用368,700个其他项目计算。

那将采用(368700^2) * 8字节 = 1013GB的RAM到cAcluation,您的机器可能没有,这可能没有错误。

如果您认为它应该只是“ 〜1100个单词,顶部”,请查看您的临时值 - WordListwordlist_vec,& WD_ARR - 确保每个都是您打算的大小/形状/内容。

(不过,当您解决此问题时,您可能会遇到另一个问题:我不认为model.simarlitypairwise_distances() metric the Expart of the Thecre的确切类型参数。)

The error clearly indicates that pairwise_distances() has been given 368,700 items whose distances should be calculated with 368,700 other items.

That would take (368700^2) * 8 bytes = 1013GB of RAM to cacluation, which your machine likely does not have, giving an error.

If you think it should be only "~1100 words, tops", take a look at your interim values – wordlist, wordlist_vec, & wd_arr – to make sure each is the size/shape/contents you intend.

(You may run into another issue when you fix that, though: I don't think model.similarity is of the exact type expected by pairwise_distances() metric parameter.)

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文