使用 Spacy 计算多个文档相似度的有效方法

发布于 2025-01-16 19:48:41 字数 699 浏览 2 评论 0原文

我有大约 10k 文档（主要是 1-2 个句子），并且希望为每个文档找到 60k 文档集合中的 10 个最相似的文档。因此，我想使用spacy库。由于文档数量庞大，这需要高效，所以我的第一个想法是计算 60k 文档中的每一个以及 10k 文档的文档向量 (https://spacy.io/api/doc#vector）并将它们保存在两个矩阵中。这两个矩阵可以相乘得到点积，可以解释为相似度。现在，我基本上有两个问题：

这实际上是最有效的方法还是有一个聪明的技巧可以加速这个过程
如果没有其他聪明的方法，我想知道是否至少有一个聪明的方法来加速这个过程计算文档向量矩阵的过程。目前我正在使用 for 循环，这显然不是很快：

import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
  doc = nlp(train_list[i]) #the train list contains the single documents
  doc_matrix[i] = doc.vector

例如，有没有一种方法可以并行化它？

原文

I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:

Is this actually the most efficient way or is there a clever trick that can speed up this process
If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:

import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
  doc = nlp(train_list[i]) #the train list contains the single documents
  doc_matrix[i] = doc.vector

Is there for example a way to parallelize this?

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

伪心 2025-01-23 19:48:41

不要进行大矩阵运算，而是将文档向量放入近似最近邻存储中（烦恼易于使用）并查询每个向量的最近项目。

进行大矩阵运算将进行 n * n 比较，但使用近似最近邻技术将划分空间以执行更少的计算。对于整体运行时间来说，这比使用 spaCy 所做的任何事情都重要得多。

也就是说，另请查看 spaCy 速度常见问题解答。

回复收藏 0 原文

失退 2025-01-23 19:48:41

我个人从未直接在 SpaCy 中处理过句子相似性/向量，所以我不能确定地告诉你你的第一个问题，可能有一些聪明的方法来做到这一点，这对 SpaCy 来说更原生/通常的方法。

对于一般加速 SpaCy 处理：

禁用不需要的组件，例如 Named实体识别、词性标注等。
使用processed_docs = nlp.pipe(train_list)而不是在循环内调用nlp。然后在循环内使用 for doc inprocessed_docs: 或 doc = next(processed_docs) 进行访问。您可以调整 pipeline() 参数以进一步加快速度，具体取决于您的硬件，请参阅文档< /a>.

对于您实际的“查找 n 个最相似”问题：

这个问题不是 NLP 或 SpaCy 特定的，而是一个普遍问题。有很多关于如何在线优化 numpy 向量的资源，您基本上是在高维（300）数据的大数据集（10000）中寻找 n 个最近的数据点。查看此帖子了解一些一般想法或这个线程了解如何对 numpy 数据执行此类搜索（在本例中为 K 最近邻搜索）。

一般来说，您也不应该忘记，在大型数据集中（除非经过过滤），将会有重复或几乎重复的文档/句子（仅以逗号左右不同），因此您可能需要在执行搜索之前应用一些过滤。