使用 Spacy 计算多个文档相似度的有效方法

发布于 2025-01-16 19:48:41 字数 699 浏览 2 评论 0原文

我有大约 10k 文档(主要是 1-2 个句子),并且希望为每个文档找到 60k 文档集合中的 10 个最相似的文档。因此,我想使用spacy库。由于文档数量庞大,这需要高效,所以我的第一个想法是计算 60k 文档中的每一个以及 10k 文档的文档向量 (https://spacy.io/api/doc#vector)并将它们保存在两个矩阵中。这两个矩阵可以相乘得到点积,可以解释为相似度。 现在,我基本上有两个问题:

  1. 这实际上是最有效的方法还是有一个聪明的技巧可以加速这个过程
  2. 如果没有其他聪明的方法,我想知道是否至少有一个聪明的方法来加速这个过程计算文档向量矩阵的过程。目前我正在使用 for 循环,这显然不是很快:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
  doc = nlp(train_list[i]) #the train list contains the single documents
  doc_matrix[i] = doc.vector

例如,有没有一种方法可以并行化它?

I have around 10k docs (mostly 1-2 sentences) and want for each of these docs find the ten most simliar docs of a collection of 60k docs. Therefore, I want to use the spacy library. Due to the large amount of docs this needs to be efficient, so my first idea was to compute both for each of the 60k docs as well as the 10k docs the document vector (https://spacy.io/api/doc#vector) and save them in two matrices. This two matrices can be multiplied to get the dot product, which can be interpreted as the similarity.
Now, I have basically two questions:

  1. Is this actually the most efficient way or is there a clever trick that can speed up this process
  2. If there is no other clever way, I was wondering whether there is at least a clever way to speed up the process of computing the matrices of document vectors. Currently I am using a for loop, which obviously is not exactly fast:
import spacy
nlp = spacy.load('en_core_web_lg')
doc_matrix = np.zeros((len(train_list), 300))
for i in range(len(train_list)):
  doc = nlp(train_list[i]) #the train list contains the single documents
  doc_matrix[i] = doc.vector

Is there for example a way to parallelize this?

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

伪心 2025-01-23 19:48:41

不要进行大矩阵运算,而是将文档向量放入近似最近邻存储中(烦恼易于使用)并查询每个向量的最近项目。

进行大矩阵运算将进行 n * n 比较,但使用近似最近邻技术将划分空间以执行更少的计算。对于整体运行时间来说,这比使用 spaCy 所做的任何事情都重要得多。

也就是说,另请查看 spaCy 速度常见问题解答

Don't do a big matrix operation, instead put your document vectors in an approximate nearest neighbors store (annoy is easy to use) and query the nearest items for each vector.

Doing a big matrix operation will do n * n comparisons, but using approximate nearest neighbors techniques will partition the space to perform many fewer calculations. That's much more important for the overall runtime than anything you do with spaCy.

That said, also check the spaCy speed FAQ.

失退 2025-01-23 19:48:41

我个人从未直接在 SpaCy 中处理过句子相似性/向量,所以我不能确定地告诉你你的第一个问题,可能有一些聪明的方法来做到这一点,这对 SpaCy 来说更原生/通常的方法。

对于一般加速 SpaCy 处理:

  1. 禁用不需要的组件,例如 Named实体识别、词性标注等。
  2. 使用processed_docs = nlp.pipe(train_list)而不是在循环内调用nlp。然后在循环内使用 for doc inprocessed_docs:doc = next(processed_docs) 进行访问。您可以调整 pipeline() 参数以进一步加快速度,具体取决于您的硬件,请参阅文档< /a>.

对于您实际的“查找 n 个最相似”问题:

这个问题不是 NLP 或 SpaCy 特定的,而是一个普遍问题。有很多关于如何在线优化 numpy 向量的资源,您基本上是在高维(300)数据的大数据集(10000)中寻找 n 个最近的数据点。查看此帖子了解一些一般想法或这个线程了解如何对 numpy 数据执行此类搜索(在本例中为 K 最近邻搜索)。

一般来说,您也不应该忘记,在大型数据集中(除非经过过滤),将会有重复或几乎重复的文档/句子(仅以逗号左右不同),因此您可能需要在执行搜索之前应用一些过滤。

I personally never worked with sentence similarity/vectors in SpaCy directly, so I can't tell you for sure about your first question, there might be some clever way to do this which is more native to SpaCy/the usual way to do it.

For generally speeding up the SpaCy processing:

  1. Disable components you don't need such as Named Entity Recognition, Part of Speech Tagging etc.
  2. Use processed_docs = nlp.pipe(train_list) instead of calling nlp inside the loop. Then access with for doc in processed_docs: or doc = next(processed_docs) inside the loop. You can tune the pipe() parameters to speed it up even more, depending on your hardware, see the documentation.

For your actual "find the n most similar" problem:

This problem is not NLP- or SpaCy-specific but a general problem. There are a lot of sources on how to optimize this for numpy vectors online, you are basically looking for the n nearest datapoints within a large dataset (10000) of high dimensional (300) data. Check out this thread for some general ideas or this thread to for how to perform this kind of search (in this case K-nearest neighbours search) on numpy data.

Generally you should also not forget that in a large dataset (unless filtered) there are going to be documents/sentences which are duplicates or nearly duplicates (only differ by comma or so), so you might want to apply some filtering before performing the search.

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文