有效计算余弦相似性

发布于 2025-01-19 11:52:28 字数 349 浏览 2 评论 0原文

我有一个大约 100k 字符串的银行，当我得到一个新字符串时，我想将它与最相似的字符串匹配。

我的想法是使用 tf-idf （这是有道理的，因为关键字非常重要），然后使用余弦距离进行匹配。有没有一种有效的方法可以使用 pandas/scikit-learn/scipy 等来做到这一点？我目前正在这样做：

df['cosine_distance'] = df.apply(lambda x: cosine_distances(x["tf-idf"], x["new_string"]), axis=1)

这显然很慢。我正在考虑可能是 KD 树，但它需要大量内存，因为 tf-idf 向量的维度为 2000。

原文

I have a bank of about 100k strings and when I get a new string, I want to match it to the most similar string.

My thoughts were to use tf-idf (makes sense as keywords are quite important), then match using the cosine distance. Is there an efficient way to do this using pandas/scikit-learn/scipy etc? I'm currently doing this:

df['cosine_distance'] = df.apply(lambda x: cosine_distances(x["tf-idf"], x["new_string"]), axis=1)

which is obviously quite slow. I was thinking of maybe a KD-tree, but it takes a lot of memory as the tf-idf vectors have a dimension of 2000.

分享到QQ

分享到微博

如果你对这篇内容有疑问，欢迎到本站社区发帖提问参与讨论，获取更多帮助，或者扫码二维码加入 Web 技术交流群。

发布评论

需要登录才能够评论，你可以免费注册一个本站的账号。

我们的影子 2025-01-26 11:52:28

考虑使用矢量化计算而不是循环 DataFrame 行（这非常慢，应该避免）。

我不确定数组在数据框中的表示方式，因此请确保从两个形状相同的数组开始。

from numpy import einsum
from numpy.linalg import norm
arr_a = df["tf_idf"].values
arr_b = df["new_string"].values
cos_sim = einsum('ij,ij->i', arr_a, arr_b) / (norm(arr_a, axis=1)*norm(arr_b, axis=1))
df["cosine_distance"] = 1 - cos_sim

此代码使用向量运算直接计算余弦距离（einsum 参考），并且运行速度比 快几个数量级DataFrame.apply() 方法。

Consider using vectorized computations rather than looping over DataFrame rows (which is very slow and should be avoided).

I'm not sure how the arrays are represented in the dataframe, so make sure you're starting out with two arrays of the same shape.

from numpy import einsum
from numpy.linalg import norm
arr_a = df["tf_idf"].values
arr_b = df["new_string"].values
cos_sim = einsum('ij,ij->i', arr_a, arr_b) / (norm(arr_a, axis=1)*norm(arr_b, axis=1))
df["cosine_distance"] = 1 - cos_sim

This code directly calculates the cosine distance using vector operations (einsum reference) and will run orders of magnitude faster than the DataFrame.apply() method.

回复收藏 0 原文

~没有更多了~