有效计算余弦相似性
我有一个大约 100k 字符串的银行,当我得到一个新字符串时,我想将它与最相似的字符串匹配。
我的想法是使用 tf-idf (这是有道理的,因为关键字非常重要),然后使用余弦距离进行匹配。有没有一种有效的方法可以使用 pandas/scikit-learn/scipy 等来做到这一点?我目前正在这样做:
df['cosine_distance'] = df.apply(lambda x: cosine_distances(x["tf-idf"], x["new_string"]), axis=1)
这显然很慢。我正在考虑可能是 KD 树,但它需要大量内存,因为 tf-idf 向量的维度为 2000。
I have a bank of about 100k strings and when I get a new string, I want to match it to the most similar string.
My thoughts were to use tf-idf (makes sense as keywords are quite important), then match using the cosine distance. Is there an efficient way to do this using pandas/scikit-learn/scipy etc? I'm currently doing this:
df['cosine_distance'] = df.apply(lambda x: cosine_distances(x["tf-idf"], x["new_string"]), axis=1)
which is obviously quite slow. I was thinking of maybe a KD-tree, but it takes a lot of memory as the tf-idf vectors have a dimension of 2000.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(1)
考虑使用矢量化计算而不是循环 DataFrame 行(这非常慢,应该避免)。
我不确定数组在数据框中的表示方式,因此请确保从两个形状相同的数组开始。
此代码使用向量运算直接计算余弦距离(einsum 参考),并且运行速度比
快几个数量级DataFrame.apply() 方法。
Consider using vectorized computations rather than looping over DataFrame rows (which is very slow and should be avoided).
I'm not sure how the arrays are represented in the dataframe, so make sure you're starting out with two arrays of the same shape.
This code directly calculates the cosine distance using vector operations (einsum reference) and will run orders of magnitude faster than the
DataFrame.apply()
method.