向量的余弦相似度,< O(n^2) 复杂度

发布于 2024-09-11 11:17:47 字数 525 浏览 10 评论 0 原文

浏览此网站是否有类似问题,我发现: http://math.nist.gov/javanumerics /jama/ 和这个: http ://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html

但是,这些运行时间似乎为 O(n^2)。我一直在进行一些文档聚类,并注意到即使处理小型文档集,这种复杂程度也是不可行的。鉴于,对于点积,我们只需要两个向量中包含的向量项,应该可以将向量放入树中,从而计算具有 n log n 复杂度的点积,其中 n 是2 个文档中的 1 个。

我错过了什么吗?有没有一个java库可以做到这一点?

谢谢

Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html

However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.

Am I missing something? Is there a java library which does this?

thanks

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(4

谁把谁当真 2024-09-18 11:17:47

如果将向量元素存储在哈希表中,那么查找只是 log n ,不是吗?循环遍历较小文档中的所有键,看看它们是否存在于较大文档中..?

If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?

简单气质女生网名 2024-09-18 11:17:47

Hashmap 很好,但它可能会占用大量内存。

如果向量存储为按键排序的键值对,则向量乘法可以在 O(n) 内完成:您只需并行迭代两个向量(例如在合并排序算法中使用相同的迭代)。乘法的伪代码:

i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
  if vec1[i].key == vec2[j].key:
    result = result + vec1[i].value * vec2[j].value
  else if vec1[i].key < vec2[j].key:
    i = i + 1
  else
    j = j + 1

Hashmap is good, but it might take a lot of memory.

If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:

i = 0
j = 0
result = 0
while i < length(vec1) && j < length(vec2):
  if vec1[i].key == vec2[j].key:
    result = result + vec1[i].value * vec2[j].value
  else if vec1[i].key < vec2[j].key:
    i = i + 1
  else
    j = j + 1
黎歌 2024-09-18 11:17:47

如果您计划使用余弦相似度作为查找相似文档集群的方法,您可能需要考虑研究 局部敏感哈希,一种基于哈希的方法,专门考虑到​​这一点而设计。直观上,LSH 对向量进行哈希处理的方式是,很可能将相似的元素放入同一个桶中,而将相距较远的元素放入不同的桶中。有些 LSH 方案使用余弦相似度作为其基础距离,因此要查找集群,您可以使用 LSH 将事物放入桶中,然后仅计算同一桶中元素的成对距离。在最坏的情况下,这将是二次的(如果一切都落在同一个桶中),但更有可能的是你的工作会大幅下降。

希望这有帮助!

If you are planning on using cosine similarity as a way of finding clusters of similar documents, you may want to consider looking into locality-sensitive hashing, a hash-based approach that was designed specifically with this in mind. Intuitively, LSH hashes the vectors in a way that with high probability places similar elements into the same bucket and distant elements into different buckets. There are LSH schemes that use cosine similarity as their underlying distance, so to find clusters you use LSH to drop things into buckets and then only compute the pairwise distances of elements in the same bucket. In the worst case this will be quadratic (if everything falls in the same bucket), but it's much more likely that you'll have a significant dropoff in work.

Hope this helps!

深空失忆 2024-09-18 11:17:47

如果您只想向大小为 n 的集合中的每个项目推荐有限的项目,例如 m 个项目,则复杂度不需要是 n^2,而是 m*n。由于 m 是常数,因此复杂度是线性的。

你可以查看项目 simbase https://github.com/guokr/simbase ,它是一个向量相似性nosql数据库。

Simbase 使用以下概念:

  • 向量集:一组向量
  • 基础:向量的基础,一个向量集中的向量具有相同的基础
  • 推荐:具有相同基础的两个向量集之间的单向二元关系

If you only want to recommend limited items, for example m items, to every item in a set with size of n, the complexity need not to be n^2, but m*n. Since m is a constant, the complexity is linear.

You can check with the project simbase https://github.com/guokr/simbase , it is a vector similarity nosql database.

Simbase use below concepts:

  • Vector set: a set of vectors
  • Basis: the basis for vectors, vectors in one vector set have same basis
  • Recommendation: a one-direction binary relationship between two vector sets which have the same basis
~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文