浏览此网站是否有类似问题,我发现: http://math.nist.gov/javanumerics /jama/ 和这个: http ://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html
但是,这些运行时间似乎为 O(n^2)。我一直在进行一些文档聚类,并注意到即使处理小型文档集,这种复杂程度也是不可行的。鉴于,对于点积,我们只需要两个向量中包含的向量项,应该可以将向量放入树中,从而计算具有 n log n 复杂度的点积,其中 n 是2 个文档中的 1 个。
我错过了什么吗?有没有一个java库可以做到这一点?
谢谢
Having looked around this site for similar issues, I found this: http://math.nist.gov/javanumerics/jama/ and this: http://sujitpal.blogspot.com/2008/09/ir-math-with-java-similarity-measures.html
However, it seems these run in O(n^2). I've been doing some document clustering and noticed this level of complexity wasn't feasible when dealing with even small document sets. Given, for the dot product, we only need the vector terms contained in both vectors it should be possible to put the vectors in a tree and thus compute the dot product with n log n complexity, where n is the lowest number of unique terms in 1 of the 2 documents.
Am I missing something? Is there a java library which does this?
thanks
发布评论
评论(4)
如果将向量元素存储在哈希表中,那么查找只是 log n ,不是吗?循环遍历较小文档中的所有键,看看它们是否存在于较大文档中..?
If you store the vector elements in a hashtable, lookup is only log n anyway, no? Loop over all keys in the smaller document and see if they exist in the larger one..?
Hashmap 很好,但它可能会占用大量内存。
如果向量存储为按键排序的键值对,则向量乘法可以在 O(n) 内完成:您只需并行迭代两个向量(例如在合并排序算法中使用相同的迭代)。乘法的伪代码:
Hashmap is good, but it might take a lot of memory.
If your vectors are stored as key-value pairs sorted by key then vector multiplication can be done in O(n): you just have to iterate in parallel over both vectors (the same iteration is used e.g. in merge sort algorithm). The pseudocode for multiplication:
如果您计划使用余弦相似度作为查找相似文档集群的方法,您可能需要考虑研究 局部敏感哈希,一种基于哈希的方法,专门考虑到这一点而设计。直观上,LSH 对向量进行哈希处理的方式是,很可能将相似的元素放入同一个桶中,而将相距较远的元素放入不同的桶中。有些 LSH 方案使用余弦相似度作为其基础距离,因此要查找集群,您可以使用 LSH 将事物放入桶中,然后仅计算同一桶中元素的成对距离。在最坏的情况下,这将是二次的(如果一切都落在同一个桶中),但更有可能的是你的工作会大幅下降。
希望这有帮助!
If you are planning on using cosine similarity as a way of finding clusters of similar documents, you may want to consider looking into locality-sensitive hashing, a hash-based approach that was designed specifically with this in mind. Intuitively, LSH hashes the vectors in a way that with high probability places similar elements into the same bucket and distant elements into different buckets. There are LSH schemes that use cosine similarity as their underlying distance, so to find clusters you use LSH to drop things into buckets and then only compute the pairwise distances of elements in the same bucket. In the worst case this will be quadratic (if everything falls in the same bucket), but it's much more likely that you'll have a significant dropoff in work.
Hope this helps!
如果您只想向大小为 n 的集合中的每个项目推荐有限的项目,例如 m 个项目,则复杂度不需要是 n^2,而是 m*n。由于 m 是常数,因此复杂度是线性的。
你可以查看项目 simbase https://github.com/guokr/simbase ,它是一个向量相似性nosql数据库。
Simbase 使用以下概念:
If you only want to recommend limited items, for example m items, to every item in a set with size of n, the complexity need not to be n^2, but m*n. Since m is a constant, the complexity is linear.
You can check with the project simbase https://github.com/guokr/simbase , it is a vector similarity nosql database.
Simbase use below concepts: