如何在 500,000 个点的 100 维空间中找到最近的 2 个点?
我有一个 100 维空间中包含 500,000 个点的数据库,我想找到最接近的 2 个点。我该怎么做?
更新:空间是欧几里得的,抱歉。并感谢所有的答案。顺便说一句,这不是家庭作业。
I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?
Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(5)
算法简介中有一章致力于在 O( 中查找二维空间中两个最近的点n*logn) 时间。您可以在 Google 图书。事实上,我建议每个人都这样做,因为他们应用分而治之技术来解决这个问题的方式非常简单、优雅且令人印象深刻。
尽管它不能直接扩展到您的问题(因为常量
7
将替换为2^101 - 1
),但它对于大多数数据集来说应该没问题。因此,如果您有相当随机的输入,它将为您提供O(n*logn*m)
复杂度,其中n
是点数,m
code> 是维数。编辑
这都是假设你有欧几里得空间。即向量
v
的长度为sqrt(v0^2 + v1^2 + v2^2 + ...)
。但是,如果您可以选择指标,则可能还有其他选项来优化算法。There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.
Although it can't be extended directly to your problem (as constant
7
would be replaced with2^101 - 1
), it should be just fine for most datasets. So, if you have reasonably random input, it will give youO(n*logn*m)
complexity wheren
is the number of points andm
is the number of dimensions.edit
That's all assuming you have Euclidian space. I.e., length of vector
v
issqrt(v0^2 + v1^2 + v2^2 + ...)
. If you can choose metric, however, there could be other options to optimize the algorithm.使用kd树。您正在研究最近邻问题,并且有高度优化的数据结构来处理此类问题。
http://en.wikipedia.org/wiki/Kd-tree
PS 有趣的问题!
Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.
http://en.wikipedia.org/wiki/Kd-tree
P.S. Fun problem!
您可以尝试 ANN 库,但这只能提供最多 20 个维度的可靠结果。
You could try the ANN library, but that only gives reliable results up to 20 dimensions.
对数据运行 PCA,将向量从 100 维转换为 20 维。然后创建一棵 K 最近邻树(KD-Tree)并根据欧几里德距离获取最近的 2 个邻居。
一般如果没有。维度非常大,那么您必须采用强力方法(并行+分布式/映射缩减)或基于聚类的方法。
Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.
Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.
使用称为 KD-TREE 的数据结构。您需要分配大量内存,但您可能会根据数据发现一两个优化。
http://en.wikipedia.org/wiki/Kd-tree。
我的朋友几年前在写博士论文时遇到了类似的问题。他的工作在 10 个维度上有 100 万个点。我们构建了一个kd-tree库来解决这个问题。如果您想离线联系我们,我们也许可以挖掘代码。
这是他发表的论文:
http://www.elec.qmul.ac.uk /people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf
Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.
http://en.wikipedia.org/wiki/Kd-tree.
My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.
Here's his published paper:
http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf