如何在 500,000 个点的 100 维空间中找到最近的 2 个点?

发布于 2024-09-27 11:03:08 字数 105 浏览 12 评论 0原文

我有一个 100 维空间中包含 500,000 个点的数据库,我想找到最接近的 2 个点。我该怎么做?

更新:空间是欧几里得的,抱歉。并感谢所有的答案。顺便说一句,这不是家庭作业。

I have a database with 500,000 points in a 100 dimensional space, and I want to find the closest 2 points. How do I do it?

Update: Space is Euclidean, Sorry. And thanks for all the answers. BTW this is not homework.

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(5

亚希 2024-10-04 11:03:08

算法简介中有一章致力于在 O( 中查找二维空间中两个最近的点n*logn) 时间。您可以在 Google 图书。事实上,我建议每个人都这样做,因为他们应用分而治之技术来解决这个问题的方式非常简单、优雅且令人印象深刻。

尽管它不能直接扩展到您的问题(因为常量 7 将替换为 2^101 - 1),但它对于大多数数据集来说应该没问题。因此,如果您有相当随机的输入,它将为您提供 O(n*logn*m) 复杂度,其中 n 是点数,m code> 是维数。

编辑
这都是假设你有欧几里得空间。即向量v的长度为sqrt(v0^2 + v1^2 + v2^2 + ...)。但是,如果您可以选择指标,则可能还有其他选项来优化算法。

There's a chapter in Introduction to Algorithms devoted to finding two closest points in two-dimensional space in O(n*logn) time. You can check it out on google books. In fact, I suggest it for everyone as the way they apply divide-and-conquer technique to this problem is very simple, elegant and impressive.

Although it can't be extended directly to your problem (as constant 7 would be replaced with 2^101 - 1), it should be just fine for most datasets. So, if you have reasonably random input, it will give you O(n*logn*m) complexity where n is the number of points and m is the number of dimensions.

edit
That's all assuming you have Euclidian space. I.e., length of vector v is sqrt(v0^2 + v1^2 + v2^2 + ...). If you can choose metric, however, there could be other options to optimize the algorithm.

瘫痪情歌 2024-10-04 11:03:08

使用kd树。您正在研究最近邻问题,并且有高度优化的数据结构来处理此类问题。

http://en.wikipedia.org/wiki/Kd-tree

PS 有趣的问题!

Use a kd tree. You're looking at a nearest neighbor problem and there are highly optimized data structures for handling this exact class of problems.

http://en.wikipedia.org/wiki/Kd-tree

P.S. Fun problem!

浪菊怪哟 2024-10-04 11:03:08

您可以尝试 ANN 库,但这只能提供最多 20 个维度的可靠结果。

You could try the ANN library, but that only gives reliable results up to 20 dimensions.

ぃ双果 2024-10-04 11:03:08

对数据运行 PCA,将向量从 100 维转换为 20 维。然后创建一棵 K 最近邻树(KD-Tree)并根据欧几里德距离获取最近的 2 个邻居。

一般如果没有。维度非常大,那么您必须采用强力方法(并行+分布式/映射缩减)或基于聚类的方法。

Run PCA on your data to convert vectors from 100 dimensions to say 20 dimensions. Then create a K-Nearest Neighbor tree (KD-Tree) and get the closest 2 neighbors based on euclidean distance.

Generally if no. of dimensions are very large then you have to either do a brute force approach (parallel + distributed/map reduce) or a clustering based approach.

很酷又爱笑 2024-10-04 11:03:08

使用称为 KD-TREE 的数据结构。您需要分配大量内存,但您可能会根据数据发现一两个优化。

http://en.wikipedia.org/wiki/Kd-tree

我的朋友几年前在写博士论文时遇到了类似的问题。他的工作在 10 个维度上有 100 万个点。我们构建了一个kd-tree库来解决这个问题。如果您想离线联系我们,我们也许可以挖掘代码。

这是他发表的论文:
http://www.elec.qmul.ac.uk /people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

Use the data structure known as a KD-TREE. You'll need to allocate a lot of memory, but you may discover an optimization or two along the way based on your data.

http://en.wikipedia.org/wiki/Kd-tree.

My friend was working on his Phd Thesis years ago when he encountered a similar problem. His work was on the order of 1M points across 10 dimensions. We built a kd-tree library to solve it. We may be able to dig-up the code if you want to contact us offline.

Here's his published paper:
http://www.elec.qmul.ac.uk/people/josh/documents/ReissSelbieSandler-WIAMIS2003.pdf

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文