查找 KD 树中所有节点的 KNN 的有效方法
我目前正在尝试找到平衡 KD 树(K=2)的所有节点的 K 最近邻。
我的实现是 Wikipedia 文章 中代码的变体,并且找到 KNN 的速度相当快任何节点O(log N)。
问题在于我需要找到每个节点的KNN。想出大约O(N log N)如果我迭代每个节点并执行搜索。
有更有效的方法吗?
I'm currently attempting to find K Nearest Neighbor of all nodes of a balanced KD-Tree (with K=2).
My implementation is a variation of the code from the Wikipedia article and it's decently fast to find KNN of any node O(log N).
The problem lies with the fact that I need to find KNN of each node. Coming up with about O(N log N) if I iterate over each node and perform the search.
Is there a more efficient way to do this?
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
我使用覆盖树来解决这个问题。这是链接: http://hunch.net/~jl/projects/cover_tree/ cover_tree.html
在50M大小的数据集中(全部为kNN查询,k=100),覆盖树的创建时间为5.5秒,查询时间为120秒。 Ann lib 创建树花费了 3.3 秒,查询花费了 138 秒。
更新:最近邻居不是对称关系。考虑一下:A(0,0) B(1,0) C(3,0)。 B 是 C 最近的,而 C 不是 B 最近的
I used cover tree for this problem. Here is the link: http://hunch.net/~jl/projects/cover_tree/cover_tree.html
In a data set for 50M size(All kNN query, k=100), cover tree took 5.5s for creation, and 120s for querying. Ann lib took 3.3s for creating the tree, and 138s for querying.
updated:The nearest neighbor is not a symmetric relation. Consider this:A(0,0) B(1,0) C(3,0). B is the nearest for C while C is not the nearest one for B
如果节点本身是查询点,那么搜索时间可能会更低。您可以从回溯阶段开始,并且测试的第一个节点已经在查询点附近。然后很快就可以修剪大面积的树。
最近邻居是对称关系(如果 n1 是 n2 的最近邻居,则同样适用于 n2),因此您只需要搜索一半节点,跳过所有已标记为最近邻居的节点。只是一个想法。
您还可以尝试 KD-Tree BBF(Best-Bin First)搜索,这将帮助您更快地搜索最近的节点(bin)。我已经用 C# 实现了这个,所以如果您对源代码感兴趣,请写信给我。
当然,实际运行时间取决于数据集中的维数、KD 树结构和点的分布。
点的聚类也可能是合适的。
If the nodes themselves are query points, then the search time might be lower. You can start with backtracking stage and the first nodes tested are already near the query point. Then large areas of the tree can be pruned soon.
The nearest neighbor is a symmetric relation (if n1 is a nearest neighbor of n2, the same applies to n2) so you only need to search half the nodes skipping all the nodes already marked as nearest neighbors. Just an idea.
You can also try KD-Tree BBF (Best-Bin First) search, which will help you search the nearest nodes (bins) sooner. I have implemented this in C#, so write me if you are interested in the source code.
Of course, the actual running time depends on dimensionality, KD-Tree structure and distribution of points in your dataset.
Clustering of the points might also be appropriate.
要搜索的术语是knn join。更准确地说,您可能想要进行自连接。
也许这些搜索结果有帮助:
我只见过 R* 树的 knn 连接算法。然而,在我自己的实验中,它们无法胜过重复查询。我可能会遗漏一些实现想法。但一般来说,为树连接适当地保存数据比单个 knn 查询要棘手得多。
The term to search for is knn join. More precisely, you probably want to do a self-join.
Maybe these search results help:
I have only seen knn join algorithms for the R*-tree. However, in my own experiments, they were not able to outperform a repeated query. I might be missing some implementation ideas. But in general, holding the data appropriately for a tree join is much more tricky than a single knn query.
根据您的需要,您可能想要尝试近似的技术。有关详细信息,请查看 Arya 和 Mount 关于该主题的工作。关键论文位于此处。 BigO 复杂性的详细信息位于他们的 '98 论文中。
该作品的图解如下所示:
来源:http://www.cs.umd.edu/~mount/ANN/Images/annspeckle.gif
我已经在包含数十万个元素的高维数据集上使用了他们的库。它比我发现的任何其他东西都快。该库可以处理精确搜索和近似搜索。该软件包包含一些 CLI 实用程序,您可以使用它们轻松地试验数据集;甚至可视化 kd 树(见上文)。
FWIW:我使用了 R 绑定。
来自 ANN 手册:
Depending on your needs, you may want to experiment with approximate techniques. For details, checkout Arya and Mount's work on the subject. A key paper is here. BigO complexity details are located in their '98 paper.
A graphical illustration of the work is shown below:
Source: http://www.cs.umd.edu/~mount/ANN/Images/annspeckle.gif
I have used their library on very high dimensional datasets with hundreds of thousands of elements. It's faster than anything else I found. The library handles both exact and approximate searches. The package contains some CLI utilities that you can use to easily experiment with your dataset; and even visualize the kd-tree ( see above ).
FWIW: I used the R Bindings.
From ANN's manual: