Python中的增量最近邻算法

发布于 2024-10-04 07:08:55 字数 1539 浏览 4 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(3

栀子花开つ 2024-10-11 07:08:55

虽然已经晚了,但对于后人来说:

实际上有一种技术可以将批处理算法(如 KD-Tree)转换为增量算法:它称为静态到动态转换

要生成 KD 树的增量变体,您需要存储一组树而不是仅存储一棵树。当您的最近邻结构中有 N 个元素时,您的结构将为 N 的二进制表示中的每个“1”位提供一棵树。此外,如果树T_i对应于N的第i位,则树T_i包含2^< em>i 元素。

因此,如果您的结构中有 11 个元素,则 N = 11,或二进制的 1011,因此您有三棵树 - T_3T_1 和 T_0 - 分别有 8 个元素、2 个元素和 1 个元素。

现在,让我们在结构中插入一个元素e。插入后,我们将有 12 个元素,即二进制的 1100 个元素。比较新的和以前的二进制字符串,我们看到 T_3 没有改变,我们有一个包含 4 个元素的新树 T_2 和树 T_1 和 T_0 被删除。我们通过批量插入 e 以及T_2“下面”树中的所有元素来构造新树 T_2,这些元素是 < em>T_1T_0

这样,我们就从静态基础结构创建了增量点查询结构。然而,“增量”静态结构会渐近减速,就像这样以额外的 log(N) 因子的形式:

  • 在结构中插入 N 个元素:N 元素>O(N log(N) log(n))
  • 具有 N 个元素的结构的最近邻查询:O(log(n) log(n))

This is way late, but for posterity:

There is actually a technique for converting batch-processed algorithms like KD-Tree into incremental algorithms: it's called a static-to-dynamic transformation.

To generate an incremental variant of a KD-Tree, you store a set of trees instead of just one tree. When there are N elements in your nearest-neighbor structure, your structure will have a tree for each "1" bit in the binary representation of N. Moreover, if tree T_i corresponds to the i-th bit of N, then tree T_i contains 2^i elements.

So, if you have 11 elements in your structure, then N = 11, or 1011 in binary, and therefore you have three trees - T_3, T_1, and T_0 - with 8 elements, 2 elements, and 1 element, respectively.

Now, let's insert an element e into our structure. After insertion, we'll have 12 elements, or 1100 in binary. Comparing the new and the previous binary string, we see that T_3 doesn't change, we have a new tree T_2 with 4 elements, and trees T_1 and T_0 get deleted. We construct the new tree T_2 by doing a batch insertion of e along with all the elements in the trees "below" T_2, which are T_1 and T_0.

In this way, we create an incremental point query structure from a static base structure. There is, however, an asymptotic slowdown in "incrementalizing" static structures like this in the form of an extra log(N) factor:

  • inserting N elements in structure: O(N log(N) log(n))
  • nearest neighbor query for structure with N elements: O(log(n) log(n))
压抑⊿情绪 2024-10-11 07:08:55

我认为增量构建 KD 树或 KNN 树的问题是,正如您在评论中提到的那样,树最终会变得不平衡,并且您无法进行简单的树旋转来解决平衡问题并保持一致性。至少,重新平衡任务并不是微不足道的,人们肯定不想在每次插入时都这样做。通常,人们会选择使用批处理方法构建一棵树,插入一堆新点并允许树在某个点上变得不平衡,然后重新平衡它。

一个非常相似的事情是为 M 个点批量构建数据结构,将其用于 M' 个点,然后用 M+M' 个点批量重新构建数据结构。由于重新平衡不是我们熟悉的树的正常快速算法,因此相比之下,重建不一定很慢,并且在某些情况下可能会更快(取决于进入增量算法的点的顺序)。

话虽这么说,如果您采用重建方法,您编写的代码量、调试难度以及其他人理解您的代码的难易程度都会大大减少。如果这样做,您可以使用批处理方法并保留尚未插入树中的点的外部列表。可以使用强力方法来确保这些都不比树中的更接近。

下面是一些 Python 实现/讨论的链接,但我没有找到任何明确声称是增量的。祝你好运。

http://www.scipy.org/Cookbook/KDTree

http://cgi.di.uoa.gr/~compgeom/pycgalvisual/kdppython.shtml

http://sites.google.com/site/mikescoderama/Home/ kd-tree-knn

http://en.wikipedia.org/wiki/Kd -tree

注意:我的评论适用于高维空间。如果您从事 2D 或 3D 工作,我所说的可能不合适。 (如果您在非常高的维度空间中工作,请使用暴力或近似最近邻。)

I think the problem with incremental construction of a KD-tree or KNN-tree is, as you've alluded to in a comment, that the tree will eventually become unbalanced and you can't do simple tree rotation to fix balance problems and keep consistency. At the minimum, the re-balancing task is not trivial and one would definitely not want to do it at each insertion. Often, one will choose to build a tree with a batch method, insert a bunch of new points and allow the tree to become unbalanced up to a point, and then re-balance it.

A very similar thing to do is to build the data structure in batch for M points, use it for M' points, and then re-build the data structure in batch with M+M' points. Since re-balancing is not normal, fast algorithm we are familiar with for trees, rebuilding is not necessarily slow in comparison and in some cases can be faster (depending on how the sequence of the points entering your incremental algorithm).

That being said, the amount of code you write, debugging difficulty, and the ease of others' understanding of your code can be significantly smaller if you take the rebuild approach. If you do so, you can use a batch method and keep an external list of points not yet inserted into the tree. A brute force approach can be used to ensure none of these is closer than the ones in the tree.

Some links to Python implementations/discussions are below, but I haven't found any that explicitly claim to be incremental. Good luck.

http://www.scipy.org/Cookbook/KDTree

http://cgi.di.uoa.gr/~compgeom/pycgalvisual/kdppython.shtml

http://sites.google.com/site/mikescoderama/Home/kd-tree-knn

http://en.wikipedia.org/wiki/Kd-tree

Note: My comments here apply to high-dimensional spaces. If you're working in 2D or 3D, what I've said may not be appropriate. (If you working in very high dimensional spaces, use brute force or approximate nearest neighbor.)

铁轨上的流浪者 2024-10-11 07:08:55

有。 Scipy Cookbook 网站包含可增量更新的 kNN 算法 的完整实现。

也许几行背景知识会对任何感兴趣但不熟悉术语的人有所帮助。

kNN 引擎由两种数据表示形式之一提供支持——存储在多维数组(距离矩阵)中的数据集中所有点之间的成对距离,或 kd 树,它只是将数据点本身存储在多维二叉树中。

这些只是基于 kd 树的 KNN 算法需要的两个操作:从数据集中创建树(类似于其他 ML 算法中以批处理模式执行的训练步骤),然后搜索树来查找“最近的邻居”(类似于测试步骤)。

KNN 算法上下文中的在线或增量训练(假设它基于 kd 树)意味着将节点插入到已构建的 kd 树中。

回到 SciPy Cookbook 中的 kd-Tree 实现:负责节点插入的具体代码行出现在注释行“insert node in kd-tree”之后(事实上,该注释之后的所有代码都针对节点插入) )。

最后,SciPy 库的空间模块(scipy.spatial 模块)中有一个名为 KDTree(scipy.spatial.KDTree)的 kd-tree 实现,但我不知道我不相信它支持节点插入,至少文档中没有这样的功能(我还没有查看源代码)。

There is. The Scipy Cookbook WebSite includes a complete implementation of a kNN algorithm that can be updated incrementally.

Maybe a few lines of background would be helpful for anyone interested but not familiar with the terminology.

A kNN engine is powered by either of two data representations--the pairwise distances between all points in the dataset stored in a multi-dimensional array (a distance matrix), or a kd-tree, which just stores the data points themselves in a multi-dimensional binary tree.

These are only two operations that a kd-tree-based KNN algorithm needs: you create the tree from the dataset (analogous to the training step performed in batch mode in other ML algorithms), and you search the tree to find 'nearest neighbors' (analogous to the testing step).

Online or incremental training in the context of a KNN algorithm (provided it's based on a kd-tree) means to insert nodes to an already-built kd-tree.

Back to the kd-Tree implementation in the SciPy Cookbook: The specific lines of code responsible for node insertion appear after the comment line "insert node in kd-tree" (in fact, all of the code after that comment are directed to node insertion).

Finally, there is a kd-tree implementation in the spatial module of the SciPy library (scipy.spatial module) called KDTree (scipy.spatial.KDTree) but i don't believe it supports node insertion, at least such a function is not in the Docs (i haven't looked at the source).

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文