关于空间索引的好书/文章

发布于 2024-10-31 09:36:25 字数 1539 浏览 5 评论 0原文

如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。

扫码二维码加入Web技术交流群

发布评论

需要 登录 才能够评论, 你可以免费 注册 一个本站的账号。

评论(2

情泪▽动烟 2024-11-07 09:36:25

我曾经使用一种自制的四叉树进行空间索引(早在我学习单词“四叉树”)。对于普通类型的空间数据(我处理街道地图数据),它们创建速度快,查询速度快,但在查询期间扫描太多叶节点。具体来说,在合理的节点大小 (50-100) 下,我的四叉树往往会为点查询生成大约 300 个结果,即应用 3-6 个叶节点(非常粗略的估计;结果变化很大。)

现在,我首选的数据结构是R*树。我自己编写并测试了一个实现,取得了非常好的结果。与我的四叉树代码相比,我构建 R*tree 的代码非常慢,但叶节点上的边界框最终组织得很好;至少一半的查询空间仅由一个叶节点回答(即,如果您进行随机点查询,则很有可能仅返回单个叶节点),并且大约 90% 的空间被覆盖两个或更少的节点。因此,对于 80 个元素的节点大小,我通常会从点查询中获得 80 或 160 个结果,平均值接近 160(因为一些查询确实返回 3-5 个节点)。即使在地图上密集的城市地区也是如此。

我知道这一点是因为我为 R* 树及其内部的图形对象编写了一个可视化工具,并在大型数据集(600,000 个路段)上对其进行了测试。它在点数据(以及边界框很少重叠的其他数据)上表现得更好。如果您实现 R* 树,我强烈建议您将结果可视化,因为当我编写我的树时,它有多个错误降低了树的效率(不影响正确性),并且我能够调整一些决策以获得更好的结果。请务必在大型数据集上进行测试,因为它将揭示小数据集无法揭示的问题。它可能有助于减少用于测试的树的扇出(节点大小),以查看树在几层深度时的工作情况。

我很乐意向您提供源代码,但需要获得雇主的许可。你知道是怎么回事。在我的实现中,我支持强制重新插入,但是我的 PickSplit 和
插入惩罚已被调整。

原始论文,R* 树:一种高效且稳健的点访问方法和矩形,由于某种原因缺少点(“i”上没有句点和点)。另外,他们的术语有点奇怪,例如,当他们说“边距”时,他们的意思是“周长”。

如果您需要可修改的数据结构,R* 树是一个不错的选择。如果您在首次创建树后不需要对其进行修改,请考虑批量加载算法。如果批量加载后只需要对树进行少量修改,普通的 R 树算法就足够了。请注意,R*-tree 和 R-tree 数据在结构上是相同的;只有插入(也许还有删除?我忘了)的算法不同。 R树是1984年的原始数据结构;这是R-的链接树纸

kd-tree 看上去效率很高,实现起来也不是太难,但是只能用对于点数据。

顺便说一句,我如此关注叶节点的原因是

  1. 我需要处理基于磁盘的空间索引。通常,您可以将所有内部节点缓存在内存中,因为它们只占索引大小的一小部分;因此,与未缓存的叶节点所需的时间相比,扫描它们所需的时间很小。
  2. 我通过不在空间索引中存储元素的边界框来节省大量空间,这意味着我必须实际测试每个元素的原始几何形状才能回答查询。因此,最小化所触及的叶节点的数量就更为重要。

I used to use a kind of home-grown QuadTree for spatial indexing (well before I learned the word "quadtree"). For ordinary kinds of spatial data (I deal with street map data), they are fast to create and fast to query, but they scan too many leaf nodes during queries. Specifically, with reasonable node sizes (50-100), my quadtree tended to produce around 300 results for a point query, i.e. 3-6 leaf nodes apply (very rough ballpark; results are highly variable.)

Nowadays, my preferred data structure is the the R*tree. I wrote and tested an implementation myself that obtained very good results. My code for building an R*tree is very slow compared to my QuadTree code, but the bounding boxes on the leaf nodes end up very well organized; at least half of the query space is answered by only one leaf node (i.e. if you do a random point query, there is a good chance that only a single leaf node is returned), and something like 90% of the space is covered by two nodes or less. So with a node size of 80 elements, I'd typically get 80 or 160 results from a point query, with the average closer to 160 (since a few queries do return 3-5 nodes). This holds true even in dense urban areas of the map.

I know this because I wrote a visualizer for my R* tree and the graphical objects inside it, and I tested it on a large dataset (600,000 road segments). It performs even better on point data (and other data in which bounding boxes rarely overlap). If you implement an R* tree I urge you to visualize the results, because when I wrote mine it had multiple bugs that lowered the efficiency of the tree (without affecting correctness), and I was able to tweak some of the decision-making to get better results. Be sure to test on a large dataset, as it will reveal problems that a small dataset does not. It may help to decrease the fan-out (node size) of the tree for testing, to see how well the tree works when it is several levels deep.

I'd be happy to give you the source code except that I would need my employer's permission. You know how it is. In my implementation I support forced reinsertion, but my PickSplit and
insertion penalty have been tweaked.

The original paper, The R* tree: An Efficient and Robust Access Method for Points and Rectangles, is missing dots for some reason (no periods and no dots on the "i"s). Also, their terminology is a bit weird, e.g. when they say "margin", what they mean is "perimeter".

The R* tree is a good choice if you need a data structure that can be modified. If you don't need to modify the tree after you first create it, consider bulk loading algorithms. If you only need to modify the tree a small amount after bulk loading, ordinary R-tree algorithms will be good enough. Note that R*-tree and R-tree data is structurally identical; only the algorithms for insertion (and maybe deletion? I forget) are different. R-tree is the original data structure from 1984; here's a link to the R-tree paper.

The kd-tree looks efficient and not too difficult to implement, but it can only be used for point data.

By the way, the reason I focus on leaf nodes so much is that

  1. I need to deal with disk-based spatial indexes. You can generally cache all the inner nodes in memory because they are a tiny fraction of the index size; therefore the time it takes to scan them is tiny compared to the time required for a leaf node that is not cached.
  2. I save a lot of space by not storing bounding boxes for the elements in the spatial index, which means I have to actually test the original geometry of each element to answer a query. Thus it's even more important to minimize the number of leaf nodes touched.
你列表最软的妹 2024-11-07 09:36:25

我开发了一种基于象限的快速搜索算法,并于几年前将其发布在 ddj.com 上。也许您感兴趣:

加速搜索最近的线路
http://drdobbs.com/windows/198900559

I developed a algorithm for quadrant based fast search and publushed it on ddj.com a couple of years ago. Maybe it's interesting for you:

Accelerated Search For the Nearest Line
http://drdobbs.com/windows/198900559

~没有更多了~
我们使用 Cookies 和其他技术来定制您的体验包括您的登录状态等。通过阅读我们的 隐私政策 了解更多相关信息。 单击 接受 或继续使用网站,即表示您同意使用 Cookies 和您的相关数据。
原文