在SQL Server 2008上优化7000万个极高密度空间点云的最近邻查询
我的 SQL Server 2008 R2 Express 数据库中有大约 7500 万条记录。每个都是对应于某个值的经纬度。该表有地理列。我正在尝试为给定的纬度经度(点)找到一个最近的邻居。我已经有一个带有空间索引的查询。但根据记录在数据库中的位置(例如第一季度或上季度),查询可能需要大约 3 到 30 秒才能找到最近的邻居。我觉得可以通过优化查询或空间索引来优化它以提供更快的结果。 现在应用了一些具有默认设置的空间索引。这是我的表和查询的样子。
CREATE TABLE lidar(
[id] [bigint] IDENTITY(1,1) NOT NULL,
[POINTID] [int] NOT NULL,
[GRID_CODE] [numeric](17, 8) NULL,
[geom] [geography] NULL,
CONSTRAINT [PK_lidar_1] PRIMARY KEY CLUSTERED ([id] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
我正在使用的空间索引:
CREATE SPATIAL INDEX [SPATIAL_lidar] ON [dbo].[lidar] ([geom]) USING GEOGRAPHY_GRID
WITH (
GRIDS =(LEVEL_1 = MEDIUM,LEVEL_2 = MEDIUM,LEVEL_3 = MEDIUM,LEVEL_4 = MEDIUM),
CELLS_PER_OBJECT = 16, PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
这是我正在使用的查询:
declare @ms_at geography = 'POINT (-95.66 30.04)';
select TOP(1) nearPoints.geom.STAsText()as latlon
from
(
select r.geom
from lidar r With(Index(SPATIAL_lidar))
where r.geom.STIntersects(@ms_at.STBuffer(1000)) = 1
) nearPoints
这是我的数据库中经纬度的示例。给出准确度和密度的概念。所有 7000 万条记录都针对一个城市(激光雷达数据)
POINT (-95.669434934023087 30.049513838913736)
现在,此查询给出了如上所述的结果,但我想尽可能提高性能。我的猜测是,通过调整空间索引的默认值,我也许能够提高性能。这有什么线索吗?
我尝试将缓冲区从 10 更改为 1000,但结果几乎相同。
也欢迎任何其他提高性能的建议。
这是我现在正在使用的系统:
Windows 7 64bit Professional
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (4 CPUs), ~3.0GHz
Ram: 8 GB
NVIDIA GeForce 9500 GT
I have about 75 million records in a SQL Server 2008 R2 Express database. Each is a lat long corresponding to some value. The table has geography column. I am trying to find one nearest neighbor for a given latitude longitude (point). I already have a query with spatial index in place. But depending on where the record is in the database, say first quarter or last quarter, the query can take about from 3 to 30 seconds to find the nearest neighbor. I feel this can be optimized to give lot faster result by optimizing the query or spatial index.
Right now applied some the spatial index with default settings. Here is what my table and query looks like.
CREATE TABLE lidar(
[id] [bigint] IDENTITY(1,1) NOT NULL,
[POINTID] [int] NOT NULL,
[GRID_CODE] [numeric](17, 8) NULL,
[geom] [geography] NULL,
CONSTRAINT [PK_lidar_1] PRIMARY KEY CLUSTERED ([id] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The spatial Index i am using:
CREATE SPATIAL INDEX [SPATIAL_lidar] ON [dbo].[lidar] ([geom]) USING GEOGRAPHY_GRID
WITH (
GRIDS =(LEVEL_1 = MEDIUM,LEVEL_2 = MEDIUM,LEVEL_3 = MEDIUM,LEVEL_4 = MEDIUM),
CELLS_PER_OBJECT = 16, PAD_INDEX = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
Here is the Query I am using:
declare @ms_at geography = 'POINT (-95.66 30.04)';
select TOP(1) nearPoints.geom.STAsText()as latlon
from
(
select r.geom
from lidar r With(Index(SPATIAL_lidar))
where r.geom.STIntersects(@ms_at.STBuffer(1000)) = 1
) nearPoints
Here is a sample of lat longs in my database . to give an idea of accuracy and density. All the 70 million records are for one city (Lidar Data)
POINT (-95.669434934023087 30.049513838913736)
Now this query gives me results as i described above, but i want to improve the performance as much as possible. My guess is by tweaking the default values of the spatial index i might be able to improve the performance. Any clues this?
I tried varying the buffer from 10 to 1000 but with almost same results.
Also any other suggestions to improve the performance are welcome.
Here is the system i am using right now:
Windows 7 64bit Professional
Intel(R) Core(TM)2 Quad CPU Q9650 @ 3.00GHz (4 CPUs), ~3.0GHz
Ram: 8 GB
NVIDIA GeForce 9500 GT
如果你对这篇内容有疑问,欢迎到本站社区发帖提问 参与讨论,获取更多帮助,或者扫码二维码加入 Web 技术交流群。
绑定邮箱获取回复消息
由于您还没有绑定你的真实邮箱,如果其他用户或者作者回复了您的评论,将不能在第一时间通知您!
发布评论
评论(4)
您是否尝试过调整索引设置?我通常在每个单元格中使用较高的对象,并对所有网格级别使用“高”。见下文:
Have you tried playing around with the index settings? I generally use higher objects per cell and "high" for all the Grid levels. See below:
阅读这篇有关空间索引的文章。
我的猜测是你的主过滤器效率很低。对于第一级,您可能需要较高的网格密度,因为您的点非常密集。我一直在努力解决的一个问题是如何使LEVEL1的密度大于256(HIGH)。我很惊讶微软强迫我们只有 3 个网格密度选项。
Read this article on spatial indexing.
My guess is that your primary filter is very inefficient. You probably need your grid density to be HIGH for the first level since you're points are very dense. One problem I've been struggling with is how to make LEVEL1 have a density greater than 256 (HIGH). I'm amazed Microsoft forced us to only have 3 options for grid density.
抱歉,这不是 SQL 答案,而是一种在数据受到某些限制的情况下获得可预测性能的方法。
数据多久更改一次?如果可能的话,您能否预先计算每个实体 5 个最近邻居的图表,并使用它来加快您的选择速度。
如果这些数据大部分是只读的,那么……
这些点分布的均匀程度如何?如果分布相当均匀且众所周知,那么您可以通过将每个坐标和索引存储在哈希表中来滚动自己的空间映射。
如果不需要数据库中的数据,请将其移出到内存映射文件以进行快速哈希查找。 (70m 记录应该可以轻松放入内存)。
我使用这种架构来生成亚毫秒级的查找,以显示广告和搜索引擎相关性。
==详细说明==
您只需创建一个固定大小的正方形网格(如棋盘),然后将每个点映射到网格中,然后创建属于每个网格的对象列表。网格框——如果正确调整每个框的大小,每个方格中平均应该有 5-50 个点——原则上这是一个四叉树,但为了简单起见没有树。
对于将所有数据分散到存储桶中后为空的存储桶,您添加哪些最近的存储桶包含数据的信息。
现在,您可以从左到右对每个存储桶进行编号,以便每个存储桶都有一个可以根据坐标计算出的唯一编号 - 并且将每个存储桶插入到哈希表中,或者如果空间允许的话,就像直接查找表。
现在,当您进行查询时,您只需计算将映射到哪个存储桶,您将获得该存储桶中的对象列表,或者您将获得一个“空”存储桶,其中包含指向具有内容的最近存储桶的指针。
这将为您提供您正在寻找的第一个对象候选列表,现在您只需运行并查看哪个是最接近的。
99% 的情况都是这样——但如果你担心 (a) 下一个桶中是否有一些实际上更接近的候选人,那么只需检查周围的 8 个桶,看看是否可以在那里找到更近的。
如果您现在还想获取所有最接近的对象的列表,那么还可以为每个对象计算 5 个最近邻居的简单网络,因此您最终将得到类似 A->{B,C, D,E,F}, B->{A,D,G,H,I}, C->{A,J,K,G,M}....
这将形成一个简单的网络现在可以通过以下变体进行遍历Dijkstra 在这里获取与您最近的点相邻的所有点。
构建数据结构需要时间,但一旦完成,完成正确的查找并返回数据集可以在亚毫秒内完成(不包括任何 http 或离机通信)
希望这会有所帮助。
Sorry, but this is not a SQL answer, but a way of getting predictable performance assuming certain constraints on your data.
How often is the data changing? If possible, could you pre-calculate a graph of every entity 5 nearest neighbors, and use that to speed up your selection.?
If this data is mostly read only, then...
How evenly are these points distributed? If fairly evenly and well know distribution, then could you roll your own spacial mapping by bucketing every coordinate and index in a hash table.
If you don't need to have the data in the database, move it out to a memory mapped file for fast hash lookups. (70m records should easily fit in memory).
I have use this architecture to generate sub-millisecond lookups for for display advertising and search engine relevance.
==Elaboration==
You simply create a grid of fixed size squares (like a chessboard), and you map each point into the grid, and you create a list of the objects witch belong into each of the grid-boxes -- if you adjust the size of each box correctly, you should on average have 5-50 points in each square -- This is in principle a quad-tree but without the tree for simplicity.
For each bucket which is empty after you have scattered all the data in buckets, you add information of which nearest buckets that contain data.
You can now number each bucket left-to-right-line-ny-line so that each bucket have a unique number which can be calculated from the coordinates -- and you insert each bucket into a hash table, or if space permitting just as a straight lookup table.
Now when you have your query, you simply calculate which bucket that will map to, and you will get either a list of objects in that bucket, or you will get a 'empty' bucket which contains the pointers to the nearest bucket that have content.
That will give you you first candidate list of objects that you are looking for, and now you simply just have to run though and see whcih one is the closest.
In 99% of cases that would be it -- but if you are worried about that there (a) either are some condidates in the next-over buckets which are actually closer, then just check the 8 surrounding buckets, and see if you can find any closer there.
If you now also want to get a list of all the objects which are closest, then also calculate a simple network of 5 nearest neigbours for each obejct, so you will end up with a data structure like A->{B,C,D,E,F}, B->{A,D,G,H,I}, C->{A,J,K,G,M}....
That will form a simple network which you can now traverse with a variation of Dijkstra here to get all the point adjacent to you nearest point.
Building the data structures will take time, but once done, and done right lookup and returning a dataset can be done in sub milliseconds (not including any http or off-box communication ofcause)
Hope this helps.
对于 SQL Server 2008 上的常规最近邻查询,请尝试 Isaac 在他的 blog 使用数字表来增加查找范围,直到找到足够的候选者。另一个建议是尝试改变网格密度 - HHHH 或 HHMM 可能更适合密集点。
在 Sql Server Denali 中,也将使用常规 SELECT TOP ... ORDER BY 语法本机支持优化最近邻居查询的功能。
作为旁注,在您的示例中,您的查询中似乎缺少 ORDER BY distance 子句来与顶部一起使用?您当前的查询将返回距离目标 1000 米以内的点,但不一定是最近的点。
For the regular nearest neighbor query on SQL Server 2008, try the approach that Isaac has documented on his blog which uses a numbers table to increase the bounds of the lookup until enough candidates have been found. The other recommendation would be to try varying your grid densities - HHHH or HHMM would likely work better for dense points.
In Sql Server Denali, this functionality for optimized nearest neighbor queries will also be supported natively using regular SELECT TOP ... ORDER BY syntax.
As a sidenote, in your example it looks like you are missing the ORDER BY distance clause in your query to go along with the top? Your current query will return a point that is within 1000 meters of the target, but not necessarily the nearest one.